Data Mining Memes in the Digital Culture Web Archive

A screenshot of the Web Cultures Web Archive

The Library of Congress Web Cultures Web Archive launched to the public last year. This collection of the American Folklife Center, including a series of sites documenting the ways that cultures have developed and changed online, has already garnered a good bit of attention (see articles from Slate, Smithsonian Magazine and GeekWire.)

You can view the web archives online through our instance of the Wayback Machine, the software we run to replay and provide access to archived websites. With that noted, many of the sites archived in the collection, like GIPHY, and Meme Generator, are more like databases or datasets of digital resources. GIPHY maintains a massive online collection of animated GIFs and Meme Generator presents a corpus of meme images or image macros. For further context, meme images/image macros are widely used images that include text written to a range of templates used in online communication and animated GIFs are short looping moving image files often used as a kind of online form of gestures.

As part of an effort to analyze how comprehensive our web crawls are, we ended up deriving data that we realized may be potentially useful for users of the collection. In working on methods to track the number of various kinds of files in individual web archives, Chase Dooley, a Digital Collections Management Specialist on the Web Archiving Team here in the Digital Content Management section, was able to generate lists of the individual GIFs and Memes in these web archives. For these kinds of web archives, researchers often want to explore data about these resources more than they want to replay what the site looked like at a given point in time.

Derivative Data from Web Archives

The Collections as Data events, both hosted by the Labs team here at the Library of Congress, and related events supported by the Institute of Museum and Library Services have brought increased attention to the needs of researchers to interact with a range of derivative data forms for digital collections. In that vein, the Web Archiving Team provided a good bit of support to the Library of Congress Digital Scholars Lab Pilot Project (PDF) which focused on a use case of deriving data sets for researchers.

The Web Archiving Program acquires web content through web crawling, a highly automated process. As part of an effort to review the coverage of web crawls of GIPHY and Meme Generator, Chase was able to query our web archives data, first finding the number of unique GIFs and distinct meme instances out of both web archives. The GIPHY dataset includes 14,787 total GIFs, of which 10,972 are unique. The Meme Generator dataset includes 86,310 total memes images which represent 57,652 unique memes.

We reached out to Nicki Saylor, Head of the American Folklife Center Archive, to share some of the derived data, which she saw as potentially useful for the researchers they serve. In her words, “The interest in researching web cultures is only growing, so being able to provide web content in such a readily usable way helps researchers make the most of these collections.”

You can download a CSV snapshot files of the meme and GIF data from the LC Labs experiments page. The data sets include some minimal metadata for these memes and GIFs, as well as links to where you can access their web archive copies. This post and the derivative data fit directly with key parts of the recently launched Library of Congress Digital Strategy. Specifically, by sharing these derivative data sets as LC labs experiments we are exploring new ways for users to engage with these collections as data, and sharing the results of Chase’s work with the web archives illustrates the kind of culture of innovation we are cultivating where staff across the library can surface new and interesting potential modes for engaging with the collections.

Memes by the Numbers

Let’s look at the 10 most frequently appearing, popular meme templates in the Meme Generator dataset, shown in the table below. For those who may be unfamiliar with these memes, LC has collected Know Your Meme articles about many of these memes as part of its Web Cultures Web Archive collection since 2014. For example, there are articles on Y U No and Philosoraptor from the Web Cultures Web Archive. In total, there are 5,391 meme instances derived from these 10 memes templates, which means that these top 10 memes templates represent 9% of the 57,652 distinct meme instances in this dataset.

Top 10 Meme Templates by Count Total Meme Instances
Y U No 766
Futurama Fry 660
Insanity Wolf 610
Philosoraptor 530
Success Kid 510
The Most Interesting Man In The World 507
willy wonka 474
Foul Bachelor Frog 469
Socially Awkward Penguin 446
Advice Yoda Gives 419

Let’s look at the Meme Generator dataset another way. If you wanted to find every meme template that has generated at least five distinct meme instances you end up with 1,165 distinct meme templates. This accounts for 56,636 distinct meme instances, which represents 98% of the 57,652 distinct meme instances in the data set.

Going Forward

Our attempts to work through analysis of this data have helped our team identify some potential ways to enhance our crawls of these sites to better harvest more of their content. At the same time, we are also interested in exploring how making this kind of derivative data available to users might help spur further use and engagement with the collections. So if you do explore some of this data we would be thrilled to hear back from you in the comments on this post.

 

 

4 Comments

  1. Claudia
    October 11, 2018 at 6:08 pm

    This is great, but there is a bad URL hyperlinked from the text “LC Labs experiments page.” I think it should be this URL: //labs.loc.gov/experiments/webarchive-datasets/?loclr=blogsig

    Thanks for sharing this!

  2. Meghan Ferriter
    October 11, 2018 at 6:33 pm

    Thank you, Claudia! We’ve updated the post with the correct link.

  3. Unconcerned Troll
    October 12, 2018 at 3:46 am

    O RLY?

  4. Andy Lechlak
    October 12, 2018 at 9:21 am

    This is amazingly cool! Being able to reference pop culture from 20, 30, or 40 years ago has always been part of society, but in a time where everything is online, it makes it a little more difficult to preserve a big amount of those decade stereotypes.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.