Data Mining Memes in the Digital Culture Web Archive

A screenshot of the Web Cultures Web Archive

The Library of Congress Web Cultures Web Archive launched to the public last year. This collection of the American Folklife Center, including a series of sites documenting the ways that cultures have developed and changed online, has already garnered a good bit of attention (see articles from Slate, Smithsonian Magazine and GeekWire.)

You can view the web archives online through our instance of the Wayback Machine, the software we run to replay and provide access to archived websites. With that noted, many of the sites archived in the collection, like GIPHY, and Meme Generator, are more like databases or datasets of digital resources. GIPHY maintains a massive online collection of animated GIFs and Meme Generator presents a corpus of meme images or image macros. For further context, meme images/image macros are widely used images that include text written to a range of templates used in online communication and animated GIFs are short looping moving image files often used as a kind of online form of gestures.

As part of an effort to analyze how comprehensive our web crawls are, we ended up deriving data that we realized may be potentially useful for users of the collection. In working on methods to track the number of various kinds of files in individual web archives, Chase Dooley, a Digital Collections Management Specialist on the Web Archiving Team here in the Digital Content Management section, was able to generate lists of the individual GIFs and Memes in these web archives. For these kinds of web archives, researchers often want to explore data about these resources more than they want to replay what the site looked like at a given point in time.

Derivative Data from Web Archives

The Collections as Data events, both hosted by the Labs team here at the Library of Congress, and related events supported by the Institute of Museum and Library Services have brought increased attention to the needs of researchers to interact with a range of derivative data forms for digital collections. In that vein, the Web Archiving Team provided a good bit of support to the Library of Congress Digital Scholars Lab Pilot Project (PDF) which focused on a use case of deriving data sets for researchers.

The Web Archiving Program acquires web content through web crawling, a highly automated process. As part of an effort to review the coverage of web crawls of GIPHY and Meme Generator, Chase was able to query our web archives data, first finding the number of unique GIFs and distinct meme instances out of both web archives. The GIPHY dataset includes 14,787 total GIFs, of which 10,972 are unique. The Meme Generator dataset includes 86,310 total memes images which represent 57,652 unique memes.

We reached out to Nicki Saylor, Head of the American Folklife Center Archive, to share some of the derived data, which she saw as potentially useful for the researchers they serve. In her words, “The interest in researching web cultures is only growing, so being able to provide web content in such a readily usable way helps researchers make the most of these collections.”

You can download a CSV snapshot files of the meme and GIF data from the LC Labs experiments page. The data sets include some minimal metadata for these memes and GIFs, as well as links to where you can access their web archive copies. This post and the derivative data fit directly with key parts of the recently launched Library of Congress Digital Strategy. Specifically, by sharing these derivative data sets as LC labs experiments we are exploring new ways for users to engage with these collections as data, and sharing the results of Chase’s work with the web archives illustrates the kind of culture of innovation we are cultivating where staff across the library can surface new and interesting potential modes for engaging with the collections.

Memes by the Numbers

Let’s look at the 10 most frequently appearing, popular meme templates in the Meme Generator dataset, shown in the table below. For those who may be unfamiliar with these memes, LC has collected Know Your Meme articles about many of these memes as part of its Web Cultures Web Archive collection since 2014. For example, there are articles on Y U No and Philosoraptor from the Web Cultures Web Archive. In total, there are 5,391 meme instances derived from these 10 memes templates, which means that these top 10 memes templates represent 9% of the 57,652 distinct meme instances in this dataset.

Top 10 Meme Templates by Count Total Meme Instances
Y U No 766
Futurama Fry 660
Insanity Wolf 610
Philosoraptor 530
Success Kid 510
The Most Interesting Man In The World 507
willy wonka 474
Foul Bachelor Frog 469
Socially Awkward Penguin 446
Advice Yoda Gives 419

Let’s look at the Meme Generator dataset another way. If you wanted to find every meme template that has generated at least five distinct meme instances you end up with 1,165 distinct meme templates. This accounts for 56,636 distinct meme instances, which represents 98% of the 57,652 distinct meme instances in the data set.

Going Forward

Our attempts to work through analysis of this data have helped our team identify some potential ways to enhance our crawls of these sites to better harvest more of their content. At the same time, we are also interested in exploring how making this kind of derivative data available to users might help spur further use and engagement with the collections. So if you do explore some of this data we would be thrilled to hear back from you in the comments on this post.



Science Blogs Web Archive

This guest post is an interview with Lisa Massengale, Head of the Science Reference Section, with contributions by the Web Archive’s creator Jennifer Harbster, a Science Reference and Research Specialist for the Science, Technology and Business Division from Oct. 2001- Dec. 2015.  Along with her reference duties for the Library’s Science Reference Service, she created […]

Exploring Late 1800s Political Cartoons through Interactive Data Visualizations

This is a guest blog post by Jeffrey Shen, a high-school Innovation Intern with LC Labs. Over the course of my three month internship with the LC Labs team, I developed a website/interactive data visualization which allows users to explore the late 1800s through political cartoons contained in the Cartoon Drawings collection. The main feature of […]

Piloting Digital Scholarship with the John W. Kluge Center and LC Labs

This is a guest post from 2018 Library of Congress Labs team Junior Fellow Eileen Jakeway that discusses her work on a collaborative Digital Scholarship pilot with the John W. Kluge Center.   In her address at the 2018 Junior Fellows Program closing ceremony this August, Manuscript Division Junior Fellow Patrice Green said that she learned a […]

Inside, Inside Baseball: A Look at the Construction of the Dataset Featuring the Smithsonian’s National Museum of African American History and Culture and the Library of Congress Digital Collections

This is a guest blog post by visiting scholar archivist Julia Hickey who is on a professional development assignment from the Defense Media Activity to the Library of Congress Labs team. Julia has been helping us prepare for and build out a visualization of collection data for our Inside Baseball event. This post was also […]

Linking chatbots to collections for place-based storytelling

The following is a guest post from Library of Congress Labs Innovation Intern, Charlie Moffett. In the course of crafting data-driven narratives with digital collections, he created @govislandbot and an open-source mapping tutorial. Below he shares his processes, some of the challenges he encountered, along with the code. I started my remote internship with LC Labs […]

Building a Southern Mosaic

The following is a guest post from Innovation Intern Aditya Jain on his Southern Mosaic visualization. Two weeks into my LC Labs Innovation internship, I came across Rachel I. Howard’s essay Southern Mosaic on the Library of Congress website. The essay describes the story of John and Ruby Lomax, a husband and wife who made […]

IIIF at the Library of Congress

The Library of Congress and LC Labs are delighted to co-host the 2018 International Image Interoperability Framework (IIIF) Conference with the Smithsonian Institution and the Folger Shakespeare Library. The event will be held May 21-25 in Washington, DC. In preparation for the event, we sat down with Chris Thatcher, a senior software developer at the […]