Top of page

Let’s Talk Comics: Comics as Data

Share this post:

“this comic goes out to captain picard before the events of star trek: nemesis as it concerns the backing up data.” Dinosaur Comics! January 28th, 2015.

Good news everyone – the first webcomics dataset is available here!!!

Wait, what?

As a part of the Library’s work to explore our web archives, my colleagues at LC Labs and the Web Archiving Team have made a dataset generated from content harvested from the Library of Congress’s web archive of (Dinosaur Comics!).

“Choo.” Dinosaur Comics! May 1, 2003. From the Dinosaur Comics Dataset Download.

So what is the dataset anyway? It includes minimal metadata for about 3,325 image objects from the Dinosaur Comics! web archive as well as the image files themselves. Bet you never thought of Dinosaur Comics! in quite this way, did you?

Image of Dinosaur Comics! Metadata, March 24, 2020. From the Dinosaur Comics! metadata csv.
Wordcloud of Dinosaur Comics! comic titles from the metadata csv. Created by Megan Halsband using, March 24, 2020. 


I’ve been trying to think about the possibilities for what could be done with this data. Something as simple as a word cloud of the top words used in the comic titles from the metadata. The comic titles in the metadata are actually from the “ALT TEXT” associated with each image. Each comic has a separate title (as listed in the archive), ALT TEXT, as well as the text within the comic itself and a unique number at the end of the URL.


Comic title(s) list from the Dinosaur Comics! webarchive. March 7, 2019.


The metadata, however, doesn’t include the full texts of the comic or the titles used in the archive.  For example, the first Dinosaur Comic! from February 1, 2003 is titled “today is a beautiful day,” and the ALT TEXT reads “it is a purposely inauspicious start.”

“today is a beautiful day.” Dinosaur Comics! February 1, 2003. From the Dinosaur Comics Dataset Download.

But what else would you do with the data beyond a word cloud? What about a corpus of transcriptions of the comics themselves? If these files could be linked by the digest number (a unique ID in the metadata file), then could they be made available as a keyword searchable resource? Could you create an automated analysis of the title versus the actual text of the image?

An animation of the images from this particular dataset might tell us about the shape of the comic text and the negative space, since the same six clip art panels appear in almost every single comic (there are exceptions).

GIF of three Dinosaur Comics! images. Created by Megan Halsband March 24, 2020 from the Dinosaur Comics Dataset Download.

For other webcomic image sets it might be an interesting way to look at the comic developing over time.

Some of the data fields require more or less context to be useful – for example the timestamp in the csv file refers to the date and time the image was crawled by the Library. We didn’t start crawling this website till 2014 – so an analysis of frequency of crawls and/or posts would work better on anything published after 2015. (Content prior to 2014 was crawled by the Internet Archive.) This graph shows the number of unique images crawled per year – and is particularly high in 2014 because that is when the Library started archiving this site.

“Number of Unique Images Crawled.” Dinosaur Comics! Web Archive Metadata. Graph created by Megan Halsband, March 24, 2020. 

I have so many more questions than answers at this point on what we could do with webcomics datasets. How would you use them?

By the way – did you know that the Library recently released the Comics Literature and Criticism web archive? Check it out – and tell me what blogs and sites you’re reading in the comments!

Discover more:

  • Follow the Signal – the blog from our colleagues who specialize in digital stewardship!
  • More web comics posts from Headlines and Heroes!


  1. Congratulations on what looks like an excellent data set! You might take a look at a study I did a couple of years ago on comics with alt-text and hidden comics:

    Bramlett, Frank. 2018. Linguistic discourse in web comics: Extending conversation and narrative into alt-text and hidden comics. In Valentin Werner (ed), The language of pop culture, 72-91. New York: Routledge.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.