Good news everyone – the first webcomics dataset is available here!!!
As a part of the Library’s work to explore our web archives, my colleagues at LC Labs and the Web Archiving Team have made a dataset generated from content harvested from the Library of Congress’s web archive of qwantz.com (Dinosaur Comics!).
So what is the dataset anyway? It includes minimal metadata for about 3,325 image objects from the Dinosaur Comics! web archive as well as the image files themselves. Bet you never thought of Dinosaur Comics! in quite this way, did you?
I’ve been trying to think about the possibilities for what could be done with this data. Something as simple as a word cloud of the top words used in the comic titles from the metadata. The comic titles in the metadata are actually from the “ALT TEXT” associated with each image. Each comic has a separate title (as listed in the archive), ALT TEXT, as well as the text within the comic itself and a unique number at the end of the URL.
The metadata, however, doesn’t include the full texts of the comic or the titles used in the archive. For example, the first Dinosaur Comic! from February 1, 2003 is titled “today is a beautiful day,” and the ALT TEXT reads “it is a purposely inauspicious start.”
But what else would you do with the data beyond a word cloud? What about a corpus of transcriptions of the comics themselves? If these files could be linked by the digest number (a unique ID in the metadata file), then could they be made available as a keyword searchable resource? Could you create an automated analysis of the title versus the actual text of the image?
An animation of the images from this particular dataset might tell us about the shape of the comic text and the negative space, since the same six clip art panels appear in almost every single comic (there are exceptions).
For other webcomic image sets it might be an interesting way to look at the comic developing over time.
Some of the data fields require more or less context to be useful – for example the timestamp in the csv file refers to the date and time the image was crawled by the Library. We didn’t start crawling this website till 2014 – so an analysis of frequency of crawls and/or posts would work better on anything published after 2015. (Content prior to 2014 was crawled by the Internet Archive.) This graph shows the number of unique images crawled per year – and is particularly high in 2014 because that is when the Library started archiving this site.
I have so many more questions than answers at this point on what we could do with webcomics datasets. How would you use them?
By the way – did you know that the Library recently released the Comics Literature and Criticism web archive? Check it out – and tell me what blogs and sites you’re reading in the comments!