Of late it seems that almost every project I have been called to work on involves some aspect of “Big Data.” I have been challenged in the past that libraries actually have big data, because we don’t as a general rule collect social science or scientific datasets. But I feel strongly in asserting that our digital collections — texts, images, GIS, legislative documents, web archives, etc. — can be considered to be data in addition to being cultural artifacts. I first talked about this in a post on this blog in October 2011.
I have had conversations recently with colleagues from many organizations about their collections, and some have told me that they do not have big data because they do not have datasets. Some have said that they do not have big data because they do not have large-scale collections or massive observational files. So it seems that we do not only need to define the “data” in big data, we need to define “big.”
Big can most definitely mean small files, but a lot of them.
And they do not have have to be seemingly exotic formats, like FITS files for astronomical images or HDF for earth science data. They can be Excel files, which California Digital Library and Microsoft Research are collaborating to preserve. PDFs are perhaps the most common file format in journal publishing; PDF/A became a formal standard meant for preservation in 2005. There are also HTML files in web archives. And TIFF or JP2 page image files from digitized books and newspapers. Our institutions have hundreds of thousands, millions, or even billions of those types of files.
I can give some examples. When working on the first phase of a publication archiving project, we received a relatively small content delivery — 100 GB. That delivery contained 1.3 million files. Or the Library of Congress web archives, which currently comprise over 6 billion files. All of the files mentioned are quite small and quite common formats. And yet, in the aggregate, they have research value and are Big Data.
As I was writing this post, a tweet came across my twitterstream pointing to this article:
danah boyd & Kate Crawford (2012): CRITICAL QUESTIONS FOR BIG DATA, Information, Communication & Society, DOI:10.1080/1369118X.2012.678878
As much as I was tempted to delete all my text and have my entire post instead just say “Read This Article,” I showed some restraint. I still feel that I need to say that there are many definitions of what constitutes big data, that cultural organizations have big data in every possible definition of the phrase, and that we need to decide how we are going to steward and provide access to our collections as data. But definitely read their article for even more on what big data is and isn’t.