Defining the “Big” in Big Data

Of late it seems that almost every project I have been called to work on involves some aspect of “Big Data.”  I have been challenged in the past that libraries actually have big data, because we don’t as a general rule collect social science or scientific datasets.  But I feel strongly in asserting that our digital collections — texts, images, GIS, legislative documents, web archives, etc. — can be considered to be data in addition to being cultural artifacts.  I first talked about this in a post on this blog in October 2011.

Big Data Can Generate Big Brainstorms

"Big Data Can Generate Big Brainstorms" from Flickr user Kevin Krejci, http://www.flickr.com/photos/kevinkrejci/6259499293/

I have had conversations recently with colleagues from many organizations about their collections, and some have told me that they do not have big data because they do not have datasets. Some have said that they do not have big data because they do not have large-scale collections or massive observational files.  So it seems that we do not only need to define the “data” in big data, we need to define “big.”

Big can most definitely mean small files, but a lot of them.

And they do not have have to be seemingly exotic formats, like FITS files for astronomical images or HDF for earth science data.   They can be Excel files, which California Digital Library and Microsoft Research are collaborating to preserve. PDFs are perhaps the most common file format in journal publishing; PDF/A became a formal standard meant for preservation in 2005.  There are also HTML files in web archives.  And TIFF or JP2 page image files from digitized books and newspapers.  Our institutions have hundreds of thousands, millions, or even billions of those types of files.

I can give some examples.  When working on the first phase of a publication archiving project, we received a relatively small content delivery  — 100 GB.  That delivery contained 1.3 million files.  Or the Library of Congress web archives, which currently comprise over 6 billion files.  All of the files mentioned are quite small and quite common formats.  And yet, in the aggregate, they have research value and are Big Data.

As I was writing this post, a tweet came across my twitterstream pointing to this article:

danah boyd & Kate Crawford (2012): CRITICAL QUESTIONS FOR BIG DATA, Information, Communication & Society, DOI:10.1080/1369118X.2012.678878

As much as I was tempted to delete all my text and have my entire post instead just say “Read This Article,” I showed some restraint.  I still feel that I need to say that there are many definitions of what constitutes big data, that cultural organizations have big data in every possible definition of the phrase, and that we need to decide how we are going to steward and provide access to our collections as data.  But definitely read their article for even more on what big data is and isn’t.

3 Comments

  1. Donna Kafel
    May 17, 2012 at 5:10 pm

    You’ve raised some great points! Bigger doesn’t necessarily mean better, as the term sometimes implies. Small collections of data are often undervalued and overlooked. Boyd and Crawford’s article that you refer to is well worth reading. I wrote a commentary “Big Data: Today’s Sacred Cow” about their paper in the e-Science Community blog http://esciencecommunity.umassmed.edu/2011/12/16/big-data-todays-sacred-cow/

  2. Leslie Johnston
    May 17, 2012 at 5:21 pm

    Thanks for sharing that link to your post, Donna! I am definitely looking forward to the new collection edited by Lisa Gitelman.

  3. J. Bernabe
    July 8, 2012 at 8:37 am

    Leslie! Excellent definition of “BIG” and I’m 100% with you… Big is what Big Data is all about!
    Yet “BIG” is one of the common mistakes companies usually do when they decide to embrace Big Data… Big Data requires starting Small to minimize risks
    I wrote an article about the most common big data mistakes I wanted to share with you:
    http://bigdata-doctor.com/common-fatal-big-data-mistakes/

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.