Machine Scale Analysis of Digital Collections: An Interview with Lisa Green of Common Crawl

Lesa Green, Director of Common Crawl

Lisa Green, Director of Common Crawl

How do we make digital collections available at scale for today’s scholars and researchers? Lisa Green, director of Common Crawl, tackled this and related questions in her keynote address at Digital Preservation 2013. (You can view her slides and watch a video of her talk online.) As a follow up to ongoing discussions of what users can do with dumps of large sets of data, I’m thrilled to continue exploring the issues she raised in this insights interview.

Trevor: Could you tell us a bit about Common Crawl? What is your mission, what kinds of content do you have and how do you make it available to your users?

Lisa: Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data that is available for everyone to access and analyze. We believe that the web is is an incredibly valuable dataset capable of driving innovation in research, business, and education and that the more people that have access to this dataset, the greater the benefit to society.  The data is stored on public cloud platforms so that anyone with a access to the internet can access and analyze it.

Common crawl invites users to get started by starting a machine image, building examples, and joining in on discussions.

Common crawl invites users to get started by starting a machine image, building examples, and joining in on discussions.

Trevor: In your talk, you described the importance of machine scale analysis. Could you define that term for us and give some examples of why you think that kind of analysis is important for digital collections?

Lisa: Let me start by describing human scale analysis.  Human scale analysis means that a person ingests information with their eyes and then processes and analyzes it with their brain. Even if several people – or even hundreds of people – work on the analysis, it is not as fast as a computer program can ingest, process, and analyze information. Machine scale analysis is when a computer program does the analysis. A computer program can analysis data millions to billions of times faster than a human. It can run 24 hours a day with no need for rest and it can simultaneously run on multiple machines.

Machine scale analysis is important for digital collections because of the massive volume of data in most digital collections. Imagine that a researcher wanted to study the etymology of a word and planned to use a digital collection to answers questions such as:

  • What is the first occurrence of this word?
  • How did the frequency of occurrence change over time?
  • What types of publication it is first appear in?
  • When did it first appear in other types of publications and how did the types of publications it appeared in change over time?
  • What other words most commonly appear in the same sentence, paragraph or page with the word and how did that change over time?

Answering such questions using human scale analysis would take lifetimes of man hours to search the collection for the given word. Machine scale analysis could retrieve the information in seconds or minutes. And if the researcher wanted to make changes in the questions or criteria, only a small amount of effort would be required to alter the software program, then the program could be rerun and return the new the information in seconds or minutes. If we want to optimize the extraction of knowledge from the enormous amounts of data digital collections, human analysis is simply too slow.

Trevor: What do you think libraries, archives and museums can learn from Common Crawl’s approach?

Lisa: I think it is of crucial importance to preserve data in a format that it can be analyzed by computers. For instance, if material is stored as a PDF, it difficult – and sometimes impossible – for software programs to analysis the material and therefore libraries, archives and museums will be limited in the amount of information that can be extracted from the material in a reasonable amount of time.

Trevor: What kind of infrastructure do you think libraries, archives and museums need to have to be able to provide capability for machine scale analysis? Do you think they need to be developing that capacity on their own systems or relying on third party systems and platforms?

Lisa: The two components are storage and compute capacity. When one thinks of digital preservation, storage is always considered but compute capacity is not always considered. Storage is necessary for preservation and the type of storage system influences access to the collection. Compute capacity is necessary for analysis. Building and maintaining the infrastructure or storage and compute can be expensive, so it doesn’t make much financial sense for each organization to develop it own their own.

One option would be a collaborative, shared system build and used by many organizations. This would allow the costs to be shared, avoid duplicative work and storing duplicate material, and – perhaps most importantly – maximize the number of people who have access to the collections.

Personally I believe a better option would be to utilize existing third party systems and platforms. This option avoids the cost of developing custom systems and often makes it easier to maintain or alter the system as there is a greater pool of technologists familiar with the popular third party platforms.

I am a strong believer in public cloud platforms is because there is no upfront cost for the hardware, no need to maintain or replace hardware, and one only pays for the storage and compute that is used. I think it would be wonderful to see more libraries, museums, and archives storing copies of their collections on public cloud platforms in order to increase access. The most interesting use of your data may be thought of by someone outside your organization and the more people who can access the data, the more minds can work to find valuable insight within your data.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.