In separate “big data” presentations at the Digital Preservation 2012 meeting, Myron Guttmann of the National Science Foundation and Leslie Johnston of the Library of Congress described scenarios that seemed futuristic and fantastic but were in fact present-day realities. Both presenters spoke about researchers using powerful new processing tools to distill information from massive pools of data.
Imagine, say, a researcher gathering social-science and health information about a troubled area of Africa. Using her home computer she connects online to a high-performance processing station, which, in turn, accesses repositories of medical, atmospheric, political and demographic data. She analyzes the data, using special visualization tools, arrives at some fact-based conclusions and generates several possible courses of action.
Professional researchers, particularly in the scientific community, can do that now. And it won’t be long before advanced research capabilities such as filtering and analyzing data on a large scale — or data-driven research — will be available outside of the professional-research community.
Guttman is fervent about the possibilities of data-driven research and about how it is revolutionizing scientific exploration and engineering innovations. He said that data is now gathered at an ever-increasing rate from a range of sources, and virtualization and advanced server architectures will enable complex data mining, machine learning and automatic extraction of new knowledge. Guttman said, “We want to increasingly make use of ‘data that drives our questions’ as well as ‘questions that drive our data.'”
Johnston, chief of repository development for the Library of Congress, is equally enthused about data-driven research. But as knowledgeable as she is about technology, she cautiously balances the potential of data-driven research against mundane practicality.
She notes that researchers may have to work with “big data” where it resides because it is often so massive it is impractical to download. “Data can only move so fast,” said Johnston. “It’s governed by the laws of physics, which dictate that you can’t move things any faster than a certain rate…unless it’s massively parallel. The two most consistent ways that people move data around are still on hard drives or on larger-scale racked systems. High-performance networks, like Internet 2 or LambdaRail, help but such broadband connections between the data communities might not be there yet.”
Johnston cites examples of researchers coming here to the Library of Congress to do sophisticated research on large quantities of data and what they intended to do with that data. In the Library’s Web archives, for instance. Johnston said, “When we began archiving election web sites, we imagined users browsing through the web pages, studying the graphics or use of phrases or links. But when our first researchers came to the Library, they wanted to mine the full corpus. And with the Chronicling America collection, which has 5 million page images from historic newspapers, some researchers want to mine for trends across time periods and geographic areas.”
She said that the Library has a lot to consider in accommodating data researchers. “How much ingest processing should be done with data collections?,” she said. “Do we process collections to create a variety of derivatives that might be used in various forms of analysis before ingesting them? Do we load collections into analytical tools?
“If we decide that we will simply provide access to data, do we limit it to the native format or provide pre-processed or on-the-fly format transformation services for downloads? Can we handle the download traffic? Do we provide guidance to researchers in using analytical tools? Or do we leave researchers to fend for themselves?”
I asked Johnston to set aside the practices at the Library of Congress for a moment and help me flesh out the fantasy scenario of the super researcher that opens this post. How could such a researcher access many disparate databases? “She’d probably have to deal with many different organizations,” Johnston said, “because ‘big data’ tends to be discipline-based. There is astronomy. There is high-energy physics, earthquake engineering, mathematics. And everybody is sticking to their own discipline, in part because those disciplines have their own shared practices. It is easier to coalesce around a similar data repository.
“And it’s challenging to find where the data resides, to know who has it and where it is. And then it’s challenging to combine it, because everyone’s got a different schema and different standards. Every experiment is run on a different set of instruments or sensors. How do you combine that? It’s another new set of skills and services. How will researchers and scholars be trained to work with combining data and creating new knowledge?”
Working with “big data” will increasingly require high-performance computational resources. Johnston said it could be a cluster or a series of parallel computers. The more important thing is that the researcher needs massive computation power to work with a large volume of data. It doesn’t have to reside on her laptop; she can access it online through telepresence. “Or maybe regional, discipline-based high-performance computing centers that permit access,” said Johnston. “NEES (the Network for Earthquake Engineering Simulation) is a good example of that. They have a service where researchers uploaded data into their data repositories for preservation but also for reuse. And they have telepresence computing where you can use the data where it sits on their machine and do some of the analysis and visualization right there, using their computing resources without ever bringing the data locally.”
Johnston points out that financial considerations could impede widespread availability of data-driven research. “Who funds the access for researchers?,” said Johnston. “Does NSF fund it? Does the White House fund it? Does NEH fund it? Are the research universities funding it for themselves? And who’s paying for the research into new technologies? Are there better conductors? Are there better cables? Are there faster speeds possible? Are there different types of storage media? Plus somebody has to support it and someone needs to manage it.”
Another crucial element for such research is trust: validation, authentication and security. There needs to be a trusted relationship between the user, the processing station and the place where the data resides. And ultimately the data itself must be vetted.
Training will also be necessary for advanced levels of data-driven research. Researchers needs to know where the data is, how to use the tools to get the data, how to combine the data if it resides in different repositories and how to analyze the data. Guttman calls for a new generation of tools. He said, “We need more tools beyond statistics, geographical information systems, network analysis, modeling and scenarios.”
Johnston adds that the tools also need to be easier to use. She also sees an enticing, if complex, road ahead. “How far we can push research using big data depends on entwined factors. It’s about policy and funding, and also about storage, bandwidth and processing.”
She laughs when asked when we can expect all these issues to fall into place. “There’s no such thing as done!” she exclaimed. “It’s an exciting challenge because these sorts of issues are never completely solved. The situation is constantly evolving.”