In separate “big data” presentations at the Digital Preservation 2012 meeting, Myron Guttmann of the National Science Foundation and Leslie Johnston of the Library of Congress described scenarios that seemed futuristic and fantastic but were in fact present-day realities. Both presenters spoke about researchers using powerful new processing tools to distill information from massive pools of data.
Imagine, say, a researcher gathering social-science and health information about a troubled area of Africa. Using her home computer she connects online to a high-performance processing station, which, in turn, accesses repositories of medical, atmospheric, political and demographic data. She analyzes the data, using special visualization tools, arrives at some fact-based conclusions and generates several possible courses of action.
Professional researchers, particularly in the scientific community, can do that now. And it won’t be long before advanced research capabilities such as filtering and analyzing data on a large scale — or data-driven research — will be available outside of the professional-research community.
Guttman is fervent about the possibilities of data-driven research and about how it is revolutionizing scientific exploration and engineering innovations. He said that data is now gathered at an ever-increasing rate from a range of sources, and virtualization and advanced server architectures will enable complex data mining, machine learning and automatic extraction of new knowledge. Guttman said, “We want to increasingly make use of ‘data that drives our questions’ as well as ‘questions that drive our data.'”
Johnston, chief of repository development for the Library of Congress, is equally enthused about data-driven research. But as knowledgeable as she is about technology, she cautiously balances the potential of data-driven research against mundane practicality.
She notes that researchers may have to work with “big data” where it resides because it is often so massive it is impractical to download. “Data can only move so fast,” said Johnston. “It’s governed by the laws of physics, which dictate that you can’t move things any faster than a certain rate…unless it’s massively parallel. The two most consistent ways that people move data around are still on hard drives or on larger-scale racked systems. High-performance networks, like Internet 2 or LambdaRail, help but such broadband connections between the data communities might not be there yet.”
Johnston cites examples of researchers coming here to the Library of Congress to do sophisticated research on large quantities of data and what they intended to do with that data. In the Library’s Web archives, for instance. Johnston said, “When we began archiving election web sites, we imagined users browsing through the web pages, studying the graphics or use of phrases or links. But when our first researchers came to the Library, they wanted to mine the full corpus. And with the Chronicling America collection, which has 5 million page images from historic newspapers, some researchers want to mine for trends across time periods and geographic areas.”
She said that the Library has a lot to consider in accommodating data researchers. “How much ingest processing should be done with data collections?,” she said. “Do we process collections to create a variety of derivatives that might be used in various forms of analysis before ingesting them? Do we load collections into analytical tools?
“If we decide that we will simply provide access to data, do we limit it to the native format or provide pre-processed or on-the-fly format transformation services for downloads? Can we handle the download traffic? Do we provide guidance to researchers in using analytical tools? Or do we leave researchers to fend for themselves?”
I asked Johnston to set aside the practices at the Library of Congress for a moment and help me flesh out the fantasy scenario of the super researcher that opens this post. How could such a researcher access many disparate databases? “She’d probably have to deal with many different organizations,” Johnston said, “because ‘big data’ tends to be discipline-based. There is astronomy. There is high-energy physics, earthquake engineering, mathematics. And everybody is sticking to their own discipline, in part because those disciplines have their own shared practices. It is easier to coalesce around a similar data repository.
“And it’s challenging to find where the data resides, to know who has it and where it is. And then it’s challenging to combine it, because everyone’s got a different schema and different standards. Every experiment is run on a different set of instruments or sensors. How do you combine that? It’s another new set of skills and services. How will researchers and scholars be trained to work with combining data and creating new knowledge?”
Working with “big data” will increasingly require high-performance computational resources. Johnston said it could be a cluster or a series of parallel computers. The more important thing is that the researcher needs massive computation power to work with a large volume of data. It doesn’t have to reside on her laptop; she can access it online through telepresence. “Or maybe regional, discipline-based high-performance computing centers that permit access,” said Johnston. “NEES (the Network for Earthquake Engineering Simulation) is a good example of that. They have a service where researchers uploaded data into their data repositories for preservation but also for reuse. And they have telepresence computing where you can use the data where it sits on their machine and do some of the analysis and visualization right there, using their computing resources without ever bringing the data locally.”
Johnston points out that financial considerations could impede widespread availability of data-driven research. “Who funds the access for researchers?,” said Johnston. “Does NSF fund it? Does the White House fund it? Does NEH fund it? Are the research universities funding it for themselves? And who’s paying for the research into new technologies? Are there better conductors? Are there better cables? Are there faster speeds possible? Are there different types of storage media? Plus somebody has to support it and someone needs to manage it.”
Another crucial element for such research is trust: validation, authentication and security. There needs to be a trusted relationship between the user, the processing station and the place where the data resides. And ultimately the data itself must be vetted.
Training will also be necessary for advanced levels of data-driven research. Researchers needs to know where the data is, how to use the tools to get the data, how to combine the data if it resides in different repositories and how to analyze the data. Guttman calls for a new generation of tools. He said, “We need more tools beyond statistics, geographical information systems, network analysis, modeling and scenarios.”
Johnston adds that the tools also need to be easier to use. She also sees an enticing, if complex, road ahead. “How far we can push research using big data depends on entwined factors. It’s about policy and funding, and also about storage, bandwidth and processing.”
She laughs when asked when we can expect all these issues to fall into place. “There’s no such thing as done!” she exclaimed. “It’s an exciting challenge because these sorts of issues are never completely solved. The situation is constantly evolving.”
“Imagine, say, a researcher gathering social-science and health information about a troubled area of Africa. Using her home computer she connects online to a high-performance processing station, which — in turn — accesses repositories of medical, atmospheric, political and demographic data.”
Since she knows next to nothing about the reliability of the data, data gathering procedures, subtleties of the interpretation of these data etc., she quickly makes a number of grand “discoveries”. Her results are even less reliable than the least reliable dataset used. Her work is useless, her life becomes miserable…
Natasha, thank you for your vivid response. I neglected to address that and, in response to your comment, I added in this paragraph,
“Another crucial element for such research is trust: validation, authentication and security. There needs to be a trusted relationship between the user, the processing station and the place where the data resides. And ultimately the data itself must be vetted.”
You analyze “BIG data” and
You think that it is an important thing from now on to obtain a civic trend.
Is data analysis a thing with what kind of meaning?
It is human important work to foresee this thing.
A super computer performs data analysis.
But how do you understand these data?
By this judgment, the importance of data turns big.
I think so that a great argument is necessary for this criterion.
According to the presentation, the name is Mhyron Gutmann.
True. But that’s a mis-spelling in the presentation. Visit the NSF pages at http://1.usa.gov/NpWvPl and http://1.usa.gov/NpWF9f. They spell Dr. Gutmann’s name without the “H.”
There is plenty of attention on gee-whiz data crunching analysis, but there is little investment in data repositories for most disciplines, and most researchers have little incentive to deposit data to them – until funders like NSF and NIH require it – which is the trend in Europe (and then it will be an unfunded mandate) – but I’m optimistic
Leaving aside the first one, the second speaker was also compelling. She mentioned that what LOC thought researchers wanted to do and what they actually wanted to do were vastly disparate. How do we understand this research trend in ways that allow us to properly work with big data researchers? And how do we stay at the front of the curve?
To me so much of the big data talk in the information community is a day late and a dollar short; I’ve been hearing about it from potential employers for over a year. It seems we’re scrambling to catch up rather than being at the leading edge. We sniff at usage rather than saying, hmm, how do we make sure they’re getting the right stuff to draw conclusions from rather than pooh-poohing the method?
I want to help make this.
Currently, I’m enrolled in Udacity’s Deep Learning nanodegree program that just started. I am also learning about Elastic Search, an open source search engine (with a commercial company supporting it who offer analytics and other services on a monthly basis for it). It’d be neat if we could make a huge p2p community based (as in anyone who has a good computer and decent internet connection) search engine where anyone could both run a shard and do search against the crowd.