We hear a constant stream of news about how crunching massive data collections will change everything from soup to nuts. Here on The Signal, it’s fair to say that scientific research data is close to the heart of our hopes, dreams and fears when it comes to big data: we’ve written over two-dozen posts touching on the subject.
In the context of all this, it’s exciting to see some major projects getting underway that have big data stewardship closely entwined with their efforts. Let me provide two examples.
The Registry of Data Repositories seeks to become a global registry of “repositories for the permanent storage and access of data sets” for use by “researchers, funding bodies, publishers and scholarly institutions.” The activity is funded by the German Research Foundation through 2014 and currently has 400 repositories listed. With the express goal to cover the complete data repository landscape, re3data.org has developed a typology of repositories that compliments existing information offered by individual instutions. The aim is to offer a “systematic and easy to use” service that will strongly enhance data sharing. Key to this intent is a controlled vocabulary that describes repository characteristics, including policies, legal aspects and technical standards.
In a bow to the current trend for visual informatics, the site also offers a set of icons with variable values that represent repository characteristics. The project sees the icons as helpful to users as well as to assist repositories “identify strengths and weaknesses of their own infrastructures” and keep the information up to date.
I really like this model. It hits the trifecta in appealing to creators who seek to deposit data, to users who seek to find data and to individual repositories who seek to evaluate their characteristics against their peers. It remains to be seen if it will scale and if it can attract ongoing funding, but the approach is elegant and attractive.
The second example is ELIXIR, an initiative of the EMBL European Bioinformatics Institute. ELIXIR aims to “orchestrate the collection, quality control and archiving of large amounts of biological data produced by life science experiments,” and “is creating an infrastructure a kind of highway system that integrates research data from all corners of Europe and ensures a seamless service provision that it is easily accessible to all.”
This is huge undertaking and has the support of many nations who are contributing millions of dollars to build a “hub and nodes” network. It will connect public and private bioscience facilities throughout Europe and promote shared responsibility for biological data delivery and management. The intention is to provide a single interface to hundreds of distributed databases and a rich array of bioinformatics analysis tools.
ELIXIR is a clear demonstration of how a well-articulated need can drive massive investment in data management. The project has a well-honed business case that presents an irresistible message. “Biological information is of vital significance to life sciences and biomedical research, which in turn are critical for tackling the Grand Challenges of healthcare for an ageing population, food security, energy diversification and environmental protection,” reads the executive summary. “The collection, curation, storage, archiving, integration and deployment of biomolecular data is an immense challenge that cannot be handled by a single organisation.” This is what the Blue Ribbon Task Force on Sustainable Digital Preservation and Access termed “the compelling value proposition” needed to drive the enduring availability of digital information.
As a curious aside, it’s worth nothing that projects such as ELIXIR may have an unexpected collateral impact on data preservation. Ewan Birney, a scientist and administrator working on ELIXIR, was so taken with the challenge of what he termed “a 10,000 year archive” holding a massive data store that he and some colleagues (over a couple of beers, no less) came up with a conjecture for how to store digital data using DNA. The idea was sound enough to merit a letter in Nature, Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. So, drawing the attention of bioinformaticians and other scientists to the digital preservation challenge may well lead to stunning leaps in practices and methods.
Perhaps one day the biggest of big data can even be reduced to the size of a bowl of alphabet soup or a bowl of mixed nuts!