Repositories: Not Just About Publications Any More

Not that repositories ever really were only about published scholarly output, but for some organizations that was the easiest first bar to reach.  But at Open Repositories 2012, it was clear that the bar has been raised.

Presenting on Big Data and Digital Collections at LC

Presenting on big data and digital collection repository development at LC, photo by eurovision_nicola at

OR2012, held at the University of Edinburgh from July 9-13, 2012, had over 480 registered attendees from over 40 countries.  The participants included developers, librarians, library administrators and service providers.  The topics were, as usual, quite wide-ranging, including calls for increased open access to scholarly publications,  introductions to core repository technologies, presentations on new types of repository services, the need for name identifiers/disambiguation and the ever-popular developer challenge.  The entire conference was live-blogged, which provided some remarkable coverage.  And there is a tweet archive for the tag #or2012. Oh, and there is a Flickr pool.

But if there was one word that that was woven into almost every presentation, it was this:  DATA.

Which made me very happy.  Because anyone who reads my posts here or has seen my speak anywhere in the past 9 months knows that I go on and on about two things: Library, Archive, and Museum collections are now being mined as data by researchers, which requires new management strategies and new self-serve services.  And, consequently, cultural institutions all have big data and need different IT infrastructures for the processing and serving of these collections.  And I said as much in my presentation at OR2012 (Check out the live blog post from the session I spoke in).

Just about everyone was discussing RDM, or Research Data Management. It has become clear that institutional repositories must not only manage scholarly publications, but the data that was created through observation and experimentation or collected and published, in order to support the “re-” activities: review, reuse, replicability and reproducibility.  RDM platforms are needed to help researches capture and share and publish their datasets.  The public-facing discovery infrastructure is but a small part of this effort: the greater need and effort is in capturing data from the original instruments and formats and the transfer and documentation of datasets in a reliable, documented way to support a forensic level of authenticity for future researchers.  The Digital Curation Centre has a great blog post reviewing some of the sessions on this topic.

Piped into the OR 2012 Reception

Piped into the OR 2012 Reception, photo by eurovision_nicola at

Another word which was everywhere was “identifiers.”  Disambiguation of researchers/authors has been a known issue since the earliest Institutional Repositories, where one publication might be by “Leslie Johnston,” and another by “L. Johnston,” and another by “Leslie  L. Johnston,” depending upon the publication’s stylebook. Is that the same person to someone searching for all my publications?  ORCID is the more mature service in the assignment and resolution of identifiers, but the status of ISNI, aka ISO 27729, was also presented.  There will likely never be a single unique identifier, as there are these two international services, national services, and institutional services.  The catch will be crosswalking between all the identifiers.  The same can be said for article or item identifiers, such as the DOI, which has a high level of buy-in in the publishing realm, but uneven adoption for other types of objects.

There was also a lot of discussion about scale.  Not a lot of solutions yet, but a lot of discussion.  I heard some very promising presentations about optimization for Solr and the use of noSQL databases.

Linked Open Data was, unsurprisingly, a topic of discussion.  The opening plenary by  Cameron Neylon from the Public Library of Science very much emphasized this point: “It’s about links; it’s about connectedness.”  And it’s not just linking between objects and repositories, but synchronization between them. Some very interesting early work was presented on the Webtracks and ResourceSync projects.

A number of NDIIPP partners presented their projects at Open Repositories, including  Duracloud and Chronopolis.

I can never say enough about the Developer’s Challenge at OR, where developers pitch ideas, refine the ideas, and often develop code in but a few hours.  DevSCI, which sponsored the challenge, covered the event and the winners.

I would encourage anyone interested in any of these topics to read through the comprehensive session live-blogging, and to check out videos on the OR2012 YouTube channel.  My own session is apparently missing, due to unrecoverable file corruption (really).

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.