Preserving Online Science: Reflections

50 years from now, what web content from today will be invaluable for understanding science in our age? What kinds of uses do you imagine this science content could serve? Lastly, where are the natural curatorial homes for this online content and how can we work together to collect, preserve, and provide access to science on the web? These were the three principal questions up for discussion at Science at Risk: Toward a National Strategy for Preserving Online Science, a recent NDIIPP summit. Thanks to generous support from the Alfred P. Sloan Foundation, we were able to invite a small but diverse set of science bloggers, representatives from citizen science projects and individuals working on innovative online science publications to talk about and share their work with archivists, librarians, curators, and historians from a diverse array of cultural heritage organizations to work through and explore these questions.

I had a lot of fun working on putting this summit together, and we will be working with some of the participants to put together a resulting report. With that said, I thought I would take this chance to share some links to reactions from some of the workshop participants and offer up some of my initial reactions here as some grist for further discussion. Anthony Salvagno, a 5th year Physics PhD student at the University of New Mexico and a practitioner of Open Notebook Science shared his responses to an informal set of questions we asked all the participants on his blog. Bora Zivkovic the Blogs Editor at Scientific American shared some of his thoughts in Science Blogs – definition, and a history. I’ve also blogged about some of these issues here before. Aside from these posts, this tweetdoc has 19 pages of the tweets (and a lot of great links) from the first day of the meeting.

Defining Science Blogging is it’s Own Challenge
Bora provided a nice insiders history and description of how science blogging has developed, and in particular, how a range of aggregators and blogging networks have come to play a role in vetting what counts inside the community as a science blog. Sites like, nature blogs, and are actually being used as a new kind of metric for how scientific research’s impact is understood. For example, see the trackbacks and link to posts from research blogging about this essay Why Most Published Research Findings Are False from PLoS medicine. These blogging networks could be ideal ways to capture and preserve science blogs. They represent organized and vetted collections by design. However, defining science blogs by these aggregators might be overly restrictive.

From some of our conversation at the meeting it strikes me that there are several different areas of science blogging. In my mind, most of these blogs fall into three categories; blogs of scientists, blogs about science, and a range of blogs focusing on a range of divisive issues relating to science and public policy (anti-vaccination and creation science blogging for example). If the aggregators/blogging networks were chosen as the primary tools for building lists of science blogs to collect, one would likely get a lot of blogs in the first two categories but miss a lot of the blogging around science conflicts. In light of this, I would suggest that cultural heritage organizations considering collecting science blogs think broadly about what to collect. Ultimately, I would hazard to guess that many of the blogs representing each side of these divisive issues are going to be some of what future historians will be most interested in and I would also hazard to guess that they are likely the most at-risk of being lost as ephemera.

Citizen Science Discussion Forums are Valuable and at Risk

After presentations and discussions about citizen science projects from the Cornel Ornithology Lab and the Zooniverse, aside from being wowed by the sophistication of these projects, I think it was clear that there is a part of their work that matters for the historical record that is not necessarily what is most valuable to the projects themselves. In both cases, these organizations have a high commitment to maintaining and making sure that they are actively managing and migrating the scientific data they are collecting, but the community interactions and discussions that resulted in the creation of that data are not nearly as critical to citizen science projects missions.

For example, aside from the data sets generated by a project like Galexy Zoo something like the Galaxy Zoo forums records how that information was produced, and the interactions between the project’s user community. This forum has hundreds of thousands of posts in discussion threads about the project, about teaching with the project, about images that individuals find to be particularly stunning. The free form discussions here are a rich place for exploring how people are using, reacting to, and discussing the project. Beyond that, discussions in the forums have actually resulted in the discovery of a completely new kind of galaxy by forum participants and Hanny’s Voorwerp, a new astronomical object discovered by and named for the Dutch school teacher who discovered it. This kind of content is a rich resource for future historians for understanding science in our times and understanding these kinds of projects, however it is, by necessarily, a lower priority kind of project data for the citizen science projects themselves.

One of my takeaways from the discussion was that these kinds of community interaction and discussion components attached to various Citizen Science projects represent a clearly valuable and clearly at risk set of online science content for cultural heritage organizations to consider collecting.

Documentation of Public Understanding Science and Science Policy is similarly at Significant Risk

The discussion forums of Citizen Science projects point to something that I think might be a broader issue. It strikes me that what is much more likely to be at risk here is the ephemeral. The kinds of web content that records interesting information about the presentation of science, and about discussions about science on the open web. The things that are least like journal articles or data sets. (Don’t get me wrong, data sets, particularly smaller data sets are also at risk content). One might include everything from discussions of a science memorial on Yelp, to discussions of evolution in the forums for a videogame like Spore, to the diverse deluge of content that shows up in the range of science reddits.

So those are some of my preliminary thoughts, what do you think? If other participants from the workshop want to share comments or share links to where they have blogged elsewhere please do so. Beyond that, I would love to hear your responses to the questions we posed to participants. 50 years from now, what kinds of online science content will invaluable for understanding science in our age? What kinds of uses do you imagine this science content could serve? Lastly, where are the natural curatorial homes for this online content and how can we work together to collect, preserve, and provide access to science on the web?

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.