Dodge that Memory Hole: Saving Digital News

Newspapers are some of the most-used collections at libraries. They have been carefully selected and preserved and represent what is often referred to as “the first draft of history.” Digitized historical newspapers provide broad and rich access to a community’s past, enabling new kinds of inquiry and research. However, these kinds of resources are at risk of being lost to future users.  Networked digital technologies have changed how we communicate with each other and have rapidly changed how information is disseminated. These changes have had a drastic effect in the news industry, disrupting delivery mechanisms, upending business models and dispersing  resources across the world wide web.

Current library acquisition and preservation methods for news are closely linked to the physical newspaper. Ensuring that the new modes of journalism, which are moving toward a “digital- and mobile-first” model, are captured and preserved at libraries and other memory institutions is the main goal of the Dodging the Memory Hole series of events. The first was organized in November 2014 by the Reynolds Journalism Institute at the University of Missouri.  The most recent took place in May of 2015 and was organized by the Educopia Institute at the Charlotte Mecklenburg Public Library in Charlotte, NC.

Hong Kong, 31st day of the Umbrella Revolution, taken October 28, 2014 by Pasu Au Yeung

Hong Kong, 31st day of the Umbrella Revolution, taken October 28, 2014 by Pasu Au Yeung.

I had the opportunity to close out the May meeting and highlight areas where continued work would have an impact in helping libraries collect, preserve and provide access to born-digital news. A (slightly longer but hopefully clearer) version of my talk (pdf) is below.

I want to start with a photograph from last year’s protest in Hong Kong known as the Umbrella Revolution. The picture speaks to the complexity of the problem we face in capturing and preserving the news of today. The protest was unique in that it was one of the first protests in China organized, sustained and broadcast via social media. Capturing a diverse set of materials about this news event would mean capturing the stories from established media companies and the writings and images from individual blogs and other social media. This is especially important in the case of the Umbrella Revolution because official media outlets (and social media accounts) in China are often censored. This protest was also an example of how activism in general has adapted due to networked digital technologies. Future researchers studying social and political movements happening right now would never get the whole story without access to the social media.

The role of the journalist is to get the story out and just like other publishers in the digital age, they’ve had to adapt to stay relevant. Digital storytelling is becoming more dynamic,  exemplified by publications like Highline, a new long-form product from Huffington Post which is richly illustrated with audio and visual elements and is translated into a variety of languages. We can expect that in the pursuit of getting the story out and advancing story telling, news content will come from more sources, be more dynamic and continue using all kinds of formats and distribution mechanisms.

Memory hole.

Memory hole.

Libraries have also been transformed by digital technologies. There are a large number of digitized collections; we are creating vast and rich resources and, I think, providing great access and good stewardship to a large amount of this digitized content. Chronicling America and the Digital Public Library of America are great examples of this. However, there are gaps–or holes–in our collections, especially the born-digital content about contemporary events. Libraries haven’t broadly adopted collecting practices so that they are relevant to the current publishing environment which today is dominated by the web.

Several people at this meeting mentioned the study done by Andy Jackson (ppt) at the British Library. I have his permission to share these slides which he presented at the recent General Assembly of the International Internet Preservation Consortium. It is a simple but powerful study of ten years (2004-2014) worth of content from the UK Web Archive. It aims to find out what they have in their archive that is not on the live web anymore. He looked at a sample of URLs per year and analyzed the content to determine if the content at the URL in the archive was still at the same URL on the live web. He broke down and color coded the URLs according to a percentage scale expressing if the content was moved, changed, missing or gone. He found that after one year half of the content was either gone or had been changed so much as to be unrecognizable. After ten years almost no content still resides at its original URL. This analysis was done across all domains but you can make a logical assumption that news content wouldn’t fare any better if subjected to this same type of analysis.

Fifty percent of URLS in the UK Web Archive have lost or missing content after one year. After ten years nearly all content is lost or missing.

Fifty percent of URLS in the UK Web Archive have lost or missing content after one year. After ten years nearly all content is moved, changed, missing or gone. Credit: Taken from a presentation given by Andy Jackson at the IIPC GA  Apr 27, 2015. The full presentation available at netpreserve.org.

We have clear data that if content is not captured from the web soon after its creation, it is at risk. Which brings me to where I think our main challenge is with collecting born-digital news: library acquisition policies and practices. Libraries collect the majority of their content by buying something–a newspaper subscription, a standing order for a serial publication, a package of titles from a publisher, an access license from an aggregator, etc. The news content that’s available for purchase and printed in a newspaper is a small subset of the content that’s created and available online. Videos, interactive graphs, comments and other user-generated data are almost exclusively available online. The absence of an acquisition stream for this content puts it at risk of being lost to future library and archives users.

Establishing relationships (and eventually agreements) with the organizations that create, distribute and own news content is one of the more promising strategies for libraries to collect digital news content.  Brian Hocker from KXAS-TV, an NBC affiliate in the Dallas area, shared the story of how KXAS partnered with the University of North Texas Libraries to digitize, share and ultimately preserve their station’s video archives as part of the Portal for Texas History. Jim Kroll from the Denver Public library also shared his story of acquiring the archives of the Rocky Mountain News after the newspaper ceased publication. Both stories emphasized the importance of establishing lasting relationships with decision-makers from news outlets in their respective communities. They also each created donor agreements that provided community access to the news archives which can serve as models for future agreements.

The relationships that enabled these agreements were the result of what I think of as entrepreneurial collection development in the model of acquiring special collections. The archives were pursed actively and over time, they represent a new type of content, required a new type of relationship with a donor and were a good fit–both geographically and topically–with existing collections at UNT and DPL.

Web archiving is another promising strategy to capture and preserve born-digital news. The Library of Congress recently announced its effort to save news websites, specifically those not affiliated with traditional news companies. Ben Walsh, creator of PastPages.org, announced that his service is now Memento-compliant, which will allow the archived front pages of websites from major-market newspapers that PastPages collects to be available in a Momento search. These projects will capture content at a national level, but the hyper-local news sites and citizen journalism and other niche blogs– news that used to be published as community newsletters or pamphlets–are most likely not being captured. Internet Archive’s Archive-It service is a mechanism for smaller libraries to engage in web archiving and capture some of this unique content. Capturing the social media around news events continues to be challenging but tools have been developed to capture tweets and collections of tweets around news events are being captured and shared.

The Dodging the Memory Hole events have thus far been excellent opportunities to bring librarians, archivists, the news industry and technologists together to help save news content for future generations. Look for more from this group on awareness raising, studies on what news content has already been lost, collaborations with the developers of news content management systems, and more guidance on developing donation agreements. To read more about the event, check out Trevor Owens’ report on the IMLS blog.

Checking in with NGAC and the National Spatial Data Infrastructure

Several times a year I attend meetings of the National Geospatial Advisory Committee, a federal advisory committee that reports to the chair of the Federal Geographic Data Committee. The NGAC pulls together participants from across academia, the private sector and all levels of government to advise the Federal government on geospatial policy and ways to […]

Digital Preservation in Mid-Michigan: An Interview with Ed Busch

Conferences, meetings and meet-ups are important networking and collaboration events that allow librarians and archivists to share digital stewardship experiences. While national conferences and meetings offer strong professional development opportunities, regional and local meetings offer opportunities for practitioners to connect and network with a local community of practice. In a previous blog post, Kim Schroeder, […]

Dodging the Memory Hole: Collaborations to Save the News

The news is often called the “first draft of history” and preserved newspapers are some of the most used collections in libraries. The Internet and other digital technologies have altered the news landscape. There have been numerous stories about the demise of the newspaper and disruption at traditional media outlets. We’ve seen more than a […]

NDSA New England Regional Meeting Recap

The following is a guest post by Meghan Banach Bergin, Bibliographic Access and Metadata Coordinator, University of Massachusetts Amherst Libraries. On October 30th, the second New England Regional National Digital Stewardship Alliance (NE NDSA) meeting was held at the University of Massachusetts Amherst Libraries.  The meeting was generously sponsored by the Five Colleges Digital Preservation […]

An Online Event & Experimental Born Digital Collecting Project: #FolklifeHalloween2014

If you haven’t heard, as the title of the press release explains, the Library of Congress Seeks Halloween Photos For American Folklife Center Collection.  As of writing this morning, there are now 288 photos shared on Flickr with the #folklifehalloween2014 tag. If you browse through the results, you can see a range of ways folks […]

Gossiping About Digital Preservation

In September the Library held its annual Designing Storage Architectures for Digital Collections meeting. The meeting brings together technical experts from the computer storage industry with decision-makers from a wide range of organizations with digital preservation requirements to explore the issues and opportunities around the storage of digital information for the long-term. I always learn […]

Data Infrastructure, Education & Sustainability: Notes from the Symposium on the Interagency Strategic Plan for Big Data

Last week, the  National Academies Board on Research Data and Information hosted a Symposium on the Interagency Strategic Plan for Big Data. Staff from the National Institutes of Health, the National Science Foundation, the U.S. Geological Survey and the National Institute for Standards and Technology presented on ongoing work to establish an interagency strategic plan […]

Close Reading, Distant Reading: Should Archival Appraisal Adjust?

From time to time, co-chairs of the National Digital Stewardship Alliance Arts and Humanities Content Working Group will bring you guest posts addressing the future of research and development for digital cultural heritage as a follow-up to a dynamic forum held at the 2014 Digital Preservation Conference.   The following is a guest post from Meg […]