All the News That’s Fit to Archive

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

The Library has had a web archiving program since the early 2000s.  As with other national libraries, the Library of Congress web archiving program started out harvesting the web sites of its national election campaigns, followed by some collections to harvest sites for period of time connected with events (for example, an Iraq War web archive and a papal transition 2005 web archive along with collecting the sites of the U.S. House and Senate and the legislative branch of government more broadly.

An American of the 1930s getting his news by reading a newspaper. These days he'd likely be looking at a computer screen. Photo courtesy of the Library of Congress Prints and Photographs division.

An American of the 1930s getting his news by reading a newspaper. These days he’d likely be looking at a computer screen. Photo courtesy of the Library of Congress Prints and Photographs division.

The question for the Library of Congress of “what else” to harvest beyond these collections is harder to answer than one might think because of the relatively small web archiving capacity of the Library of Congress (which is influenced by our permissions approach) compared to the vast immenseness of the Internet.  About six years ago we started a collection now known as the Public Policy Topics, for which we would acquire sites with content reflecting different viewpoints and research on a broad selection of public policy questions, including the sites of national political parties, selected advocacy organizations and think tanks and other organizations with a national voice in America’s policy discussions that could be of interest to future researchers.  We are adding more sites to Public Policy Topics continuously.

Eventually I decided to include some news web sites that contained significant discussion of policy issues from particular points of view – sites ranging from DailyKos.com to Townhall.com, from TruthDig.com to Redstate.com.  We started crawling these sites on a weekly basis to try to assure complete capture over time and to build a representation of how the site looked as different news events came and went in the public consciousness (and on these web sites).  We have been able to assess the small number of such sites that we have crawled and have decided that the results are acceptable.  But this was obviously not a very large-scale effort compared to the increasing number of sites presenting general news on the Internet -for many people, their current equivalent of a newspaper.

Newspapers – they are a critical source for historical research and the Library of Congress has a long history of collecting and providing access to U.S. (and other countries’) newspapers.  Having started to collect a small number of “newspaper-like” U.S. news sites for the Public Policy Topics collection, I began a conversation with three reference librarian colleagues from the Newspaper & Current Periodical Reading Room – Amber Paranick, Roslyn Pachoca and Gary Johnson ­- about expanding this effort to a new collection, a “General News on the Internet” web archive.  They explained to me:

Our newspaper collections are invaluable to researchers.  Newspapers provide a first-hand draft of history.  They provide supplemental information that cannot be found anywhere else.  They ‘fill in the gaps,’ so to speak. The way people access news has been changing and evolving ever since newspapers were first being published. We recognized the need to capture news published in another format.  It is reasonable to expect us to continue to connect these kinds of resources to our current and future patrons. Websites tend to be ephemeral and may disappear completely.  Without a designated archive, critical news content may be lost.

In short, my colleagues shared my interest, concern and enthusiasm for starting a larger collection of Internet-only general news sites as a web archiving collection.  I’ll let them explain their thinking further:

When we first got started on the project, we weren’t sure how to proceed.  Once we established clear boundaries on what to include, what types of news sites would be within scope for this collection, our selection process became easier. We asked for help in finding websites from our colleagues. 

We felt it was important to include sites that focus on general news with significant national presence where there are articles that have an author’s voice, such as with HuffingtonPost.com or BuzzFeed.com (even as some of these sites also contain articles that are meant to attract visitors, so-called “click bait).  We wanted to include a variety of sites that represent more cutting edge ways of presenting general news, such as Vox.com and TheVerge, and we felt sites that focus on parody such as TheOnion.com were also important to have represented.  Of course, these sites are not the only sources from which people obtain their news, but we tried to choose a variety that included more trendy or popular sources as well as the conventional or traditional types.  Again, the idea is to assure future users have access to a significant representation of how Americans accessed news at this time using the Internet.

The Library of Congress has an internal process for proposing new web archiving collections.  I worked with Amber, Roslyn and Gary and they submitted a “General News on the Internet” project proposal and it was approved.  Yay!  Then the work began – Amber, Roslyn and Gary describe some of the hurdles:

We understand that archiving video content is a problem. We thought websites like NowThisNews.com could be great candidates but in effect, because they contained so much video and a kind of Tumblr-like portal entry point for news, we had to reject them.  Since we do not do “one hop out” crawling, the linked-to content that is the substantive content (i.e., the news) would be entirely missed.   Also, websites like Vice.com change their content so frequently, it might be impossible to capture all of its content.

In addition, it was decided that sites chosen would not include general news sites associated primarily with other delivery vehicles, such as CNN.com or NYTimes.com.  Many of these types also have paywalls and therefore obviously would create limitations when trying to archive.

We also encountered another type of challenge with Drudgereport.com.  Since it is primarily a news-aggregator with most of the site consisting of links to news on other sites it would be tough to include the many links with the limitations in crawling (again, the “one hop” limitation – we don’t harvest links that are on a different URL).  In the end we decided to proceed in archiving The Drudge Report site since it is well known for the content that is original to that site.

The harvesting for this collection has now been underway for several months; we are examining the results.  We look forward to making an archived version of today’s news as brought to you by the Internet available to Library of Congress patrons for many tomorrows.

What news sites do you think we should collect?

Audio for Eternity: Schüller and Häfner Look Back at 25 Years of Change

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in the Office of Strategic Initiatives. During the first week of October, Kate Murray and I participated in the annual conference of the International Association of Sound and Audiovisual Archives in Cape Town, South Africa.  Kate’s blog describes the conference.  This blog […]

Gossiping About Digital Preservation

In September the Library held its annual Designing Storage Architectures for Digital Collections meeting. The meeting brings together technical experts from the computer storage industry with decision-makers from a wide range of organizations with digital preservation requirements to explore the issues and opportunities around the storage of digital information for the long-term. I always learn […]

We Want You Just the Way You Are: The What, Why and When of Fixity

Fixity, the property of a digital file or object being fixed or unchanged, is a cornerstone of digital preservation. Fixity information, from simple file counts or file size values to more precise checksums and cryptographic hashes, is data used to verify whether an object has been altered or degraded. Many in the preservation community know […]

QCTools: Open Source Toolset to Bring Quality Control for Video within Reach

In this interview, part of the Insights Interview series, FADGI talks with Dave Rice and Devon Landes about the QCTools project. In a previous blog post, I interviewed Hannah Frost and Jenny Brice about the AV Artifact Atlas, one of the components of Quality Control Tools for Video Preservation, an NEH-funded project which seeks to […]

Beyond Us and Them: Designing Storage Architectures for Digital Collections 2014

The following post was authored by Erin Engle, Michelle Gallinger, Butch Lazorchak, Jane Mandelbaum and Trevor Owens from the Library of Congress. The Library of Congress held the 10th annual Designing Storage Architectures for Digital Collections meeting September 22-23, 2014. This meeting is an annual opportunity for invited technical industry experts, IT  professionals, digital collections […]

Emerging Collaborations for Accessing and Preserving Email

The following is a guest post by Chris Prom, Assistant University Archivist and Professor, University of Illinois at Urbana-Champaign. I’ll never forget one lesson from my historical methods class at Marquette University.  Ronald Zupko–famous for his lecture about the bubonic plague and a natural showman–was expounding on what it means to interrogate primary sources–to cast […]

Upgrading Image Thumbnails… Or How to Fill a Large Display Without Your Content Team Quitting

The following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library. Preservation is usually about maintaining as much information as possible for the future but access requires us to balance factors like image quality against file size and design […]