All the News That’s Fit to Archive

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

The Library has had a web archiving program since the early 2000s.  As with other national libraries, the Library of Congress web archiving program started out harvesting the web sites of its national election campaigns, followed by some collections to harvest sites for period of time connected with events (for example, an Iraq War web archive and a papal transition 2005 web archive along with collecting the sites of the U.S. House and Senate and the legislative branch of government more broadly.

An American of the 1930s getting his news by reading a newspaper. These days he'd likely be looking at a computer screen. Photo courtesy of the Library of Congress Prints and Photographs division.

An American of the 1930s getting his news by reading a newspaper. These days he’d likely be looking at a computer screen. Photo courtesy of the Library of Congress Prints and Photographs division.

The question for the Library of Congress of “what else” to harvest beyond these collections is harder to answer than one might think because of the relatively small web archiving capacity of the Library of Congress (which is influenced by our permissions approach) compared to the vast immenseness of the Internet.  About six years ago we started a collection now known as the Public Policy Topics, for which we would acquire sites with content reflecting different viewpoints and research on a broad selection of public policy questions, including the sites of national political parties, selected advocacy organizations and think tanks and other organizations with a national voice in America’s policy discussions that could be of interest to future researchers.  We are adding more sites to Public Policy Topics continuously.

Eventually I decided to include some news web sites that contained significant discussion of policy issues from particular points of view – sites ranging from DailyKos.com to Townhall.com, from TruthDig.com to Redstate.com.  We started crawling these sites on a weekly basis to try to assure complete capture over time and to build a representation of how the site looked as different news events came and went in the public consciousness (and on these web sites).  We have been able to assess the small number of such sites that we have crawled and have decided that the results are acceptable.  But this was obviously not a very large-scale effort compared to the increasing number of sites presenting general news on the Internet -for many people, their current equivalent of a newspaper.

Newspapers – they are a critical source for historical research and the Library of Congress has a long history of collecting and providing access to U.S. (and other countries’) newspapers.  Having started to collect a small number of “newspaper-like” U.S. news sites for the Public Policy Topics collection, I began a conversation with three reference librarian colleagues from the Newspaper & Current Periodical Reading Room – Amber Paranick, Roslyn Pachoca and Gary Johnson ­- about expanding this effort to a new collection, a “General News on the Internet” web archive.  They explained to me:

Our newspaper collections are invaluable to researchers.  Newspapers provide a first-hand draft of history.  They provide supplemental information that cannot be found anywhere else.  They ‘fill in the gaps,’ so to speak. The way people access news has been changing and evolving ever since newspapers were first being published. We recognized the need to capture news published in another format.  It is reasonable to expect us to continue to connect these kinds of resources to our current and future patrons. Websites tend to be ephemeral and may disappear completely.  Without a designated archive, critical news content may be lost.

In short, my colleagues shared my interest, concern and enthusiasm for starting a larger collection of Internet-only general news sites as a web archiving collection.  I’ll let them explain their thinking further:

When we first got started on the project, we weren’t sure how to proceed.  Once we established clear boundaries on what to include, what types of news sites would be within scope for this collection, our selection process became easier. We asked for help in finding websites from our colleagues. 

We felt it was important to include sites that focus on general news with significant national presence where there are articles that have an author’s voice, such as with HuffingtonPost.com or BuzzFeed.com (even as some of these sites also contain articles that are meant to attract visitors, so-called “click bait).  We wanted to include a variety of sites that represent more cutting edge ways of presenting general news, such as Vox.com and TheVerge, and we felt sites that focus on parody such as TheOnion.com were also important to have represented.  Of course, these sites are not the only sources from which people obtain their news, but we tried to choose a variety that included more trendy or popular sources as well as the conventional or traditional types.  Again, the idea is to assure future users have access to a significant representation of how Americans accessed news at this time using the Internet.

The Library of Congress has an internal process for proposing new web archiving collections.  I worked with Amber, Roslyn and Gary and they submitted a “General News on the Internet” project proposal and it was approved.  Yay!  Then the work began – Amber, Roslyn and Gary describe some of the hurdles:

We understand that archiving video content is a problem. We thought websites like NowThisNews.com could be great candidates but in effect, because they contained so much video and a kind of Tumblr-like portal entry point for news, we had to reject them.  Since we do not do “one hop out” crawling, the linked-to content that is the substantive content (i.e., the news) would be entirely missed.   Also, websites like Vice.com change their content so frequently, it might be impossible to capture all of its content.

In addition, it was decided that sites chosen would not include general news sites associated primarily with other delivery vehicles, such as CNN.com or NYTimes.com.  Many of these types also have paywalls and therefore obviously would create limitations when trying to archive.

We also encountered another type of challenge with Drudgereport.com.  Since it is primarily a news-aggregator with most of the site consisting of links to news on other sites it would be tough to include the many links with the limitations in crawling (again, the “one hop” limitation – we don’t harvest links that are on a different URL).  In the end we decided to proceed in archiving The Drudge Report site since it is well known for the content that is original to that site.

The harvesting for this collection has now been underway for several months; we are examining the results.  We look forward to making an archived version of today’s news as brought to you by the Internet available to Library of Congress patrons for many tomorrows.

What news sites do you think we should collect?

An Online Event & Experimental Born Digital Collecting Project: #FolklifeHalloween2014

If you haven’t heard, as the title of the press release explains, the Library of Congress Seeks Halloween Photos For American Folklife Center Collection.  As of writing this morning, there are now 288 photos shared on Flickr with the #folklifehalloween2014 tag. If you browse through the results, you can see a range of ways folks […]

Gossiping About Digital Preservation

In September the Library held its annual Designing Storage Architectures for Digital Collections meeting. The meeting brings together technical experts from the computer storage industry with decision-makers from a wide range of organizations with digital preservation requirements to explore the issues and opportunities around the storage of digital information for the long-term. I always learn […]

Five Questions for Will Elsbury, Project Leader for the Election 2014 Web Archive

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress. Since the U.S. national elections of 2000, the Library of Congress has been harvesting the web sites of candidates for elections for Congress, state governorships and the presidency. These collections  require considerable manual effort to identify […]

The Library of Congress Wants Your File Format Ideas

In June of this year, the Library of Congress announced a list of formats it would prefer for digital collections. This list of recommended formats is an ongoing work; the Library will be reviewing the list and making revisions for an updated version in June 2015. Though the team behind this work continues to put […]

Beyond Us and Them: Designing Storage Architectures for Digital Collections 2014

The following post was authored by Erin Engle, Michelle Gallinger, Butch Lazorchak, Jane Mandelbaum and Trevor Owens from the Library of Congress. The Library of Congress held the 10th annual Designing Storage Architectures for Digital Collections meeting September 22-23, 2014. This meeting is an annual opportunity for invited technical industry experts, IT  professionals, digital collections […]

Library to Launch 2015 Class of NDSR

The Library of Congress Office of Strategic Initiatives, in partnership with the Institute of Museum and Library Services, has recently announced the 2015 National Digital Stewardship Residency program, which will be held in the Washington, DC area starting in June 2015. As you may know (NDSR was well represented on the blog last year), this […]