The First Decade of Web Archiving at the Library of Congress

The following is a guest post by Abbie Grotke,  Web Archiving Team Lead at the Library of Congress.

Eleven years ago, the Library of Congress established a pilot web archiving project to study methods to evaluate, select, collect, catalog, provide access to and preserve at-risk born digital content for future generations. We could write a book (or at least a few blog posts!) about lessons learned since then, yet we continue to face a variety of challenges.

The Library of Congress Web Archives

Library of Congress Web Archives,

Future blog posts will cover some of our technological and legal challenges, but today and in a following post I’ll talk about our “social” issues: with the Web so big, what do we preserve and who preserves it?

There are a few approaches to Web archiving: bulk or domain harvesting (for example, the National Library of Iceland archives all of .is), and selective, thematic or event-based harvesting.

Identifying what the “U.S. Web” might be is nearly impossible. Archiving “.us” certainly doesn’t cut it. There’s no one master list to which we can refer to incorporate all of the websites generated by and hosted in the United States (although the recent news about publishing a list of all top-level government domains gives us some hope). So it comes as no surprise that the Library has taken a selective approach to Web archiving.

We’ve collected over 240 terabytes of content, in almost 40 event and thematic collections. Our strengths are in government, public policy and law: we archive U.S. national elections, house and senate and committee sites, changes in the Supreme Court and legal blawgs.

League of Women Voters

League of Women Voters, archived March 30, 2006, from the Manuscript Division Archive of Organizational Websites.

We also build web archives with our special collection divisions – the Manuscript, Prints and Photographs and Music divisions are archiving sites related to their physical holdings. In recent years Library staff in overseas offices in Egypt, Brazil, Indonesia, India and Pakistan captured born digital content documenting elections and other events.

Our collections are in various stages of production so not all are available to researchers yet. Once a collection is completed (or in the case of ongoing archives, after about a year of archiving) we transfer the content to the Library of Congress from the Internet Archive. We currently contract with them to do our large-scale web crawling.

Election 2008

"Resource page" showing dates of capture from the Election 2008 Web Archive.

Then staff in the Library’s Acquisitions and Bibliographic Access division catalog the sites and we work together to prepare them for public access. This process can take anywhere from one year to much longer. To give you a clue, we just recently launched the Election 2008 Web Archive.

Next time I’ll talk about how we work with other libraries and archives to build collaborative web archives.

In the meantime, check out our publicly available collections by visiting the Library of Congress Web Archives.

No Comments

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.