Introducing the Federal Web Archiving Working Group

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

View of Library of Congress from U.S. Capitol dome in winter. Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

View of Library of Congress from U.S. Capitol dome in winter. Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

“Publishing of federal information on government web sites is orders of magnitude more than was previously published in print.  Having GPO, NARA and the Library, and eventually other agencies, working collaboratively to acquire and provide access to these materials will collectively result in more information being available for users and will accomplish this more efficiently.” – Mark Sweeney, Associate Librarian for Library Services, Library of Congress.

“Harvesting born-digital agency web content, making it discoverable, building digital collections, and preserving them for future generations all fulfill the Government Publishing Office’s mission, Keeping America Informed. We are pleased to be partnering with the Library and NARA to get this important project off the ground. Networking and collaboration will be key to our success government-wide.” – Mary Alice Baish, Superintendent of Documents, Government Publishing Office.

“The Congressional Web Harvest is an invaluable tool for preserving Congress’ web presence. The National Archives first captured Congressional web content for the 109th Congress in 2006, and has covered every Congress since, making more than 25 TB of content publicly available at webharvest.gov. This important resource chronicles Congress’ increased use of the web to communicate with constituents and the wider public. We value this collaboration with our partners at the Government Publishing Office and the Library of Congress, and look forward to sharing our results with the greater web archiving community.” – Richard Hunt, Director of the Center for Legislative Archives, National Archives and Records Administration.

Today most information that federal government agencies produce is created in electronic format and disseminated over the World Wide Web. Few federal agencies have any legal obligation to preserve web content that they produce long-term and few deposit such content with the Government Publishing Office or the National Archives and Records Administration – such materials are vulnerable to being lost.

Exterior of Government Printing Office I  [Today known as the Government Publishing Office]. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

Exterior of Government Printing Office I [Today known as the Government Publishing Office]. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

How much information are we talking about? Just quantifying an answer to that question turns out to be a daunting task. James Jacobs, Data Services Librarian Emeritus, University of California, San Diego, prepared a background paper (PDF) looking at the problem of digital government information for the Center for Research Libraries for the “Leviathan: Libraries and Government in the Age of Big Data” conference organized in April 2014:

The most accurate count we currently have is probably from the 2008 “end of term crawl.” It attempted to capture a snapshot “archive” of “the U.S. federal government Web presence” and, in doing so, revealed the broader scope of the location of government information on the web. It found 14,338 .gov websites and 1,677 .mil websites. These numbers are certainly a more comprehensive count than the official GSA list and more accurate as a count of websites than the ISC count of domains. The crawl also included government information on sites that are not .gov or .mil. It found 29,798 .org, 13,856 .edu, and 57,873 .com websites that it classified as part of the federal web presence. Using these crawl figures, the federal government published information on 135,215 websites in 2008.

In other words, a sea of information in 2008 and now, in 2015, still more.

A function of government is to provide information to its citizens through publishing and to preserve some selected portion of these publications. Clearly some (if not most) .gov web sites are “government publications” and the U.S. federal government puts out information on .mil, .com, and other domains as well. What government agencies are archiving federal government sites for future research on a regular basis? And why? To what extent?

In part inspired by discussions at last year’s Leviathan conference, and in part fulfilling earlier conversations, managers and staff of three federal agencies that each do selective harvesting of federal web sites decided to start meeting and talking on a regular basis – the Government Publishing Office, the National Archives and Records Administration and the Library of Congress.

Managers and staff involved in web archiving from these three agencies have now met five times and have plans to continue meeting on a monthly basis during the remainder of 2015. At the most recent meeting we added a representative from the National Library of Medicine. So far we have been learning about what each of the agencies is doing with harvesting and providing access to federal web sites and why – whether it is the result of a legal mandate or because of other collection development policies. We expect to involve representatives of other federal agencies as seems appropriate over time.

Entrance of National Archives on Constitution Ave. side I. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

Entrance of National Archives on Constitution Ave. side I. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

So far one thing we have agreed on is that we have enjoyed our meetings – the world of web archiving is a small one, and sharing our experiences with each other turns out to be both productive and pleasant. Now that we better understand what we are all doing individually and collectively, we are able to discuss how we can be more efficient and effective in the aggregated results of what we are doing going forward, for example by reducing duplication of effort.

And that’s the kind of thing we hope comes out of this – a shared collective development strategy, if only informally developed. The following are some specific activities we are looking at:

  • Developing and describing web archiving best practices for federal agencies, a web archiving “FADGI” (Federal Agencies Digitization Guidelines Initiative), that could also be of interest to others outside of the federal agency community.
  • Investigate common metrics for use of our web archives of federal sites.
  • Establishing outreach to federal agency staff members who create the sites in order to improve our ability to harvest them.
  • Understand what federal agencies are doing (those that do something) to archive their sites themselves and how that work can be integrated with our efforts.
  • Maintain a seed list of federal URLs and who is archiving what (and which sites are not being harvested).

As the work progresses we look forward to communicating via blog posts and other means about what we accomplish. We hope to hear from you, via the comments on blog posts like this one, with your questions or ideas.

2 Comments

  1. Lynda Schmitz Fuhrig
    February 23, 2015 at 11:15 am

    The Smithsonian Institution Archives would be happy to provide feedback and to participate in the working group.

  2. michael neubert ~
    February 23, 2015 at 12:13 pm

    Smithsonian – great! We know where to find you . . . . thanks and we’ll be in touch.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.