Introducing the Federal Web Archiving Working Group

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

View of Library of Congress from U.S. Capitol dome in winter. Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

View of Library of Congress from U.S. Capitol dome in winter. Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

“Publishing of federal information on government web sites is orders of magnitude more than was previously published in print.  Having GPO, NARA and the Library, and eventually other agencies, working collaboratively to acquire and provide access to these materials will collectively result in more information being available for users and will accomplish this more efficiently.” – Mark Sweeney, Associate Librarian for Library Services, Library of Congress.

“Harvesting born-digital agency web content, making it discoverable, building digital collections, and preserving them for future generations all fulfill the Government Publishing Office’s mission, Keeping America Informed. We are pleased to be partnering with the Library and NARA to get this important project off the ground. Networking and collaboration will be key to our success government-wide.” – Mary Alice Baish, Superintendent of Documents, Government Publishing Office.

“The Congressional Web Harvest is an invaluable tool for preserving Congress’ web presence. The National Archives first captured Congressional web content for the 109th Congress in 2006, and has covered every Congress since, making more than 25 TB of content publicly available at webharvest.gov. This important resource chronicles Congress’ increased use of the web to communicate with constituents and the wider public. We value this collaboration with our partners at the Government Publishing Office and the Library of Congress, and look forward to sharing our results with the greater web archiving community.” – Richard Hunt, Director of the Center for Legislative Archives, National Archives and Records Administration.

Today most information that federal government agencies produce is created in electronic format and disseminated over the World Wide Web. Few federal agencies have any legal obligation to preserve web content that they produce long-term and few deposit such content with the Government Publishing Office or the National Archives and Records Administration – such materials are vulnerable to being lost.

Exterior of Government Printing Office I  [Today known as the Government Publishing Office]. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

Exterior of Government Printing Office I [Today known as the Government Publishing Office]. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

How much information are we talking about? Just quantifying an answer to that question turns out to be a daunting task. James Jacobs, Data Services Librarian Emeritus, University of California, San Diego, prepared a background paper (PDF) looking at the problem of digital government information for the Center for Research Libraries for the “Leviathan: Libraries and Government in the Age of Big Data” conference organized in April 2014:

The most accurate count we currently have is probably from the 2008 “end of term crawl.” It attempted to capture a snapshot “archive” of “the U.S. federal government Web presence” and, in doing so, revealed the broader scope of the location of government information on the web. It found 14,338 .gov websites and 1,677 .mil websites. These numbers are certainly a more comprehensive count than the official GSA list and more accurate as a count of websites than the ISC count of domains. The crawl also included government information on sites that are not .gov or .mil. It found 29,798 .org, 13,856 .edu, and 57,873 .com websites that it classified as part of the federal web presence. Using these crawl figures, the federal government published information on 135,215 websites in 2008.

In other words, a sea of information in 2008 and now, in 2015, still more.

A function of government is to provide information to its citizens through publishing and to preserve some selected portion of these publications. Clearly some (if not most) .gov web sites are “government publications” and the U.S. federal government puts out information on .mil, .com, and other domains as well. What government agencies are archiving federal government sites for future research on a regular basis? And why? To what extent?

In part inspired by discussions at last year’s Leviathan conference, and in part fulfilling earlier conversations, managers and staff of three federal agencies that each do selective harvesting of federal web sites decided to start meeting and talking on a regular basis – the Government Publishing Office, the National Archives and Records Administration and the Library of Congress.

Managers and staff involved in web archiving from these three agencies have now met five times and have plans to continue meeting on a monthly basis during the remainder of 2015. At the most recent meeting we added a representative from the National Library of Medicine. So far we have been learning about what each of the agencies is doing with harvesting and providing access to federal web sites and why – whether it is the result of a legal mandate or because of other collection development policies. We expect to involve representatives of other federal agencies as seems appropriate over time.

Entrance of National Archives on Constitution Ave. side I. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

Entrance of National Archives on Constitution Ave. side I. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

So far one thing we have agreed on is that we have enjoyed our meetings – the world of web archiving is a small one, and sharing our experiences with each other turns out to be both productive and pleasant. Now that we better understand what we are all doing individually and collectively, we are able to discuss how we can be more efficient and effective in the aggregated results of what we are doing going forward, for example by reducing duplication of effort.

And that’s the kind of thing we hope comes out of this – a shared collective development strategy, if only informally developed. The following are some specific activities we are looking at:

  • Developing and describing web archiving best practices for federal agencies, a web archiving “FADGI” (Federal Agencies Digitization Guidelines Initiative), that could also be of interest to others outside of the federal agency community.
  • Investigate common metrics for use of our web archives of federal sites.
  • Establishing outreach to federal agency staff members who create the sites in order to improve our ability to harvest them.
  • Understand what federal agencies are doing (those that do something) to archive their sites themselves and how that work can be integrated with our efforts.
  • Maintain a seed list of federal URLs and who is archiving what (and which sites are not being harvested).

As the work progresses we look forward to communicating via blog posts and other means about what we accomplish. We hope to hear from you, via the comments on blog posts like this one, with your questions or ideas.

Boxes of Hard Drives and Other Challenges at WGBH: An NDSR Project Update

The following is a guest post by Rebecca Fraimow, National Digital Stewardship Resident at WGBH in Boston I have a pretty comprehensive list of goals to accomplish over the course of my time as the National Digital Stewardship Resident at WGBH’s Media, Library and Archives. That is: Document WGBH’s existing ingest workflow for production media […]

DPOE Interview: Three Trainers Launch Virtual Courses

The following is a guest post by Barrie Howard, IT Project Manager at the Library of Congress. This is the first post in a series about digital preservation training inspired by the Library’s Digital Preservation Outreach & Education (DPOE) Program.  Today I’ll focus on some exceptional individuals, who among other things, have completed one of […]

All in the (Apple ProRes 422 Video Codec) Family

We’ve spent a lot of time recently thinking about digital video issues. As mentioned in a previous blog post, the Federal Agencies Digitization Guidelines Initiative published several reports on this topic including “Creating and Archiving Born Digital Video.” Work on the “Eight Federal Case Histories” (PDF) report nudged us to add the Apple ProRes 422 […]

From the Field: More Insight Into Digital Preservation Training Needs

The following is a guest post by Jody DeRidder, Head of Digital Services at the University of Alabama Libraries.  This post reports on efforts in the digital preservation community that align with the Library’s Digital Preservation Outreach & Education (DPOE) Program. Jody, among many other accomplishments, has completed one of the DPOE Train-the-Trainer workshops and […]

Digital Audio Preservation at MIT: an NDSR Project Update

The following is a guest post by Tricia Patterson, National Digital Stewardship Resident at MIT Libraries This month marks the mid-way point of my National Digital Stewardship Residency at MIT Libraries, a temporal vantage point that allows me to reflect triumphantly on what has been achieved so far and peer fearlessly ahead at all that […]

The DPC’s 2014 Digital Preservation Awards

In November, our colleagues at the Digital Preservation Coalition presented their Digital Preservation 2014 awards. These awards, which are given every two years, were established in 2004 to help raise awareness about digital preservation. The Library of Congress welcomes any public recognition of excellence in digital preservation. We, too, have given our own awards, most recently […]

Web Archive Management at NYARC: An NDSR Project Update

The following is a guest post by Karl-Rainer Blumenthal, National Digital Stewardship Resident at the New York Art Resources Consortium (NYARC). A tipping point from traditional to emergent digital technologies in the regular conduct of art historical scholarship threatens to leave unprepared institutions and their researchers alike in a “digital black hole.” NYARC–the partnership of […]