Introducing the Federal Web Archiving Working Group

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

View of Library of Congress from U.S. Capitol dome in winter. Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

View of Library of Congress from U.S. Capitol dome in winter. Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

“Publishing of federal information on government web sites is orders of magnitude more than was previously published in print.  Having GPO, NARA and the Library, and eventually other agencies, working collaboratively to acquire and provide access to these materials will collectively result in more information being available for users and will accomplish this more efficiently.” – Mark Sweeney, Associate Librarian for Library Services, Library of Congress.

“Harvesting born-digital agency web content, making it discoverable, building digital collections, and preserving them for future generations all fulfill the Government Publishing Office’s mission, Keeping America Informed. We are pleased to be partnering with the Library and NARA to get this important project off the ground. Networking and collaboration will be key to our success government-wide.” – Mary Alice Baish, Superintendent of Documents, Government Publishing Office.

“The Congressional Web Harvest is an invaluable tool for preserving Congress’ web presence. The National Archives first captured Congressional web content for the 109th Congress in 2006, and has covered every Congress since, making more than 25 TB of content publicly available at webharvest.gov. This important resource chronicles Congress’ increased use of the web to communicate with constituents and the wider public. We value this collaboration with our partners at the Government Publishing Office and the Library of Congress, and look forward to sharing our results with the greater web archiving community.” – Richard Hunt, Director of the Center for Legislative Archives, National Archives and Records Administration.

Today most information that federal government agencies produce is created in electronic format and disseminated over the World Wide Web. Few federal agencies have any legal obligation to preserve web content that they produce long-term and few deposit such content with the Government Publishing Office or the National Archives and Records Administration – such materials are vulnerable to being lost.

Exterior of Government Printing Office I  [Today known as the Government Publishing Office]. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

Exterior of Government Printing Office I [Today known as the Government Publishing Office]. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

How much information are we talking about? Just quantifying an answer to that question turns out to be a daunting task. James Jacobs, Data Services Librarian Emeritus, University of California, San Diego, prepared a background paper (PDF) looking at the problem of digital government information for the Center for Research Libraries for the “Leviathan: Libraries and Government in the Age of Big Data” conference organized in April 2014:

The most accurate count we currently have is probably from the 2008 “end of term crawl.” It attempted to capture a snapshot “archive” of “the U.S. federal government Web presence” and, in doing so, revealed the broader scope of the location of government information on the web. It found 14,338 .gov websites and 1,677 .mil websites. These numbers are certainly a more comprehensive count than the official GSA list and more accurate as a count of websites than the ISC count of domains. The crawl also included government information on sites that are not .gov or .mil. It found 29,798 .org, 13,856 .edu, and 57,873 .com websites that it classified as part of the federal web presence. Using these crawl figures, the federal government published information on 135,215 websites in 2008.

In other words, a sea of information in 2008 and now, in 2015, still more.

A function of government is to provide information to its citizens through publishing and to preserve some selected portion of these publications. Clearly some (if not most) .gov web sites are “government publications” and the U.S. federal government puts out information on .mil, .com, and other domains as well. What government agencies are archiving federal government sites for future research on a regular basis? And why? To what extent?

In part inspired by discussions at last year’s Leviathan conference, and in part fulfilling earlier conversations, managers and staff of three federal agencies that each do selective harvesting of federal web sites decided to start meeting and talking on a regular basis – the Government Publishing Office, the National Archives and Records Administration and the Library of Congress.

Managers and staff involved in web archiving from these three agencies have now met five times and have plans to continue meeting on a monthly basis during the remainder of 2015. At the most recent meeting we added a representative from the National Library of Medicine. So far we have been learning about what each of the agencies is doing with harvesting and providing access to federal web sites and why – whether it is the result of a legal mandate or because of other collection development policies. We expect to involve representatives of other federal agencies as seems appropriate over time.

Entrance of National Archives on Constitution Ave. side I. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

Entrance of National Archives on Constitution Ave. side I. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

So far one thing we have agreed on is that we have enjoyed our meetings – the world of web archiving is a small one, and sharing our experiences with each other turns out to be both productive and pleasant. Now that we better understand what we are all doing individually and collectively, we are able to discuss how we can be more efficient and effective in the aggregated results of what we are doing going forward, for example by reducing duplication of effort.

And that’s the kind of thing we hope comes out of this – a shared collective development strategy, if only informally developed. The following are some specific activities we are looking at:

  • Developing and describing web archiving best practices for federal agencies, a web archiving “FADGI” (Federal Agencies Digitization Guidelines Initiative), that could also be of interest to others outside of the federal agency community.
  • Investigate common metrics for use of our web archives of federal sites.
  • Establishing outreach to federal agency staff members who create the sites in order to improve our ability to harvest them.
  • Understand what federal agencies are doing (those that do something) to archive their sites themselves and how that work can be integrated with our efforts.
  • Maintain a seed list of federal URLs and who is archiving what (and which sites are not being harvested).

As the work progresses we look forward to communicating via blog posts and other means about what we accomplish. We hope to hear from you, via the comments on blog posts like this one, with your questions or ideas.

All the News That’s Fit to Archive

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress. The Library has had a web archiving program since the early 2000s.  As with other national libraries, the Library of Congress web archiving program started out harvesting the web sites of its national election campaigns, followed […]

An Online Event & Experimental Born Digital Collecting Project: #FolklifeHalloween2014

If you haven’t heard, as the title of the press release explains, the Library of Congress Seeks Halloween Photos For American Folklife Center Collection.  As of writing this morning, there are now 288 photos shared on Flickr with the #folklifehalloween2014 tag. If you browse through the results, you can see a range of ways folks […]

Gossiping About Digital Preservation

In September the Library held its annual Designing Storage Architectures for Digital Collections meeting. The meeting brings together technical experts from the computer storage industry with decision-makers from a wide range of organizations with digital preservation requirements to explore the issues and opportunities around the storage of digital information for the long-term. I always learn […]

Five Questions for Will Elsbury, Project Leader for the Election 2014 Web Archive

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress. Since the U.S. national elections of 2000, the Library of Congress has been harvesting the web sites of candidates for elections for Congress, state governorships and the presidency. These collections  require considerable manual effort to identify […]

The Library of Congress Wants Your File Format Ideas

In June of this year, the Library of Congress announced a list of formats it would prefer for digital collections. This list of recommended formats is an ongoing work; the Library will be reviewing the list and making revisions for an updated version in June 2015. Though the team behind this work continues to put […]

Beyond Us and Them: Designing Storage Architectures for Digital Collections 2014

The following post was authored by Erin Engle, Michelle Gallinger, Butch Lazorchak, Jane Mandelbaum and Trevor Owens from the Library of Congress. The Library of Congress held the 10th annual Designing Storage Architectures for Digital Collections meeting September 22-23, 2014. This meeting is an annual opportunity for invited technical industry experts, IT  professionals, digital collections […]

Library to Launch 2015 Class of NDSR

The Library of Congress Office of Strategic Initiatives, in partnership with the Institute of Museum and Library Services, has recently announced the 2015 National Digital Stewardship Residency program, which will be held in the Washington, DC area starting in June 2015. As you may know (NDSR was well represented on the blog last year), this […]