The following is a guest post from Michael Neubert, a supervisory digital projects specialist at the Library of Congress.
In February of this year I wrote a post here about an collaborative effort of representatives of the National Archives and Records Administration (NARA), the Government Publishing Office (GPO), and the Library of Congress to work together in various ways on archiving of federal government agency websites – Introducing the Federal Web Archiving Working Group.
Since that time we have expanded the participation from NARA, GPO, and the Library to include some additional federal agencies that are more heavily focused on harvesting of their own agency sites and less on harvesting the sites of other agencies to include the National Library of Medicine, the Smithsonian Institution, and the Department of Health and Human Services. We plan to reach out to more soon. We have realized we have things we can learn from one another about web archiving and federal sites because of the relative newness of this activity in what is a small community of interested staff and managers at federal agencies with this shared interest.
Lynda Schmitz Fuhrig, electronic records archivist, is the representative to the Federal Web Archiving Working Group from the Smithsonian Institution Archives (SIA). SIA “captures, preserves, and makes available to the public the history of this extraordinary Institution. From its inception in 1846 to the present, the records of the history of the Institution—its people, its programs, its research, and its stories—have been gathered, organized, and disseminated so that everyone can learn about the Smithsonian. The history of the Smithsonian is a vital part of American history, of scientific exploration, and of international cultural understanding.” Since the late 1990s this has included archiving of the websites and social media presence for Smithsonian’s various museums, research centers, and offices.
Michael: Why does the Smithsonian Institution archive its own sites? What is your process?
Lynda: As the official recordkeeper of the Smithsonian, we document what the Institution does in terms of exhibits and program planning, construction of buildings, and many other aspects. Our websites and social media accounts also serve as the public face of the Smithsonian. Many of them contain significant content of historical and research value that is now not found elsewhere. These are considered records of the Institution. It is also interesting to see how websites evolve over time. It would irresponsible of us as an archives to only rely upon other organizations to archive our websites.
We use the web crawling service from Archive-It to capture most of these sites. In addition to Archive-It hosting our web archives, we also retain copies of the files in our collections. We use some other tools to capture specific tweets or hashtags or sites that are a little more challenging due to how they are constructed and the dynamic nature of social media content.
In terms of public-facing websites, we try to capture them every 12 to 18 months. It is more frequent if a redesign is happening, and the archiving will happen before and after the update/refresh. An archivist appraises the content on the social media sites to determine if it has been replicated and captured elsewhere in some instances. For example, a museum’s postings on Facebook and Twitter could be similar and don’t require frequent captures. We now have more than 400 websites and blogs and more than 600 social media accounts that include Twitter, Facebook, Instagram, and YouTube across the Institution.
Michael: You’ve been participating in the Federal Web Archiving Working Group since June 2015. What did you hope to learn or accomplish with this group and how is it going so far?
Lynda: I am hoping to learn from my colleagues about their experiences and challenges, as well as other tools or approaches they are implementing at their agencies regarding web archiving. It has been interesting to hear about the various collecting missions or directives at other government agencies.
Michael: When you talk to colleagues or managers at the Smithsonian Institution about web archiving, what is the reaction? How do they see the benefit of this activity?
Lynda: Many do understand the value of it since we reach more people globally via the web than visitors coming to our museums physically. Our websites and social media accounts do indeed document the history of the Institution. Many webmasters know it is important to contact us when they are getting ready to retire a website so we can get a capture and/or retrieve the actual files from a content management system. We also have made various presentations at the Institution about web archiving.
Michael: I can imagine someone suggesting that since the Smithsonian must “back up” its web servers that it seems redundant to archive the websites. How would you explain the difference?
Lynda: It is true that we back up our network servers at the Smithsonian, but backing up is the not the same as archiving. By crawling sites we deem appropriate, we have a snapshot in time of the look and feel of a website. Backups serve the purpose of having duplicate files to rely upon due to disaster or failure. Backups typically are only saved for a certain time period. The website archiving we do is kept permanently. If I wanted to see si.edu from Oct. 9, 2012, there is a good chance the backup tape no longer exists but if I crawled that site that day I will have those files.
Michael: We have talked about my next question before: what is your view on whether it makes sense to use web archiving to make complete copies of cultural heritage presentation sites, including the records and displays of digitized collection items?
Lynda: Our approach has been to exclude as much collection objects/images from our crawls of the museum websites, as per Smithsonian Institution Archives policy. Of course, there are items that do get crawled because of the nature of the sites and we usually have the main collections page. Physical collection items fall under the unit responsible for them and they are something that we would never accession in the Archives.
Personally, I have mixed feelings about this since it is not a “complete” website capture then, especially since the images themselves are only representations online and not the actual object.
We do crawl exhibit websites that contain collection objects though.
This is something that researchers need to be aware of when using web archives. Typically, many website captures are not going to have everything either because of excluded content, blocked content, or dynamic content such as Flash elements or calendars that are generated by databases. Capturing the web is not perfect.