One Size Does Not Always Fit All

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

Marjorie Merriweather Post Hutton Davies, three-quarter length portrait, seated at desk, facing left. Library of Congress, Prints & Photographs Division, by C.M. Stieglitz. 1942.

Marjorie Merriweather Post Hutton Davies, three-quarter length portrait, seated at desk, facing left. Library of Congress, Prints & Photographs Division, by C.M. Stieglitz. 1942.

Recently, I talked with Kristen Regina, Head of Archives and Special Collections at the Hillwood Estate, Museum and Gardens in northwest Washington and Jaime McCurry, Digital Assets Librarian, about workflows and issues for web archiving, an activity that they are looking at. What could I tell them based on LC’s experiences?

Hillwood Estate, Museum and Gardens opened as a public museum in 1977. It is the former residence of American businesswoman, socialite, philanthropist and collector Marjorie Merriweather Post, and home to one of the most comprehensive collections of Russian imperial art outside of Russia, a distinguished 18th-century French decorative art collection and twenty-five acres of landscaped gardens and natural woodlands.

According to Kristen and Jaime, Hillwood has identified digital stewardship as an area of great importance, both in its strategic planning efforts and in day-to-day operations. This fresh focus has supported the institution’s recent migration to a new digital asset management system, the continuation of digital partnerships such as Hillwood’s participation in the Google Art Project to encourage access to its rich digital resources, and moving forward, the exploration and creation of a well-rounded web archiving program.

As Kristen explained, the team has three specific activities in mind for web archiving:

  1. Archive Hillwood’s online presence, in particular its own website, The site would be archived on a regular basis to support traditional archival efforts related to the museum and its ongoing operational activities. This aligns with the usual reasons that an organization of this type would keep copies of brochures, publications, reports and so on that are provided to the public about the organization.
  2. Targeted harvesting of listings or digital catalogues of materials in scope for the Hillwood collections on websites such as dealers or auction houses such as Sotheby’s or Christie’s. Again, this mirrors the collecting of analogous paper materials.
  3. Harvesting on a continuing or one-time basis of sites (or more often parts of sites) of peer institutions, particularly in Russia, and web-based publications about Hillwood or topics relevant to its collections or collecting priorities.

One could come up with any number of challenges associated with each of these activities, but I was struck in thinking about these activities after our meeting that each had particularly distinct challenges for which the best solution might not contribute to solving either of the other two. This was in contrast to my usual thinking of web archiving problem solving.

Hillwood Estate, Museum & Gardens in northwest Washington, D.C. (Courtesy of Hillwood Estate, Museum and Gardens Archives.)

Hillwood Estate, Museum & Gardens in northwest Washington, D.C. (Courtesy of Hillwood Estate, Museum and Gardens Archives.)

Archiving the organization’s site: This is a fairly typical activity for many organizations nowadays and can easily be arranged with a vendor who will periodically “crawl” the organization’s web site from top to bottom, capturing as much of the site as is technically possible following browsable links. This can be done at whatever frequency is desired. The “traditional” approach however is to do a complete harvest of the site each time. Depending on the frequency of revision to the site overall, this can be a considerable amount of effort to make a copy of something that has only slightly changed since the earlier crawl. (The resulting files are de-duped, so at least duplicated copies of the site materials are not stored.) At the same time, a portion of the site might have any number of changes that would be missed between scheduled full harvests.

The solution in this situation is to have a completely different approach and to contract with an organization that can harvest those pages when changes are made on the basis of an RSS-like notification. In the case of a web site that is mostly unchanged over time, this would be much more economical, and yet at the same time would allow the assurance to organization management that the question, “what did our site look like on date X?” can be answered accurately in the future. A full crawl could be attempted once a year as a baseline.

Targeted crawling of certain types of materials from dealers and auction houses, complementing what other groups such as the New York Art Resources Consortium are collecting: To my mind, this kind of crawling presents a completely different challenge. Hillwood has a relatively narrow and specific collecting profile and most relevant auction houses will have much broader scope. If we assume that Hillwood (or other museums or cultural heritage institutions for that matter) would not want to harvest entire sites and then “throw away” what they don’t need, then the likely solution lies in collaborative effort by collecting institutions. Collaboration is a theme that seems to be gaining traction in discussions of web archiving, which is probably good, but at the same time it presents an entirely different set of challenges.

Harvesting sites of peer institutions, particularly in Russia, and web-based publishing about Hillwood or topics relevant to its collections: These activities seem closer to “traditional” web archiving as I think of them, but are still challenging for a small organization with a small staff. Hillwood’s focus will rarely align directly with other institutions’ so “scoping” a crawl of another organization’s site so as to just acquire relevant materials and not an entire organizational site would often be tricky. In addition, there is the ongoing challenge of identifying what these sites and materials might be, which requires staff attention – this seems the greatest challenge in fact here, to find the staff time to identify the sites and then scope them properly and later do the quality assurance review of the results.

Having worked on web archiving collection-building at the Library of Congress for about five years, I am increasingly struck by the singular nature of the web archiving tools. Perhaps this is reflective of the relatively youthful nature of the activity, perhaps it reflects a certain gratitude that there are tools that at least do one thing. But as I look at some of what we at the Library would like to expand our activities to do and talk to people like Kristen and Jaime, I learn about different use cases that lead me to think about different problems, both technical and organizational, than the ones we have focused on so far.

Checking in with NGAC and the National Spatial Data Infrastructure

Several times a year I attend meetings of the National Geospatial Advisory Committee, a federal advisory committee that reports to the chair of the Federal Geographic Data Committee. The NGAC pulls together participants from across academia, the private sector and all levels of government to advise the Federal government on geospatial policy and ways to […]

Digital Audio Preservation at MIT: an NDSR Project Update

The following is a guest post by Tricia Patterson, National Digital Stewardship Resident at MIT Libraries This month marks the mid-way point of my National Digital Stewardship Residency at MIT Libraries, a temporal vantage point that allows me to reflect triumphantly on what has been achieved so far and peer fearlessly ahead at all that […]

The DPC’s 2014 Digital Preservation Awards

In November, our colleagues at the Digital Preservation Coalition presented their Digital Preservation 2014 awards. These awards, which are given every two years, were established in 2004 to help raise awareness about digital preservation. The Library of Congress welcomes any public recognition of excellence in digital preservation. We, too, have given our own awards, most recently […]

Digital Preservation in Mid-Michigan: An Interview with Ed Busch

Conferences, meetings and meet-ups are important networking and collaboration events that allow librarians and archivists to share digital stewardship experiences. While national conferences and meetings offer strong professional development opportunities, regional and local meetings offer opportunities for practitioners to connect and network with a local community of practice. In a previous blog post, Kim Schroeder, […]

Dodging the Memory Hole: Collaborations to Save the News

The news is often called the “first draft of history” and preserved newspapers are some of the most used collections in libraries. The Internet and other digital technologies have altered the news landscape. There have been numerous stories about the demise of the newspaper and disruption at traditional media outlets. We’ve seen more than a […]

NDSA New England Regional Meeting Recap

The following is a guest post by Meghan Banach Bergin, Bibliographic Access and Metadata Coordinator, University of Massachusetts Amherst Libraries. On October 30th, the second New England Regional National Digital Stewardship Alliance (NE NDSA) meeting was held at the University of Massachusetts Amherst Libraries.  The meeting was generously sponsored by the Five Colleges Digital Preservation […]

New FADGI Report: Creating and Archiving Born Digital Video

As part of a larger effort to explore file formats, the Born Digital Video subgroup of the Federal Agencies Digitization Guidelines Initiative Audio-Visual Working Group is pleased to announce the release of a new four-part report, “Creating and Archiving Born Digital Video.” This report has already undergone review by FADGI members and invited colleagues including […]

Comparing Formats for Video Digitization

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in the Office of Strategic Initiatives. FADGI format comparison projects. The Audio-Visual Working Group within the Federal Agencies Digitization Guidelines Initiative recently posted a comparison of a few selected digital file formats for consideration when reformatting videotapes. We sometimes call these […]