The following is a guest post by Abbie Grotke, Web Archiving Team Lead.
The United States national elections are a year away, but the Library of Congress is already busy archiving presidential campaign websites and preparing to archive House and Senate campaign sites and more starting in March 2012. This actually isnt the earliest weve started for the 2008 archive we began a full nineteen months before the election.
Our prior election archives (since 2000) are publicly available; Election 2010 is in production and isnt yet available for researcher access.
Building the 2012 archive takes a large team of people around the Library Recommending Officers research to find official campaign website URLs and confirm that they are on the ballots then load that data in our nomination tool; technical team members prepare sites for archiving and do quality review to make sure were getting good captures of the sites.
Campaign websites are prone to frequent change and often disappear soon after the elections are over, so we must act to preserve as much as possible before November 6, 2012. Things will get particularly hectic for all of us around June, when the primary schedule gets busy.
We began collecting the presidential campaign sites earlier this month, and our Web Archiving Team is reviewing the results of the crawl. Using the Heritrix archival crawler, we archive each site weekly so that future researchers can compare changes over time. Weve noticed that some of the sites we started archiving a few weeks ago have already changed their look and feel! Weve also noticed a few early, but not surprising, trends:
- Most campaign sites open with splash screens soliciting donations or encouraging visitors to sign up for email alerts. These can be tricky when archiving they have not yet interfered with capture but cause trouble upon replay in the open-source Wayback Machine, so the results in the archive can be strange (one campaign site splash screen flickered like a broken sign every few seconds and wouldnt let us click through to the main site).
- Candidates often post content on domains other than their main websites, from social media content to entire websites devoted to a particular topic (an example from the Election 2008 Web Archive is HillaryClinton.coms spin-off site, thehillaryiknow.com. Our goal is to document as much of a candidates web presence as we can, so we give the crawler specific instructions to go get that content. Unfortunately these sites are not always linked from the main site, so we may not discover it through our normal process for identifying content generated by the campaign site. The good news is that social media sites in use by candidates (so far) seemed to have narrowed down to a few usual suspects; as long as the content is public and is linked from their sites, well try to archive it.
With so much to do in the months ahead, were looking forward to collaborating in 2012 with other organizations who will be selecting and archiving content related to the elections, beyond what were doing with campaign sites. Ill share more on that later, as the details get worked out.
If your institution has done any archiving of campaign or other election websites, whether for local, state, or national elections, what challenges have you faced?