Archiving the “Intellectual” Components of a Website

The following is a guest post by Abbie Grotke, Web Archiving Team Lead.

You might imagine that with the web being in its twenties everyone would know exactly what a website is. But you’d be surprised – those of us in the web archiving business spend quite a bit of time pondering what makes up a particular organization or person’s website. We’re often challenged to determine what domains constitute any given site we are trying to archive.

Members of "Puppies for a Stronger Government" gathering to design their website? No, just an image from Prints and Photographs collection, Library of Congress LC-USZ62-60676

Members of "Puppies for a Stronger Government" gathering to design their website? No, just an image from Prints and Photographs collection, Library of Congress LC-USZ62-60676

The Library of Congress must send notices or permission requests to site owners for most content that we collect.  In some cases, content that we collect might have links to other content that either we don’t want or whose owner we didn’t contact.  Knowing the boundaries of a website is an important part of our process.

When we archive websites, our Library recommending officers select a seed URL (the starting point for the crawler), and input that into a curator tool that is used manage our web archiving workflow. We use this seed URL as the starting point for the crawler, but we also use that as the URL that we catalog so that researchers can access archived content.  Seed URLs can be any URL on the web, but typically they are the top-level domain.

Let me provide a close-to-home example. Everyone knows the Library’s URL is If we set our crawler to archive that domain, it will follow links to subdomains and pages and content, such this blog post.

But what about or Or our newly launched

All of those domains also equal the Library of Congress website. Our Web Archiving Team has a term for these, and things like an organization or candidate’s social media accounts. We call them “intellectually part of the site.” Humans recognize that they are related, but the machines, the crawling tools, cannot. We have to give special instructions to the crawlers to get that content. We do that by “scoping” them so the crawler knows what links it should follow for that site.

Our Recommending Officers, who are selecting content for the archives, are constantly challenged in identifying what seed URLs to archive out of the billions of pages out there. After a seed URL is nominated, our team, often in consult with the RO, scopes it to identify what is intellectually part of the site. We often go back as sites change or evolve, to re-evaluate the scoping.

List of related URLs

In our election archives, we find a lot of people and organizations buy up new web domains to support whatever hot topic is in the news (domains are so cheap, why not!). Here’s an (albeit silly) example to help illustrate this issue:

Let’s pretend that there’s a site that’s been identified for archiving that was created by an organization called Puppies for a Stronger Government, and they are at (not a real URL–at least not as of this writing). This site has social media accounts that we’ll want to archive, but they’ve also just announced that a candidate (a cat, of all things!), is running on a platform that they don’t agree with. So quickly they buy up a few domains to get their message out to a variety of audiences: and and They put links to these new domains on their website, and use the same design and contact information.

We’re obviously not really archiving puppy and kitten debates, as much as I might like to pretend we are. But this does illustrate the type of thing we see with content we archive.

Are these new domains intellectually part of the original site? Or should we treat them as new sites entirely? Sometimes, particularly if there aren’t design clues or obvious contact information, it is hard to tell. We can archive them all; that’s not a problem – but how we handle them (as a seed or a scope) affects how comprehensively we might collect something, and whether or not we catalog them for access. If we treat something a “seed” it gets more weight in our workflow process; a scope is just an added bonus.

And while there is generally an assumption that a website = a domain name; domain names do change. Most would agree that the same website content and organization at a new domain name is still the same “website.”

So, next time you’re asked “what’s your website?” how will you respond?

9/28/12: Typographical errors fixed.

Yes, The Library of Congress Has Video Games: An Interview with David Gibson

Video games represent one of the most difficult challenges for digital preservationists. Created for a diverse array of hardware and software platforms, rife with rights issues, and as expressive creative works objects which one hopes to attend to the highest levels of artifactual qualities. Despite being one of the most challenging forms of content, there […]

Exhibiting Video Games: An interview with Smithsonian’s Georgina Goodlander

For this installment of Insights, the National Digital Stewardship Alliance Innovation Working Group’s ongoing series of interviews, I talk with Georgina Goodlander, the Web & Social Media Content Manager for the Smithsonian American Art Museum and Exhibition Coordinator for the Museum’s  The Art of Video Games exhibition. There are already some nice interviews exploring the subject […]

Born Digital Minimum Processing and Access

The following guest post from Kathleen O’Neill, Archives Specialist in The Library of Congress Manuscript Division continues our series of posts reflecting on CurateCamp Processing. Meg Phillips’s earlier post on More Product, Less Process for Born Digital Collections focused on developing minimum standards for ingest and processing with the goal of making the maximum number […]

Digital Cultural Heritage DC Meetup Launched

I had the pleasure of joining a number of colleagues at the inaugural meetup for Digital Cultural Heritage DC last night.  The informal group bills itself as a monthly gathering for those “working to acquire, preserve, steward, provide access to, exhibit and interpret digital cultural heritage,” and “a great opportunity for networking with others from […]

New Residency Program Moves Forward

Heads up, recent grad students!  The National Digital Stewardship Residency program is in the works, and Library of Congress staff members are currently working with other institutions in the Washington, D.C. area to set up this new program, which promises to be a great opportunity for students to gain real world experience with digital preservation […]

Communities of Practice Make it Possible: Digital Preservation at Smaller Institutions

The following is a guest post by Jennifer Gunter King, Director, Harold F. Johnson Library, Hampshire College. In July, scholars, entrepreneurs and digital preservation practitioners gathered in Arlington, Va., for the annual meeting of the National Digital Information Infrastructure and Preservation Program and the National Digital Stewardship Alliance, DigitalPreservation 2012. NDIIPP program management director Martha […]

New Web Archiving Resources

The launch of a new web site is the perfect opportunity for an organization to revamp itself. Information is refreshed and updated, new initiatives are touted while old content and projects get shuffled out of plain sight. Graphics and architectures change to better meet user needs and underlying technologies allow for easier management. Even an […]

Sharing, Theft, and Creativity: deviantART’s Share Wars and How an Online Arts Community Thinks About Their Work

Dan Perkel is a Design Researcher at IDEO. Last year, he finished is doctoral work at Berkley’s iSchool. His dissertation, Making Art, Creating Infrastructure: deviantART and the Production of the Web, involved an extensive ethnographic study of deviantART, a massive online community site built around sharing digital art. As part of our on going Insights […]