Legal Issues in Web Archiving

This is a guest post by Abbie Grotke, Web Archiving Team Lead at the Library of Congress.

What do you get when mix six lawyers and more than 25 web archivists in a room together? No, not a joke. When you’re serious about the topics of legal deposit, permissions responsibilities, access concerns, and robots.txt, you get a great discussion, that’s what you get!

Experts in the field of copyright law joined members of the International Internet Preservation Consortium at our first Legal Issues Roundtable, held during the IIPC meetings earlier this month.

Two lawyers. Library of Congress Prints and Photographs Division, LC-USZ61-1806.

Two lawyers. Library of Congress Prints and Photographs Division, LC-USZ61-1806.

As a web archivist at LC, I am often confronted with legal issues. Our team is responsible for working directly with the Library’s lawyers to get guidance on permissions policies for our web archives.  Policy decisions affect every step of the workflow – from what notices or permission requests are sent to the site owners we’re interested in archiving, to what sites we can archive based on those notices and responses, and what content we can provide access to and in what way. This is definitely an area I am interested in, so I offered to help organize the session.

The intent behind the roundtable was to discuss and compare the impact of international and national legislations and policies on web archiving activities. The organizers were particularly interested in exploring what the current challenges were for IIPC members, and how IIPC could help with some of those challenges.

The roundtable agenda offered a chance for participants to speak about their own situations back home. There was discussion of some of the different legal frameworks, including the need for legislation as opposed to the availability in some countries of existing legal doctrines such as fair use. We also discussed permissions-based approaches where neither exists, or if the legal frameworks are unclear.

Many IIPC organizations are able to (or mandated to) to collect and preserve web content created in their countries and by their countries citizens. Some of the challenges for these organizations include:

  • Access to content. Some may only allow researchers to use the archives on the library premises, in some cases because of privacy laws or concerns. A few IIPC members reported that they hope to stretch the concept of “premises” to include other branches or partner libraries, to allow for broader access.
  • Laws covering some types of electronic content do not always extend to websites.
  • Identifying what falls within the scope of a domain. Some examples: In France, the .fr is an obvious indicator, but in reality only 1/3 of French websites are on .fr. Fortunately, the French law does not specify that only .fr must be collected, so the library can preserve content on other domains that are produced by French citizens. To find this content, they research to see where the creator of the site is living, and use a written framework that can be shared if challenged about archiving. In Denmark, they have similar practices, archiving content from or about Denmark and designed for a Danish audience.
  • A number of national library and archive representatives in attendance reported that their countries are in the process of changing or evaluating legal deposit laws to include websites or to further clarify regulations around websites; IIPC members have been actively participating in the processes, where possible. Some reported on successes, others had disappointing results: law changes have been refused, delayed or not implemented in ways that benefit the national libraries or archives.

Many IIPC members have a permission-based approach like we do at the Library of Congress. Challenges reported here include:

  • Lack of response from site owners. Members seeking permission reported a 30-50% response rate; it’s not that websites are denying permission. They just aren’t responding to our attempts to contact them.
  • Patchy, unbalanced collections as a result of permissions not granted.
  • The tremendous effort required to contact site owners and notify or obtain permission can sometimes overwhelm staff resources.
  • Risk assessments and fair use analysis may allow some organizations to do more, however some are hesitant to go down this path and instead take a more cautious approach.

One big discussion at the roundtable revolved around robots.txt, which is a file that websites use to provide instructions to crawlers. They are a known Internet convention, but do robots exclusions have any legal meaning? Our lawyers in attendance told of some legal cases where the absence of a robots.txt meant a site could be crawled, but were aware of no cases that stated that the presense of one meant a site couldn’t be crawled.

No, not this type of robot...."Mr. Televox, the perfect servant, who is never late or insolent ...Library of Congress Prints and Photographs Division,LC-USZ62-106927

No, not this type of robot...."Mr. Televox, the perfect servant, who is never late or insolent ...Library of Congress Prints and Photographs Division,LC-USZ62-106927

In web archiving, many organizations respect robots.txt instructions, however doing so can interfere with archiving in a number of ways. As you may have read about it in yesterday’s post by Jimi Jones, entire sites can be blocked with robots.txt, or specific parts of sites. Sometimes style sheets and images will be blocked, elements that are important when you are trying to document the look and feel of a website. Some IIPC members obey robots.txt except when it comes to inline images and stylesheets, so the website is better represented. Others who are seeking permission bypass the robots.txt so that the sites archived are as complete as possible.

We wondered if more could be done to educate site owners about the impact of robots.txt on preservation efforts. Some participating in our discussion suspect that more and more content management systems are inserting standards robots.txt files, and site owners may not be aware they can customize them.

Finally, we discussed how IIPC could help its members. Some ideas that emerged:

  • Document best practices, to help others who are struggling with lawyers on web archiving issues.
  • Provide more legal resources and information on the IIPC website, such as a map of where legislation supporting web archiving is occurring, where it is not, and providing a directory of resources.
  • Sharing permission letters and formal licenses with other members.
  • Further study (perhaps a white paper?) on the use and abuse of robots.txt. Roundtable participants shared a 2006 paper by Smitha Ajay and Jaliya Ekanayake titled Analysis of the Usage Statistics of Robots Exclusion Standard (PDF), however more recent and expansive information might be useful, particularly for lawyers working on policies issues related to web archiving and the use of robots.txt. There was particular interest in documentation explaining why archivists might not follow robots.txt exclusions, and why we don’t believe it is being used as a proxy for copyright.

We certainly didn’t resolve any of our challenges over the course of this afternoon, but it was great to be able to share stories and learn more about the approaches that our colleagues at other organizations are taking. At least we know we are not alone. We will continue to share our successes and challenges, providing more resources for each other as we move ahead.

One Comment

  1. Ken Krugler
    June 6, 2012 at 3:53 pm

    Interesting discussion, thanks! Over at the Web Crawler Masters group on LinkedIn, we’ve been discussing the recent ruling by a Canadian court on the legality of scraping – see

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.