Web Archiving and Mainstreaming Special Collections: The Case of the Latin American Government Documents Archive

Kent Norsworthy from University of Texas Libraries

When historians of the future want to understand Latin American governments, they are going to be thrilled that curators like Kent Norsworthy from University of Texas Libraries have been preserving Latin American government websites. Since 2005 the Latin American Government Documents Archive  has been collecting, preserving, and providing access to ministerial and presidential documents from 18 Latin American and Caribbean countries. I had the pleasure of hearing Kent describe this collection in a presentation at the International Internet Preservation Consortium’s General Assembly (PDF of his slides). I am thrilled to have this opportunity to discuss and share their work here on the blog.

Trevor: Could you give us an example of some of the kinds of things you have collected that are already inaccessible on the live web? In your talk you gave a powerful example of how the collection captures, illustrates, and provides access to changes of government web presences in a coup. It would be ideal if you could walk us through and give us links to explore what a change in government looks like in the collection.

Kent: Since we have been crawling since 2005 at sites where the shelf life of Web content is pretty short, in many cases just a few weeks or months, a large proportion of everything in our archive has long since disappeared from the live web. A good example would be official government reports, many of which are issued periodically, say once per year or quarterly. What we typically see happen is a new administration will take power and just delete all of the old reports before they start putting their own up. In other cases, in the midst of an existing administration, the ministry will migrate over to a new Content Management System, and in the process just delete the older issues of the reports. The same thing happens with the full text or even audio or video of speeches: they’ll be up on these sites for a few months, sometimes a year or two, but eventually they are all taken down, in some cases it’s for political or ideological reasons, in others it’s a logistical or systems issue. At root, the problem is in most cases there is no entity that is mandated to archive this content. Numerous government entities are mandated with producing and publishing the content, but not with saving or preserving it, at least not when the content is born digital.

Beyond that, there are other cases where large swaths of content simply disappear from these government web sites. For example, if there is a major policy shift, often documentation of the previous policies will be scrubbed. If a particular minister falls out of favor and is dismissed from the government, often most of the content associated with the programs they oversaw is either taken down or reworked. We have even seen cases of territorial disputes between neighboring countries where the governmental entity charged with certifying and issuing official maps have removed maps from their sites because they were deemed to undermine certain claims.

The most extreme or extensive example of this willful destruction of content would be be a coup d’etat, where the new regime basically wants to erase all traces of the previous administration as quickly as possible. My presentation at the Library of Congress included “before and after” web archived screenshots from the 2009 coup in Honduras that resulted in the overthrow of the elected administration of President Manuel Zelaya. We typically crawl the sites on our list, about 275 Latin American govermental web sites, once per quarter. Here is our “before” snapshot of the Home Page for Zelaya’s administration, think of it as the Honduran equivalent of whitehouse.gov. Linked from here, and thus included in our archive, are dozens of reports, policy papers, speeches, etc., documenting all aspects of Zelaya’s period in the presidency.

“before” snapshot of the Home Page for Zelaya’s administration

Here you can see in a subsequent crawl of the same address, conducted after the coup, where someone from the new regime had gone in and wiped all the files belonging to the site. When we pointed the crawler at the URL for the Honduran presidency, all we got was the default server test page!

Site during the coup shows default server test page

Then, at some point prior to our next crawl of the same URL, the coup regime had its new site up and running, again featuring none of the content that had previously resided at that address.

The coup regime's new site up and running

Eventually, the Zelaya folks put up a “government in exile” site, which we also crawled, where they posted their take on the coup and plans for attempting to reverse it. This site itself was only up for a few months, but future historians interested in studying the coup in Honduras will be able to consult a complete copy in our LAGDA Web archive.

Zelaya “government in exile” site

Trevor: Could you tell us a bit about how this collection started?

Kent: We started out as an Archive-It “pilot partner” in 2005. The Internet Archive was looking for institutions interested in testing their new web archiving service, and we had already been doing some work on trying to archive content from these government web sites, so it was a perfect fit. The actual need grew directly out of an issue we faced in the library. UT Austin’s Benson Latin American Collection had been collecting government reports in print from Latin American countries for decades. In some cases, we have copies of such reports dating back to the late 19th century, and in many cases we have runs of the same report issued annually over the course of 40 or 50 years.

With the advent of born digital beginning in the early 2000s, it became increasingly difficult to sustain this print-based collecting activity. We initially tried going to the web sites and printing the reports out from there. We also experimented with putting links to URLs for these reports in our library catalog, and even downloading individual reports and putting them in the Institutional Repository. Faced with the inefficiency and unsustainability of these approaches, we realized that web archiving the web sites that contain the reports might be a more feasible approach.

Trevor: How is this collection being used and who is using the collection? I imagine readers will be very interested in these, so please describe as many examples of different kinds of use as you can.

Kent: The main users of the collection are students and faculty, both here at UT Austin and at other institutions. I have talked to students who have used the collection to gather data or information for classroom assignments on a specific topic for which they cannot locate the content on the live web or in specialized databases. Again, a lot of the information published on these government web sites is simply not collected or archived by other entities, so if you can’t find it on the live web, chances are your next best bet is going to be to look for it in our LAGDA web archive.

Graduate students and faculty use the collection as a source of data for their research. We have one professor, a political scientist, whose research involves comparative analysis of Latin American electoral systems. He compares election results from multiple countries in the region across as broad a time period as he can obtain results for. While he does most of his data collection in country, he has told me that in some cases, he has found election results data in our LAGDA collection that he was unable to track down in other places. Another example would be with speeches: LAGDA contains the full text, as well as in some cases audio or video, of hundreds of presidential speeches. There are several different types of research that rely on large collections of speeches such as the ones contained in the LAGDA collection, that could be content analysis, other types of linguistic research, political scientists or sociologists looking at ideology, etc.

Trevor: How does this collection fit in with other collections and areas of research activity at the University?

Kent: It sits right at the intersection of a lot of activity. Recognizing that when it comes to born digital, as a research library “we can’t collect it all”, we decided to put a couple of stakes in the ground and establish that the areas of human rights and Latin America would be major collecting priorities for us going forward. Our University has outstanding resources across campus in both these areas, and over the past few years we have launched several major digital initiatives in human rights and Latin America that not only draw on this expertise but also bolster it by making new scholarly resources available in the digital realm.

Our Teresa Lozano Long Institute of Latin American Studies, which is out of the College of Liberal Arts, has over 140 affiliated faculty from a broad array of disciplines. In addition to an undergraduate major, we have programs leading to an MA and PhD in Latin American Studies. The Institute is currently engaged in a broad collaborative endeavor with the Benson Collection, which is part of the UT Libraries, and our Latin American Government Documents web archive is one of the initiatives that is helping to propel this collaboration forward.

In the area of human rights, our Human Rights Documentation Initiative  is similarly a multidisciplinary effort that brings together faculty expertise across campus in the context of a project that is anchored in the Libraries. There are three major components to the work done by the HRDI: long term preservation of fragile and vulnerable records of human rights struggles globally, promotion and secure usage of human rights archival material, and advancement of human rights research and advocacy around the world. One of the core HRDI collecting activities is an ongoing web archive of human rights related resources.

In terms of linking up with resources and expertise on campus, I should also mention the work we are doing through our LAGDA web archive with the campus supercomputing facility, the Texas Advanced Computing Center. We are working with TACC to research ways in which advanced text mining techniques and cloud computing can be used to facilitate the programatic classification of and access to content in our web archives. The LAGDA collection alone is currently about 6TB in size, so the work that is being done on this front is helping the Libraries as we strive to meet the challenges of storage, preservation, access and other issues associated with “big data”.

Trevor: In your talk you offered this collection as an example of the role web archiving can play in research library special collections. Could you tell us a bit about what you see as the primary value of this kind of collection as a research library special collection? Do you think this is a model for other research libraries, and if so what do you see as the key features that research libraries should be focusing on in developing plans for web archive special collections.

Kent: The primary value of the collection is as a resource for our faculty and students, as I have touched on above. To the extent that we are providing vital digital content and services that engage the processes of research and teaching, we are helping the library to fulfill its mission.

Beyond that, I would touch upon a couple of additional aspects. First is the issue of Web archiving as an example of, or a mechanism for, mainstreaming of special collections within the research library. In this era of ever-growing budget constraints in libraries and in higher education, in order for special collections to thrive we must have a strong presence on campus in terms of cutting edge digital initiatives. Furthermore, those initiatives must be structured in a way that integrates them into the mainstream of library activity. In the case of the web archiving projects we are engaged in, which originated out of special collections, we have strived to integrate processes as appropriate into existing library workflows, including in the areas of acquisition, cataloging, metadata, and long term preservation. To cite one example, both the HRDI web archive staff and myself on behalf of the LAGDA collection have worked closely with the UT Libraries Metadata Librarian, Amy Rushing, to facilitate the integration of our metadata into existing Libraries discovery systems. Amy is now part of a broader effort involving several institutions nationally that is seeking to elaborate a set of metadata best practices for web archiving.

The second is the imperative for collaboration among research libraries in doing Web archiving. Web archiving is an area that is a natural fit for collaboration. On the one hand, as I suggested earlier, the web is so vast, obviously no single research library is going to be able to even aspire to “collect it all,” so we will need to agree on who will take responsibility for collecting which piece. On the other hand, the disincentives to cooperation that existed with print are not really present when we talk about born digital collections such as web archives; we don’t need 50 or 100 research libraries each acquiring their own copy, we can collect it in one place, but provide access to the content across all of our institutions. In order to do that, we need to come together as a group and figure out not just the appraisal and the crawling piece, not just the access and the preservation piece, but a broader approach where we look at the whole web archiving lifecycle and figure out which institutions can take responsibility for what.

I was at a meeting at Columbia University a couple of weeks ago, “Web Archiving Policies and Practices in the US: 2012 Summit,” where we were attempting to start doing just that. We hope to meet again to continue work on the development of best practices and to move toward some type of collaborative mechanism for web archiving among US-based memory institutions. If any of your readers are interested in joining this type of discussion, I would urge them to contact someone from this group.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.