Why Web Archiving?: A Conversation with Web Archivists and Researchers

On May 23, the Library of Congress hosted “#WhyWebArchiving: Preserving Internet Content for Research Use,” a virtual event that brought together Library subject experts actively involved in building web archives with researchers that have utilized the Library’s web archives in their work. The event kicked-off the 2022 Web Archiving Conference, which the Library co-hosted with the International Internet Preservation Consortium (IIPC), and it included discussions on topics like the importance of web archiving, how curators respond to current events and decide which sites to preserve, suggestions on how to become more involved in web archiving, and examples of how to use web archives in research and teaching. A video recording of the panel is now available online, and you can also read about some of the highlights here.

Screenshot capturing the speakers of the live-streamed event

The speakers of #WhyWebArchiving– Elizabeth Osborne, Ian Milligan, Amelia Acker, Benjamin Lee, and Jennifer Harbster

The panel’s moderator, Ian Milligan, opened the event with reflections on web archiving from his perspective as a historian who studies web archives and how they can be used for historical research. While it is now common to call a historian who uses web archives in their work a “web historian,” Milligan predicts that in the future, scholars engaged in that type of research will just be called “historians.” The internet and web archives will lie “at the heart of…any study that’s looking back to the mid-1990s or later,” Milligan explained, “and if we as a society want to do justice to understanding almost any dimension of our society, culture, [or] politics since the mid-1990s, we’re going to have to…use web-based primary sources.” Researchers are going to need that information, Milligan continued, so it’s important for web archivists at the Library and other institutions to preserve and provide access to it.

Milligan knows, however, that web archiving is a complex and challenging task. “Just think about how much of an interplay it takes to display even one website,” he said, going on to name just a few of the many components that are present in the Library’s own website–images, formatting, links, connections to social media, and the ability to zoom into documents. And not only that, Milligan adds, “If I go to loc.gov tomorrow, it will be different than it was today. And it will be different than it was yesterday. So what does it mean to preserve loc.gov? Do we preserve it every day? Do we preserve it every month, every week? Every second?…[and] what if it’s interactive?” When it comes to web archiving, “the possibilities are limitless, but also terrifying, as the amount of data we’re talking about in these conversations scales to levels few could have imagined two decades ago– hundreds of billions of archived pages, hundreds of petabytes.”

The homepage for the Library of Congress website, as it appeared in 1997.

This capture from 1999 is one of the earliest that the Library has of loc.gov. The website has been captured over 6000 times since 1997.

The challenges of web archiving are well-known to librarians involved in web archiving, like J.J. Harbster (Science, Technology, and Business Division) and Elizabeth Osborne (Law Library). As Harbster explained, “I can’t collect and preserve everything. I think that’s a romantic notion and nobody can achieve that…What I can achieve is a curated, representative collection…taking a snapshot of a moment in time, identifying content that tells a story [or] answers a question.” In her role as Head of the Science Reference section, she contributed to event-based collections related to Hurricane Katrina and the COVID-19 pandemic, while also developing thematic collections like the Science Blogs Web Archive. When she first began working on the Coronavirus Web Archive she read about other pandemics in history, particularly the Influenza Pandemic of 1918, and she took note of what kind of information the authors were citing and the topics they focused on. This reaffirmed the need to balance the science, policy, and business content the Library was collecting with “human stories” about how the pandemic affected people’s everyday lives.  The still-growing collection, which includes over 450 items, was a collaborative effort, and it contains content on a variety of subjects, like “corona cuisine,” fashion, religion, and the performing arts.

A screenshot of the homepage of Sofa Shakespeare, as it appeared in May 2020. There is an embedded video of a performance with the text, "Thousands of Shakespearean actors & fans all over the world, waiting out a global pandemic in isolation. One minute of a play to record as you choose. One epic performance. This is: Sofa Shakespeare."

The Coronavirus Web Archive includes content like Sofa Shakespeare, an online project for professionals and fans to perform plays by William Shakespeare during COVID-19 isolation. This capture is from May 19, 2020.

As a Senior Legal Reference Librarian at the Law Library and manager of the Legal Blawgs Web Archive, Osborne sees another dimension to the importance of web archiving. Collecting and preserving legal blogs is integral to the Law Library’s mission to “provide to an unrivaled collection of U.S., foreign, comparative, and international law,” she said, because blogs can offer a perspective that might be missing from more traditional forms of legal scholarship. “Bloggers are pushing out newer ideas and discussing areas where the law is changing,” she explained, and because writers can publish on a blog faster than they can through other avenues, they can respond more quickly to current events or developments in the field. Blogs can also be very niche, delving deep into specific topics, and they are often interactive, “invit[ing] engagement and critique in the form of comments and responses.” Osborne, who is herself an attorney, points out that blogs have been cited in courts for their legal analysis. These are all significant reasons for preserving blogs, which otherwise tend to have short lifespans on the internet, but Osborne admits that she doesn’t know exactly how future researchers will use the archive and had to learn to be “comfortable with that ambiguity.” To account for all the possibilities, Osborne said, “I have made it my goal to try to make sure that I have cast a wide net and bring into the collection blogs that are diverse, represent[ing] a variety of topics and voices.”

The homepage for the Concurring Opinions blog, as it appeared in October 2005. Articles include titles "Fictions, Concessions & Genossenschaft," "Must see Tv...," and "Preparing for a Bird Flu Pandemic.

The Legal Blawgs Web Archive includes content like the Concurring Opinions blog. This capture is from October 14, 2005.

Benjamin Lee and Amelia Acker are on the other side of web archiving, having utilized the Library’s web archives as researchers. Lee, a Ph.D. candidate in Computer Science at the University of Washington, is interested in addressing the problems of scale that Milligan identified when describing the increasing size and complexity of web archives. In his research, Lee explores using machine learning to improve search and discovery for “massive” digital collections. He put his ideas into practice as a 2020 Innovator-in-Residence at the Library, where he created Newspaper Navigator, a tool for searching through millions of digitized historic newspapers. He is excited about similar opportunities with web archives, and sees the possibilities of going beyond keyword searching to use other forms of textual and visual analysis to “understand the broader contours” of web archives, identify interesting patterns, and provide new ways to approach collections. An example of how this might work is the project he undertook with Trevor Owens to analyze a dataset of born-digital government PDFs. Their work demonstrated how machine learning might be used to rethink how web archives are processed and enrich the search experience for the user.

This image demonstrates Newspaper Navigator’s interactive machine learning interface. In this example, the AI has been trained to retrieve images of birds. Read “The Newspaper Navigator Dataset: Extracting Headlines and Visual Content from 16 Million Historic Newspaper Pages in Chronicling America” to learn more.

Acker, an Assistant Professor at the University of Texas at Austin, comes to web archives from the perspective of an information scientist interested in “how digital preservation approaches are changing and how that impacts the way we understand society and how we know ourselves.” She is concerned about privatized information infrastructures and believes the work that the Library and other institutions are doing to preserve born-digital information is vital because researchers can’t rely on platforms to preserve and provide access to data long term. In her teaching, Acker is also particularly fond of the Library’s Web Cultures Web Archive, which she has used as a tool to teach graduate students about metadata. The collection inspired a research project with Anne C. Loos and Julia Sufrin to analyze data from the archives of Meme Generator, a site that enables users to create and share memes. Acker said her students were excited to see “vintage” memes from 2012 and study how they evolved over time. Memes may seem like just silly entertainment, but they have the potential to reveal insights into “cultural expectations of humor, politics, [and] current events,” she explained.

A bar graph of the ten most frequent base meme, including "Y U No," "Futurama Fry," "Insanity Wolf," "Philosopher" "Success Kid," "The Most Interesting Man in the World," "Willy Wonka," "Foul Bachelor Frog," "Socially Awkward Penguin," and "Advice Yoda Gives."

This bar chart is from “The Neil deGrasse Tyson Problem: Methods for Exploring Base Memes in Web Archives,” a 2020 article written by Amelia Acker, Anne C. Loos, and Julia Sufrin. This bar chart compares the most frequently used based meme images in a dataset derived from the Meme Generator web archive.

The event continued with questions from Ian and the audience. In response to a question about how web archivists respond to current events, Harbster described how she is still working on the Coronavirus Web Archive, as the pandemic continues to move into new phases. “I think it’s good to be aware and flexible,” she said, so that as the pandemic evolves, the collection can too. The Coronavirus Web Archive is an example of an event-based collection, created by the Library to capture a specific event (other examples include web archives for September 11, 2001, the Iraq War, and U.S. elections). Osborne, whose Legal Blawgs Web Archive is a thematic collection and not an event-based one, emphasized the importance of also developing well-balanced thematic collections that would ideally be archiving content related to current events as part of their routine collecting. For her, current events are a prompt to reevaluate thematic collections and fill in any gaps. With the pandemic, for example, it was important to check that the Legal Blawgs Web Archive had sufficient content related to health law and to consider if there were any “newer voices” or “different perspectives” that she could add.

Acker added, “One thing that I have found really exciting is that we’re beginning to see that web archiving is becoming something of a citizen science project,” referencing the grassroots work of volunteers in data rescue and web archiving projects like EDGI (Environmental Data & Governance Initiative) and SUCHO (Saving Ukrainian Cultural Heritage Online). These groups, which include people from a variety of backgrounds, demonstrate that web archiving tools are becoming more accessible and easier to use and that concerned citizens can become involved in the work to preserve digital heritage too. She described “informal to more formal web archiving project happen[ing] really quickly” now, in part because a “silver lining” of the pandemic is that people are now more familiar with tools for virtual collaboration.

Homepage of the Isis the Scientist blog, as it appeared in January 2013. Header image is of two woman holding parasols and the first article is entitled "On the Importance of Viable Alternative Careers."

The Science Blogs Web Archive includes content like Isis the Scientist, a blog written by a female scientist. This capture is from January 5, 2013.

Acker elaborated on this idea more when an audience member asked, “Can anybody contribute to web archiving? How do you get started?” She said, “I like to think that there are two kinds of swimlanes of web archiving. There’s macro efforts that we see from huge organizations. That’s formal collecting [with] lots of resources, and then [there’s] micro efforts,” which she describes as being more informal collecting by individuals or groups. Both are important, she says. She recommended the Internet Archive’s Save Page Now as a way for anyone to archive individual websites. “There’s a whole continuum of ways to get involved,” including also connecting with professional organizations like IIPC to learn about web archiving programs and workshops.

Another audience member asked, “If you can’t collect everything, how does one justify your bounds?” Harbster responded that she takes guidance from knowing what current researchers have asked for. She also said it’s important to have a clear scope and purpose, citing the example of the Coronavirus Web Archive, where the amount of content produced could be overwhelming. For that project, a web archivist created a thorough rubric of criteria to consider when deciding if content should be included in the archive. This clarified the process of collecting, but Harbster also mentions that with on-going events like the pandemic, it’s important to revisit a collection’s scope and purpose in order to respond to new developments or fill any gaps in the collection. Osborne concurred and added, “I think the most important thing is don’t work in a bubble.” She noted that it’s important for her to work collaboratively, so that “a lot of other people are involved in looking at things, improving and discussing [the collection].”

Watch the video to hear the panelist talk more about these issues, and to answer other questions like, is anyone preserving misinformation sites? Why do web archivists use the word ‘curate’ when speaking about building collections, and how do they handle the potential for bias in collecting? What about the potential for duplication? And how would a person further develop their web archiving and digital humanities skills?

Why do you think web archiving is important? Let us know in the comments!

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.