On May 23, the Library of Congress hosted “#WhyWebArchiving: Preserving Internet Content for Research Use,” a virtual event that brought together Library subject experts actively involved in building web archives with researchers that have utilized the Library’s web archives in their work. The event kicked-off the 2022 Web Archiving Conference, which the Library co-hosted with the International Internet Preservation Consortium (IIPC), and it included discussions on topics like the importance of web archiving, how curators respond to current events and decide which sites to preserve, suggestions on how to become more involved in web archiving, and examples of how to use web archives in research and teaching. A video recording of the panel is now available online, and you can also read about some of the highlights here.
The panel’s moderator, Ian Milligan, opened the event with reflections on web archiving from his perspective as a historian who studies web archives and how they can be used for historical research. While it is now common to call a historian who uses web archives in their work a “web historian,” Milligan predicts that in the future, scholars engaged in that type of research will just be called “historians.” The internet and web archives will lie “at the heart of…any study that’s looking back to the mid-1990s or later,” Milligan explained, “and if we as a society want to do justice to understanding almost any dimension of our society, culture, [or] politics since the mid-1990s, we’re going to have to…use web-based primary sources.” Researchers are going to need that information, Milligan continued, so it’s important for web archivists at the Library and other institutions to preserve and provide access to it.
Milligan knows, however, that web archiving is a complex and challenging task. “Just think about how much of an interplay it takes to display even one website,” he said, going on to name just a few of the many components that are present in the Library’s own website–images, formatting, links, connections to social media, and the ability to zoom into documents. And not only that, Milligan adds, “If I go to loc.gov tomorrow, it will be different than it was today. And it will be different than it was yesterday. So what does it mean to preserve loc.gov? Do we preserve it every day? Do we preserve it every month, every week? Every second?…[and] what if it’s interactive?” When it comes to web archiving, “the possibilities are limitless, but also terrifying, as the amount of data we’re talking about in these conversations scales to levels few could have imagined two decades ago– hundreds of billions of archived pages, hundreds of petabytes.”
The challenges of web archiving are well-known to librarians involved in web archiving, like J.J. Harbster (Science, Technology, and Business Division) and Elizabeth Osborne (Law Library). As Harbster explained, “I can’t collect and preserve everything. I think that’s a romantic notion and nobody can achieve that…What I can achieve is a curated, representative collection…taking a snapshot of a moment in time, identifying content that tells a story [or] answers a question.” In her role as Head of the Science Reference section, she contributed to event-based collections related to Hurricane Katrina and the COVID-19 pandemic, while also developing thematic collections like the Science Blogs Web Archive. When she first began working on the Coronavirus Web Archive she read about other pandemics in history, particularly the Influenza Pandemic of 1918, and she took note of what kind of information the authors were citing and the topics they focused on. This reaffirmed the need to balance the science, policy, and business content the Library was collecting with “human stories” about how the pandemic affected people’s everyday lives. The still-growing collection, which includes over 450 items, was a collaborative effort, and it contains content on a variety of subjects, like “corona cuisine,” fashion, religion, and the performing arts.
As a Senior Legal Reference Librarian at the Law Library and manager of the Legal Blawgs Web Archive, Osborne sees another dimension to the importance of web archiving. Collecting and preserving legal blogs is integral to the Law Library’s mission to “provide to an unrivaled collection of U.S., foreign, comparative, and international law,” she said, because blogs can offer a perspective that might be missing from more traditional forms of legal scholarship. “Bloggers are pushing out newer ideas and discussing areas where the law is changing,” she explained, and because writers can publish on a blog faster than they can through other avenues, they can respond more quickly to current events or developments in the field. Blogs can also be very niche, delving deep into specific topics, and they are often interactive, “invit[ing] engagement and critique in the form of comments and responses.” Osborne, who is herself an attorney, points out that blogs have been cited in courts for their legal analysis. These are all significant reasons for preserving blogs, which otherwise tend to have short lifespans on the internet, but Osborne admits that she doesn’t know exactly how future researchers will use the archive and had to learn to be “comfortable with that ambiguity.” To account for all the possibilities, Osborne said, “I have made it my goal to try to make sure that I have cast a wide net and bring into the collection blogs that are diverse, represent[ing] a variety of topics and voices.”
Benjamin Lee and Amelia Acker are on the other side of web archiving, having utilized the Library’s web archives as researchers. Lee, a Ph.D. candidate in Computer Science at the University of Washington, is interested in addressing the problems of scale that Milligan identified when describing the increasing size and complexity of web archives. In his research, Lee explores using machine learning to improve search and discovery for “massive” digital collections. He put his ideas into practice as a 2020 Innovator-in-Residence at the Library, where he created Newspaper Navigator, a tool for searching through millions of digitized historic newspapers. He is excited about similar opportunities with web archives, and sees the possibilities of going beyond keyword searching to use other forms of textual and visual analysis to “understand the broader contours” of web archives, identify interesting patterns, and provide new ways to approach collections. An example of how this might work is the project he undertook with Trevor Owens to analyze a dataset of born-digital government PDFs. Their work demonstrated how machine learning might be used to rethink how web archives are processed and enrich the search experience for the user.
Acker, an Assistant Professor at the University of Texas at Austin, comes to web archives from the perspective of an information scientist interested in “how digital preservation approaches are changing and how that impacts the way we understand society and how we know ourselves.” She is concerned about privatized information infrastructures and believes the work that the Library and other institutions are doing to preserve born-digital information is vital because researchers can’t rely on platforms to preserve and provide access to data long term. In her teaching, Acker is also particularly fond of the Library’s Web Cultures Web Archive, which she has used as a tool to teach graduate students about metadata. The collection inspired a research project with Anne C. Loos and Julia Sufrin to analyze data from the archives of Meme Generator, a site that enables users to create and share memes. Acker said her students were excited to see “vintage” memes from 2012 and study how they evolved over time. Memes may seem like just silly entertainment, but they have the potential to reveal insights into “cultural expectations of humor, politics, [and] current events,” she explained.
The event continued with questions from Ian and the audience. In response to a question about how web archivists respond to current events, Harbster described how she is still working on the Coronavirus Web Archive, as the pandemic continues to move into new phases. “I think it’s good to be aware and flexible,” she said, so that as the pandemic evolves, the collection can too. The Coronavirus Web Archive is an example of an event-based collection, created by the Library to capture a specific event (other examples include web archives for September 11, 2001, the Iraq War, and U.S. elections). Osborne, whose Legal Blawgs Web Archive is a thematic collection and not an event-based one, emphasized the importance of also developing well-balanced thematic collections that would ideally be archiving content related to current events as part of their routine collecting. For her, current events are a prompt to reevaluate thematic collections and fill in any gaps. With the pandemic, for example, it was important to check that the Legal Blawgs Web Archive had sufficient content related to health law and to consider if there were any “newer voices” or “different perspectives” that she could add.
Acker added, “One thing that I have found really exciting is that we’re beginning to see that web archiving is becoming something of a citizen science project,” referencing the grassroots work of volunteers in data rescue and web archiving projects like EDGI (Environmental Data & Governance Initiative) and SUCHO (Saving Ukrainian Cultural Heritage Online). These groups, which include people from a variety of backgrounds, demonstrate that web archiving tools are becoming more accessible and easier to use and that concerned citizens can become involved in the work to preserve digital heritage too. She described “informal to more formal web archiving project happen[ing] really quickly” now, in part because a “silver lining” of the pandemic is that people are now more familiar with tools for virtual collaboration.
Acker elaborated on this idea more when an audience member asked, “Can anybody contribute to web archiving? How do you get started?” She said, “I like to think that there are two kinds of swimlanes of web archiving. There’s macro efforts that we see from huge organizations. That’s formal collecting [with] lots of resources, and then [there’s] micro efforts,” which she describes as being more informal collecting by individuals or groups. Both are important, she says. She recommended the Internet Archive’s Save Page Now as a way for anyone to archive individual websites. “There’s a whole continuum of ways to get involved,” including also connecting with professional organizations like IIPC to learn about web archiving programs and workshops.
Another audience member asked, “If you can’t collect everything, how does one justify your bounds?” Harbster responded that she takes guidance from knowing what current researchers have asked for. She also said it’s important to have a clear scope and purpose, citing the example of the Coronavirus Web Archive, where the amount of content produced could be overwhelming. For that project, a web archivist created a thorough rubric of criteria to consider when deciding if content should be included in the archive. This clarified the process of collecting, but Harbster also mentions that with on-going events like the pandemic, it’s important to revisit a collection’s scope and purpose in order to respond to new developments or fill any gaps in the collection. Osborne concurred and added, “I think the most important thing is don’t work in a bubble.” She noted that it’s important for her to work collaboratively, so that “a lot of other people are involved in looking at things, improving and discussing [the collection].”
Watch the video to hear the panelist talk more about these issues, and to answer other questions like, is anyone preserving misinformation sites? Why do web archivists use the word ‘curate’ when speaking about building collections, and how do they handle the potential for bias in collecting? What about the potential for duplication? And how would a person further develop their web archiving and digital humanities skills?
Why do you think web archiving is important? Let us know in the comments!