Pioneers of the internet and the world wide web unveiled their ideas on how to preserve the web’s contents at a day-long symposium called “Saving the Web” held Thursday, June 16 at The John W. Kluge Center at the Library of Congress. Hosted by computer scientist Dame Wendy Hall, the event featured web pioneers and experts Vint Cerf, Ramesh Jain, James Hendler and others and was attended by more than 150 people.
Cerf, who co-designed the TCP/IP protocols and the architecture of the internet, unveiled the term “digital vellum.” Vellum refers to parchment made from calf skin. It has great stability and permanence if kept in the right environment. The web is perhaps the exact opposite. It is temporal. It changes quickly. The average web page remains online for barely 100 days and URLs are not stable. How do you make it permanent?
Cerf said that preservation should not be an afterthought or an accident. It should be built into the web: a self-archiving web. Objects, metadata and the connectivity of the web should all be captured. New systems must be developed to archive not just static pages but links, what pages connect to, and the experience of navigating. Executable code also needs to be preserved. Cerf suggested that government-sponsored research should incorporate the preservation of the data that results from it along with its metadata. Preservation of the web should be purposeful. “I don’t like preservation by accident,” Cerf said. What are the consequences of not acting? Cerf suggested that in the 22nd century, humanity may know more about Abraham Lincoln’s presidency than Barack Obama’s. Will future historians be able to write about our politics without access to today’s websites and social media?
Not all of the web can be saved, and some speakers suggested that not all of it should be. Political scientist David Lazer noted that much of the web today is bots and spam. How does one separate the wheat from the chaff? One idea may be to ask relevant stakeholders. What websites and web content would political scientists want about the 2016 election? This could inform what libraries and archives choose to preserve. Of course, this approach creates priorities that bring cultural and political biases, a point raised by Richard Price of the British Library. We may preserve political election websites but we would miss the political ideas that do not reach the mainstream.
Organizations are already preserving elements of the web. The Internet Archive, represented at the symposium by Jefferson Bailey, will soon reach 50 billion web captures. They are investigating building bots in order to preserve at the massive scale necessary. Abbie Grotke spoke on behalf of the Library of Congress Web Archives, which makes a deep regular crawl of large sites such as State.gov and employs RSS feed crawling to keep up with fast changes to websites. Still, with limited staff and resources, it is a challenge to keep pace. And presently the team can only focus on acquiring the data not on making it accessible.
James Hendler, an originator of the semantic web, illuminated the numerous challenges of the modern-day internet, namely that the data that powers the web lives in its structure. Information in a database is only so useful; descriptive data about the database tells a more complete story. It is this data that needs to be housed, searchable and findable. Students will need to acquire the literacy to work with these data sets. “Data is for this generation what writing was for past,” Hendler said.
Once the web is preserved there is much we can do with it. Katy Borner of Indiana University Bloomington displayed stunning maps and data visualizations. The closing panel discussed the research that can be done using web and social media content, and in particular the perils of not having that information in the archive. Matthew Weber of Rutgers pointed out that CNN’s coverage of Hurricane Katrina is not preserved in its entirety on their website; however it is preserved in the Internet Archive. Comparing the two sites, one sees that CNN perhaps made deliberate choices of what it chose to save and what it did not. Philip Napoli, also of Rutgers, discussed how archiving of journalistic content at the local level is nearly nonexistent. Could you do a project on local radio coverage from a certain time period? Likely not. It’s easier to study local news coverage from 1940 than for 2015. And Katrin Weller, an information scientist and former Kluge Fellow in Digital Studies discussed how the context of what we’re doing now on social media may be completely lost to future historians. How can we be sure that scholars of the future will be able to decode a tweet or understand what a hashtag was?
To close the day, Dame Wendy Hall unveiled a new Web Observatory Search. The new tool allows a full web search across known data sets, searching the metadata of documents, not the contents. An open source project, organizations that place their metadata in the system in the proper format will be able to have their data searchable and discoverable by users worldwide. While Hall recognizes that digital storage space will be needed in large quantities, she was more concerned about making sense of all the information once it is stored. In a very short time span, the web has grown and changed our world dramatically. What do we save? How do we save it? How will anyone make sense of it? What tools and policies can be developed to ensure the record of our current moment does not disappear? One thing all participants agreed upon: the principles of openness and multidisciplinary that created the web will be essential to preserving it.
“Saving the Web: The Ethics and Challenges of Preserving What’s on the Internet” occurred Thursday, June 16, 2016 at The John W. Kluge Center at the Library of Congress. The event was live-tweeted: #SaveTheWeb. Videos will be posted to our YouTube page.