“Distant Reading” and Web Archiving

The following is a guest post by Andrea Fox, Web Archiving Intern at the Library of Congress.

Andrea Fox

Andrea Fox

When Abbie Grotke of Web Archiving took me on for an internship, I thought well of myself for a few minutes until realizing I had no clue what Web Archiving was or what it wanted from me. Abbie didn’t seem to mind I had no digital background and was majoring in linguistics (studying over the summer allowed me a break in winter classes). She may have had misgivings after our phone interview.

 Abbie: So, this internship isn’t really related to linguistics, but you say you’re interested in archival work and organization. Do you have good computer skills?

 Me: Yes, I do the computer.

 Abbie: Right. And how about your experience with detail-oriented work?

 Me: Oh, I’ve worked with a number of details.

 Abbie: …Okay. Back to your earlier question about what artifacts we’re archiving. You do understand you’ll be helping with archiving the web itself and not—as those unfamiliar with digital futures might conclude from our department title—archiving by means of the web?

 Me: Yes, absolutely. I am in no way hiding the fact I didn’t know that until this moment.

This dialog, if not factual, gives an impression of my feelings at the time. Though born in DC, I’ve lived in isolated areas for most of my life. I wasn’t nervous about working in the city, but knowing now that I am capable of becoming disoriented in the one block between Eastern Market and the Eastern Market Metro, perhaps I should’ve been.

Once I met the Web Archiving team, my fears dissolved. They introduced me to concepts quickly but patiently and gave me feedback as I completed data cleanup tasks to prepare archived resources for access. Scrolling through thousands of collection entries to catch repeated or inconsistent titles turned up several interesting finds. In a series of websites of U.S. election candidates, for instance, I noticed such contenders as Vermin Supreme, Jon Trailerpark Jackson, and Kinky Friedman.

In the meantime I also worked with Michael Neubert. Remembering my linguistics major, Mr. Neubert encouraged me to look into computational analysis of digitized texts. He suggested I start by exploring how researchers can use a collection like Chronicling America to perform a “distant reading” of thousands of pages, finding patterns an individual reader cannot.

The report (pdf) begins with an exploration of the Google Ngram Viewer, a tool that graphs word and phrase frequencies in Google Books’ collection over time. I played around with different languages for a while, working off of linguistic trends I’d read of and wanted to test.

Spelling changes often demonstrate gratifying patterns. Here you can see how connoisseur, known to modern English and old French, subsequently changes to connaisseur in modern French:

Figure 1. Borrowing of French connoisseur (blue) into English (green) and subsequent change of the French spelling to connaisseur (red). (Experiment with the original graph at http://tinyurl.com/qfc8byj).

Figure 1. Borrowing of French connoisseur (blue) into English (green) and subsequent change of the French spelling to connaisseur (red). (Experiment with the original graph at http://tinyurl.com/qfc8byj).

Another transformation occurred after the English used their beef from the French bœuf to form beefsteak, a word that was then reborrowed in an altered form into French:fox-figure2Figure 2. Borrowing of French bœuf into English beef, followed by reborrowing of English compound beefsteak into French bifteck (both multiplied to show detail). (Original graph at http://tinyurl.com/q6vkcr2).

I couldn’t have made these comparisons without previous knowledge of words that have undergone change. Researchers who specialize in analyzing texts on a large scale, however, could potentially automate and expand these types of linguistic searches with more advanced tools, making conclusions larger than the words themselves (read the paper (pdf) produced by the Google Ngram Viewer team). Though the Ngram Viewer provides a relatively shallow view—the user can’t see the original context of the searched words—its scope allows for unusual insights.

Moving beyond spelling changes, I searched for revolution in eight languages (nine groups of texts when you count British English) to get a superficial idea of which countries discussed the term when. The result looks like a fluke:

Figure 1. Borrowing of French connoisseur (blue) into English (green) and subsequent change of the French spelling to connaisseur (red). (Experiment with the original graph at http://tinyurl.com/qfc8byj).

Figure 1. Borrowing of French connoisseur (blue) into English (green) and subsequent change of the French spelling to connaisseur (red). (Experiment with the original graph at http://tinyurl.com/qfc8byj).

fox-figure3Figure 3. Revolution in Simplified Chinese, German, Italian, Spanish, Hebrew, British English, French, American English, and Russian. (Original graph at http://tinyurl.com/pyt9ox4).

The major spike in Chinese corresponds with the 1966 Cultural Revolution. What shocked me is how the Chinese frequency dwarfs that of the other languages: at its peak in 1969, the simplified Chinese 革命 appears nearly twenty times as often as its closest competitor, the German Revolution. As a benchmark, the word the hovers around 5 percent usage in English. Have, I, and for (0.35 to 0.6 percent) appear roughly as often as 革命 in Chinese (0.35). (For comparison, see the same graph excluding Chinese at http://tinyurl.com/opxqlxs).

Barring a double meaning not listed in the dictionary, an imbalance in the Chinese texts, or some bizarre mechanism behind Google Books, it seems only a censorship bubble of strictly-policed word choice could cause this disproportion. I found a similar asymmetry the graphs of leader and censorship. Though not nearly as exaggerated as the spike in revolution, these findings suggest a political agenda has skewed the results.

Several other words, listed at the end of this post, show upturns, though some during different periods. I’ve included pairs of words in which only one demonstrates a strong pattern. I used neutral words, such as town and language, as controls. Apart from a tendency for Hebrew to rank highly with more concrete words—though not as highly and narrowly as Chinese—these words don’t seem to demonstrate significant patterns.

I’ve chosen to share these Google Ngrams because they are visually appealing and user-friendly (you can click on any one of the chart links above or in the list below and adjust the input phrase(s), time period, and number of languages for new results). More advanced and research-driven tools, however, have already been put to use on bodies of text, including the Library of Congress’ Chronicling America collection, that do not have the same access restrictions as does Google Books. The technique of topic modeling, for instance, can point to trends in ChronAm’s digitized newspapers by identifying distinct “topics,” each with its own cluster of related words, based on the likelihood of these words appearing near one another.

There is tremendous potential to conduct these types of analysis on web archive collections to help researchers navigate large quantities of available material. When you let the machine page or scroll through thousands of pages and begin the pattern-making process, you can better focus on those patterns–whether expected or unexpected–and decide which ones warrant your attention.

After three months, I’ve accomplished more than I thought I would. I still don’t understand one in four words in the weekly Web Archiving meetings, but I’ll chalk it up to lack of technical background and smile smugly again.

Explore more cross-linguistic Google Ngram graphs:

Leader (strong increase): http://tinyurl.com/k98spek

Censorship (moderate increase): http://tinyurl.com/mxpusul

Government (strong increase): http://tinyurl.com/k7wsqlm

Peace (moderate increase): http://tinyurl.com/oxrckho

War (slight increase): http://tinyurl.com/p5dtjp8

Give (moderate increase): http://tinyurl.com/pfk3jph

Take (no increase): http://tinyurl.com/p3wc8wc

Doubt as a verb (moderate increase): http://tinyurl.com/p9bsq2n

Deny (moderate increase): http://tinyurl.com/lcvw3st

Forget (slight increase): http://tinyurl.com/ngcpksx

Town (control, no increase): http://tinyurl.com/mcyvjzl

Language (control, no increase): http://tinyurl.com/khonlsn

Cook as a verb (control, no increase): http://tinyurl.com/kktk37p

Meat (control, Hebrew increases): http://tinyurl.com/mxv2qxb

Shoe (control, Hebrew increases): http://tinyurl.com/krsgdvl

Sky (control, Hebrew increases): http://tinyurl.com/m67nfor

Protect Your Data: Information Security and the Boundaries of your Storage System

The following is a guest post from Jane Mandelbaum, co-chair of the National Digital Stewardship Alliance Innovation Working group and IT Project Manager at the Library of Congress. The NDSA Levels of Digital Preservation are useful in providing a high-level, at-a-glance overview of tiered guidance for planning for digital preservation. One of the most common requests received […]

April Issue of the Library of Congress Digital Preservation Newsletter is Now Available!

The April 2014 Library of Congress Digital Preservation Newsletter (pdf) is now available! In this issue: Where are the Born Digital Archives Test Data Sets? Fixity Data in Sound and Moving Image Files Managing a Library of Congress Worth of Data Personal Digital Archiving: The Basics of Scanning New NDSA Report: Geospatial Data Stewardship Online […]

Protect Your Data: File Fixity and Data Integrity

The following is a guest post by Jefferson Bailey, Strategic Initiatives Manager at Metropolitan New York Library Council, National Digital Stewardship Alliance Innovation Working Group co-chair and a former Fellow in the Library of Congress’s Office of Strategic Initiatives. Here on The Signal, members of the NDSA Levels of Digital Preservation team have been providing some […]

Where are the Born-Digital Archives Test Data Sets?

By Butch Lazorchak and Trevor Owens We’ve talked in the past on the Signal on the need more applied research in digital preservation and stewardship. This is a key issue addressed by the 2014 National Agenda for Digital Stewardship, which dives in a little deeper to suggest that there’s a great need to strengthen the […]

March Issue of Library of Congress Digital Preservation Newsletter is Now Available

The March 2014 issue of the Library of Congress Digital Preservation Newsletter is now available! In this issue: A Career’s Worth of Archives – Bill LeFurgy talks about his career and personal archives, as he heads into retirement New NDSA Report:  PDF/A-3 for Archival Institutions CFP for Digital Preservation 2014 – deadline is March 14th […]

New NDSA Report: Geospatial Data Stewardship Key Online Resources

“Location is everywhere.” It’s become a catch phrase in mobile computing development and marketing, but it could just as easily become standard operating procedure in libraries, archives and museums as our content becomes increasingly geoenabled, using “location intelligence” to liberate our physical information from the confines of our walled spaces. Legislators, funders and planners have […]

New NDSA Report: The Benefits and Risks of the PDF/A-3 File Format for Archival Institutions

We’re lucky in the digital stewardship community that our challenges tend to be non life-threatening. Still, when we get fired up about something there is guaranteed to be spirited debate and passionate advocacy on all sides. Such was the case with the release of the PDF/A-3 file format specification in October 2012. We wrote about […]

Considering Emulation for Digital Preservation

There was a week in January 2014 where I participated in three meetings/events where emulation came up as a digital preservation solution. Emulation has really hit its stride, 20 years after I first heard about it. An emulator is an environment that imitates the behavior of a computer or other electronic system.  In recent years, […]

Let’s Start at the Very Beginning: Guiding Principles for Creating Born Digital Video

The beginning is a very fine place to start indeed for the Federal Agencies Digitization Guidelines Initiative Born Digital Video subgroup of the Audio-Visual Working Group. As mentioned in a previous blog post, the FADGI Born Digital Video subgroup is taking a close look at the range of decisions to be made throughout the lifecycle […]