Digital Collections and Data Science

Photo of a lock with the word "Access" on it.

Unlock the access. By James Levine on Flickr.

Researchers, of varying technical abilities, are increasingly applying data science tools and methods to digital collections. As a result, new ways are emerging for processing and analyzing the digital collections’ raw material — the data.

For example, instead of pondering one single digital item at a time – such as a news story, photo or weather event – a researcher can compute massive quantities of digital items at once to find patterns, trends and connections. Such data visualization can be revelatory. Ian Milligan, assistant professor in the Department of History at the University of Waterloo, said, “Visualizations can, at a glance, tell you more than if you get mired down in the weeds of reading document after document after document.”

The NEH Chronicling America Data Challenge is an example of extracting data visualizations from a large, publicly available data set. Recently, the National Endowment for the Humanities invited people to “Create a web-based tool, data visualization or other creative use of the information found in the (Library of Congress’s) Chronicling America historic newspaper database.” The results are diverse and imaginative. According to the NEH website,

  • America’s Public Bible tracks Biblical quotations in American newspapers to see how the Bible was used for cultural, religious, social, or political purposes”
  • American Lynching…explores America’s long and dark history with lynching, in which newspapers acted as both a catalyst for public killings and a platform for advocating for reform”
  • Historical Agricultural News, a search tool site for exploring information on the farming organizations, technologies, and practices of America’s past”
  • Chronicling Hoosier tracks the origins of the word ‘Hoosier’ “
  • USNewsMap.com discovers patterns, explores regions, investigates how stories and terms spread around the country, and watches information go viral before the era of the internet”
  • Digital APUSH…uses word frequency analysis…to discover patterns in news coverage.”
Dame Wendy Hall and Vint Cerf at Archives Unleashed 2.0: Web Archive Datathon. Photo by Mike Ashenfelder.

Dame Wendy Hall and Vint Cerf at Archives Unleashed 2.0: Web Archive Datathon. Photo by Mike Ashenfelder.

The explicit purpose of the Library of Congress’s Archives Unleashed 2.0, Web Archive Datathon was exploratory — open-ended discovery. The data came from a variety of sources, such as the Internet Archive’s crawl of the Cuban web domain,  Participants divided into teams, each with a general objective of what to do with the data. The room bustled with people clacking laptop keys, poking at screens, bunching around whiteboards and scrawling rapidly on easel pads. At one table, a group queried web site data related to the Supreme Court nominations of Justice Samuel Alito and Justice John Roberts. They showed off a word cloud view of their results and pointed out that the cloud for the archived House of Representatives websites prominently displayed the words “Washington Post” and the word cloud for the Senate prominently displayed “The New York Times.” The group was clearly excited by the discovery. This was solid data, not conjecture. But what did it mean? And who cares?

Well, it was an intriguing fact, one that invites further research. And surely someone might be curious enough to research it someday to figure out the “why” of it. And the results of that research, in turn, might open a new avenue of information that the researcher didn’t know was available or even relevant.

Events such as hackathons and the upcoming Collections as Data conference bring together librarians, archivists, digital humanities professionals, data scientists, artists, scholars — people from disparate backgrounds, evidence that computation of large data sets in research is blurring the lines between disciplines. There are a lot of best practices to be shared.

Data labs
A variety of digital research centers, scholars’ labs, digital humanities labs, learning labs and visualization labs are opening in libraries, universities and other institutions. But, despite their variety, these data labs are congealing into identifiable, standardized components that include

  • A work space
  • Hardware resources
  • Network access
  • Databases and data sets
  • Teaming researchers with technologists
  • Powerful processing capability
  • Software resources and tools
  • Repositories for end-result data sets.

A work space
A quiet room or rooms should be available for brainstorming. Whiteboards and easel pads enable people to quickly jot down ideas and diagram half-formed thoughts. A brain dump, no matter how unfocused, contains bits of value that may clump into solid ideas and strategies. The room also needs enough tables, chairs and power outlets.

Photo of Nikola Tesla in his lab.

Nikola Tesla, with his equipment for producing high-frequency alternating currents. Wellcome Library, London. M0014782.

Hardware resources
The lab should provide computer workstations, monitors, laptops, conference phones and possibly a net-cam for video teleconferencing.

Network access
Because of the constant flow of network requests and transactions, some moving potentially large files around, Wi-Fi must be consistent and reliable, and cable networks should be optimized for the highest bandwidth possible.

Databases and data sets
The data may need to be cleaned. Web harvesting, for example, grabs almost everything related to the seed URL – even with some filtering — and the archive often includes web pages that the researcher does not care about. Databases and data sets, if they are to be accessed over the network, should be small enough so they can be moved about easily. A researcher can also download large databases in advance of the scheduled work time.

Teaming researchers with technologists
In a complimentary collaboration between a researcher or subject matter expert and an information technologist, the researcher conveys what she would like to query the data for and the technologist makes it happen. The researcher may analyze the results and make suggestions to the technologist for refining the results. Some workshops such as Ian Milligan’s web archiving analysis workshop, require their researchers to take a Data Carpentry workshop, which is an overview of computation, programming and analysis methods that a data researcher might need. The researcher could either conduct data analyses for herself or become more conversant in data analysis methods in order to better understand her options and communicate with the technologist.

Powerful processing capability
Processing large data sets foists a load on computational power, so a lab needs ample processing muscle. At the Archives Unleashed event, it took one group ten hours to process their query. Milligan is a big proponent of cloud processing and storage, using powerful network systems supported and maintained by others. He said, “We started out using machines on the ground and we found the issue was to have real sustainable storage that’s backed up and not risky to use, that’s going to have to live in network storage anyway. We found that we’re moving data all over the place and we do some of our stuff on our server itself and when we have to spin up other machines, it’s so much quicker to actually move stuff — especially when you’re working with terabytes of web archives — until you get to that last mile of the actual Ethernet cable coming to your workstation. That’s turning out to be the mass bottleneck. Our federation of provincial computing organizations has a sort of Amazon AWS-esque dashboard where we can spin up virtual machines. We have a big server at York University and we sometimes use Jimmy Lin’s old cluster down at Maryland. So the physical equipment turns out not to be that important when we have so many network resources to draw on.”

Software resources and tools
As data labs spring up, newer and better tools are appearing too. Data labs may offer a gamut of tools for

  • Content and file management
  • Data visualization
  • Document processing
  • Geospatial analysis
  • Image editing
  • Moving image editing
  • Network Analysis
  • Programming
  • Text mining
  • Version control

The Digital Research Tools site is a good comprehensive resource to begin with for an overview of resources.

Repositories for end-result data sets
The data set at the end of a project may be of value to other researchers and the researcher might want her project to be discoverable. The data set should include metadata to describe the project and how to repeat the work in order to arrive at the same data set and conclusions. The repository where the data set resides should have long-term preservation reliability.

Conclusion
Data science is drifting out of the server rooms and into the general public. The sharp differences among professions and areas of interest are getting fuzzier all the time as researchers increasingly use information technology tools. Archaeologists practice 3D modelling. Humanities scholars practice data visualization. Students of all kinds query databases.

For the near future, interacting with data is a specialized skill that requires a basic understanding of data science and knowledge of its tools, whether through training or teaming up with knowledgeable technologists. But eventually the relevant instruction should be made widely available, whether in person or by video, and tools need to be simplified, especially as API-enabled databases proliferate and more sources of data become available

In time, computationally enhanced research will not be a big deal and cultural institutions’ data resources and growing digital collections will be ready for researchers to access, use and enjoy.

Co-Hosting a Datathon at the Library of Congress

On June 14 and 15, the Library of Congress hosted Archives Unleashed 2.0, a web archive “datathon” (otherwise known as a “hackathon,” but apparently any term with the word “hack” in it might sound a bit menacing) in which teams of researchers used a variety of analytical tools to query web-archive data sets in the hopes of discovering some intriguing insights before their 48-hour deadline […]

O Email! My Email! Our Fearful Trip is Just Beginning: Further Collaborations with Archiving Email

Apologies to Walt Whitman for co-opting the first line of his famous poem O Captain! My Captain!  but solutions for archiving email are not yet anchor’d safe and sound. Thanks to the collaborative and cooperative community working in this space, however, we’re making headway on the journey. Email archiving as a distinct research area has […]

Demystifying Digital Preservation for the Audiovisual Archiving Community

The following is a guest post by Kathryn Gronsbell, Digital Asset Manager, Carnegie Hall; Shira Peltzman, Digital Archivist, UCLA Library; Ashley Blewer, Applications Developer, NYPL; and Rebecca Fraimow, Archivist and AAPB NDSR Program Coordinator, WGBH. The intersection of digital preservation and audiovisual archiving has reached a tipping point. As the media production and use landscape […]

Avoid Jitter! Measuring the Performance of Audio Analog-to-Digital Converters

The following is a guest post by Carl Fleischhauer, a Project Manager in the National Digital Initiatives unit at the Library of Congress. It’s not for everyone, but I enjoy trying to figure out specialized technical terminology, even at a superficial level. For the last month or two, I have been helping assemble a revision […]

Digitizing Motion Picture Film: FADGI Report on Current Practices and Future Directions

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager at the Library of Congress. More often than not, the Federal Agencies Digitization Guidelines Initiative Working Groups (one for still images, one for audio-visual) find themselves walking a line between codifying widely adopted practices and exploring new ideas and new technologies […]

Acquiring at Digital Scale: Harvesting the StoryCorps.me Collection

This post was originally published on the Folklife Today blog, which features folklife topics, highlighting the collections of the Library of Congress, especially the American Folklife Center and the Veterans History Project.  In this post, Nicole Saylor, head of the American Folklife Center Archive, talks about the StoryCorps.me mobile app and interviews Kate Zwaard and […]

Access Historic Audio and Video Programs: AAPB Launches Online Reading Room

The following is a guest post by Karen Cariani, AAPB Project Director and Director WGBH Media Library and Archive, Alan Gevinson, AAPB Project Director and Special Assistant to the Packard Campus Chief, and Casey Davis, Project Manager, American Archive of Public Broadcasting, WGBH Educational Foundation. The American Archive of Public Broadcasting (AAPB) Project Team at […]

Announcing the 2015 Innovation Award Winners

On behalf of the National Digital Stewardship Alliance Innovation Working Group, I am excited to announce the 2015 NDSA Innovation Award winners! This year, the annual innovation awards committee reviewed over thirty exceptional nominations from across the country. Awardees were selected based on how their work or their project’s whose goals or outcomes represent an […]