Researchers, of varying technical abilities, are increasingly applying data science tools and methods to digital collections. As a result, new ways are emerging for processing and analyzing the digital collections’ raw material — the data.
For example, instead of pondering one single digital item at a time – such as a news story, photo or weather event – a researcher can compute massive quantities of digital items at once to find patterns, trends and connections. Such data visualization can be revelatory. Ian Milligan, assistant professor in the Department of History at the University of Waterloo, said, “Visualizations can, at a glance, tell you more than if you get mired down in the weeds of reading document after document after document.”
The NEH Chronicling America Data Challenge is an example of extracting data visualizations from a large, publicly available data set. Recently, the National Endowment for the Humanities invited people to “Create a web-based tool, data visualization or other creative use of the information found in the (Library of Congress’s) Chronicling America historic newspaper database.” The results are diverse and imaginative. According to the NEH website,
- “America’s Public Bible tracks Biblical quotations in American newspapers to see how the Bible was used for cultural, religious, social, or political purposes”
- “American Lynching…explores America’s long and dark history with lynching, in which newspapers acted as both a catalyst for public killings and a platform for advocating for reform”
- “Historical Agricultural News, a search tool site for exploring information on the farming organizations, technologies, and practices of America’s past”
- “Chronicling Hoosier tracks the origins of the word ‘Hoosier’ “
- “USNewsMap.com discovers patterns, explores regions, investigates how stories and terms spread around the country, and watches information go viral before the era of the internet”
- “Digital APUSH…uses word frequency analysis…to discover patterns in news coverage.”
The explicit purpose of the Library of Congress’s Archives Unleashed 2.0, Web Archive Datathon was exploratory — open-ended discovery. The data came from a variety of sources, such as the Internet Archive’s crawl of the Cuban web domain, Participants divided into teams, each with a general objective of what to do with the data. The room bustled with people clacking laptop keys, poking at screens, bunching around whiteboards and scrawling rapidly on easel pads. At one table, a group queried web site data related to the Supreme Court nominations of Justice Samuel Alito and Justice John Roberts. They showed off a word cloud view of their results and pointed out that the cloud for the archived House of Representatives websites prominently displayed the words “Washington Post” and the word cloud for the Senate prominently displayed “The New York Times.” The group was clearly excited by the discovery. This was solid data, not conjecture. But what did it mean? And who cares?
Well, it was an intriguing fact, one that invites further research. And surely someone might be curious enough to research it someday to figure out the “why” of it. And the results of that research, in turn, might open a new avenue of information that the researcher didn’t know was available or even relevant.
Events such as hackathons and the upcoming Collections as Data conference bring together librarians, archivists, digital humanities professionals, data scientists, artists, scholars — people from disparate backgrounds, evidence that computation of large data sets in research is blurring the lines between disciplines. There are a lot of best practices to be shared.
Data labs
A variety of digital research centers, scholars’ labs, digital humanities labs, learning labs and visualization labs are opening in libraries, universities and other institutions. But, despite their variety, these data labs are congealing into identifiable, standardized components that include
- A work space
- Hardware resources
- Network access
- Databases and data sets
- Teaming researchers with technologists
- Powerful processing capability
- Software resources and tools
- Repositories for end-result data sets.
A work space
A quiet room or rooms should be available for brainstorming. Whiteboards and easel pads enable people to quickly jot down ideas and diagram half-formed thoughts. A brain dump, no matter how unfocused, contains bits of value that may clump into solid ideas and strategies. The room also needs enough tables, chairs and power outlets.
Hardware resources
The lab should provide computer workstations, monitors, laptops, conference phones and possibly a net-cam for video teleconferencing.
Network access
Because of the constant flow of network requests and transactions, some moving potentially large files around, Wi-Fi must be consistent and reliable, and cable networks should be optimized for the highest bandwidth possible.
Databases and data sets
The data may need to be cleaned. Web harvesting, for example, grabs almost everything related to the seed URL – even with some filtering — and the archive often includes web pages that the researcher does not care about. Databases and data sets, if they are to be accessed over the network, should be small enough so they can be moved about easily. A researcher can also download large databases in advance of the scheduled work time.
Teaming researchers with technologists
In a complimentary collaboration between a researcher or subject matter expert and an information technologist, the researcher conveys what she would like to query the data for and the technologist makes it happen. The researcher may analyze the results and make suggestions to the technologist for refining the results. Some workshops such as Ian Milligan’s web archiving analysis workshop, require their researchers to take a Data Carpentry workshop, which is an overview of computation, programming and analysis methods that a data researcher might need. The researcher could either conduct data analyses for herself or become more conversant in data analysis methods in order to better understand her options and communicate with the technologist.
Powerful processing capability
Processing large data sets foists a load on computational power, so a lab needs ample processing muscle. At the Archives Unleashed event, it took one group ten hours to process their query. Milligan is a big proponent of cloud processing and storage, using powerful network systems supported and maintained by others. He said, “We started out using machines on the ground and we found the issue was to have real sustainable storage that’s backed up and not risky to use, that’s going to have to live in network storage anyway. We found that we’re moving data all over the place and we do some of our stuff on our server itself and when we have to spin up other machines, it’s so much quicker to actually move stuff — especially when you’re working with terabytes of web archives — until you get to that last mile of the actual Ethernet cable coming to your workstation. That’s turning out to be the mass bottleneck. Our federation of provincial computing organizations has a sort of Amazon AWS-esque dashboard where we can spin up virtual machines. We have a big server at York University and we sometimes use Jimmy Lin’s old cluster down at Maryland. So the physical equipment turns out not to be that important when we have so many network resources to draw on.”
Software resources and tools
As data labs spring up, newer and better tools are appearing too. Data labs may offer a gamut of tools for
- Content and file management
- Data visualization
- Document processing
- Geospatial analysis
- Image editing
- Moving image editing
- Network Analysis
- Programming
- Text mining
- Version control
The Digital Research Tools site is a good comprehensive resource to begin with for an overview of resources.
Repositories for end-result data sets
The data set at the end of a project may be of value to other researchers and the researcher might want her project to be discoverable. The data set should include metadata to describe the project and how to repeat the work in order to arrive at the same data set and conclusions. The repository where the data set resides should have long-term preservation reliability.
Conclusion
Data science is drifting out of the server rooms and into the general public. The sharp differences among professions and areas of interest are getting fuzzier all the time as researchers increasingly use information technology tools. Archaeologists practice 3D modelling. Humanities scholars practice data visualization. Students of all kinds query databases.
For the near future, interacting with data is a specialized skill that requires a basic understanding of data science and knowledge of its tools, whether through training or teaming up with knowledgeable technologists. But eventually the relevant instruction should be made widely available, whether in person or by video, and tools need to be simplified, especially as API-enabled databases proliferate and more sources of data become available
In time, computationally enhanced research will not be a big deal and cultural institutions’ data resources and growing digital collections will be ready for researchers to access, use and enjoy.