June 2022
LC LABS LETTER
Monthly News from the Library of Congress Labs Team
Data & Libraries
“Data” has long been a part of library practices. For years even before the prevalence of digital technologies, library catalogs recorded important data (i.e. information) describing elements of an item or a collection. With the rise of the internet and proliferation of computers, the field faced a growing need to make descriptive data “machine readable.” For example, the development of Machine Readable Cataloging (MARC) standards in the 1960s and 1970s standardized cataloging information, thus allowing it to be stored digitally and shared more easily with other organizations.
Our team LC Labs investigates new possibilities that emerge when treating digital collections themselves– not just the information describing them–as data that can be accessed, analyzed, transformed, and visualized with the assistance of computers. We’ve hosted conferences, hackathons, software carpentry workshops, data jams, design sprints, data challenges—you name it!
This month’s issue is devoted to updates on LC Labs’ latest collaborations in this exciting space.
Providing access to collections “as data”
“Collections as data” is a widely-used term in academic libraries and research to denote new models of scholarship and inquiry that apply computational methods to digital cultural heritage materials. Many libraries, archives, and museums have developed resources tailored to the needs and questions of computer scientists and researchers seeking to access collections content programmatically (i.e. using programming languages).
LC Labs is proud to have collaborated on updating the documentation available at loc.gov/apis, a site documenting the many application programming interfaces (APIs) the Library of Congress makes publicly available. Our team of developers, digital collections specialists, and Labs staff have also worked to organize the Library of Congress Data Exploration repository, which contains over 15 different Jupyter Notebooks for exploring the Library’s collections via APIs and select derivative datasets.
One of the latest additions to the Data Exploration repo is a Jupyter Notebook that digs into a derivative dataset created from the United States’ Election Web Archives. This is a unique collection that preserves over twenty years of campaign websites for candidates in presidential, gubernatorial, and congressional elections. Web archives data sets are typically large and complex files, so it can be difficult to commit to downloading and working with them without knowing what information they contain. This notebook demonstrates how to explore the web archives’ CDX files and underlying data. It runs code in a browser so you can see, in real time, what these data entail before you decide to work with them.
Specifically, the Jupyter Notebook runs the code for how to:
- Put the data into a dataframe, akin to a spreadsheet, to help make the data more comprehensible and easier to manipulate and analyze.
- Limit the scope of the data if you don’t have the computing power to do an analysis on the entire dataset (for example, looking at data for just one year or limiting it to a specific number of rows).
- In addition to these preliminary steps to prepare the data, it demonstrates potential avenues for analysis—by mimetype (media type) and by the most commonly occurring words in the text within the CDX files.
For additional information check out this Signal Blog post authored by the Library’s Web Archives team and feel free to contact [email protected] for additional questions or if you have used this dataset yourself!
Finally, both Jupyter Notebooks and API documentation are helpful resources being considered and created by the LC Labs Mellon-funded Computing Cultural Heritage in the Cloud initiative as we assess the feasibility of a cloud service model for researcher support.
If you have experience using the Library’s APIs and want to share your feedback, please email [email protected].
Exploring data in the Library’s collections as primary sources
Working with machine-readable data that describe and constitute digital collections is one major, and exciting, way for researchers and librarians to work with data. However, in today’s digital age, “data” are collected, analyzed, and represented about all manner of material and topics—including, but not limited to, library collections.
The prevalence of data has also led to an increased emphasis on data analysis skills in K-12 education. Some readers of this newsletter may remember that LC Labs member Eileen J. Manchester worked with Peter DeCraene to explore the suitability of data sets in the Library’s collection for use as primary sources in the K-12 STEM classroom. We found that it was useful to create a derivative dataset, a subset of the Grand Comics Database, tailored for our purpose.
In this close out conversation on the Signal Blog, Eileen and Peter discuss how messy, yet rewarding, it is to think of library collections as data in a classroom educational setting. Peter also gives several examples of historical data sources that shed light on the long history of research and data collection before computers.
Inspiring new creative works using digital collections
Finally, we want to stress that “computational uses” of digital collections do not have to be quantitative data analyses or formal research projects! The first round of grant projects funded by the Connecting Communities Digital Initiative (CCDI) are formidable examples of using the Library’s resources to inspire creativity, prompt reflection, and build community around the United States.
We invite you to attend a livestreamed presentation by the inaugural grantees and to learn more about CCDI. The grantees’ presentations, as well as an advisory board member panel, will take place via Zoom from 12:45pm to 3pm EST on Wednesday 6 July. Sign up via this registration page if you are interested.
The second round of CCDI grant opportunities will close on September 30, 2022.
Curio
- What’s new online: Staff from the Library’s Digital Content Management Section routinely share updates on the Signal Blog about new content added to loc.gov. In case you missed it, you can catch up on January’s additions as well as new content added in time for Memorial Day 2022.
- Fun with File Formats: The Library’s Digital Collections Management and Services Division recently published a round of exciting updates on work done by the Federal Agencies Digital Guidelines Initiative (FADGI) working groups. The team revised the Technical Guidelines for Digitizing Cultural Heritage Materials and published new additions to the Sustainability of Digital Formats site, one of the premier resources in the world for in-depth technical information about digital file formats. FADGI is also a finalist for the prestigious Digital Preservation Coalition (DPC) 20th Anniversary Award!
To subscribe to the monthly LC Labs Letter, visit https://updates.loc.gov/accounts/USLOC/subscriber/new?topic_id=USLOC_182
For more information about LC Labs, visit us at https://labs.loc.gov/
Questions? Contact LC Labs at [email protected]