LC Labs Letter: Data and Libraries

June 2022

LC LABS LETTER
Monthly News from the Library of Congress Labs Team

Data & Libraries

“Data” has long been a part of library practices. For years even before the prevalence of digital technologies, library catalogs recorded important data (i.e. information) describing elements of an item or a collection.  With the rise of the internet and proliferation of computers, the field faced a growing need to make descriptive data “machine readable.” For example, the development of Machine Readable Cataloging (MARC) standards in the 1960s and 1970s standardized cataloging information, thus allowing it to be stored digitally and shared more easily with other organizations.

Our team LC Labs investigates new possibilities that emerge when treating digital collections themselves– not just the information describing them–as data that can be accessed, analyzed, transformed, and visualized with the assistance of computers. We’ve hosted conferenceshackathonssoftware carpentry workshopsdata jamsdesign sprintsdata challenges—you name it!

This month’s issue is devoted to updates on LC Labs’ latest collaborations in this exciting space.

Providing access to collections “as data”

“Collections as data” is a widely-used term in academic libraries and research to denote new models of scholarship and inquiry that apply computational methods to digital cultural heritage materials. Many libraries, archives, and museums have developed resources tailored to the needs and questions of computer scientists and researchers seeking to access collections content programmatically (i.e. using programming languages).

LC Labs is proud to have collaborated on updating the documentation available at loc.gov/apis, a site documenting the many application programming interfaces (APIs) the Library of Congress makes publicly available. Our team of developers, digital collections specialists, and Labs staff have also worked to organize the Library of Congress Data Exploration repository, which contains over 15 different Jupyter Notebooks for exploring the Library’s collections via APIs and select derivative datasets.

One of the latest additions to the Data Exploration repo is a Jupyter Notebook that digs into a derivative dataset created from the United States’ Election Web Archives. This is a unique collection that preserves over twenty years of campaign websites for candidates in presidential, gubernatorial, and congressional elections. Web archives data sets are typically large and complex files, so it can be difficult to commit to downloading and working with them without knowing what information they contain. This notebook demonstrates how to explore the web archives’ CDX files and underlying data. It runs code in a browser so you can see, in real time, what these data entail before you decide to work with them.

Specifically, the Jupyter Notebook runs the code for how to:

  • Put the data into a dataframe, akin to a spreadsheet, to help make the data more comprehensible and easier to manipulate and analyze.
  • Limit the scope of the data if you don’t have the computing power to do an analysis on the entire dataset (for example, looking at data for just one year or limiting it to a specific number of rows).
  • In addition to these preliminary steps to prepare the data, it demonstrates potential avenues for analysis—by mimetype (media type) and by the most commonly occurring words in the text within the CDX files.

For additional information check out this Signal Blog post authored by the Library’s Web Archives team and feel free to contact [email protected] for additional questions or if you have used this dataset yourself!

Finally, both Jupyter Notebooks and API documentation are helpful resources being considered and created by the LC Labs Mellon-funded Computing Cultural Heritage in the Cloud initiative as we assess the feasibility of a cloud service model for researcher support.

If you have experience using the Library’s APIs and want to share your feedback, please email [email protected].

Exploring data in the Library’s collections as primary sources

Working with machine-readable data that describe and constitute digital collections is one major, and exciting, way for researchers and librarians to work with data. However, in today’s digital age, “data” are collected, analyzed, and represented about all manner of material and topics—including, but not limited to, library collections.

The prevalence of data has also led to an increased emphasis on data analysis skills in K-12 education. Some readers of this newsletter may remember that LC Labs member Eileen J. Manchester worked with Peter DeCraene to explore the suitability of data sets in the Library’s collection for use as primary sources in the K-12 STEM classroom. We found that it was useful to create a derivative dataset, a subset of the Grand Comics Database, tailored for our purpose.

In this close out conversation on the Signal Blog, Eileen and Peter discuss how messy, yet rewarding, it is to think of library collections as data in a classroom educational setting. Peter also gives several examples of historical data sources that shed light on the long history of research and data collection before computers.

Inspiring new creative works using digital collections

Finally, we want to stress that “computational uses” of digital collections do not have to be quantitative data analyses or formal research projects! The first round of grant projects funded by the Connecting Communities Digital Initiative (CCDI) are formidable examples of using the Library’s resources to inspire creativity, prompt reflection, and build community around the United States.

We invite you to attend a livestreamed presentation by the inaugural grantees and to learn more about CCDI. The grantees’ presentations, as well as an advisory board member panel, will take place via Zoom from 12:45pm to 3pm EST on Wednesday 6 July. Sign up via this registration page if you are interested.

The second round of CCDI grant opportunities will close on September 30, 2022. 

 

Curio

  • What’s new online: Staff from the Library’s Digital Content Management Section routinely share updates on the Signal Blog about new content added to loc.gov. In case you missed it, you can catch up on January’s additions as well as new content added in time for Memorial Day 2022.

 

 

To subscribe to the monthly LC Labs Letter, visit //updates.loc.gov/accounts/USLOC/subscriber/new?topic_id=USLOC_182

For more information about LC Labs, visit us at https://labs.loc.gov/

Questions? Contact LC Labs at [email protected]

FADGI is a Finalist for the Digital Preservation Coalition 20th Anniversary Award

Today’s guest post is from Kate Murray, Tom Rieger and Hana Beckerle, leaders of the FADGI working groups at the Library of Congress. The Federal Agencies Digital Guidelines Initiative (FADGI) is thrilled to announce that it is a finalist for the prestigious Digital Preservation Coalition (DPC) 20th Anniversary Award! The DPC 20th Anniversary Award celebrates a […]

In conversation: LC Labs staff and Einstein Educator Fellow discuss library data, STEM education, and primary source analysis

The following blog post is a conversation between Eileen J. Manchester of LC Labs and Peter DeCraene, the 2020-2022 Albert Einstein Distinguished Educator Fellow at the Library of Congress. Eileen and Peter reflect on how messy, yet rewarding, it is to think of library collections as data in a classroom educational setting.

FADGI Publishes Revision to Influential Still Image Digitization Guidelines

Today’s guest post is from Hana Beckerle, a 2021/22 Librarian-in-Residence at the Library of Congress. The Federal Agencies Digital Guidelines Initiative (FADGI) Still Image Working Group is pleased to announce the publication of the 3rd edition of the Technical Guidelines for Digitizing Cultural Heritage Materials. The newly-revised Guidelines are in draft form and are open for […]

New from FADGI: Mapping FFV1 into MXF

Today’s guest post is from Kate Murray, Digital Projects Coordinator in the Digital Collections Management and Services Division at the Library of Congress. The Federal Agencies Digital Guidelines Initiative (FADGI) AudioVisual working group is pleased to announce new resources to support diverse digital preservation workflows using the open source FFV1 video encoding. FADGI, through its […]

What’s new online at the Library of Congress – Memorial Day Weekend 2022

Interested in learning more about what’s new in the Library of Congress’ digital collections? The Signal now shares out semi-regularly about new additions to publicly-available digital collections and we can’t wait to show off all the hard work from our colleagues from across the Library. Read on for a sample of what’s been added recently […]

New Article Explores Preservation and Access to Two Historical Literary Audio Archives

This blog post was co-authored by Camille Salas (Assistant Head, Digital Content Management Section), Kristy Darby (Digital Collections Specialist, Digital Content Management Section), and Marcus Nappier (Digital Collections Specialist, Digital Content Management Section). On October 26, 2020, Catalina Gomez of the Latin American, Caribbean & European Division, Anne Holmes (formerly of the Literary Initiatives Division), […]

Performing Arts in the Coronavirus Web Archive: Part 3

This post was originally written by Melissa Wertheimer, a Music Reference Specialist at the Library of Congress, for In the Muse: Performing Arts Blog. In Part 1 of this series, I walked readers through Coronavirus Web Archive items within the theme of financial relief efforts in the performing arts. Part 2 of this series highlighted collection items related to medical and public health […]