Integrating Wikidata at the Library of Congress

This is a guest post by Matt Miller, a Linked Data Applications Technical Specialist in the Network Development and MARC Standards Office in Library Services.

Wikidata is described as “a free and open knowledge base that can be read and edited by both humans and machines.” Very similar to its wider known sibling Wikipedia, Wikidata is a collaborative community driven project. While users of Wikipedia create and edit encyclopedia articles, when you contribute to Wikidata you create structured data. The quickest way to understand Wikidata is by looking at an entry. For example, if you look up an author, you see a list of claims about that person: Birth date, notable works, occupation, and many other biographical statements. When combined with the hundreds of millions of other entries in Wikidata it creates a very powerful database. You can learn more about Wikidata from this short video.

There has been a growing interest from libraries and other cultural heritage organizations in Wikidata. Of the many potential uses for Wikidata, one emerging area of focus has been using Wikidata as a hub for institutional identifiers. Many organizations maintain unique identifiers for people, subjects, works, etc. If these IDs are all added to Wikidata then you could seamlessly access data from dozens of sources if you know the Wikidata ID. If we return to the author example from above you can see the Wikidata page for Virginia Woolf has ninety external links to various organizations. Many of these are national libraries, museums, and other cultural heritage institutions including the Library of Congress.

The Library of Congress maintains many authority files that are widely used. Two of the largest are the Name Authority File (NAF) and Library of Congress Subject Headings (LCSH). The Network Development and MARC Standard Office maintains the Linked Open Data version of these files at the site id.loc.gov. For example, authority data for Virginia Woolf is located at //id.loc.gov/authorities/names/n79041870. This data ensures that items being cataloged are all referencing the same person. One of the goals of linked data is to make sure you link out to other’s data. With id.loc.gov we maintain links to many other institutions authority files including the French and German national libraries, other government services such as Department of Agriculture and other cultural institutions like the Getty Museum. You’ll notice these links on the page and are also present in the machine readable data. With the potential of Wikidata being a hub of identifiers we wanted to also include links in our authority record out to Wikidata.

 

Screen shot of id.loc.gov links for Virginia Woolf.

An example of id.loc.gov links for Virginia Woolf

Before adding in the Wikidata ids to the id.loc.gov system I wanted to check how many Library of Congress authority IDs are already in Wikidata. You can run queries like this using their query interface. At the beginning of 2019 there were around 650,000 IDs already in Wikidata. These are in the system from Wikimedians over time editing data and adding them. Over the next few months using various existing mappings, such as the OCLC VIAF project I bulk loaded 400,000 more LC identifiers into Wikidata. This brought the total Library of Congress identifiers in Wikidata to over one million.The majority of these links are to the Name Authority File with around 35,000 of them linking to the LCSH subject file. These links to Wikidata now appear on over one million id.loc.gov authority pages and in the data.

While we have added many external links to id.loc.gov in the past the addition of Wikidata is very different. Previously these types of mappings have been fairly static and provided by the contributing institution. With Wikidata the contributing institution is an active open community of editors. This means anyone can contribute to the process. When a Library of Congress identifier is added to a Wikidata page it will appear on id.loc.gov once the data is refreshed. Anyone can help us build connections between these two knowledge systems by adding Library of Congress identifiers to Wikidata.

Wikidata also represents very different type of data than found in traditional authority data. Wikidata often contains more extensive biographical data. You could simplify it by saying Wikidata contains information about the person or thing and the data found in id.loc.gov helps connect it to library resources. If you combined the two you can see a compelling reason why it is important to build these connections.

Proof-of-Concept

Using records from the Library of Congress Prints & Photographs Division I built an interface that combines Library of Congress collection items with Wikidata information. This tool demonstrates the possibilities in connecting these two knowledge systems:

Screenshot of Virginia Woolf entry in wikidata and Library of Congress photo experiment

Virginia Woolf’s entry in the PnP Wikidata Tool – try it here

You are able to browse 66,000 pictures representing 13,300 entities, based on the biographical metadata from Wikidata. You can start asking interesting questions like “Show some images held at LC of women who have won a Nobel Prize” or “What are some images of works by photographers born in the early 1900s available at LC?” This tool is by no means comprehensive. More than 1.3 million records with almost 80,000 contributor names are in the catalog for pictures. Some records don’t link to a digital image and many names don’t yet appear in Wikidata, two required factors for an image to appear in this tool. But this proof-of-concept demonstration gives a valuable glimpse into potential future avenues of discovery. Explore the PnP Wikidata Tool on its LC Labs experiment page.

As the Library of Congress continues to work on the Bibframe initiative in the coming months work resources (bibliographic descriptions about books, images, etc.) will begin to appear in the id.loc.gov system. This will simplify the link going from metadata in Wikidata to actual resources held at the Library of Congress. As these linked data systems continue to develop at the Library of Congress, we will begin to see the exciting new possibilities for access and discovery.

2 Comments

  1. Jesse Johnston
    May 24, 2019 at 11:30 am

    This is exciting work. Thanks for sharing about this with us, Matt!

  2. Sam Smith
    July 18, 2019 at 5:05 pm

    Library of Congress: Meghan Ferriter and Matt Miller
    Thank you for this informative article. I have been working with Dr. Yongqun He, a specialist in medical ontologies, to enhance Wikidata for historical research with an improved ontological framework. I attended Wikiconference in Columbus, OH, last October and had an “Ontology Omlette” breakfast with Smithsonian, Harvard Library and Vanderbilt University particiants on the topic of using Wikidata and ontologies for search, research and analysis. I am hopeful that U of Michigan will take an interest in a comprehensive effort to make Wikidata a focal point for coordinating information – just as you suggest in the article. Please let me know if you would like to hear more about this “structured history” project. Best regards, Sam Smith – Grosse Pointe, Michigan

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.