The Digital Content Management section has been working to extract and make available sets of files from the Library’s significant Web Archives holdings. The outcome of the project is a series of web archive file datasets, each containing 1,000 files of related media types selected from .gov domains. You can read more about this series here.
PowerPoint presentations have become a nearly ubiquitous form of communication document in the digital era. At the most basic level, PowerPoint files present a sequence of slides containing text, images and multimedia. Today, we are excited to share out a dataset of 1,000 random slide decks from U.S. government websites, collected via the Library of Congress Web Archive, such as the presentation on transporting hazardous materials in Figure 1. You can download a CSV file of data about the files, you can learn more about the dataset from this README, and you can also download the entire 3.7 GB dataset of the actual files.
Understanding the 1,000 U.S. Government Slide Decks
The dataset contains 1,000 purported PowerPoint files residing on the .gov United States government domain, randomly selected from the Library of Congress Web Archive. More specifically, it includes 1,000 files which asserted that they were associated with PowerPoint in their Media Type. Nearly all of these are .ppt files. Of note, newer PowerPoint files that use the extension .pptx use a different Media Type and as a result there are only 11 files in the corpus that end in .pptx. As part of our analysis and creation of this dataset, we ran each file through Apache Tika and were able to collect additional metadata about the dataset. For example, we discovered that the dataset contains 22,542 individual slides and 1,340,722 individual words by aggregating the slide count and word count fields from the metadata CSV. The words may appear on the individual slides themselves or in the notes field associated with an individual slide. The README for this dataset contains more information about these and all the fields included in the metadata CSV. Some files in the dataset did not report a slide count or a word count and as such, were not included in the aggregate numbers mentioned above.
The data suggests that, on average, these slide decks are 22 slides long and contain 1,340 words each. As the scatter plot below illustrates, a small number of outliers significantly skew the number of detected words and number of slides.
The outliers in Figure 2 demonstrate the varied ways that PowerPoint is used for government publishing. For example, consider the furthest outlier in regards to number of slides detected and number of words detected: 288 and 29,939, respectively. The length of the deck and extensive text notes included with the slides in this employee training guide power point from the state of Washington feels more like a book than a presentation. Similarly, this slide deck from the U.S. Department of Transportation on transporting hazardous materials contains 147 slides and 7,693 words.
U.S. Government Slide Decks Over Time
Files in this dataset were captured between 1997 and 2017. It is important to note, however, that this can vary from the creation date field, which was derived through Apache Tika. For example, the earliest creation date found in the dataset is for a 1994 slide deck on a leadership program from NASA. However, it was not captured in the web archives until six years later, in 2000.
Figure 3 illustrates the gap between the original creation date of the files and the capture date and accentuates the necessity of understanding the data, provenance of the data, and the nuances with its metadata. Further analysis in this arena would be fascinating, and we encourage you to dive in and let us know what you find!
What Will You Do With 1,000 U.S. Government Slide Decks?
We are curious for the ways that you might explore and use this set of slide decks. Even from this initial exploration, it is clear that these varied resources have become important parts of the way the government is communicating and publishing.
Valiant, interesting effort. However, not being able to see the contents or what slides from what agency are a problem. Maybe another time through the contents would help. I’d rather not download a huge zip file, just to see what might be there, IMHO.
Hi DrWeb, Happy to help! You can download just the CSV of metadata about the 1,000 files here. From the CSV, you can also get links to the individual files, so you can download only the ones you are interested in. The metadata has all of the original URLs for the files, so you can see which agency websites they are each from.
This is really wonderful and a great resource for the community. Thank you!
Would it be possible to identify other types of presentation files? Do you have any other metadata that could help to do so other than “media type”?
If it’s any use here is a query in Wikidata that includes many more types of formats that are created by presentation software applications:
I will also see about getting this added to the test corpora section on https://digipres.org