In the Library’s Web Archives: US Government Audio on Shuffle

The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant web archives holdings. This is another step to explore the web archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” content in the Library’s digital collections, which we can provide for public access. The outcome of the project will be a series of datasets, each containing 1,000 files of related media types selected from .gov domains. We will announce and explore these datasets here on The Signal, and the data will be made available through LC Labs. Although we invite usage and interest from a wide range of digital enthusiasts, we are particularly hoping to interest practitioners and scholars working on digital preservation education and digital scholarship projects.

We introduced the file datasets and our methods of creating them in our recent post about our PDF dataset, and are glad to be continuing the series with the release of our audio dataset. More information about the datasets can be found on the project’s LC Labs’ Experiments page, and the audio files can be downloaded directly from this link.

Understanding the 1,000 Audio File Dataset

The second installment in our series is a set of 1,000 files with audio media types, randomly selected from archived .gov domain sites. In addition to file format data extracted by Apache Tika, we have included audiovisual format data extracted with MediaInfo, such as bit rate and audio encoding, in the dataset’s accompanying metadata CSV (link to download). More in-depth information about the data can be found in the dataset’s README (link to download).

We examined the dataset and found some significant points to share. Uncompressed, the 1,000 .gov audio files comprise 5.1 Gigabytes. Many of the audio files lack a definite creation date, so the harvested date (which comes from the timestamp column in the metadata CSV) is one of the useful data points for dating the files. Like the PDF file dataset, items in this set were harvested during web crawls conducted over two decades, from 1996 to 2017. The years in which the most objects in this dataset were harvested are 2009 and 2010. 

A bar chart showing the number of objects harvested per year. The x-axis is years 1997-2017, and the y-axis is number of objects 0-250. The chart shows that the years 2009 and 2010 were the years where objects in the set were harvested most frequently.

The earliest harvested recording in the set was collected on the afternoon of January 18th, 1997. By happy chance, it is short recording of birds chirpingwhich I think is a charming and evocative coincidence. It seems fitting that it is sounds from the natural world.

A variety of file types are present in the audio dataset. There are twenty-one different media type values, eighteen different file extensions, and at least ten different file formats. Some formats can have more than one corresponding file extension. For example, the file extensions associated with RealMedia files are “.ra”, “.rm”, and “.ram”. The data within those extensions is either RealAudio or RealMedia Metafile format (or sometimes even RealVideo!). All three of those extensions often share the ”audio/x-pn-realaudio” media type despite the RealMedia Metafile format (“.ram” extension or, confoundingly, sometimes “.rm”) being very different from the RealAudio file format (usually “.ra” or “.rm” extensions). These distinctions can be confusing; the media type or the extension suggests that you’re opening a file that will play sounds, but that might not be the case. The top three media types across the set are “audio/mpeg” (38%), “audio/x-wav” (25%), and “audio/x-pn-realaudio” (23%). The top three file extensions are “.mp3” (42%), “.wav” (26%), and “.ram” (14%).

Adding a further wrinkle to the characterization of audio files and the relationships between media types, extensions, and formats is that some audio file formats (“wrappers” or “containers”) can wrap different types of audio bitstream encodings. For instance, a WAVE format file can technically wrap an MPEG Audio (MP3) encoded audio stream. Though using WAVE files in this way is uncommon, this recording appears to be one of a few examples in the set. RealAudio format files can also wrap a variety of different encodings. The most frequent audio encodings across the dataset are MPEG Audio (42% of the set, primarily corresponding to the “.mp3” extension and MP3 file format) and Pulse Code Modulated (PCM) Audio (26%, corresponding to WAVE format and ”.wav” extension).

Chart is a pie chart showing the number of objects by audio encoding. There are segments for MPEG Audio (41.9%), PCM (26.0%), ACELP (2.7%), WMA (3.6%), None Identified (22.8%), and Other (3%).

As we’ve mentioned, extensions and media types corresponding to RealMedia format files are well represented in the set. However, encodings common to RealMedia are not a large portion of the audio encodings identified in the dataset. This is because many of the RealMedia files are actually RealMedia Metafile format, containing only links to internet streams or other audio files instead of an audio bitstream. An example of that sort of file from the set is this “.ram” from our own website. In fact, MediaInfo was unable to identify an encoding or format for 23% of the files – in most cases these files do not contain any recorded sound. If you’re interested in the technical aspects of audio formats, please find more information here.

When exploring duration, we discovered that the file with the longest duration is 4 hours, 43 minutes, and 7 seconds. It is a recording of a webinar on river herring extinction risk. Of the 762 files with duration information, the shortest is 113 milliseconds, and it is a recording of a touch-tone phone beep. Removing the files with no duration from the set, the average recording length is 9 minutes and 49 seconds. Interestingly, of the 762 files, 51% are shorter than one minute and 67% are shorter than 3 minutes. There are some long recordings that push the arithmetic mean upwards. The sum duration of all the files is 5 days, 4 hours, 46 minutes, and 8 seconds.

Interacting with the metadata allows us to observe patterns, though this sample is too small in comparison to the full extent of our web archives (let alone the internet) to reveal true trends. But analyzing the small set helps us form questions to ask of the data across our full archive. For instance, charting the distribution of the six most frequent extensions in the dataset harvested over time yields this chart:

A line chart with six lines, one for mp3, wav, ram, rm, wma, and ra. The x-axis is years 1997-2017, and the y-axis is number of objects 0-140. The chart shows that the years 2009 and 2010 were the years where objects in the set were harvested most frequently. There is a large spike in those years for mp3 and wav.

The visualization makes us wonder, were mostly WAVE and RealMedia files on .gov domains in the early years of web archiving? Did MP3 really begin to appear on the web more significantly around 2003? If the 2009-2010 spike isn’t a quirk of the small sample size, why were so many more “.wav”, “.mp3”, and “.rm” files harvested then? Did the government start hosting more audio those years? Or did the Library’s collecting patterns make a difference in the amount of audio collected? They’re all compelling questions for examination across the full archives.

We are sure there are more questions the dataset can spark, but we will leave those for you to ponder! We’re interested to know what you find in this dataset and what questions about our archive the files inspire.

 

In the Library’s Web Archives: Sorting through a Set of US Government PDFs

The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant web archives holdings. This is another step to explore the web archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” content in the Library’s […]

The Library of Congress Web Archives: Dipping a Toe in a Lake of Data

Today’s guest post is from Chase Dooley and Grace Thomas, Digital Collections Specialists on the Library of Congress Web Archiving Team.  Over the last two decades, the Library of Congress Web Archiving Program has acquired and made available over 16,000 web archives, as part of more than 114 event and thematic collections. Each Web Archive […]

Foreign Law Web Archives

Law and government are major areas of web archiving at the Library of Congress, and feature prominently among the event and thematic collections available on loc.gov. The Law Library, which holds the largest collection of legal materials in the world, also coordinates the collection of Law websites through five significant collections: the Federal Courts Web […]

Science Blogs Web Archive

This guest post is an interview with Lisa Massengale, Head of the Science Reference Section, with contributions by the Web Archive’s creator Jennifer Harbster, a Science Reference and Research Specialist for the Science, Technology and Business Division from Oct. 2001- Dec. 2015.  Along with her reference duties for the Library’s Science Reference Service, she created […]

Recommendations for Enabling Digital Scholarship

Mass digitization — coupled with new media, technology and distribution networks — has transformed what’s possible for libraries and their users. The Library of Congress makes millions of items freely available on loc.gov and other public sites like HathiTrust and DPLA. Incredible resources — like digitized historic newspapers from across the United States, the personal papers […]

Nominations Sought for the U.S. Federal Government End of Term Web Archive

This is a guest post by Abbie Grotke, lead information technology specialist of the Library of Congress Web Archiving Team Readers of The Signal may recall prior efforts to archive United States Federal Government websites during the end of presidential terms. I last wrote about this in 2012 when we were working on preserving the […]

Survey: How Do You Approach Web Archiving?

Do you have fifteen minutes to tell the National Digital Stewardship Alliance about your organization’s web archiving activities? If the answer is yes, please contribute to the NDSA Web Archiving Survey. By filling out this short survey, your institution will be part of a multi-year project to track the evolution of web archiving programs in […]