The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant web archives holdings. This is another step to explore the web archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” content in the Library’s digital collections, which we can provide for public access. The outcome of the project will be a series of datasets, each containing 1,000 files of related media types selected from .gov domains. We will announce and explore these datasets here on The Signal, and the data will be made available through LC Labs. Although we invite usage and interest from a wide range of digital enthusiasts, we are particularly hoping to interest practitioners and scholars working on digital preservation education and digital scholarship projects.
We introduced the file datasets and our methods of creating them in our recent post about our PDF dataset, and are glad to be continuing the series with the release of our audio dataset. More information about the datasets can be found on the project’s LC Labs’ Experiments page, and the audio files can be downloaded directly from this link.
Understanding the 1,000 Audio File Dataset
The second installment in our series is a set of 1,000 files with audio media types, randomly selected from archived .gov domain sites. In addition to file format data extracted by Apache Tika, we have included audiovisual format data extracted with MediaInfo, such as bit rate and audio encoding, in the dataset’s accompanying metadata CSV (link to download). More in-depth information about the data can be found in the dataset’s README (link to download).
We examined the dataset and found some significant points to share. Uncompressed, the 1,000 .gov audio files comprise 5.1 Gigabytes. Many of the audio files lack a definite creation date, so the harvested date (which comes from the timestamp column in the metadata CSV) is one of the useful data points for dating the files. Like the PDF file dataset, items in this set were harvested during web crawls conducted over two decades, from 1996 to 2017. The years in which the most objects in this dataset were harvested are 2009 and 2010.
The earliest harvested recording in the set was collected on the afternoon of January 18th, 1997. By happy chance, it is a short recording of birds chirping, which I think is a charming and evocative coincidence. It seems fitting that it is sounds from the natural world.
A variety of file types are present in the audio dataset. There are twenty-one different media type values, eighteen different file extensions, and at least ten different file formats. Some formats can have more than one corresponding file extension. For example, the file extensions associated with RealMedia files are “.ra”, “.rm”, and “.ram”. The data within those extensions is either RealAudio or RealMedia Metafile format (or sometimes even RealVideo!). All three of those extensions often share the ”audio/x-pn-realaudio” media type despite the RealMedia Metafile format (“.ram” extension or, confoundingly, sometimes “.rm”) being very different from the RealAudio file format (usually “.ra” or “.rm” extensions). These distinctions can be confusing; the media type or the extension suggests that you’re opening a file that will play sounds, but that might not be the case. The top three media types across the set are “audio/mpeg” (38%), “audio/x-wav” (25%), and “audio/x-pn-realaudio” (23%). The top three file extensions are “.mp3” (42%), “.wav” (26%), and “.ram” (14%).
Adding a further wrinkle to the characterization of audio files and the relationships between media types, extensions, and formats is that some audio file formats (“wrappers” or “containers”) can wrap different types of audio bitstream encodings. For instance, a WAVE format file can technically wrap an MPEG Audio (MP3) encoded audio stream. Though using WAVE files in this way is uncommon, this recording appears to be one of a few examples in the set. RealAudio format files can also wrap a variety of different encodings. The most frequent audio encodings across the dataset are MPEG Audio (42% of the set, primarily corresponding to the “.mp3” extension and MP3 file format) and Pulse Code Modulated (PCM) Audio (26%, corresponding to WAVE format and ”.wav” extension).
As we’ve mentioned, extensions and media types corresponding to RealMedia format files are well represented in the set. However, encodings common to RealMedia are not a large portion of the audio encodings identified in the dataset. This is because many of the RealMedia files are actually RealMedia Metafile format, containing only links to internet streams or other audio files instead of an audio bitstream. An example of that sort of file from the set is this “.ram” from our own website. In fact, MediaInfo was unable to identify an encoding or format for 23% of the files – in most cases these files do not contain any recorded sound. If you’re interested in the technical aspects of audio formats, please find more information here.
When exploring duration, we discovered that the file with the longest duration is 4 hours, 43 minutes, and 7 seconds. It is a recording of a webinar on river herring extinction risk. Of the 762 files with duration information, the shortest is 113 milliseconds, and it is a recording of a touch-tone phone beep. Removing the files with no duration from the set, the average recording length is 9 minutes and 49 seconds. Interestingly, of the 762 files, 51% are shorter than one minute and 67% are shorter than 3 minutes. There are some long recordings that push the arithmetic mean upwards. The sum duration of all the files is 5 days, 4 hours, 46 minutes, and 8 seconds.
Interacting with the metadata allows us to observe patterns, though this sample is too small in comparison to the full extent of our web archives (let alone the internet) to reveal true trends. But analyzing the small set helps us form questions to ask of the data across our full archive. For instance, charting the distribution of the six most frequent extensions in the dataset harvested over time yields this chart:
The visualization makes us wonder, were mostly WAVE and RealMedia files on .gov domains in the early years of web archiving? Did MP3 really begin to appear on the web more significantly around 2003? If the 2009-2010 spike isn’t a quirk of the small sample size, why were so many more “.wav”, “.mp3”, and “.rm” files harvested then? Did the government start hosting more audio those years? Or did the Library’s collecting patterns make a difference in the amount of audio collected? They’re all compelling questions for examination across the full archives.
We are sure there are more questions the dataset can spark, but we will leave those for you to ponder! We’re interested to know what you find in this dataset and what questions about our archive the files inspire.
Comments (3)
Other than the metadata, has any voice-recognition software been applied against the spoken-word audio? I am under the impression that, like OCR, translating audio into searchable text is imperfect but is getting better. At least the text version of my phone’s voicemail messages gives me an idea of who is calling and what the message is about, which is valuable info, even if a non-trivial portion of the transcript is (sometimes-amusing) nonsense.
That’s a great suggestion! In exploring this data, we hoped to provide just enough information to spark researcher interest just like this. So while we didn’t run the audio files through any additional processing software, we think that’s an excellent idea and would love to see what might come of it!
I think that Pulse Code Modulation is a better audio representation, but that is because most Microsoft Windows 95 (and some with 3.11 Windows for Workgroups) used this form of wave recording standard. I was able to enjoy the dynamics of Cool Edit 95 and later Cool Edit Pro 1.2 for a good many years (from 1995 to 2000+) until David Johnston sold his software to Adobe for Audition 1.0. Now, it’s great to see Audacity is a workhorse open source project, and those in the radio and recording industry can benefit from fast fourier transforms (FFTs) in an open sourced software program.