The digital preservation community is a connected and collaborative one. I first heard about the Europe-based PREFORMA project last summer at a Federal Agencies Digitization Guidelines Initiative meeting when we were discussing the Digital File Formats for Videotape Reformatting comparison matrix. My interest was piqued because I heard about their incorporation of FFV1 and Matroska, both included in our matrix but not yet well adopted within the federal community. I was drawn first to PREFORMA’s format standardization efforts – Disclosure and Adoption are two of the sustainability factors we use to evaluate digital formats on the Sustainability of Digital Formats website – but the wider goals of the project are equally interesting.
In this interview, I was excited to learn more about the PREFORMA project from MediaConch’s Project Manager Dave Rice and Archivist Ashley Blewer.
Kate: Tell me about the goals of the PREFORMA project and how you both got involved. What are your specific roles?
Dave: The goals of the PREFORMA project are best summarized by their foundational document called the PREFORMA Challenge Brief (PDF). The Brief describes an objective to “establish a set of tools and procedures for gaining full control over the technical properties of digital content intended for long-term preservation by memory institutions”. The brief recognizes that although memory institutions have honed decades of expertise for the preservation of specific materials, we need additional tools and knowledge to achieve the same level of preservation control with digital audiovisual files.
For initial work, the PREFORMA consortium selected several file formats including TIFF, PDF/A, lossless FFV1 video, the Matroska container, and PCM audio. After a comprehensive proposal process, three suppliers were selected to move forward with development. A project called VeraPDF focusing on PDF/A is led by a consortium comprised of Open Preservation Foundation, PDF Association, Digital Preservation Coalition, Dual Lab, and KEEP SOLUTIONS. The TIFF format is addressed by DPF Manager led by Easy Innova. Ashley and I work as part of the MediaArea.net team. Our project is called MediaConch and focuses on the selected audiovisual formats: Matroska, FFV1, and PCM. MediaArea is led by Jérôme Martinez, who is the originator and principal developer of MediaInfo.
Ashley: Dave and Jérôme have collaborated in the past on open source software projects such as BWF MetaEdit (developed by AudioVisual Preservation Solutions as part of a FADGI initiative to support embedded metadata) and QCTools. QCTools, developed by BAVC with support from the National Endowment for the Humanities, was profiled in a blog post last year. Dave had also brought me in to do some work on the documentation and design of QCTools. When QCTools development was wrapping up, we submitted a proposal to PREFORMA and were accepted into the initial design phase. During that phase, we competed with other teams to deliver the software structure and design. We were then invited to continue to Phase II of the project: the development prototyping stage. We are currently in month seven (out of 22) of this second phase.
The majority of the work happens in Europe, which is where the software development team is based. Jérôme Martinez is the technical lead of the project. Guillaume Roques works on MediaConchOnline, database management, and performance optimization. Florent Tribouilloy develops the graphical user interface, reporting, and metadata extraction.
Here in the U.S., Dave Rice works as project manager and leads the team in optimizations for archival practice, system OAIS compliance, and format standardization. Erik Piil focuses on technical writing, creation of test files, and file analysis. Tessa Fallon leads community outreach and standards organization, mostly involving our plans to improve the standards documentation for both the Matroska and FFV1 formats through the Internet Engineering Task Force. I work on documentation, design and user experience, as well as some web development. Our roles are somewhat fluid, and often we will each contribute to tasks such as analyzing bitstream trace outputs to writing press releases for the latest software features.
Kate: The standardization of digital formats is a key piece in the PREFORMA puzzle as well as being something we consider when evaluating the Disclosure factor in the Sustainability of Digital Formats website. What’s behind the decision to pursue standardization through the Internet Engineering Task Force instead of an organization like the Society of Motion Picture and Television Engineers? What’s the process like and where are you now in the sequence of events? From the PREFORMA perspective, what’s to be gained through standardization?
Dave: A central aspect of the PREFORMA project is to create a conformance checker that would be able to process files and report on the state to which they deviate or conform to their associated specification. Early in the development of our proposal for Matroska and FFV1, we realized that the state of the specification compromised how effectively and precisely we could create a conformance checker. Additionally as we interviewed many archives that were using FFV1 and/or Matroska for preservation we found that the state of the standardization of these formats was the most shared concern. This research led us to include efforts towards facilitating the further standardization of both FFV1 and Matroska through an open standards body into our proposal. After reaching agreement from the FFmpeg and Matroska communities, we developed a standardization plan (PDF), which was included in our overall proposal.
As several standards organizations were considered, it was important to gain feedback on the process from several stakeholder communities. These discussions informed our decision to approach the IETF, which appeared the most appropriate for the project needs as well as the needs of our communities. The PREFORMA project is designed with significant emphasis and mandate on an open source approach, including not only the licensing requirements of the results, but also a working environment that promotes disclosure, transparency, participation, and oversight. The IETF subscribes to these same ideals; the standards documents are freely and easily available without restrictive licensing and much of the procedure behind the standardization is open to research and review.
The IETF also strives to promote involvement and participation; their recent conferences include IRC channels, audio stream, video streams per meeting and an assigned IRC channel representative to facilitate communication between the room and virtual attendees. In addition to these attributes, the format communities involved (Matroska, FFmpeg, and libav) were already familiar with the IETF from earlier and ongoing efforts to standardize open audiovisual formats such as Opus and Daala. Through an early discovery process we gathered the requirements and qualities needed in a successful standardization process for Matroska and FFV1 from memory institutions, format authors, format implementation communities, and related technical communities. From here we assessed standards bodies according to traits such as disclosure, transparency, open participation, and freedom in licensing, confirming that IETF is the most appropriate venue for standardizing Matroska and FFV1 for preservation use.
At this stage of the process we presented our proposal for standardization of Matroska and FFV1 standardization at the July 2015 IETF93 conference. After soliciting additional input and feedback from IETF members and the development communities, we have a proposed working group charter under consideration that encompasses FFV1, Matroska, and FLAC. If accepted, this will provide a venue for the ongoing standardization work on these formats towards the specific goals of the charter.
I should point out that other PREFORMA projects are involved in standardization efforts as well. The Easy Innova team are working on furthering TIFF standardization in their TIFF/A initiative.
Kate: Let’s talk about two formats of interest for this project, FFV1 and Matroska. What are some of the unique features of these formats that make them viable for preservation use and for the goals of PREFORMA?
Dave: FFV1 is a very efficient lossless video codec from the FFmpeg project that is designed in a manner responsive to the requirements of digital preservation. A number of archivists participated and reviewed efforts to design, standardize, and test FFV1 version 3. The new features in FFV1 version 3 included more self-descriptive properties to store its own information regarding field dominance, aspect ratio, and colorspace so that it is not reliant on a container format to store this information. Other codecs that rely heavily on its container for technical description often face interoperability challenges. FFV1 version 3 also facilitates storage of cyclic redundancy checks in frame headers to allow verification of the encoded data and stores error status messages. FFV1 version 3 is also a very flexible codec allowing adjustments to the encoding process based on different priorities such as size efficiency, data resilience, or encoding speed. For the past year or two, FFV1 may be seen at a tipping point for preservation use. Its speed, accessibility, and digital preservation features make it an increasingly attractive option for lossless video encoding that can be found in more and more large scale projects; the standardization of FFV1 through an open standards organization certainly plays a significant role in the consideration of FFV1 as a preservation option.
Matroska is an open-licensed audiovisual container format with extensive and flexible features and an active user community. The format is supported by a set of core utilities for manipulating and assessing Matroska files, such as mkvtoolnix and mkvalidator. Matroska is based on EBML, Extensible Binary Meta Language. An EBML file is comprised of one of many defined “Elements”. Each element is comprised of an identifier, a value that notes the size of the element’s data payload, and the data payload itself. Matroska integrates a flexible and semantically comprehensive hierarchical metadata structure as well as digital preservation features such as the ability to provide CRC checksums internally per selected elements. Because of its ability to use internal, regional CRC protection it is possible to update a Matroska file to log OAIS events without any compromise to the fixity of its audiovisual payload. Standardization efforts are currently renewed with an initial focus on Matroska’s underlying EBML format. For those who would like to participate I’d recommend contributing to the EBML specification GitHub repository or joining the matroska-devel mailing list.
Ashley: Matroska is especially appealing to me as a former cataloger and someone who has migrated data between metadata management systems because of its inherent ability to store a large breadth of descriptive metadata within the file itself. Archivists can integrate content descriptions directly into files. In the event of a metadata management software sunsetting or potential loss occurring during the file’s lifetime of duplication and migration, the file itself can still harbor all the necessary intellectual details required to understand the content.
It’s great to have those self-checking mechanisms in place to set and verify fixity inherently built into a file format’s infrastructure instead of requiring an archivist to do supplemental work on top by storing technical requirements, checksums, and descriptive metadata alongside a file for preservation purposes. By using Matroska and FFV1 together, an archivist can get full coverage of every aspect of the file. And if fixity fails, the point where that failure occurs can be easily pinpointed. This level of precision is ideal for preservation and as harbinger for archivists in the future. Since error warnings can be frame/slice-level specific, assessing problems becomes much easier. It’s like being able to use a microscope to analyze a record instead of being limited to plain eyesight. It avoids the problem of “I have a file, it’s not validating against a checksum that represents the entirety of a file, and it’s a 2 hour long video. Where do I begin in diagnosing this problem?”
Kate: What communities are currently using them? Would it be fair to say that ffv1 and Matroska are still emerging formats in terms of adoption in the US?
Ashley: Indiana University has embarked upon a project to digitally preserve all of its significant audio and video recordings in the next four years. Mike Casey, director of technical operations for the Media Preservation Initiative project confirmed in a personal email that “after careful examination of the available options for video digitization formats, we have selected FFV1 in combination with Matroska for our video preservation master files.”
Dave: The Wikipedia page for FFV1 has an initial list of institutions using or considering FFV1. Naturally users do not need to announce publicly that they use it but there’s been an increase in messages to related communities forums.
Kate: Do you expect that the IETF standardization process will likely help increase adoption?
Ashley: I think a lot of people are unsure of these formats because they aren’t currently backed by a standards body. Matroska has been around for a long time and is a sturdy open source format. Open source software can have great community support but getting institutional support isn’t usually a priority. We have been investing time into clarifying the Matroska technical specifications in anticipation of a future release.
The harder case to be made regarding adoption in libraries and archives is with FFV1, as this codec is relatively new, less familiar, and has yet to be fully standardized. Access to creating FFV1 encoded files is limited to people with a lot of technical knowledge.
Kate: One of my favorite parts of my job is playing format detective in which I use a set of specialized tools to determine what the file is – the file extension isn’t always a reliable or specific enough marker – and if the file has been produced according to the specifications of a standard file format. But the digital preservation community needs more flexible and more accurate format identification and conformance toolsets. How will PREFORMA contribute to the toolset canon?
Ashley: The initial development with MediaConch began with creating an extension of MediaInfo, which is already heavily integrated into many institutions in the public and private sectors as a microservice to gather information about media files. The MediaConch software will go beyond just providing useful information about the file and help ensure that the file is what it says it is and can continually be checked through routine services to ensure the file’s integrity far into the future.
A major goal for PREFORMA is the extensibility of the software being developed — working across all computer platforms, working to check files at the item level or in batches, and cross-comparability between the different formats. We collaborate with Easy Innova and veraPDF to discover and implement compatible methods of file checking. The intent is to avoid creating a tool that exists within a silo. Even though we are three teams working on different formats, we can, in the end, be compatible through API endpoints, not just for the three funded teams but to other specialized tools or archival management programs like Archivematica. Keeping the software open source for future accessibility and development is not optional — it’s required by the PREFORMA tender.
Dave: Determining if a file has been produced according to the specifications of a standard file format is a central issue to PREFORMA and unfortunately there are not nearly enough tools to do so. I credit Matroska for developing a utility, mkvalidate, alongside the development of their format specifications, but to have this type of conformance utility accompany the specification is unfortunately a rarity.
Our current role in the PREFORMA project is fairly specific to certain formats but there are some components of the project which contribute to file format investigation. Already we have released a new technical metadata report, MediaTrace, which may be generated via MediaInfo or MediaConch. The MediaTrace report will help with advanced ‘format detective’ investigations as it presents the entire structure of an audiovisual file in an orderly way. The report may be used directly, but within our PREFORMA project it plays a crucial role in supporting conformance checks of Matroska. MediaConch is additionally able to display the structure of Matroska files and will eventually allow metadata fixes and repairs to both Matroska and FFV1.
MediaArea seeks input and feedback on the standard, specifications and future of each format for future development of the preservation-standard conformance checker software. If you work with these formats and are interested in contributing your requirements and/or test files, please contact us at [email protected].
Comments (5)
I didn’t see the discussion about why any form of compression, FFV1 or whatever, should be chosen. Also I didn’t see the discussion about why Matroska would fare any better than MXF in terms of guaranteed interchangability/validity for all implementations. Common formats aren’t as ironclad as people may think. There is data that a large percentage of PDF’s don’t conform in some way, and even data that ‘most web pages are illegal’. [http://blog.dshr.org/2009/01/postels-law.html] So how do we know that all Matroska files, made by all Matroska implementations, will actually be valid?
mkvalidator came later in the Matroska lifetime. At the same time as WebM came out has it had stricter rules and less elements. Player were refusing invalid files, and so a tool to find out why some files would not play.
Historically there hasn’t been many muxers available though. Up to that date there were mostly 4: mkvmerge, ffmpeg, gstreamer and the DivX one. So there were less formal checks needed. Now the situation is a bit different. And there might be some invalid files produced.
But at least we can tell when they are.
Steve Lhomme
Hello,
This is a response from PACKED. We were the work package leader in PREFORMA, responsible for organizing the pre-commercial procurement procedure. So we have been responsible for writing the tender documents and identifying the file formats to be addressed by the PREFORMA conformance checkers.
The documents we produced are available at the preforma website: http://www.preforma-project.eu/call-documents.html
• About formats
You rightly refer to the problem that formats aren’t ironclad. That is exactly what PREFORMA is about: providing keepers and producers of digital documents with a tool to test if a file adheres to the standard specification of its file format, so that we will be able in the future to build the right applications to access it.
• About the chosen formats
PREFORMA was an opportunity to take a stance and advance the adoption of file formats that primarily fit long-term preservation. In particularly for AV, this was an opportunity to put emphasis on long term challenges instead of short term implementation challenges.
For PREFORMA, fitness for long-term preservation depended on four criteria:
#1. A format that can capture content in an uncompressed or mathematically lossless encoding, and retain as many original properties as possible (ie. bit-depth, resolution, …) because this is a requirement for implementing an effective preservation strategy. This meant ruling out all well-adopted lossy codecs (ie. D10).
We did not take a position in the uncompressed/lossless discussion, but based on our experience in digitisation projects we considered advantages of lossless compression for storage cost and self-description. Advocating for lossless encodings such as FFV1 or JPEG2K is easier than advocating for uncompressed when you talk to collection managers and directors or even IT managers. While we believe that storage cost is likely a short-term concern, for now, we still feel a clear financial concern on the part of decision-makers. Uncompressed would complicate adoption by a lot of collections where storage costs often represent a big argument to choose a lossy format. Beyond storage, lossless compression gives advantages for self-description and fixity (frame checksums, error concealment, etc). FFV1’s version 3 has amongst others features integrity validation by CRC checksums, which we feel is very good for preservation.
#2. A format that is Free/Libre (cf EIFv1)
This was a key issue for the PREFORMA project. Software developed within the scope of the project must be licensed under a specific copyleft license (i.e. “GPLv3 or later and MPLv2 orlater”) to ensure that the tools for checking archive files will be available over very long periods. This requirement ruled out the MXF wrapper, since the current licensing of the required MXF specifications do not allow for implementing a validator under the given license. Moreover the specifications for MXF are behind a paywall and thus limit their collaborative use.
#3. A format that is well documented
The questions we asked ourselves were: is there a specification available? Is it accessible and maintained? Otherwise, you can’t write an implementation and create false or correct test files.
#4. A format that has been adopted
Some formats have very promising specifications, but we have been looking for a degree of adoption that proved that the file format is viable. A format that is usable in real life situations.
• What did these criteria lead to:
None of the standard specifications of text, image our audiovisual material matched all four criteria. After discussion in the project board about the preferred way to make a compromise, the project board decided to insist on the first two criteria and see how the shortcomings in the two remaining criteria could be solved within the scope of the project. Concrete: To what extent can a lack of documentation be solved in the project? To what extent can the project demonstrate the viability of the format?
About the AV formats:
The result of this exercise for audiovisual ended up in a set of two containers(MKV and Ogg) and three video/image codecs (JPEG2000, FFV1 and Dirac). Tenderers have been invited to make an R&D project proposal for a combination of a container and codec from this list.
Please note that the procurement procedure had two steps before the actual development could start. Proposals have been evaluated first on the criteria of the Invitation to Tender. PREFORMA invited two tenderers to make a design. The two designs have been evaluated based on how they dealt with the shortcomings of the file formats.
• As for MXF vs MKV:
Cost for acquiring necessary specifications and access to the standardization process of MXF was too high for an open source project, in particular in comparison with MKV. Moreover, PREFORMA could not tender for MXF because of Swedish Law that obliges public services to tender for open formats as defined in the EIFv1.
PREFORMA considered that working on the documentation and standardisation during the course of the project would be more feasible with the MKV community than with SMPTE for MXF. Given that there was already a validator available for MKV, the project expected that the investment in this container would pay off more.
We chose for standards where we could imagine that the existing shortcomings could be solved within the scope of the project. We evaluated the tenders in two stages. So we have chosen the project that offered the best strategy for solving the existing shortcomings.
Even if we would like to switch from MKV to MXF now, MXF does not permit to wrap FFV1 essences. From what we’ve been told it would be possible, but it would require some money to pay for a codec id registration.
Hope that it answers your questions.
Despite being future-proof in terms of containable audio and video formats, Matroska (MKV) is also far more useful than competing containers such as MXF, MP4 and MOV when it comes to metadata support.
The problem with MOV and MP4 is the fixed number of elements (atoms) and the lack of MPEG7 metadata support in existing software to support the types of descriptive metadata required by cultural heritage institutions.
As for MXF, its (also fixed) metadata structure is very extensive but highly centered around the needs of broadcasting companies and much less relevant for other cultural heritage institutions… And as already mentioned: Less accessible because of the paywall.
In comparison, Matroska is technically far more agnostic – something that has ensured the format’s growing popularity since 2002. I regard Microsoft’s recent addition of native MKV and FLAC support (albeit with somewhat insufficient format support) in Windows 10 as an acknowledgement of that fact.
I can recommend some clean & free software options, if you want to “play” with these different containers:
– Xmedia Recode (add metadata & chapters to various containers without transcoding) – http://www.xmedia-recode.de/en/download.html
– Hybrid (More advances & with HEVC support) – http://www.selur.de/downloads
FADGI has published the AS-07 MXF application specification for Archiving and Preservation which encourages the use the Generic Stream Partition (GSP) for additional metadata. The GSP itself will have metadata objects to describe the content. The AS-07 specification document is available at no cost from the Advanced Media Workflow Association (AMWA) website but the underlying SMPTE specifications are behind the SMPTE pay wall.