FADGI’s embARC: Extending embedded metadata support and validation for DPX and MXF files

Today’s guest post is from Kate Murray, Digital Projects Coordinator in Digital Collections Management and Services at the Library of Congress and Bertram Lyons, Partner and Managing Director for Software at AVP.

Note: This is the last in a series of updates from the Federal Agencies Digital Guidelines Initiative (FADGI) Audio-Visual working group. See That’s Our Cue! Updates for the FADGI Embedded Metadata Guidelines and BWF MetaEdit for the Cue Chunk in Broadcast Wave Files and Reading the (Same) Signals: Using FADGI’s ADCTest for Quality Control in Outsourced Audio Digitization for the previous installments. 


Fig. 1: embARC is a free, open source software application that enables users to audit and correct embedded metadata to comply with FADGI guidelines for DPX and MXF files.

Fig. 1: embARC is a free, open source software application that enables users to audit and correct embedded metadata to comply with FADGI guidelines for DPX and MXF files.

embARC, short for “metadata embedded for archival content,” is a free, open source software application that enables users to audit and correct embedded metadata to comply with FADGI guidelines for DPX (Guidelines for Embedded Metadata within DPX File Headers for Digitized Motion Picture Film) and MXF (SMPTE RDD 48: MXF Archive and Preservation Format) files (figure 1).

DPX, short for Digital Picture Exchange, is a pixel-based (raster) file format intended for very high quality moving image content with attributes defined in a binary file header. MXF, short for Material Exchange Format, is an object-based file format that wraps video, audio, and other bitstreams (“essences”), optimized for content interchange or archiving by creators and/or distributors, and intended for implementation in devices ranging from cameras and video recorders to computer systems.

embARC was first released in 2019 and is developed with support from the Library of Congress and FADGI (Federal Agencies Digital Guidelines Initiative) and in collaboration with AVP and PortalMedia.

Recent development in 2020-2021 has expanded the scope of embARC to meet the evolving user needs and workflows of the audiovisual preservation community. This most recent release is an important milestone, introducing the first beta release of the CLI as well as the first official release of the GUI, which now includes functionality for the MXF file format.

New CLI beta version released

While developing the CLI (command line interface) version was already in the project work plan, discussions with local and international colleagues seeking to integrate embARC’s robust functionality to fill gaps in digital preservation workflows caused us to prioritize this work and bump it up in the timeline. Give the people what they want, FADGI says!

The CLI version allows users to include embARC services in automated and logical workflows more easily without requiring user interface interaction. The CLI allows users to include embARC services in automated and logical workflows with other applications to achieve specific tasks and move the files on to the next step more easily. For DPX files, embARC can support individual DPX files or an entire DPX sequence while not impacting the image data. For MXF files, which have much more complex metadata than DPX, embARC supports single file analysis. For this blog post, we’ll look at the CLI functionality for DPX files but a more complete explanation, including MXF files is available in the embARC CLI User Guide.

embARC CLI users can read and extract metadata from selected files. The output includes first a summary of total files processed, total files that were DPX format, and any non-DPX files found. This is based on triage format detection and will provide feedback if a non-DPX file is passed into the argument instead of a DPX file.

Fig. 2: Output of DPX sequence import.

When users request to process a sequence of DPX files, the summary result also includes the results of nine custom tests that embARC runs to produce boolean (pass/fail) outputs. These tests look for file sequencing and duplication errors, as well as file truncation errors, and these are reported to the user.

After the summary results, the output articulates the file metadata for the single file that was processed or a comparative metadata analysis for all files if a sequence was processed. The information is structured according to the standard SMPTE structures as defined in the DPX specification (ST 268). The data is delivered in three columns: byte offset from byte 0, field name, field value. Users can also have this data written to a target file in JSON or CSV format for use in other applications.

For sequences, the comparative analysis looks at each metadata field/value present in each file and compares to give you a quick view of the static values across the sequence and to simultaneously provide flags where there are fields with multiple values so that you can evaluate in the CSV or JSON output, if desired.

Another new feature we have added for CLI is the process of conformance checking for one or more DPX files. With the optional conformance checking flags, embARC allows users to submit a set of rules using a conformance JSON template (see example below, figure 3). embARC will evaluate one file or a sequence of files based on the submitted rules and will provide a summary conformance report in the terminal as well as a CSV list of test results for each file and each test evaluated.

Fig. 3: DPX sequence metadata output. Fields with multiple values are flagged for further analysis.

The embARC conformance template is a JSON document with the following elements.

  • Rules – an array that contains all Rule objects to be evaluated.
  • Rule – each Rule is an object that contains a Column, Operator, and Value property.
  • Column – this property specifies the metadata field to target for the Rule. See Appendix B for a controlled list of possible column values.
  • Operator – this property specifies the evaluation operator for the Rule. For example, Min, Max, or Equals, or Contains. See Appendix B for a controlled list of possible operators.
  • Value – this property provides the value that will be evaluated against for the particular Column and Operator in the Rule.

Following (figure 4) is an example valid conformance rule set:

Conformance rule set example for DPX.

Fig. 4: Conformance rule set example for DPX.

The conformance document can contain as many Rules as needed as long as they follow the pattern consistently.

As a result of conformance testing, embARC provides a summary result in the terminal that includes a count of files tested and files failed, as well as a list of files that failed any test. Additionally, embARC outputs a CSV file containing a row for each test carried out and the result (PASS/FAIL).

Examples for using the CLI version of embARC, including MXF workflows, are available in Appendix C of the CLI user guide.

embARC GUI version now includes MXF!

The second major update is the inclusion of the MXF format in the embARC GUI (graphical user interface) version which, along with existing DPX functionality, now also supports the FADGI sponsored SMPTE RDD 48: MXF Archive and Preservation Format guidelines.  Adding MXF functionality was a big ask as MXF is a data-rich and metadata-complex file format.

With this new expanded support in embARC, the user interface has changed so that users are now presented with a splash page to load files upon launching the application. Because embARC’s UI (user interface) supports DPX and MXF in different ways, the splash page gives the application an opportunity to identify the desired file format from the user before launching the full UI. Users can now load DPX or MXF (but not both at the same time!) and the system will select the appropriate UI to show.

In the new MXF UI, embARC supports reading of the following MXF file structures: track descriptors; AS 07 Core Descriptive Metadata Scheme (DMS); Text Data Generic Stream Partitions (GSP); and Binary Data GSP present in the MXF file. Clicking on different tabs will display the fields present in that file section. Note: In-depth explanations for these terms are available in SMPTE RDD 48: MXF Archive and Preservation Format in section 4, Acronyms and Terms.

Fig. 5: MXF descriptor data.

embARC reads a variety of track “descriptors” or metadata about picture essences, audio essences, and data essences. The Descriptors section of the embARC GUI user guide lists all the currently supported descriptors. These descriptors are “read only” in the UI and cannot be edited.

embARC supports the creation of, or editing of, a single AS 07 Core DMS for any supported MXF file. See Annex D of RDD 48 for more information about the Core Descriptive Metadata Scheme. If no existing AS 07 Core DMS is present, then embARC will allow a user to embed one. If an existing AS 07 Core DMS is present, then embARC will read the existing data and will allow a user to edit, delete, or add data to the existing one.

Additionally, embARC supports reading and downloading stored text-data GSPs (such as XML-based supplementary metadata) and binary-data GSPs (such as still images) in supported MXF files.

These are newly available open-source features for working with MXF files that are not easily found in other tools. We are excited to share them with the community and look forward to continued input and collaboration as we move forward. Comments are always welcome!

Where to find embARC

embARC is available for download now in both Windows and Mac GUI versions from the FADGI website along with all user guides. The embARC beta CLI is available on GitHub.

The September 11, 2001 Web Archive: Twenty Years Later

Today’s guest post is from Tracee Haupt, a Digital Collection Specialist in the Digital Content Management section at the Library of Congress. On the twentieth anniversary of the September 11th terrorist attacks, I asked four individuals who were part of the creation of the September 11, 2001 Web Archive to reflect on their experience documenting […]

Reading the (Same) Signals: Using FADGI’s ADCTest for Quality Control in Outsourced Audio Digitization

This is the second in a series of updates from the Federal Agencies Digital Guidelines Initiative (FADGI) Audio-Visual working group. See That’s Our Cue! Updates for the FADGI Embedded Metadata Guidelines and BWF MetaEdit for the Cue Chunk in Broadcast Wave Files for the first installment. This post is co-authored by Kate Murray, Digital Projects […]

That’s Our Cue! Updates for the FADGI Embedded Metadata Guidelines and BWF MetaEdit for the Cue Chunk in Broadcast Wave Files

This is guest post, the first in a series of updates about the recent work of the Federal Agencies Digital Guidelines Initiative (FADGI) Audio-Visual working group, is co-authored by Kate Murray, Digital Projects Coordinator in Digital Collections Management and Services, audiovisual archivist and technologist Dave Rice, and Jérôme Martinez, Founder and President of MediaArea.net. The […]

Review With Us: By the People and Smithsonian Transcription Center team up for crowdsourced transcription

Today’s guest post is from Caitlin Haynes, the Program Coordinator for the Smithsonian Transcription Center in Washington, D.C. You can read Caitlin’s original post from the Smithsonian here.* During the month of August 2021, we teamed up with the community managers and volunteers at By the People, the Library of Congress’s crowdsourced transcription program, to focus […]

Diving into Branch Rickey: Using a dataset of crowdsourced transcriptions as a tool for open research

Today’s blog post is from Abby Shelton and Lauren Seroka, two Digital Collections Specialists in the Digital Content Management Section here at the Library of Congress. Abby and Lauren discuss their work with the University of Michigan School of Information’s Ann Arbor Data Dive earlier this year. On March 27, 1956, Branch Rickey wrote of baseball […]

Developing a New Digital Collections Strategy at the Nation’s Library

Today’s guest post is from Joe Puccio, Collection Development Officer at the Library of Congress. Tremendous progress has been made by the Library of Congress in acquiring born-digital content as part of a coordinated strategy presented in its 2017 Digital Collecting Plan and previously reported in the Signal. With that plan now in its fifth […]

Speculative Annotation in the Classroom: A Conversation with Educator Ashley Wood and Innovator Courtney McClellan

The following is a guest post by the 2021 Innovator in Residence Courtney McClellan, a research-based artist who lives in Atlanta, Georgia. With a subject focus on speech and civic engagement, McClellan works in a range of media including sculpture, performance, photography, and writing. She has served as studio art faculty at Virginia Commonwealth University, […]