The following is a guest post by Jim Mangiafico. Jim is the winner of our Legislative Data Challenges and has been working with our partner, the National Archives of the United Kingdom, for the second challenge to further the work he began during our challenges. He has graciously agreed to provide an update on his exciting progress with Akoma Ntoso and Legislative XML.
It has been a year since the Library’s Legislative Data Challenges, and we have learned much from the comparative study of legal markup. The Data Challenges asked participants to develop tools to translate legislative documents from their native XML formats into Akoma Ntoso, a newer XML schema currently in the process of standardization by the Organization for the Advancement of Structured Information Standards (OASIS). In the past year, I have continued the work begun during the Challenges, writing code for the National Archives of the United Kingdom to generate Akoma Ntoso versions of the laws available at legislation.gov.uk. An Application Program Interface (API) for UK legislation in Akoma Ntoso will soon be made public. In the process, we had to confront fundamental design decisions about the structure of legislative markup, and we developed some new tools that we hope will improve access to legislation.
The biggest difficulty I encountered when translating UK legislation into Akoma Ntoso stems from the differing paragraph models in the two XML formats. The native XML schema governing UK legislation, called the Crown Legislative Markup Language (CLML), follows what it calls a “true” paragraph model, according to which all elements associated with a paragraph are represented as children of the paragraph element. (This is in contrast to, say, HTML, in which lists and other elements are frequently represented as siblings of the
<p> elements with which readers naturally associate them.) Consequently, it is possible in CLML to have a section of an act with multiple paragraphs of text, only one of which is grouped with the section’s subsections. For example, the following pattern is not uncommon in CLML:
<P1>denotes a section,
<P1> <Pnumber>1</Pnumber> <P1para> <Text>some text</Text> </P1para> <P1para> <Text>some more text</Text> <P2></P2> <P2></P2> </P1para> </P1>
Markup such as this is not easily translated into Akoma Ntoso, which does not contemplate an association between a subsection and any one textual component of its parent section. Akoma Ntoso permits introductory paragraphs before a section’s first subsection and concluding paragraphs after its last, but all subsections must be direct children of their parent section, and there can be nothing between them that is not their sibling. Consequently, we have chosen to translate CLML like the above as follows:
<section> <num>1</num> <intro> <p>some text</p> <p>some more text</p> </intro> <subsection></subsection> <subsection></subsection> </section>
As you can see, the semantics of these two fragments is different: the association between the second paragraph of text and the subsections has been lost. We take some comfort in the fact that both will likely be displayed identically to readers, but it remains for us an open question the extent to which legislative markup benefits from the ability to group subsections within a section.
Another challenge we faced relates to the need in the UK to specify the territorial applicability of individual parts of legislation. Legislation in the United Kingdom often contains alternative versions of individual sections, each with a geographical restriction. For example, an act may have two versions of Section 1, the first applying to England and Wales and the second to Scotland. CLML has a dedicated attribute for such cases. Akoma Ntoso allows authors to define jurisdictional restrictions in the metadata and to link them to sections of the document body, but to my mind this mechanism is not as elegant as Akoma Ntoso’s vocabulary for capturing temporal restrictions.
On the whole, however, we have grown quite fond of the simplicity of the Akoma Ntoso data model, and we have borrowed ideas from it for other projects. For example, The National Archives is very interested in supporting HTML5. We have been experimenting with a near one-to-one serialization of Akoma Ntoso in HTML5 and have produced HTML5 versions of all legislation available on legislation.gov.uk. The goal has been to follow the structure of Akoma Ntoso as closely as possible, while using all of the native semantics of HTML5. The core nodes of the document tree–parts, chapters, sections and other high level “hierarchical containers” in Akoma Ntoso–are represented as nested HTML
<section> elements, allowing the document outline to be parsed faithfully by HTML5 validators. We had a lively debate about the best use of HTML’s
<section> element in legislative documents, ultimately deciding not to use it to represent hierarchical levels beneath the subsection, such as clauses. Also, we mirror the rich Akoma Ntoso metadata structure with native HTML elements using RDFa Lite.
Lastly, in the course of developing testing procedures for our document conversions we began thinking about ways to count elements within legislation and the relationships between them. Now, as part of The National Archives’ Big Data for Law project, we will conduct a “census” of the UK statute book and release data about the frequencies of structural patterns within legislative documents and the changes in those frequencies over time. We’re also using natural language processing to trace changes in statutory language. Look for this soon on legislation.gov.uk.
I would like to thank John Sheridan, Head of Legislation Services at The National Archives, for giving me the opportunity to do the kind of work I find so rewarding. I hope it proves to be useful.