The following is a guest post by Jim Mangiafico. Jim is the winner of our Legislative Data Challenges and has been working with our partner, the National Archives of the United Kingdom, for the second challenge to further the work he began during our challenges. He has graciously agreed to provide an update on his exciting progress with Akoma Ntoso and Legislative XML.
It has been a year since the Library’s Legislative Data Challenges, and we have learned much from the comparative study of legal markup. The Data Challenges asked participants to develop tools to translate legislative documents from their native XML formats into Akoma Ntoso, a newer XML schema currently in the process of standardization by the Organization for the Advancement of Structured Information Standards (OASIS). In the past year, I have continued the work begun during the Challenges, writing code for the National Archives of the United Kingdom to generate Akoma Ntoso versions of the laws available at legislation.gov.uk. An Application Program Interface (API) for UK legislation in Akoma Ntoso will soon be made public. In the process, we had to confront fundamental design decisions about the structure of legislative markup, and we developed some new tools that we hope will improve access to legislation.
The biggest difficulty I encountered when translating UK legislation into Akoma Ntoso stems from the differing paragraph models in the two XML formats. The native XML schema governing UK legislation, called the Crown Legislative Markup Language (CLML), follows what it calls a “true” paragraph model, according to which all elements associated with a paragraph are represented as children of the paragraph element. (This is in contrast to, say, HTML, in which lists and other elements are frequently represented as siblings of the
<p> elements with which readers naturally associate them.) Consequently, it is possible in CLML to have a section of an act with multiple paragraphs of text, only one of which is grouped with the section’s subsections. For example, the following pattern is not uncommon in CLML:
<P1>denotes a section,
<P1> <Pnumber>1</Pnumber> <P1para> <Text>some text</Text> </P1para> <P1para> <Text>some more text</Text> <P2></P2> <P2></P2> </P1para> </P1>
Markup such as this is not easily translated into Akoma Ntoso, which does not contemplate an association between a subsection and any one textual component of its parent section. Akoma Ntoso permits introductory paragraphs before a section’s first subsection and concluding paragraphs after its last, but all subsections must be direct children of their parent section, and there can be nothing between them that is not their sibling. Consequently, we have chosen to translate CLML like the above as follows:
<section> <num>1</num> <intro> <p>some text</p> <p>some more text</p> </intro> <subsection></subsection> <subsection></subsection> </section>
As you can see, the semantics of these two fragments is different: the association between the second paragraph of text and the subsections has been lost. We take some comfort in the fact that both will likely be displayed identically to readers, but it remains for us an open question the extent to which legislative markup benefits from the ability to group subsections within a section.
Another challenge we faced relates to the need in the UK to specify the territorial applicability of individual parts of legislation. Legislation in the United Kingdom often contains alternative versions of individual sections, each with a geographical restriction. For example, an act may have two versions of Section 1, the first applying to England and Wales and the second to Scotland. CLML has a dedicated attribute for such cases. Akoma Ntoso allows authors to define jurisdictional restrictions in the metadata and to link them to sections of the document body, but to my mind this mechanism is not as elegant as Akoma Ntoso’s vocabulary for capturing temporal restrictions.
On the whole, however, we have grown quite fond of the simplicity of the Akoma Ntoso data model, and we have borrowed ideas from it for other projects. For example, The National Archives is very interested in supporting HTML5. We have been experimenting with a near one-to-one serialization of Akoma Ntoso in HTML5 and have produced HTML5 versions of all legislation available on legislation.gov.uk. The goal has been to follow the structure of Akoma Ntoso as closely as possible, while using all of the native semantics of HTML5. The core nodes of the document tree–parts, chapters, sections and other high level “hierarchical containers” in Akoma Ntoso–are represented as nested HTML
<section> elements, allowing the document outline to be parsed faithfully by HTML5 validators. We had a lively debate about the best use of HTML’s
<section> element in legislative documents, ultimately deciding not to use it to represent hierarchical levels beneath the subsection, such as clauses. Also, we mirror the rich Akoma Ntoso metadata structure with native HTML elements using RDFa Lite.
Lastly, in the course of developing testing procedures for our document conversions we began thinking about ways to count elements within legislation and the relationships between them. Now, as part of The National Archives’ Big Data for Law project, we will conduct a “census” of the UK statute book and release data about the frequencies of structural patterns within legislative documents and the changes in those frequencies over time. We’re also using natural language processing to trace changes in statutory language. Look for this soon on legislation.gov.uk.
I would like to thank John Sheridan, Head of Legislation Services at The National Archives, for giving me the opportunity to do the kind of work I find so rewarding. I hope it proves to be useful.
Great write-up Jim. I have so many questions I’m not sure where to start:
First of all, were you able to model everything in Akoma Ntoso without having to resort to any custom classes using the base level tags?
Second, were you able to preserve the word order from the printed document?
Third, did you have any cases of generated text in CLML and how did you handle the conversion? I’m talking about printed text that is added by rule rather than being within the text content of the XML itself.
Fourth, have you taken a look at the compliance levels built into the Akoma Ntoso spec? To which level of compliance have you been able to achieve? Did we strike the right balance in the spec — providing a path to adoption that would allow both for early benefit and an achievable result?
Your comments regarding the geographical restrictions are interesting. You mentioned them in your remote presentation in Ravenna, as well. You’re right, while we’ve spent a lot of time working through temporal restrictions of provisions, we’ve not spent nearly as much time dealing with geographical restrictions.
Akoma Ntoso will be out for public review very shortly. I’m inviting you (you don’t really need an invitation) to bring up these issues as part of the process. It will be of great help to us all.
Thanks for sharing your great work! — Grant
Hi Grant. Thanks for your questions.
I use a few of Akoma Ntoso’s generic , and elements, but I try to use them sparingly. For example, I use an to represent the section-level provisions of Northern Ireland Statutory Rules, which are commonly referred to as “regulations.” Also, I occasionally use s and s to group elements in the front or back matter, where most of the important Akoma Ntoso elements are inline. Sometimes I want a block-level grouping that seems more significant than a with a class attribute would imply.
Second, yes, I believe I was able to preserve word order throughout. The one possible exception is the order of section titles, some of which appeared in the margins of the original print editions but are now commonly displayed following the section number in the HTML versions on legislaiton.gov.uk. I’d like to see Akoma Ntoso allow the simple inversion of the and elements within hierarchical containers.
I don’t think I had to deal with any cases of generated text in the body of the documents. Of course, a fair amount of string manipulation was required to generate all of the necessary metadata values.
And finally, yes, I did look at the Akoma Ntoso compliance levels. We have achieved only the first level of compliance, but that may be misleading. We chose not to follow the recommended naming conventions in order to preserve consistency among the APIs on legislation.gov.uk. For example, one can request a specific chapter or section of an act. http://www.legislation.gov.uk/ukpga/2014/1/section/1 returns section 1 of that act, and http://www.legislation.gov.uk/ukpga/2014/1/section/1/data.xml returns the same in CLML. The id attribute of that section has the value “section-1”, and one can always generate the proper URL suffix for a document fragment by replacing certain hyphens with slashes in the desired element’s id. Following the Akoma Ntoso naming conventions for element identifiers would break that rule. In all other respects, however, I think we satisfy the requirements of the higher compliance levels, although I haven’t confirmed that recently. Perhaps the naming conventions could be moved up the compliance scale? They’re not difficult to implement, but one may have reasons not to.
I’ll keep an eye out for the public review process.