The five recipients of the inaugural NDSA innovation awards are exemplars of the creativity, diversity, and collaboration essential to supporting the digital community as it works to preserve and make available digital materials. In an effort to learn more and share the work of the individuals, projects and institutions who won these awards I am excited to start the first, of what I hope to be a series, of interviews with the award winners.
Mat Kelly is a Mobile Applications Developer and Programmer at NASA Langley Research Center and a Graduate Student in Computer Science at Old Dominion University. The awards recognized him for his work on WARCreate, a Google Chrome extension that allows users to create a Web ARChive (WARC) file from any browseable webpage.
Trevor: Could you tell us about WARCreate? Specifically, about what the projects goals are, what problems it was designed to ameliorate and how it fits into a larger set of ideas about web archiving?
Mat: WARCreate’s primary purpose is to provide the facilities to allow users to easily preserve content on the web to which they would otherwise have to resort to ad hoc means (e.g., a browser’s “save page as…” function). The focus of the project is to capture content on social media networks but really it is about giving the power to preserve to those whom the preservation of the content would matter most and whom the content is often about (think: Facebook). A secondary goal of the tool is to bring the facilities of institutional archiving (e.g., WARC, wayback) to personal web archiving so some of the advances in the field can be enjoyed by professional and amateur archivists alike. This has not been easy, as some ideas have had to be shoehorned to work with conventional web archiving technologies but with each consideration I hope to make personal web archiving a task that is not daunting to a casual user.
Trevor: Where did the idea for this project come from?
Mat: I initially worked on a similar yet vastly different software project called Archive Facebook, which I presented at the NDIIPP – NDSA digital preservation partners meeting. Exploring the methods and output of this tool made me dig deeper into what has been done to overcome the problem of preserving content that users feel is important and is otherwise difficult or impossible to preserve. While the Bergman’s work made me aware of the vastness of content inaccessible to crawlers, the work of Cathy Marshall brought me up to speed with more contemporary concerns and how casual users go about accomplish personal web archiving. These two authors and many others between the time of their respective publications served as motivation to help overcome the issues that users in Marshall’s work faced as well as capture the content that Bergman implicitly said was difficult to preserve.
Web content inaccessible to crawlers is many times larger than content that crawlers, including Heritrix, can reach and is frequently not preserved because of this. WARCreate’s method of capturing any page that the user can see allows the archivable content, a superset Heritrix’s archivable content, to be preserved.
Trevor: What have you learned through working on WARCreate? Are there things about either the process and goals of web archiving or about developing software tools to support digital preservation?
Mat: My exposure to the WARC format prior to developing WARCreate was limited. Through some use cases in the Internet Archive driven Archive-It service, I learned that many would like means to archive but that it cannot be so complicated as to require users to dedicate a lot of time in overcoming the learning curve of new software and formats. Through discussing the project in the early stages with both professional preservationists and casual PC users alike, I was made aware of some needs and concerns of users but also the desire for new tools to be interoperable with tools currently used for archiving. This was the premise of utilizing the WARC format – it is an ISO standard and utilized by one of the more popular end-user systems, the Wayback Machine. Giving exposure to tools like the open source Wayback, the Memento framework, the XAMPP client-side server package and the like is a sort of byproduct of developing a tool with integration in mind. I want to make it useful and easy for casual users while taking advantage of the formats and tools with which professional web archivists are already familiar. I am constantly learning what the user wants by developing the evolving project while trying to keep scope creep at bay by modularizing all of the components to ensure that those that want to simply create WARCs without the extra integration will be able to do so by using WARCreate.
Trevor: What attracted you to digital preservation as an area of study? Further, do you have any thoughts for how we can get more computer scientists and computer science students interested in working on problems related to digital preservation?
Mat: Old Dominion University has an excellent web science & digital libraries group and their academic, research and post-graduation success attracted me to the group. In past endeavors, I had worked with some extremely niche projects. With personal digital archiving being a subject that “fell between the cracks” for a long time with different tools and processes for individuals and organization, creating a tool like WARCreate was an opportunity to merge the preservation software environments. One problem in computer science is that there is a tendency to think that Google has solved all open problems involving the web, which isn’t true, especially in respect to preservation. With increased research funding, such as the NSF/NIH Big Data program, more attention is being drawn to the problem and stress that preservation is a necessary pre-condition to data mining and use.
Trevor: Did you find any of the sessions at the Digital Preservation 2012 conference particularly interesting or valuable for thinking about your work? If so, please elaborate on what about them intrigued you or connects with how you are thinking about your work?
Mat: Quite a few of the sessions were valuable and interesting but a select set stuck out as being helpful to my research into personal web archiving. Anil Dash’s talk about content in relation to open formats and Michael Carroll’s description of fair use are going to be excellent starting points when I consider the advantages and ramifications that WARCreate has for a user and the user from which they may be trying to preserve their content. I loved the poster and demo sessions, as they started the wheels spinning on how some of ideas in these concrete implementations are helpful to preservationists and how I should consider these concerns when developing archiving tools in the future.