Butch Lazorchak is an Information Technology Project Manager at the Library of Congress. His research interests include digital geography, digital music preservation and musical communities, K-12 and LIS student outreach and emerging technologies. He holds an MLS from the University of North Carolina-Chapel Hill.
The following is a guest post by Abbie Grotke, Lead Information Technology Specialist on the Web Archiving Team, Library of Congress.
Recently the Library of Congress launched a significant amount of new Web Archive content on the Library’s Web site, as a part of a continued effort to integrate the Library’s Web Archives into the rest of the loc.gov web presence.
This is our first big release since we launched the first iteration of collections into this new interface, back in June 2013. The earlier approach to presenting archived web sites turned out to be a challenge to allow us to increase the amount of content available, so in a “one step back, two steps forward” move, the interface has been simplified, and should be more familiar to those working with Web Archives at other institutions – item records point to archived web sites displaying in an open-source version of the Wayback Machine. This simplification allowed the Library to increase the number of sites available in this interface from just under 1,000 to over 5,800. The most recent harvested sites now publicly available were harvested in March-April 2012. The simplified approach should also allow catching up with moving more current content into the online collections.
There are now 21 named collections available in the new interface; some had been available in our old interface but are newly migrated; other content is entirely new. With this launch, we are particularly excited about the addition of the United States Congressional Web Archives, which for the first time allows researchers to access content collected since December 2002 up thru April 2012. Each record covers those sessions where a particular member of Congress was serving, such as for Barack Obama as senator during two sessions, or the example of Kirsten E. Gillibrand serving in the House and Senate, represented on one record despite a URL change.
We still have some work to do to move the U.S. Election Web Archives from our old interface, so for the time being researchers interested in those collections will need to refer back to the old site. Eventually we will be combining the separate Election collections into one U.S. Election Archive that will allow better searchability and access, and migrating them over (and then “turning off” the old interface).
We hope researchers will enjoy access to these new web archive collections.
If you believe the Web (and who doesn’t believe everything they read on the Web?), it boastfully celebrated its 25th birthday last year. Twenty-five years is long enough for the first “children of the Web” to be fully-grown adults, just now coming of age to recognize that the Web that grew up around them has irrevocably changed.
In this particular instance, change is good. It’s only by becoming aware of what we’re losing (or have already lost) that we’ll be spurred to action to preserve it. We’ve been aware of the value of the historic web for a number of years here at the Library of Congress, and we’ve worked hard to understand how to capture the Web through the Library’s Web Archiving program and the work we’ve done with partners at the Memento project and through the International Internet Preservation Consortium.
But let’s go back to those “children of the Web.” Nostalgia is a powerful driver for preservation, but most preservation efforts are driven by full-grown adults. If they’re able to bring a child’s perspective to their work it’s only through the prism of their own memory, and in any event, the nostalgic items they may wish to capture may not be around anymore by the time they get to them. What’s needed is not just a nostalgic memory of the web, but efforts to curate and capture the web with a perspective that includes the interests of the young. And who better to represent the interests of the young than children and teenagers themselves! Luckily the Library of Congress has such a program: the K-12 web archiving program.
The K-12 Web Archiving program has been operating since 2008, engaging dozens of schools and hundreds of students from schools, large and small, from across the U.S. in understanding what the Web means to them, and why it’s important to capture it. In partnership with the Internet Archive, the program enables schools to set up their own web capture tools and choose sets of web resources to collect; resources that represent the full range of youthful experience, including popular culture, commerce, news, entertainment and more.
Cheryl Lederle, an Educational Resource Specialist at the Library of Congress, notes that the program builds student awareness of the internet as a primary source as well as how quickly it can change. The program might best be understood through the reflections of participating teachers:
“The students gained an understanding of how history is understood through the primary sources that are preserved and therefore the importance of the selection process for what we are digitally preserving. But, I think the biggest gain was their personal investment in preserving their own history for future generations. The students were excited and fully engaged by being a part of the K-12 archiving program and that their choices were being preserved for their own children someday to view.” – MaryJane Cochrane, Paul VI Catholic High School
“The project introduced my students to historical thinking; awareness of digital data as a primary source and documentation of current events and popular culture; and helped foster an appreciation and awareness of libraries and historical archives.” – Patricia Carlton, Mount Dora High School
And participating students:
“Before this project, I was under the impression that whatever was posted on the Internet was permanent. But now, I realize that information posted on the Internet is always changing and evolving.”
“I find it very interesting that you can look back on old websites and see how technology has progressed. I want to look back on the sites we posted in the future to see how things have changed.”
“I was surprised by the fact that people from the next generation will also share the information that I have collected.”
“They’re really going to listen to us and let us choose sites to save? We’re eight!”
Collections from 2008-2014 are available for study on the K-12 Web Archiving site, and the current school year will be added soon. Students examining these collections might:
Compare one school’s collections from different years.
Compare collections preserved by students of different grade levels in the same year.
Compare collections by students of the same grade level, but from different locations.
Create a list of Web sites they think should be preserved and organize them into two or three collections.
What did your students discover about the value of preserving Web sites?
We’ve written about the BitCurator project a number of times, but the project has recently entered a new phase and it’s a great time to check in again. The BitCurator Access project began in October 2014 with funding through the Mellon Foundation. BitCurator Access is building on the original BitCurator project to develop open-source software that makes it easier to access disk images created as part of a forensic preservation process.
Kam Woods has been a part of BitCurator from the beginning as its Technical Lead, and he’s currently a Research Scientist in the School of Information and Library Science at the University of North Carolina at Chapel Hill. As part of our Insights Interview series we talked with Woods about the latest efforts to apply digital forensics to digital preservation.
Butch: How did you end up working on the BitCurator project?
Kam: In late 2010, I took a postdoc position in the School of Information and Library Science at UNC, sponsored by Cal Lee and funded by a subcontract from an NSF grant awarded to Simson Garfinkel (then at the Naval Postgraduate School). Over following months I worked extensively with many of the open source digital forensics tools written by Simson and others, and it was immediately clear that there were natural applications to the issues faced by collecting organizations preserving born-digital materials. The postdoc position was only funded for one year, so – in early 2011 – Cal and I (along with eventual Co-PI Matthew Kirschenbaum) began putting together a grant proposal to the Andrew W. Mellon Foundation describing the work that would become the first BitCurator project.
Butch: If people have any understanding at all of digital forensics it’s probably from television or movies, but I suspect the actions you see there are pretty unrealistic. How would you describe digital forensics for the layperson? (And as an aside, what do people on television get “most right” about digital forensics?)
Kam: Digital forensics commonly refers to the process of recovering, analyzing, and reporting on data found on digital devices. The term is rooted in law enforcement and corporate security practices: tools and practices designed to identify items of interest (e.g. deleted files, web search histories, or emails) in a collection of data in order to support a specific position in a civic or criminal court case, to pinpoint a security breach, or to identify other kinds of suspected misconduct.
The goals differ when applying these tools and techniques within archives and data preservation institutions, but there are a lot of parallels in the process: providing an accurate record of chain of custody, documenting provenance, and storing the data in a manner that resists tampering, destruction, or loss. I would direct the interested reader to the excellent and freely available 2010 Council on Library and Information Resources report Digital Forensics and Born-Digital Content in Cultural Heritage Institutions (pdf) for additional detail.
You’ll occasionally see some semblance of a real-world tool or method in TV shows, but the presentation is often pretty bizarre. As far as day-to-day practices go, discussions I’ve had with law enforcement professionals often include phrases like “huge backlogs” and “overextended resources.” Sound familiar to any librarians and archivists?
Butch: Digital forensics has become a hot topic in the digital preservation community, but I suspect that it’s still outside the daily activity of most librarians and archivists. What should librarians and archivists know about digital forensics and how it can support digital preservation?
Kam: One of the things Cal Lee and I emphasize in workshops is the importance of avoiding unintentional or irreversible changes to source media. If someone brings you a device such as a hard disk or USB drive, a hardware write-blocker will ensure that if you plug that device into a modern machine, nothing can be written to it, either by you or some automatic process running on your operating system. Using a write-blocker is a baseline risk-reducing practice for anyone examining data that arrives on writeable media.
Creating a disk image – a sector-by-sector copy of a disk – can support high-quality preservation outcomes in several ways. A disk image retains the entirety of any file system contained within the media, including directory structures and timestamps associated with things like when particular files were created and modified. Retaining a disk image ensures that as your external tools (for example, those used to export files and file system metadata) improve over time, you can revisit a “gold standard” version of the source material to ensure you’re not losing something of value that might be of interest to future historians or researchers.
Disk imaging also mitigates the risk of hardware failure during an assessment. There’s no simple, universal way to know how many additional access events an older disk may withstand until you try to access it. If a hard disk begins to fail while you’re reading it, chances of preserving the data are often higher if you’re in the process of making a sector-by-sector copy in a forensic format with a forensic imaging utility. Forensic disk image formats embed capture metadata and redundancy checks to ensure a robust technical record of how and when that image was captured, and improve survivability over raw images if there is ever damage to your storage system. This can be especially useful if you’re placing a material in long-term offline storage.
There are many situations where it’s not practical, necessary, or appropriate to create a disk image, particularly if you receive a disk that is simply being used as an intermediary for data transfer, or if you’re working with files stored on a remote server or shared drive. Most digital forensics tools that actually analyze the data you’re acquiring (for example, Simson Garfinkel’s bulk extractor, which searches for potentially private and sensitive information and other items of interest) will just as easily process a directory of files as they would a disk image. Being aware of these options can help guide informed processing decisions.
Finally, collecting institutions spend a great deal of time and money assessing, hiring and training professionals to make complex decisions about what to preserve, how to preserve it and how to effectively provide and moderate access in ways that serve the public good. Digital forensics software can reduce the amount of manual triage required when assessing new or unprocessed materials, prioritizing items that are likely to be preservation targets or require additional attention.
Butch: How does BitCurator Access extend the work of the original phases of the BitCurator project?
Kam: One of the development goals for BitCurator Access is to provide archives and libraries with better mechanisms to interact with the contents of complex digital objects such as disk images. We’re developing software that runs as a web service and allows any user with a web browser to easily navigate collections of disk images in many different formats. This includes: providing facilities to examine the contents of the file systems contained within those images; interact with visualizations of file system metadata and organization (including timelines indicating changes to files and folders); and download items of interest. There’s an early version and installation guide in the “Tools” section of http://access.bitcurator.net/.
We’re also working on software to automate the process of redacting potentially private and sensitive information – things like Social Security Numbers, dates of birth, bank account numbers and geolocation data – from these materials based on reports produced by digital forensics tools. Automatic redaction is a complex problem that often requires knowledge of specific file format structures to do correctly. We’re using some existing software libraries to automatically redact where we can, flag items that may require human attention and prepare clear reports describing those actions.
Finally, we’re exploring ways in which we can incorporate emulation tools such as those developed at the University of Freiburg using the Emulation-as-a-Service model.
Butch: I’ve heard archivists and curators express ethical concerns about using digital forensics tools to uncover material that an author may not have wished be made available (such as earlier drafts of written works). Do you have any thoughts on the ethical considerations of using digital forensics tools for digital preservation and/or archival purposes?
Kam: There’s a great DPC Technology Watch report from 2012, Digital Forensics and Preservation (pdf), in which Jeremy Leighton John frames the issue directly: “Curators have always been in a privileged position due to the necessity for institutions to appraise material that is potentially being accepted for long-term preservation and access; and this continues with the essential and judicious use of forensic technologies.”
What constitutes “essential and judicious” is an area of active discussion. It has been noted elsewhere (see the CLIR report I mentioned earlier) that the increased use of tools with these capabilities may necessitate revisiting and refining the language in donor agreements and ethics guidelines.
As a practical aside, the Society of American Archivists Guide to Deeds of Gift includes language alerting donors to concerns regarding deleted content and sensitive information on digital media. Using the Wayback Machine, you can see that this language was added mid-2013, so that provides some context for the impact these discussions are having.
Kam: DigitalCorpora.org was originally created by Simson Garfinkel to serve as a home for corpora he and others developed for use in digital forensics education and research. The set of materials on the site has evolved over time, but several of the currently available corpora were captured as part of scripted, simulated real-world scenarios in which researchers and students played out roles involving mock criminal activities using computers, USB drives, cell phones and network devices.
These corpora strike a balance between realism and complexity, allowing students in digital forensics courses to engage with problems similar to those they might encounter in their professional careers while limiting the volume of distractors and irrelevant content. They’re freely distributed, contain no actual confidential or sensitive information, and in certain cases have exercises and solution guides that can be distributed to instructors. There’s a great paper linked in the Bibliography section of that site entitled “Bringing science to digital forensics with standardized forensic corpora” (pdf) that goes into the need for such corpora in much greater detail.
We’ve used disk images from one corpus in particular – the “M57-Patents Scenario” – in courses taken by LIS students at UNC and in workshops run by the Society of American Archivists. They’re useful in illustrating various issues you might run into when working with a hard drive obtained from a donor, and in learning to work with various digital forensics tools. I’ve had talks with several people about the possibility of building a realistic corpus that simulated, say, a set of hard drives obtained from an artist or author. This would be expensive and require significant planning, for reasons that are most clearly described in the paper linked in the previous paragraph.
Butch: What are the next steps the digital preservation community should address when it comes to digital forensics?
Kam: Better workflow modeling, information sharing and standard vocabularies to describe actions taken using digital forensics tools are high up on the list. A number of institutions do currently document and publish workflows that involve digital forensics, but differences in factors like language and resolution make it difficult to compare them meaningfully. It’s important to be able to distinguish those ways in which workflows differ that are inherent to the process, rather than the way in which that process is described.
Improving community-driven resources that document and describe the functions of various digital forensics tools as they relate to preservation practices is another big one. Authors of these tools often provide comprehensive documentation, but they doesn’t necessarily emphasize those uses or features of the tools that are most relevant to collecting institutions. Of course, a really great tool tutorial doesn’t really help someone who doesn’t know about that tool, or isn’t familiar with the language being used to describe what it does, so you can flip this: describing a desired data processing outcome in a way that feels natural to an archivist or librarian, and linking to a tool that solves part or all of the related problem. We have some of this already, scattered around the web; we just need more of it, and better organization.
Finally, a shared resource for robust educational materials that reflect the kinds of activities students graduating from LIS programs may undertake using these tools. This one more or less speaks for itself.
The following is a guest post by Nora Ohnishi, a former intern with the Web Archiving Team at the Library of Congress.
My name is Nora Ohnishi, and I will graduate with my Masters in Library and Information Science from the University of North Texas in May. I began working for The Library of Congress via the HACU National Internship Program in January 2015. While here I have worked with the Web Archiving Team, part of the Office of Strategic Initiatives.
During this meeting, Christie Moffatt from NLM discussed her team’s use of the Archive-It tool and what collections are publicly available, such as the Global Health Events and Health and Medicine Blogs. Dory Bower and David Wallace from the Government Publishing Office distributed a new publication documenting the Federal Information Preservation Network’s plans for activities.
It was after this meeting that I began work on analyzing a spreadsheet of Federal government Web sites published by 18F, a newly-formed organization within the General Services Administration. I review the sites for owners, authors and completeness and determine the size of the site according to Google. After this, I create a nomination using DigiBoard (pdf), the Library’s curatorial tool, for inclusion of the site in the Public Policy Topics Web Archives collection. This work has helped build the Library of Congress collection of Federal Web sites.
It is now early May, and using this spreadsheet, I have successfully evaluated over 2,000 Federal Government sites and created nominations for review! Among my favorites are:
I like these sites for various reasons, but the main reason is that they helped me learn about different government groups or initiatives that I otherwise would not have known about. Additionally, some of the documents and reports on the sites date back to the 1980s, and the site owners may decide to take them down at some point. This reason makes the sites good candidates for inclusion in the Web Archives.
As a result of this process, the Library and the working group can now use this as a launching point for ensuring our work harvesting Web sites does not overlap, and in cases where it does, discuss the reasons, such as different scopes, harvesting levels or other issues. This was at times a formidable task, but it all ties into the reason for our work at the Library of Congress and in the Federal Web Archiving Working Group: to provide future researchers open access to government information. As a newly minted librarian, this is one of my most important philosophies behind library services.
The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.
In a blog post about six months ago I wrote about how the Library of Congress web archiving program was starting to harvest “general” internet news sites such as Daily Kos, Huffington Post and Townhall.com, as well as newer sites such as news.vice.com and verge.com.
Many of these sites are extremely large. How large? While not an exact count (and in fact, far from it), use of the “site” limiter in Google will provide a count of digital objects found and indexed by Google (which is a far larger number than the number of web pages, but gives some sense of relative scale to other sites). A “site:huffingtonpost.com” search in Google returns “about 3,470,000 results.” That is large.
When harvesting sites like these in a “traditional” way the harvesting starts at the home page and follows links found on pages that are in scope, capturing each page and the digital bits and pieces (such as JPEG images) so that playback software can recreate the page accurately later. For example, with HuffingtonPost.com that would mean links on the URL huffingtonpost.com and not links out to some other URL.
Unfortunately such sites are so large that the harvesting process runs out of time. Even though we were harvesting (to stay with this example) HuffingtonPost.com on a weekly basis, capturing whatever we could manage to get each time, there was no assurance that over time we would capture all the content published to the site even once, since with each harvest the process of winding through the site started over again but then followed a different path.
What is the use case served by trying to capture the entire site in one visit (leaving aside completeness)? Presumably it is that the archive ends up with multiple copies of the same news item over time, perhaps showing that some changes were made for one reason or another. This is the accountability use case:
Crawling websites over time allows for modifications to content to be observed and analyzed. This type of access can be useful in ensuring accountability and visibility for web content that no longer exists. On one hand, companies may archive their web content as part of records management practices or as a defense against legal action; on the other, public web archives can show changes in governmental, organizational, or individual policy or practices.
This would be a fine use case for our efforts to support, but if we aren’t able to reliably harvest all the content at least once, it seems less than helpful for users of the archived site to have multiple copies of some news items and none of others (on a completely random basis, from the user perspective).
Blogs, Tweets, and status updates on social media are just as likely sources for news today as traditional newspapers and broadcasts. Traditional media have also adapted to deliver news online. Libraries and archives have always collected newspapers, these are the core collections of many local historical societies. If the news that is distributed online is not preserved there will be a huge hole in our collective memory.
For this use case, completeness of harvest, getting at least one copy of every news story published on the site, is a more useful outcome over time than having multiple copies harvested over a period of time of some of the news stories.
And there is another use case that will be better served by completeness – text mining. The Library of Congress does not now support any type of text mining tools to interact with its web archives, but when it does (someday), completeness of capture of all that was published will be more important than multiple copies of mostly the same thing. But how do we achieve this so-called completeness, if not by attempting regular top-to-bottom harvesting?
Borrowing from an approach used by Icelandic colleagues, we have tried to achieve a more complete harvest over time by making use of RSS feeds provided by the news sites. Over the course of a week, RSS for articles that are produced by “target” news sites (such as HuffingtonPost.com) are aggregated into one-time use “seed lists” and the crawler then can go to the news site and harvest just those seeds, reliably. Although there is a certain extra effort in this approach in building custom seed lists week after week, over time (by which I mean years) it will assure completeness of capture. We should get at least one capture of every new item published moving forward in time. This is good.
We will also do occasional attempts (maybe twice a year) to harvest a whole site thoroughly to fill gaps, perhaps to help with completeness and to provide multiple captures of some content.
What will this mean for future users of these archived news sites? As we begin to make these sites that depend on harvesting-by-RSS-seed-URL available, users may find more broken link problems when browsing (although maybe not – it simply isn’t clear). While our present interface is an archive that reproduces the browse-the-site experience and it can be useful for users to have the “time machine” experience of browsing older versions of a site, this is not the only significant use case. If a user has a particular URL and wants to see it in the archive, browsing is not necessary. We still need to add textual search to support our users, but that would also offer a path around broken links. And over time (again, years) the completeness of coverage on an ongoing basis will build more reliability.
That is what a national library is supposed to be about – building significant collections over time, steadily and reliably. And this is where we want web archiving to be.
The NEH has consistently funded research that addresses the most pertinent issues related to digital stewardship. Its recently revised Research and Development grant program seeks to address major challenges in preserving and providing access to humanities collections and resources, and Josh will help us understand new application guidelines and their perspective on digital stewardship. The deadline for submitting an application is June 25, 2015.
Josh has posted several times on the Signal and we interviewed him about his background and NEH’s digital stewardship interests back in March 2012.
Butch: This year NEH has decided to break its funding for Research and Development into two tiers: Tier I for short-term and Tier II for longer-term projects. Talk about why NEH wanted to split the funding up this way.
Josh: First of all, thank you for this opportunity to discuss the exciting changes to our grant program! Last year, my colleagues and I in the Division of Preservation and Access undertook an intensive year-long review of our Research and Development program. We reached out to the field, including participants in the 2014 NDSA Digital Preservation Conference, to listen to practitioners’ needs. We discovered that the landscape of research has changed dramatically in very short order. For starters, new content formats (especially in the digital space) are emerging and changing the way we understand the humanities. Yes, tools and platforms are critical for the work of humanities scholars, educators, curators, archivists, librarians and students, but just as important is the need to establish standards, practices, methodologies and workflows to promote resource sharing, evaluation, and collaboration.
By introducing the Tier I grant, we believe we can seed projects at all stages of development, from early conceptualization to advanced implementation. In addition, we want to support discrete research and development projects. Sometimes, a small team of humanities practitioners and scientists can assemble rapidly to collect critical data for the field. Altogether, the combination of short- and longer-term projects is intended to capture the fluid dynamic that we see arising from within cultural heritage research and development.
Butch: Give us a little more detail on each of the funding Tiers and examples of the kinds of projects you’d like to see under each.
Josh: We see Tier I as a promising entry point for a wide variety of project types, from the planning of large, multi-year collaborative projects to standalone projects such as basic research experiments, case studies, or tool development. Tier I projects, therefore, may be used to accomplish an expansive range of tasks. For example, a team creating an open source digital asset management system wants to include additional functionalities that takes the platform out of its “beta” phase. A group of information scientists, working with humanities scholars, wants to investigate the efficacy of a new linked open data model. Or a group of computer scientists wants to test a new approach to search and discovery within a large humanities data corpus.
At the Tier II level, NEH continues to support projects at an advanced implementation stage. Projects at this level must investigate the development of standards, practices, methodologies or workflows that could be shared and adopted by a wider community of practitioners.
For both tiers, we encourage collaboration across the humanities and sciences, whether information, computer, or natural. We believe pairing people from disparate backgrounds poses the best opportunity to accomplish positive outcomes for cultural heritage. We have included possible research topics and areas in our guidelines (pdf) that may provide some guidance, although please bear in mind the list is not intended to be comprehensive.
Butch: Do you foresee that projects originally funded under Tier I will return for Tier II funding down the road?
Josh: Yes, but it is not a prerequisite to apply. After reviewing many successful R&D projects over the years, we learned that the keys to a successful project begin with considerable planning, preparation, preliminary research and in some instances, prototyping, all of which would be eligible for Tier I support. Even if a project team does not continue into a formal implementation stage, a Tier I project can still provide a tremendous benefit to the field.
Butch: The digital stewardship community has often been challenged in securing stewards and funding support for tools and services that have grown to become part of the community infrastructure, such as web archiving tools. How does NEH see itself in terms of helping to develop and sustain a long-term digital stewardship infrastructure?
Josh: We envision the digital stewardship community, along with the wider cultural heritage R&D community, as building on an expanding scaffolding of data, tools, platforms, standards and practices. Each element has its role in advancing knowledge, forming professional connections and advancing the cause of providing better long-term preservation and access to humanities collections. One of the most gratifying parts of our job is to see how a standard under development and supported by R&D funding is eventually used in projects supported through our other grant programs. We think R&D can have the greatest impact by supporting the development of the elements that serve as the practical and theoretical glue binding the work of the humanities. For this reason, the grants do not support direct infrastructural development, per se, but rather applied research that leads to fundamental changes in our approach to stewardship.
Butch: Starting in 2016, the NEH will host an annual Research and Development Project Directors’ Meeting. Tell us about this meeting and how it will help publicize digital stewardship projects and research.
Josh: Compared to the sciences, the cultural heritage community perhaps has fewer opportunities to reflect upon major preservation and access-related challenges in the field in a public forum. Whether we are considering open access of humanities content, the crisis in audiovisual preservation and access, or a host of other topics, these challenges are clearly complex and demand creative thinking. Starting next spring, NEH will host an open forum that will not only provide recently awarded Project Directors the opportunity to showcase their innovative work, but will also encourage participants to think beyond their own projects and offer expert perspective on a single pre-selected issue. I don’t have much more to share at this stage, but I encourage everyone to stay tuned as information becomes available!
Butch: The revised NEH funding approach seems designed to help build connections across the digital stewardship community. How concerned is NEH and organizations like it about the “silo-ing” of digital stewardship research?
Josh: Maintaining active and productive research connections is essential for the success of digital cultural heritage research and development. It is the reason why, starting this year, we are requiring Tier II applicants to supply a separate dissemination proposal describing how research findings on standards, practices, methodologies and workflows will reach a representative audience. Research in digital stewardship has matured in recent years. Project teams can no longer rely on uploading a set of code and expecting a community to form magically around its sustainability. Thankfully, there are so many resourceful ways in which researchers can reach their constituency from holding in-person and virtual workshops, to code sprints, to developing online tutorials, to name just a few possibilities.
Butch: The 2015 National Agenda published last fall included a number of solid recommendations for research and development in the area of digital stewardship. In addition to applying for funds from NEH, what can NDSA member organizations concentrate on that will benefit the community as a whole?
Josh: NDSA has done a wonderful job crystallizing the R&D needs of specific areas and drawing attention to new ones. My recommendation, therefore, comes from social, rather than technical, considerations. I think first and foremost NDSA members should not be afraid to self-identify with the cultural heritage research and development community. All too often during our internal review we found that humanities practitioners were content working with the “status quo” as far as tools, platforms, standards, practices and methodologies are concerned. As a consequence, a lot of time and energy is spent adapting commercial or open source tools that were produced with entirely different audiences in mind. As soon as those in cultural heritage realize that their needs are unique from those of other disciplines, they can begin to form the necessary partnerships, collaborations, programming, and project focus.
The following is a guest post by Kalev Hannes Leetaru, Senior Fellow, George Washington University Center for Cyber & Homeland Security. Portions adapted from a post for the Knight Foundation.
Imagine a world where language was no longer a barrier to information access, where anyone can access real-time information from anywhere in the world in any language, seamlessly translated into their native tongue and where their voice is equally accessible to speakers of all the world’s languages. Authors from Douglas Adams to Ethan Zuckerman have long articulated such visions of a post-lingual society in which mass translation eliminates barriers to information access and communication. Yet, even as technologies like the web have broken down geographic barriers and increasingly made it possible to access information from anywhere in the world, linguistic barriers mean most of those voices remain steadfastly inaccessible. For libraries, mass human and machine translation of the world’s information offers enormous possibilities for broadening access to their collections. In turn, as there is greater interest in the non-Western and non-English world’s information, this should lead to a greater focus on preserving it akin to what has been done for Western online news and television.
Yet, these efforts have substantial limitations. Twitter and Facebook’s on-demand model translates content only as it is requested, meaning a user must discover a given post, know it is of possible relevance, explicitly request that it be translated and wait for the translation to become available. Wikipedia and TED attempt to address this by pre-translating material en masse, but their reliance on human translators and all-volunteer workflows impose long delays before material becomes available.
Libraries have explored translation primarily as an outreach tool rather than as a gateway to their collections. Facilities with large percentages of patrons speaking languages other than English may hire bilingual staff, increase their collections of materials in those languages and hold special events in those languages. The Denver Public Library offers a prominent link right on its homepage to its tailored Spanish-language site that includes links to English courses, immigration and citizenship resources, job training and support services. Instead of merely translating their English site into Spanish wording, they have created a completely customized parallel information portal. However, searches of their OPAC in Spanish will still only return works with Spanish titles: a search for “Matar un ruisenor” will return only the single Spanish translation of “To Kill a Mockingbird” in their catalog.
On the one hand, this makes sense, since a search for a Spanish title likely indicates an interest in a Spanish edition of the book, but if no Spanish copy is available, it would be useful to at least notify the patron of copies in other languages in case that patron can read any of those other languages. Other sites like the Fort Vancouver Regional Library District use the Google Translate widget to perform live machine translation of their site. This has the benefit that when searching the library catalog in English, the results list can be viewed in any of Google Translate’s 91 languages. However, the catalog itself must still be searched in English or the language that the book title is published in, so this only solves part of the problem.
In fact, the lack of available content for most of the world’s languages was identified in the most recent Internet.org report (PDF) as being one of the primary barriers to greater connectivity throughout the world. Today there are nearly 7,000 languages spoken throughout the world of which 99.7% are spoken by less than 1% of the world’s population. By some measures, just 53% of the Earth’s population has access to measurable online content in their primary language and almost a billion people speak languages for which no Wikipedia content is available. Even within a single country there can be enormous linguistic variety: India has 425 primary languages and Papua New Guinea has 832 languages spoken within its borders. As ever-greater numbers of these speakers join the online world, even English speakers are beginning to experience linguistic barriers: as of November 2012, 60% of tweets were in a language other than English.
Web companies from Facebook and Twitter to Google and Microsoft are increasingly turning to machine translation to offer real-time access to information in other languages. Anyone who has used Google or Microsoft Translate is familiar with the concept of machine translation and both its enormous potential (transparently reading any document in any language) and current limitations (many translated documents being barely comprehensible). Historically, machine translation systems were built through laborious manual coding, in which a large team of linguists and computer programmers sat down and literally hand-programmed how every single word and phrase should be translated from one language to another. Such models performed well on perfectly grammatical formal text, but often struggled with the fluid informal speech characterizing everyday discourse. Most importantly, the enormous expense of manually programming translation rules for every word and phrase and all of the related grammatical structures of both the input and output language meant that translation algorithms were built for only the most widely-used languages.
Advances in computing power over the past decade, however, have led to the rise of “statistical machine translation” (SMT) systems. Instead of humans hand-programming translation rules, SMT systems examine large corpora of material that have been human-translated from one language to another and learn which words from one language correspond to those in the other language. For example, an SMT system would determine that when it sees “dog” in English it almost always sees “chien” in French, but when it sees “fan” in English, it must look at the surrounding words to determine whether to translate it into “ventilateur” (electric fan) or “supporter” (sports fan).
Such translation systems require no human intervention – just a large library of bilingual texts as input. United Nations and European Union legal texts are often used as input given that they are carefully hand translated into each of the major European languages. The ability of SMT systems to rapidly create new translation models on-demand has led to an explosion in the number of languages supported by machine translation systems over the last few years, with Google Translate translating to/from 91 languages as of April 2015.
What would it look like if one simply translated the entirety of the world’s information in real-time using massive machine translation? For the past two years the GDELT Project has been monitoring global news media, identifying the people, locations, counts, themes, emotions, narratives, events and patterns driving global society. Working closely with governments, media organizations, think tanks, academics, NGO’s and ordinary citizens, GDELT has been steadily building a high resolution catalog of the world’s local media, much of which is in a language other than English. During the Ebola outbreak last year, GDELT actually monitored many of the earliest warning signals of the outbreak in local media, but was unable to translate the majority of that material. This led to a unique initiative over the last half year to attempt to build a system to literally live-translate the world’s news media in real-time.
Beginning in Fall 2013 under a grant from Google Translate for Research, the GDELT Project began an early trial of what it might look like to try and mass-translate the world’s news media on a real-time basis. Each morning all news coverage monitored by the Portuguese edition of Google News was fed through Google Translate until the daily quota was exhausted. The results were extremely promising: over 70% of the activities mentioned in the translated Portuguese news coverage were not found in the English-language press anywhere in the world (a manual review process was used to discard incorrect translations to ensure the results were not skewed by translation error). Moreover, there was a 16% increase in the precision of geographic references, moving from “rural Brazil” to actual city names.
The tremendous success of this early pilot lead to extensive discussions over more than a year with the commercial and academic machine-translation communities on how to scale this approach upwards to be able to translate all accessible global news media in real-time across every language. One of the primary reasons that machine translation today is still largely an on-demand experience is the enormous computational power it requires. Translating a document from a language like Russian into English can require hundreds or even thousands of processors to produce a rapid result. Translating the entire planet requires something different: a more adaptive approach that can dynamically adjust the quality of translation based on the volume of incoming material, in a form of “streaming machine translation.”
The final system, called GDELT Translingual, took around two and a half months to build and live-translates all global news media that GDELT monitors in 65 languages in real-time, representing 98.4% of the non-English content it finds worldwide each day. Languages supported include Afrikaans, Albanian, Arabic (MSA and many common dialects), Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian (Bokmal), Norwegian (Nynorsk), Persian, Polish, Portuguese (Brazilian), Portuguese (European), Punjabi, Romanian, Russian, Serbian, Sinhalese, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu and Vietnamese.
Just as we digitize books and use speech synthesis to create spoken editions for the visually impaired, we can use machine translation to provide versions of those digitized books into other languages. Imagine a speaker of a relatively uncommon language suddenly being able to use mass translation to access the entire collections of a library and even to search across all of those materials in their native language. In the case of a legal, medical or other high-importance text, one would not want to trust the raw machine translation on its own, but at the very least such a process could be used to help a patron locate a specific paragraph of interest, making it much easier for a bilingual speaker to assist further. For more informal information needs, patrons might even be able to consume the machine translated copy directly in many cases.
Machine translation may also help improve the ability of human volunteer translation networks to bridge common information gaps. For example, one could imagine an interface where a patron can use machine translation to access any book in their native language regardless of its publication language, and can flag key paragraphs or sections where the machine translation breaks down or where they need help clarifying a passage. These could be dispatched to human volunteer translator networks to translate and offer back those translations to benefit others in the community, perhaps using some of the same volunteer collaborative translation models of the disaster community.
As Online Public Access Catalog software becomes increasingly multilingual, eventually one could imagine an interface that automatically translates a patron’s query from his/her native language into English, searches the catalog, and then returns the results back in that person’s language, prioritizing works in his/her native language, but offering relevant works in other languages as well. Imagine a scholar searching for works on an indigenous tribe in rural Brazil and seeing not just English-language works about that tribe, but also Portuguese and Spanish publications.
Much of this lies in the user interface and in making language a more transparent part of the library experience. Indeed, as live spoken-to-spoken translation like Skype’s becomes more common, perhaps eventually patrons will be able to interact with library staff using a Star Trek-like universal translator. As machine translation technology improves and as libraries focus more on multilingual issues, such efforts also have the potential to increase visibility of non-English works for English speakers, countering the heavily Western-centric focus of much of the available information on the non-Western world.
Finally, it is important to note that language is not the only barrier to information access. The increasing fragility and ephemerality of information, especially journalism, poses a unique risk to our understanding of local events and perspectives. While the Internet has made it possible for even the smallest news outlet to reach a global audience, it has also has placed journalists at far greater risk of being silenced by those who oppose their views. In the era of digitally published journalism, so much of our global heritage is at risk of disappearing at the pen stroke of an offended government, at gunpoint by masked militiamen, by regretful combatants or even through anonymized computer attacks. A shuttered print newspaper will live on in library archives, but a single unplugged server can permanently silence years of journalism from an online-only newspaper.
In perhaps the single largest program to preserve the online journalism of the non-Western world, each night the GDELT Project sends a complete list of the URLs of all electronic news coverage it monitors to the Internet Archive under its “No More 404” program, where they join the Archive’s permanent index of more than 400 billion web pages. While this is just a first step towards preserving the world’s most vulnerable information, it is our hope that this inspires further development in archiving high-risk material from the non-Western and non-English world.
We have finally reached a technological junction where automated tools and human volunteers are able to take the first, albeit imperfect, steps towards mass translation of the world’s information at ever-greater scales and speeds. Just as the internet reduced geographic boundaries in accessing the world’s information, one can only imagine the possibilities of a world in which a single search can reach across all of the world’s information in all the world’s languages in real-time.
Machine translation has truly come of age to a point where it can robustly translate foreign news coverage into English, feed that material into automated data mining algorithms and yield substantially enhanced coverage of the non-Western world. As such tools gradually make their way into the library environment, they stand poised to profoundly reshape the role of language in the access and consumption of our world’s information. Among the many ways that big data is changing our society, its empowerment of machine translation is bridging traditional distances of geography and language, bringing us ever-closer to the notion of a truly global society with universal access to information.
The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.
Recently, I talked with Kristen Regina, Head of Archives and Special Collections at the Hillwood Estate, Museum and Gardens in northwest Washington and Jaime McCurry, Digital Assets Librarian, about workflows and issues for web archiving, an activity that they are looking at. What could I tell them based on LC’s experiences?
Hillwood Estate, Museum and Gardens opened as a public museum in 1977. It is the former residence of American businesswoman, socialite, philanthropist and collector Marjorie Merriweather Post, and home to one of the most comprehensive collections of Russian imperial art outside of Russia, a distinguished 18th-century French decorative art collection and twenty-five acres of landscaped gardens and natural woodlands.
According to Kristen and Jaime, Hillwood has identified digital stewardship as an area of great importance, both in its strategic planning efforts and in day-to-day operations. This fresh focus has supported the institution’s recent migration to a new digital asset management system, the continuation of digital partnerships such as Hillwood’s participation in the Google Art Project to encourage access to its rich digital resources, and moving forward, the exploration and creation of a well-rounded web archiving program.
As Kristen explained, the team has three specific activities in mind for web archiving:
Archive Hillwood’s online presence, in particular its own website, http://www.hillwoodmuseum.org. The site would be archived on a regular basis to support traditional archival efforts related to the museum and its ongoing operational activities. This aligns with the usual reasons that an organization of this type would keep copies of brochures, publications, reports and so on that are provided to the public about the organization.
Targeted harvesting of listings or digital catalogues of materials in scope for the Hillwood collections on websites such as dealers or auction houses such as Sotheby’s or Christie’s. Again, this mirrors the collecting of analogous paper materials.
Harvesting on a continuing or one-time basis of sites (or more often parts of sites) of peer institutions, particularly in Russia, and web-based publications about Hillwood or topics relevant to its collections or collecting priorities.
One could come up with any number of challenges associated with each of these activities, but I was struck in thinking about these activities after our meeting that each had particularly distinct challenges for which the best solution might not contribute to solving either of the other two. This was in contrast to my usual thinking of web archiving problem solving.
Archiving the organization’s site: This is a fairly typical activity for many organizations nowadays and can easily be arranged with a vendor who will periodically “crawl” the organization’s web site from top to bottom, capturing as much of the site as is technically possible following browsable links. This can be done at whatever frequency is desired. The “traditional” approach however is to do a complete harvest of the site each time. Depending on the frequency of revision to the site overall, this can be a considerable amount of effort to make a copy of something that has only slightly changed since the earlier crawl. (The resulting files are de-duped, so at least duplicated copies of the site materials are not stored.) At the same time, a portion of the site might have any number of changes that would be missed between scheduled full harvests.
The solution in this situation is to have a completely different approach and to contract with an organization that can harvest those pages when changes are made on the basis of an RSS-like notification. In the case of a web site that is mostly unchanged over time, this would be much more economical, and yet at the same time would allow the assurance to organization management that the question, “what did our site look like on date X?” can be answered accurately in the future. A full crawl could be attempted once a year as a baseline.
Targeted crawling of certain types of materials from dealers and auction houses, complementing what other groups such as the New York Art Resources Consortium are collecting: To my mind, this kind of crawling presents a completely different challenge. Hillwood has a relatively narrow and specific collecting profile and most relevant auction houses will have much broader scope. If we assume that Hillwood (or other museums or cultural heritage institutions for that matter) would not want to harvest entire sites and then “throw away” what they don’t need, then the likely solution lies in collaborative effort by collecting institutions. Collaboration is a theme that seems to be gaining traction in discussions of web archiving, which is probably good, but at the same time it presents an entirely different set of challenges.
Harvesting sites of peer institutions, particularly in Russia, and web-based publishing about Hillwood or topics relevant to its collections: These activities seem closer to “traditional” web archiving as I think of them, but are still challenging for a small organization with a small staff. Hillwood’s focus will rarely align directly with other institutions’ so “scoping” a crawl of another organization’s site so as to just acquire relevant materials and not an entire organizational site would often be tricky. In addition, there is the ongoing challenge of identifying what these sites and materials might be, which requires staff attention – this seems the greatest challenge in fact here, to find the staff time to identify the sites and then scope them properly and later do the quality assurance review of the results.
Having worked on web archiving collection-building at the Library of Congress for about five years, I am increasingly struck by the singular nature of the web archiving tools. Perhaps this is reflective of the relatively youthful nature of the activity, perhaps it reflects a certain gratitude that there are tools that at least do one thing. But as I look at some of what we at the Library would like to expand our activities to do and talk to people like Kristen and Jaime, I learn about different use cases that lead me to think about different problems, both technical and organizational, than the ones we have focused on so far.
As more and more of the world’s citizen generated information becomes natively geotagged, we increasingly think of information as being created in space and referring to space, using geography to map conversation, target information, and even understand global communicative patterns. Yet, despite the immense power of geotagging, the vast majority of the world’s information does not have native geographic metadata, especially the vast historical archives of text held by libraries. It is not that libraries do not contain spatial information, it is that their rich descriptions of location are expressed in words rather than precise mappable latitude/longitude coordinates. A geotagged tweet can be directly placed on a map, while a textual mention of “a park in Champaign, USA” in a digitized nineteenth century book requires highly specialized “fulltext geocoding” algorithms to identify, disambiguate (determine whether the mention is of Champaign, Illinois or Champaign, Ohio and which park is referred to) and convert textual descriptions of location into mappable geographic coordinates.
Building robust algorithms capable of recognizing mentions of an obscure hilltop or a small rural village anywhere on Earth requires a mixture of state-of-the-art software algorithms and artistic handling of the enormous complexities and nuances of how humans express space in writing. This is made even more difficult by assumptions of shared locality made by content like news media, the mixture of textual and visual locative cues in television, and the inherent transcription error of sources like OCR and closed captioning.
Recognizing location across languages is especially problematic. The majority of textual location mentions on Twitter are in English regardless of the language of the tweet itself. On the other hand, mapping the geography of the world’s news media across 65 languages requires multilingual geocoding that takes into account the enormous complexity of the world’s languages. For example, the extensive noun declension of Estonian means that identifying mentions of “New York” requires recognizing “New York”, “New Yorki” , “New Yorgi”, “New Yorgisse”, “New Yorgis”, “New Yorgist”, “New Yorgile”, “New Yorgil”, “New Yorgilt”, “New Yorgiks”, “New Yorgini”, “New Yorgina”, “New Yorgita”, and “New Yorgiga”. Multiplied by over 10 million recognized locations on Earth across 65 languages and one can imagine the difficulties of recognizing textual geography.
For the past decade much of my work has centered on this intersection of location and information across languages and modalities, exploring the geography of massive textual archives through the twin lenses of the locations they describe and the impact of location on the production and consumption of information. A particular emphasis of my work has lain in expanding the study of textual geography to new modalities and transitioning the field from small human studies to at-scale computational explorations.
Over the past five years my studies have included the first large-scale explorations of the textual geography of news media, social media, Wikipedia, television, academic literature, and the open web, as well as the first large-scale comparisons of geotagging versus textual description of location in citizen media and the largest work on multilingual geocoding. The remainder of this blog post will share many of the lessons I have learned from these projects and the implications and promise they hold for the future of making the geography of library holdings more broadly available in spatial form.
In the early 2000’s while an undergraduate student at the National Center for Supercomputing Applications I launched an early open cloud geocoding and GIS platform that provided a range of geospatial services through a simple web interface and cloud API. The intense interest in the platform and the incredible variety of applications that users found for the geocoding API foreshadowed the amazing creativity of the open data community in mashing up geographic APIs and data. Over the following several years I undertook numerous small-scale studies of textual geography to explore how such information could be extracted and utilized to better understand various kinds of information behavior.
Some of my early papers include a 2005 study of the geographic focus and ownership of news and websites covering climate change and carbon sequestration (PDF) that demonstrated the importance of the dual role of the geography of content and consumer. In 2006 I co-launched a service that enabled spatial search of US Government funding opportunities (PDF), including alerts of new opportunities relating to specific locations. This reinforced the importance of location in information relevance: a contract to install fire suppression sprinklers in Omaha, Nebraska is likely of little interest to a small business in Miami, Florida, yet traditional keyword search does contemplate the concept of spatial relevance.
Similarly, in 2009 I explored the impact of a news outlet’s physical location on the Drudge Report’s sourcing behavior and in 2010 examined the impact of a university’s physical location on its national news stature. These twin studies, examining the impact of physical location on news outlets and on newsmakers, emphasized the highly complex role that geography plays in mediating information access, availability, and relevance.
In Fall 2011 I published the first of what has become a half-decade series of studies expanding the application of textual geography to ever-larger and more diverse collections of material. Published in September 2011, Culturomics 2.0 was the first large-scale study of the geography of the world’s news media, identifying all mentions of location across more than 100 million news articles stretching across half a century.
A key finding was the centrality of geography to journalism: on average a location is mentioned every 200-300 words in a typical news article and this has held relatively constant for over 60 years. Another finding was that mapping the locations most closely associated with a public figure (in this case Osama Bin Laden) offers a strong estimate of that person’s actual location, while the network structure of which locations more frequently co-occur with each other yields powerful insights into perceptions of cultural and societal boundaries.
The following Spring I collaborated with supercomputer vendor SGI to conduct the first holistic exploration of the textual geography of Wikipedia. Wikipedia allows contributors to include precise latitude/longitude coordinates in articles, but because such coordinates must be manually entered in specialized code, just 4% of English-language articles had a single entry as of 2012, totaling just 1.1 million coordinates, primarily centered in the US and Western Europe. In contrast, 59% of English-language articles had at least one textual mention of a recognized location, totaling more than 80.7 million mentions of 2.8 million distinct places on Earth.
In essence, the majority of contributors to Wikipedia appear more comfortable writing the word “London” in an article than looking up its centroid latitude/longitude and entering it in specialized code. This has significant implications for how libraries leverage volunteer citizen geocoding efforts in their collections.
To explore how such information could be used to provide spatial search for large textual collections, a prototype Google Earth visualization was built to search Wikipedia’s coverage of Libya. A user could select a specific time period and instantly access a map of every location in Libya mentioned across all of Wikipedia with respect to that time period.
Finally, a YouTube video was created that visualizes world history 1800-2012 through the eyes of Wikipedia by combining the 80 million textual location mentions in the English Wikipedia with the 40 million date references to show which locations were mentioned together in an article with respect to a given year. Links were color-coded red for connections with a negative tone (usually indicating physical conflict like war) or green for connections with a positive tone.
At the time, few concrete details were available regarding Twitter’s geographic footprint and the majority of social media maps focused on the small percentage of natively geotagged tweets. Twitter offered a unique opportunity to compare textual and sensor-based geographies in that 2% of tweets are geotagged with precise GPS or cellular triangulation coordinates. Coupled with the very high correlation of electricity and geotagged tweets, this offers a unique ground truth of the actual confirmed location of Twitter users to compare against different approaches to geocoding textual location cues in estimating the location of the other 98% of tweets that often have textual information about location.
A key finding was that two-thirds of those 2% of tweets that are geotagged were sent by just 1% of all users, meaning that geotagged information on Twitter is extremely skewed. Another finding was that across the world location is primarily expressed in English regardless of the language that a user tweets in and that 34% of tweets have recoverable high-resolution textual locations. From a communicative standpoint, it turns out that half of tweets are about local events and half of tweets are directed at physically nearby users versus tweeting about global events or users elsewhere in the world, suggesting that geographic proximity plays only a minor role in communication patterns on broadcast media like Twitter.
A common pattern that emerges across both Wikipedia and Twitter is that even when native geotagging is available, the vast majority of location metadata resides in textual descriptions rather than precise GIS-friendly numeric coordinates. This is the case even when geotagging is made transparent and automatic through GPS tagging on mobile devices.
In Spring 2013 I launched the GDELT Project, which extends my earlier work on the geography of the news media by offering a live metadata firehose geocoding global news media on a daily basis. That Fall I collaborated with Roger Macdonald and the Internet Archive’s Television News Archive to create the first large-scale map of the geography of television news. More than 400,000 hours of closed captioning of American television news totaling over 2.7 billion words was geocoded to produce an animated daily map of the geographic focus of television news from 2009-2013.
Closed captioning text proved to be extremely difficult to geocode. Captioning streams are in entirely uppercase letters, riddled with errors like “in two Paris of shoes” and long sequences of gibberish characters, and in some cases have a total absence of punctuation or other boundaries.
This required extensive adaptation of the geocoding algorithms to tolerate an enormous diversity of typographical errors more pathological in nature than those found in OCR’d content – approaches that were later used in creating the first live emotion-controlled television show for NBCUniversal’s Syfy channel. Newscasts also frequently rely on visual on-screen cues such as maps or text overlays for location references, and by their nature incorporate a rapid-fire sequence of highly diverse locations mentioned just sentences apart from each other, making the disambiguation process extremely complex.
In Fall 2014 I collaborated with the US Army to create the first large-scale map of the geography of academic literature and the open web, geocoding more than 21 billion words of academic literature spanning the entire contents of JSTOR, DTIC, CORE, CiteSeerX, and the Internet Archive’s 1.6 billion PDFs relating to Africa and the Middle East, as well as a second project creating the first large-scale map of human rights reports. A key focus of this project was the ability to infuse geographic search into academic literature, enabling searches like “find the five most-cited experts who publish on water conflicts with the Nuers in this area of South Sudan” and thematic maps such as a heatmap of the locations most closely associated with food insecurity.
As of spring 2015 the GDELT Project now maps the geography of an ever-growing cross-section of the global news media in realtime across 65 languages. Every 15 minutes it machine translates all global news coverage it monitors in 65 languages from Afrikaans and Albanian to Urdu and Vietnamese and applies the world’s largest multilingual geocoding system to identify all mentions of location anywhere in the world, from a capital city to a remote hilltop. Over the past several years, GDELT’s mass realtime geocoding of the world’s news media has popularized the use of large-scale automated geocoding, with disciplines from political science to journalism now experimenting with the technology and GDELT’s geocoding capabilities now lie at the heart of numerous initiatives from cataloging disaster coverage for the United Nations to mapping global conflict with the US Institute of Peace to modeling the patterns of world history.
Most recently, a forthcoming collaboration with cloud mapping platform CartoDB will enable ordinary citizens and journalists to create live interactive maps of the ideas, topics, and narratives pulsing through the global news media using GDELT. The example map below shows the geographic focus of Spanish (green), French (red), Arabic (yellow) and Chinese (blue) news media for a one hour period from 8-9AM EST on April 1, 2015, placing a colored dot at every location mentioned in the news media of each language. Ordinarily, mapping the geography of language would be an enormous technical endeavor, but by combining GDELT’s mass multilingual geocoding with CartoDB’s interactive mapping, even a non-technical user can create a map in a matter of minutes. This is a powerful example of what will become possible as libraries increasingly expose the spatial dimension of their collections in data formats that allow them to be integrated into popular mapping platforms. Imagine an amateur historian combining digitized georeferenced historical maps and geocoded nineteenth newspaper articles with modern census data to explore how a region has changed over time – these kinds of mashups would be commonplace if the vast archives of libraries were made available in spatial form.
In short, as we begin to peer into the textual holdings of our world’s libraries using massive-scale data mining algorithms like fulltext geocoding, we are for the first time able to look across our collective informational heritage to see macro-level global patterns never before visible. Geography offers a fundamental new lens through which to understand and observe those new insights, and as libraries increasingly geocode their holdings and make that material available in standard geographic open data formats, they will enable an entirely new era where libraries become conveners of information and innovation that empower a new era of access and understanding of our world.