Libraries Looking Across Languages: Seeing the World Through Mass Translation

The following is a guest post by Kalev Hannes Leetaru, Senior Fellow, George Washington University Center for Cyber & Homeland Security. Portions adapted from a post for the Knight Foundation.

Geotagged tweets November 2012 colored by language.

Geotagged tweets November 2012 colored by language.

Imagine a world where language was no longer a barrier to information access, where anyone can access real-time information from anywhere in the world in any language, seamlessly translated into their native tongue and where their voice is equally accessible to speakers of all the world’s languages. Authors from Douglas Adams to Ethan Zuckerman have long articulated such visions of a post-lingual society in which mass translation eliminates barriers to information access and communication. Yet, even as technologies like the web have broken down geographic barriers and increasingly made it possible to access information from anywhere in the world, linguistic barriers mean most of those voices remain steadfastly inaccessible. For libraries, mass human and machine translation of the world’s information offers enormous possibilities for broadening access to their collections. In turn, as there is greater interest in the non-Western and non-English world’s information, this should lead to a greater focus on preserving it akin to what has been done for Western online news and television.

There have been many attempts to make information accessbile across language barriers using both human and machine translation. During the 2013 Egyptian uprising, Twitter launched live machine translation of Arabic-language tweets from select political leaders and news outlets, an experiment which it expanded for the World Cup in 2014 and made permanent this past January with its official “Tweet translation” service. Facebook launched its own machine translation service in 2011, while Microsoft recently unveiled live spoken translation for Skype. Turning to human translators, Wikipedia’s Content Translation program combines machine translation with human correction in its quest to translate Wikipedia into every modern language and TED’s Open Translation Project has brought together 20,000 volunteers to translate 70,000 speeches into 107 languages since 2009. Even the humanitarian space now routinely leverages volunteer networks to mass translate aid requests during disasters, while mobile games increasingly combine machine and human translation to create fully multilingual chat environments.

Yet, these efforts have substantial limitations. Twitter and Facebook’s on-demand model translates content only as it is requested, meaning a user must discover a given post, know it is of possible relevance, explicitly request that it be translated and wait for the translation to become available. Wikipedia and TED attempt to address this by pre-translating material en masse, but their reliance on human translators and all-volunteer workflows impose long delays before material becomes available.

Journalism has experimented only haltingly with large-scale translation. Notable successes such as Project Lingua, Yeeyan.org and Meedan.org focus on translating news coverage for citizen consumption, while journalist-directed efforts such as Andy Carvin’s crowd-sourced translations are still largely regarded as isolated novelties. Even the U.S. government’s foreign press monitoring agency draws nearly half its material from English-language outlets to minimize translation costs. At the same time, its counterterrorism division monitoring the Lashkar-e-Taiba terrorist group remarks of the group’s communications, “most of it is in Arabic or Farsi, so I can’t make much of it.

Libraries have explored translation primarily as an outreach tool rather than as a gateway to their collections. Facilities with large percentages of patrons speaking languages other than English may hire bilingual staff, increase their collections of materials in those languages and hold special events in those languages. The Denver Public Library offers a prominent link right on its homepage to its tailored Spanish-language site that includes links to English courses, immigration and citizenship resources, job training and support services. Instead of merely translating their English site into Spanish wording, they have created a completely customized parallel information portal. However, searches of their OPAC in Spanish will still only return works with Spanish titles: a search for “Matar un ruisenor” will return only the single Spanish translation of “To Kill a Mockingbird” in their catalog.

On the one hand, this makes sense, since a search for a Spanish title likely indicates an interest in a Spanish edition of the book, but if no Spanish copy is available, it would be useful to at least notify the patron of copies in other languages in case that patron can read any of those other languages. Other sites like the Fort Vancouver Regional Library District use the Google Translate widget to perform live machine translation of their site. This has the benefit that when searching the library catalog in English, the results list can be viewed in any of Google Translate’s 91 languages. However, the catalog itself must still be searched in English or the language that the book title is published in, so this only solves part of the problem.

In fact, the lack of available content for most of the world’s languages was identified in the most recent Internet.org report (PDF) as being one of the primary barriers to greater connectivity throughout the world. Today there are nearly 7,000 languages spoken throughout the world of which 99.7% are spoken by less than 1% of the world’s population. By some measures, just 53% of the Earth’s population has access to measurable online content in their primary language and almost a billion people speak languages for which no Wikipedia content is available. Even within a single country there can be enormous linguistic variety: India has 425 primary languages and Papua New Guinea has 832 languages spoken within its borders. As ever-greater numbers of these speakers join the online world, even English speakers are beginning to experience linguistic barriers: as of November 2012, 60% of tweets were in a language other than English.

Web companies from Facebook and Twitter to Google and Microsoft are increasingly turning to machine translation to offer real-time access to information in other languages. Anyone who has used Google or Microsoft Translate is familiar with the concept of machine translation and both its enormous potential (transparently reading any document in any language) and current limitations (many translated documents being barely comprehensible). Historically, machine translation systems were built through laborious manual coding, in which a large team of linguists and computer programmers sat down and literally hand-programmed how every single word and phrase should be translated from one language to another. Such models performed well on perfectly grammatical formal text, but often struggled with the fluid informal speech characterizing everyday discourse. Most importantly, the enormous expense of manually programming translation rules for every word and phrase and all of the related grammatical structures of both the input and output language meant that translation algorithms were built for only the most widely-used languages.

Advances in computing power over the past decade, however, have led to the rise of “statistical machine translation” (SMT) systems. Instead of humans hand-programming translation rules, SMT systems examine large corpora of material that have been human-translated from one language to another and learn which words from one language correspond to those in the other language. For example, an SMT system would determine that when it sees “dog” in English it almost always sees “chien” in French, but when it sees “fan” in English, it must look at the surrounding words to determine whether to translate it into “ventilateur” (electric fan) or “supporter” (sports fan).

Such translation systems require no human intervention – just a large library of bilingual texts as input. United Nations and European Union legal texts are often used as input given that they are carefully hand translated into each of the major European languages. The ability of SMT systems to rapidly create new translation models on-demand has led to an explosion in the number of languages supported by machine translation systems over the last few years, with Google Translate translating to/from 91 languages as of April 2015.

What would it look like if one simply translated the entirety of the world’s information in real-time using massive machine translation? For the past two years the GDELT Project has been monitoring global news media, identifying the people, locations, counts, themes, emotions, narratives, events and patterns driving global society. Working closely with governments, media organizations, think tanks, academics, NGO’s and ordinary citizens, GDELT has been steadily building a high resolution catalog of the world’s local media, much of which is in a language other than English. During the Ebola outbreak last year, GDELT actually monitored many of the earliest warning signals of the outbreak in local media, but was unable to translate the majority of that material. This led to a unique initiative over the last half year to attempt to build a system to literally live-translate the world’s news media in real-time.

Beginning in Fall 2013 under a grant from Google Translate for Research, the GDELT Project began an early trial of what it might look like to try and mass-translate the world’s news media on a real-time basis. Each morning all news coverage monitored by the Portuguese edition of Google News was fed through Google Translate until the daily quota was exhausted. The results were extremely promising: over 70% of the activities mentioned in the translated Portuguese news coverage were not found in the English-language press anywhere in the world (a manual review process was used to discard incorrect translations to ensure the results were not skewed by translation error). Moreover, there was a 16% increase in the precision of geographic references, moving from “rural Brazil” to actual city names.

The tremendous success of this early pilot lead to extensive discussions over more than a year with the commercial and academic machine-translation communities on how to scale this approach upwards to be able to translate all accessible global news media in real-time across every language. One of the primary reasons that machine translation today is still largely an on-demand experience is the enormous computational power it requires. Translating a document from a language like Russian into English can require hundreds or even thousands of processors to produce a rapid result. Translating the entire planet requires something different: a more adaptive approach that can dynamically adjust the quality of translation based on the volume of incoming material, in a form of “streaming machine translation.”

Figure 10 - Geographic focus of world's news media by language 8-9AM EST on April 1, 2015 (Green = locations mentioned in Spanish media, Red = French media, Yellow = Arabic media, Blue = Chinese media).

Geographic focus of world’s news media by language 8-9AM EST on April 1, 2015 (Green = locations mentioned in Spanish media, Red = French media, Yellow = Arabic media, Blue = Chinese media).

The final system, called GDELT Translingual, took around two and a half months to build and live-translates all global news media that GDELT monitors in 65 languages in real-time, representing 98.4% of the non-English content it finds worldwide each day. Languages supported include Afrikaans, Albanian, Arabic (MSA and many common dialects), Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian (Bokmal), Norwegian (Nynorsk), Persian, Polish, Portuguese (Brazilian), Portuguese (European), Punjabi, Romanian, Russian, Serbian, Sinhalese, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu and Vietnamese.

Building the system didn’t require starting from scratch, as there is an incredible wealth of open tools and datasets available to support all of the pieces of the machine translation pipeline. Open source building blocks utilized include the Moses toolkit and a number of translation models contributed by researchers in the field; the Google Chrome Compact Language Detector; 22 different WordNet datasets; multilingual resources from the GEOnet Names Server, Wikipedia, the Unicode Common Locale Data Repository; word segmentation algorithms for Chinese, Japanese, Thai and Vietnamese; and countless other tools. Much of the work lay in how to integrate all of the different components and constructing some of the key unique new elements and architectures to enable the system to scale to GDELT’s needs. A more detailed technical description of the final architecture, tools, and datasets used in the creation of GDELT Translingual is available on the GDELT website.

Just as we digitize books and use speech synthesis to create spoken editions for the visually impaired, we can use machine translation to provide versions of those digitized books into other languages. Imagine a speaker of a relatively uncommon language suddenly being able to use mass translation to access the entire collections of a library and even to search across all of those materials in their native language. In the case of a legal, medical or other high-importance text, one would not want to trust the raw machine translation on its own, but at the very least such a process could be used to help a patron locate a specific paragraph of interest, making it much easier for a bilingual speaker to assist further. For more informal information needs, patrons might even be able to consume the machine translated copy directly in many cases.

Machine translation may also help improve the ability of human volunteer translation networks to bridge common information gaps. For example, one could imagine an interface where a patron can use machine translation to access any book in their native language regardless of its publication language, and can flag key paragraphs or sections where the machine translation breaks down or where they need help clarifying a passage. These could be dispatched to human volunteer translator networks to translate and offer back those translations to benefit others in the community, perhaps using some of the same volunteer collaborative translation models of the disaster community.

As Online Public Access Catalog software becomes increasingly multilingual, eventually one could imagine an interface that automatically translates a patron’s query from his/her native language into English, searches the catalog, and then returns the results back in that person’s language, prioritizing works in his/her native language, but offering relevant works in other languages as well. Imagine a scholar searching for works on an indigenous tribe in rural Brazil and seeing not just English-language works about that tribe, but also Portuguese and Spanish publications.

Much of this lies in the user interface and in making language a more transparent part of the library experience. Indeed, as live spoken-to-spoken translation like Skype’s becomes more common, perhaps eventually patrons will be able to interact with library staff using a Star Trek-like universal translator. As machine translation technology improves and as libraries focus more on multilingual issues, such efforts also have the potential to increase visibility of non-English works for English speakers, countering the heavily Western-centric focus of much of the available information on the non-Western world.

Finally, it is important to note that language is not the only barrier to information access. The increasing fragility and ephemerality of information, especially journalism, poses a unique risk to our understanding of local events and perspectives. While the Internet has made it possible for even the smallest news outlet to reach a global audience, it has also has placed journalists at far greater risk of being silenced by those who oppose their views. In the era of digitally published journalism, so much of our global heritage is at risk of disappearing at the pen stroke of an offended government, at gunpoint by masked militiamen, by regretful combatants or even through anonymized computer attacks. A shuttered print newspaper will live on in library archives, but a single unplugged server can permanently silence years of journalism from an online-only newspaper.

In perhaps the single largest program to preserve the online journalism of the non-Western world, each night the GDELT Project sends a complete list of the URLs of all electronic news coverage it monitors to the Internet Archive under its “No More 404” program, where they join the Archive’s permanent index of more than 400 billion web pages. While this is just a first step towards preserving the world’s most vulnerable information, it is our hope that this inspires further development in archiving high-risk material from the non-Western and non-English world.

We have finally reached a technological junction where automated tools and human volunteers are able to take the first, albeit imperfect, steps towards mass translation of the world’s information at ever-greater scales and speeds. Just as the internet reduced geographic boundaries in accessing the world’s information, one can only imagine the possibilities of a world in which a single search can reach across all of the world’s information in all the world’s languages in real-time.

Machine translation has truly come of age to a point where it can robustly translate foreign news coverage into English, feed that material into automated data mining algorithms and yield substantially enhanced coverage of the non-Western world. As such tools gradually make their way into the library environment, they stand poised to profoundly reshape the role of language in the access and consumption of our world’s information. Among the many ways that big data is changing our society, its empowerment of machine translation is bridging traditional distances of geography and language, bringing us ever-closer to the notion of a truly global society with universal access to information.

One Size Does Not Always Fit All

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress. Recently, I talked with Kristen Regina, Head of Archives and Special Collections at the Hillwood Estate, Museum and Gardens in northwest Washington and Jaime McCurry, Digital Assets Librarian, about workflows and issues for web archiving, an […]

Format Migrations at Harvard Library: An NDSR Project Update

The following is a guest  post by Joey Heinen, National Digital Stewardship Resident at Harvard University Library. As has been famously outlined by the Library of Congress on their website on sustainability factors for digital formats, digital material is just as susceptible to obsolescence as analog formats. Within digital preservation there are a number of […]

Tracking Digital Collections at the Library of Congress, from Donor to Repository

When Kathleen O’Neill talks about digital collections, she slips effortlessly into the info-tech language that software engineers, librarians, archivists and other information technology professionals use to communicate with each other.  O’Neill, a senior archives specialist in the Library of Congress’s Manuscript Division, speaks with authority about topics such as file signatures, hex editors and checksums even […]

Mapping Words: Lessons Learned From a Decade of Exploring the Geography of Text

The following is a guest post by Kalev Hannes Leetaru, Senior Fellow, George Washington University Center for Cyber & Homeland Security. It is hard to imagine our world today without maps. Though not the first online mapping platform, the debut of Google Maps a decade ago profoundly reshaped the role of maps in everyday life, […]

Residents Chosen for NDSR 2015 in Washington, DC

We are pleased to announce that the Washington, DC National Digital Stewardship Residency class for 2015 has now been chosen! Five very accomplished people have been selected from a highly competitive field of candidates. The new residents will arrive in Washington, DC this June to begin the program. Updates on the program, including more information […]

Checking in with NGAC and the National Spatial Data Infrastructure

Several times a year I attend meetings of the National Geospatial Advisory Committee, a federal advisory committee that reports to the chair of the Federal Geographic Data Committee. The NGAC pulls together participants from across academia, the private sector and all levels of government to advise the Federal government on geospatial policy and ways to […]

DPOE Interview with Sam Meister

The following is a guest post by Barrie Howard, IT Project Manager at the Library of Congress. This interview is part of a series about digital preservation training inspired by the Library’s Digital Preservation Outreach & Education (DPOE) Program. Today’s interview is with Sam Meister, University of Montana-Missoula, who is a DPOE Train-the-Trainer Workshop instructor […]

Many Goals for One Residency: An NDSR Project Update

The following is a guest post by Jen LaBarbera, National Digital Stewardship Resident at Northeastern University Library. It’s hard to believe that I only have two and a half months left in this residency. Despite Boston’s interminable winter (officially the snowiest on record), my time as a National Digital Stewardship Resident at Northeastern University has […]

Reaching Out and Moving Forward: Revising the Library of Congress’ Recommended Format Specifications

The following post is by Ted Westervelt, head of acquisitions and cataloging for U.S. Serials in the Arts, Humanities & Sciences section at the Library of Congress. Nine months ago, the Library of Congress released its Recommended Format Specifications. This was the result of years of work by experts from across the institution, bringing their […]