Mapping Words: Lessons Learned From a Decade of Exploring the Geography of Text

The following is a guest post by Kalev Hannes Leetaru, Senior Fellow, George Washington University Center for Cyber & Homeland Security.

It is hard to imagine our world today without maps. Though not the first online mapping platform, the debut of Google Maps a decade ago profoundly reshaped the role of maps in everyday life, popularizing the concept of organizing information in space. When Flickr unveiled image geotagging in 2006, more than 1.2 million photos were geotagged in the first 24 hours. In August 2009, with the launch of geotagged tweets, Twitter announced that organizing posts according to location would usher in a new era of spatial serendipity, allowing users to “switch from reading the tweets of accounts you follow to reading tweets from anyone in your neighborhood or city–whether you follow them or not.”

As more and more of the world’s citizen generated information becomes natively geotagged, we increasingly think of information as being created in space and referring to space, using geography to map conversation, target information, and even understand global communicative patterns. Yet, despite the immense power of geotagging, the vast majority of the world’s information does not have native geographic metadata, especially the vast historical archives of text held by libraries. It is not that libraries do not contain spatial information, it is that their rich descriptions of location are expressed in words rather than precise mappable latitude/longitude coordinates. A geotagged tweet can be directly placed on a map, while a textual mention of “a park in Champaign, USA” in a digitized nineteenth century book requires highly specialized “fulltext geocoding” algorithms to identify, disambiguate (determine whether the mention is of Champaign, Illinois or Champaign, Ohio and which park is referred to) and convert textual descriptions of location into mappable geographic coordinates.

Building robust algorithms capable of recognizing mentions of an obscure hilltop or a small rural village anywhere on Earth requires a mixture of state-of-the-art software algorithms and artistic handling of the enormous complexities and nuances of how humans express space in writing. This is made even more difficult by assumptions of shared locality made by content like news media, the mixture of textual and visual locative cues in television, and the inherent transcription error of sources like OCR and closed captioning.

Recognizing location across languages is especially problematic. The majority of textual location mentions on Twitter are in English regardless of the language of the tweet itself. On the other hand, mapping the geography of the world’s news media across 65 languages requires multilingual geocoding that takes into account the enormous complexity of the world’s languages. For example, the extensive noun declension of Estonian means that identifying mentions of “New York” requires recognizing “New York”, “New Yorki” , “New Yorgi”, “New Yorgisse”, “New Yorgis”, “New Yorgist”, “New Yorgile”, “New Yorgil”, “New Yorgilt”, “New Yorgiks”, “New Yorgini”, “New Yorgina”, “New Yorgita”, and “New Yorgiga”. Multiplied by over 10 million recognized locations on Earth across 65 languages and one can imagine the difficulties of recognizing textual geography.

For the past decade much of my work has centered on this intersection of location and information across languages and modalities, exploring the geography of massive textual archives through the twin lenses of the locations they describe and the impact of location on the production and consumption of information. A particular emphasis of my work has lain in expanding the study of textual geography to new modalities and transitioning the field from small human studies to at-scale computational explorations.

Over the past five years my studies have included the first large-scale explorations of the textual geography of news media, social media, Wikipedia, television, academic literature, and the open web, as well as the first large-scale comparisons of geotagging versus textual description of location in citizen media and the largest work on multilingual geocoding. The remainder of this blog post will share many of the lessons I have learned from these projects and the implications and promise they hold for the future of making the geography of library holdings more broadly available in spatial form.

Figure 1 - Locations of news outlets linked by the Drudge Report 2002-2008

Figure 1 – Locations of news outlets linked by the Drudge Report 2002-2008.

In the early 2000’s while an undergraduate student at the National Center for Supercomputing Applications I launched an early open cloud geocoding and GIS platform that provided a range of geospatial services through a simple web interface and cloud API. The intense interest in the platform and the incredible variety of applications that users found for the geocoding API foreshadowed the amazing creativity of the open data community in mashing up geographic APIs and data. Over the following several years I undertook numerous small-scale studies of textual geography to explore how such information could be extracted and utilized to better understand various kinds of information behavior.

Some of my early papers include a 2005 study of the geographic focus and ownership of news and websites covering climate change and carbon sequestration (PDF) that demonstrated the importance of the dual role of the geography of content and consumer. In 2006 I co-launched a service that enabled spatial search of US Government funding opportunities (PDF), including alerts of new opportunities relating to specific locations. This reinforced the importance of location in information relevance: a contract to install fire suppression sprinklers in Omaha, Nebraska is likely of little interest to a small business in Miami, Florida, yet traditional keyword search does contemplate the concept of spatial relevance.

Similarly, in 2009 I explored the impact of a news outlet’s physical location on the Drudge Report’s sourcing behavior and in 2010 examined the impact of a university’s physical location on its national news stature. These twin studies, examining the impact of physical location on news outlets and on newsmakers, emphasized the highly complex role that geography plays in mediating information access, availability, and relevance.

Figure 2 - Network of locations that most frequently co-occur with each other in coverage of Osama Bin Laden - center point is 200km from where he was ultimately found

Figure 2 – Network of locations that most frequently co-occur with each other in coverage of Osama Bin Laden – center point is 200km from where he was ultimately found.

In Fall 2011 I published the first of what has become a half-decade series of studies expanding the application of textual geography to ever-larger and more diverse collections of material. Published in September 2011, Culturomics 2.0 was the first large-scale study of the geography of the world’s news media, identifying all mentions of location across more than 100 million news articles stretching across half a century.

A key finding was the centrality of geography to journalism: on average a location is mentioned every 200-300 words in a typical news article and this has held relatively constant for over 60 years. Another finding was that mapping the locations most closely associated with a public figure (in this case Osama Bin Laden) offers a strong estimate of that person’s actual location, while the network structure of which locations more frequently co-occur with each other yields powerful insights into perceptions of cultural and societal boundaries.

Figure 3 - Map of countries which are most commonly mentioned together in global news coverage - countries with the same color are more frequently mentioned with other countries of that color than with countries of a different color

Figure 3 – Map of countries which are most commonly mentioned together in global news coverage – countries with the same color are more frequently mentioned with other countries of that color than with countries of a different color.

The following Spring I collaborated with supercomputer vendor SGI to conduct the first holistic exploration of the textual geography of Wikipedia. Wikipedia allows contributors to include precise latitude/longitude coordinates in articles, but because such coordinates must be manually entered in specialized code, just 4% of English-language articles had a single entry as of 2012, totaling just 1.1 million coordinates, primarily centered in the US and Western Europe. In contrast, 59% of English-language articles had at least one textual mention of a recognized location, totaling more than 80.7 million mentions of 2.8 million distinct places on Earth.

In essence, the majority of contributors to Wikipedia appear more comfortable writing the word “London” in an article than looking up its centroid latitude/longitude and entering it in specialized code. This has significant implications for how libraries leverage volunteer citizen geocoding efforts in their collections.

Figure 4 - Interactive Google Earth interface to Wikipedia's coverage of Libya

Figure 4 – Interactive Google Earth interface to Wikipedia’s coverage of Libya.

To explore how such information could be used to provide spatial search for large textual collections, a prototype Google Earth visualization was built to search Wikipedia’s coverage of Libya. A user could select a specific time period and instantly access a map of every location in Libya mentioned across all of Wikipedia with respect to that time period.

Finally, a YouTube video was created that visualizes world history 1800-2012 through the eyes of Wikipedia by combining the 80 million textual location mentions in the English Wikipedia with the 40 million date references to show which locations were mentioned together in an article with respect to a given year. Links were color-coded red for connections with a negative tone (usually indicating physical conflict like war) or green for connections with a positive tone.

Figure 5 - Video of world history 1800-2012 through the eyes of Wikipedia - links indicate locations mentioned together in an article with respect to the given year (red=negative tone, green=positive tone)

Figure 5 – Video of world history 1800-2012 through the eyes of Wikipedia – links indicate locations mentioned together in an article with respect to the given year (red=negative tone, green=positive tone).

That Fall I collaborated with SGI once again, along with social media data vendor GNIP and the University of Illinois Geospatial Information Laboratory to produce the first detailed exploration of the geography of social media, which helped popularize the geocoding and mapping of Twitter. This project produced the first live map of a natural disaster, as well as the first live map of a presidential election.

At the time, few concrete details were available regarding Twitter’s geographic footprint and the majority of social media maps focused on the small percentage of natively geotagged tweets. Twitter offered a unique opportunity to compare textual and sensor-based geographies in that 2% of tweets are geotagged with precise GPS or cellular triangulation coordinates. Coupled with the very high correlation of electricity and geotagged tweets, this offers a unique ground truth of the actual confirmed location of Twitter users to compare against different approaches to geocoding textual location cues in estimating the location of the other 98% of tweets that often have textual information about location.

A key finding was that two-thirds of those 2% of tweets that are geotagged were sent by just 1% of all users, meaning that geotagged information on Twitter is extremely skewed. Another finding was that across the world location is primarily expressed in English regardless of the language that a user tweets in and that 34% of tweets have recoverable high-resolution textual locations. From a communicative standpoint, it turns out that half of tweets are about local events and half of tweets are directed at physically nearby users versus tweeting about global events or users elsewhere in the world, suggesting that geographic proximity plays only a minor role in communication patterns on broadcast media like Twitter.

Figure 6 - Animated heatmap of tweets relating to Hurricane Sandy.

Figure 6 – Animated heatmap of tweets relating to Hurricane Sandy.

A common pattern that emerges across both Wikipedia and Twitter is that even when native geotagging is available, the vast majority of location metadata resides in textual descriptions rather than precise GIS-friendly numeric coordinates. This is the case even when geotagging is made transparent and automatic through GPS tagging on mobile devices.

In Spring 2013 I launched the GDELT Project, which extends my earlier work on the geography of the news media by offering a live metadata firehose geocoding global news media on a daily basis. That Fall I collaborated with Roger Macdonald and the Internet Archive’s Television News Archive to create the first large-scale map of the geography of television news. More than 400,000 hours of closed captioning of American television news totaling over 2.7 billion words was geocoded to produce an animated daily map of the geographic focus of television news from 2009-2013.

Figure 7 - Animated map of locations mentioned in American television news 2009.

Figure 7 – Animated map of locations mentioned in American television news 2009.

Closed captioning text proved to be extremely difficult to geocode. Captioning streams are in entirely uppercase letters, riddled with errors like “in two Paris of shoes” and long sequences of gibberish characters, and in some cases have a total absence of punctuation or other boundaries.

This required extensive adaptation of the geocoding algorithms to tolerate an enormous diversity of typographical errors more pathological in nature than those found in OCR’d content – approaches that were later used in creating the first live emotion-controlled television show for NBCUniversal’s Syfy channel. Newscasts also frequently rely on visual on-screen cues such as maps or text overlays for location references, and by their nature incorporate a rapid-fire sequence of highly diverse locations mentioned just sentences apart from each other, making the disambiguation process extremely complex.

Figure 8 - Heatmap of the locations most commonly mentioned in US Government publications, academic literature, and the global news media 1979-2014.

Figure 8 – Heatmap of the locations most commonly mentioned in US Government publications, academic literature, and the global news media 1979-2014.

In Fall 2014 I collaborated with the US Army to create the first large-scale map of the geography of academic literature and the open web, geocoding more than 21 billion words of academic literature spanning the entire contents of JSTOR, DTIC, CORE, CiteSeerX, and the Internet Archive’s 1.6 billion PDFs relating to Africa and the Middle East, as well as a second project creating the first large-scale map of human rights reports. A key focus of this project was the ability to infuse geographic search into academic literature, enabling searches like “find the five most-cited experts who publish on water conflicts with the Nuers in this area of South Sudan” and thematic maps such as a heatmap of the locations most closely associated with food insecurity.

Figure 9 - Map of global protest and conflict activity drawn from the GDELT Project displayed on the NOAA Science on a Sphere.

Figure 9 – Map of global protest and conflict activity drawn from the GDELT Project displayed on the NOAA Science on a Sphere.

As of spring 2015 the GDELT Project now maps the geography of an ever-growing cross-section of the global news media in realtime across 65 languages. Every 15 minutes it machine translates all global news coverage it monitors in 65 languages from Afrikaans and Albanian to Urdu and Vietnamese and applies the world’s largest multilingual geocoding system to identify all mentions of location anywhere in the world, from a capital city to a remote hilltop. Over the past several years, GDELT’s mass realtime geocoding of the world’s news media has popularized the use of large-scale automated geocoding, with disciplines from political science to journalism now experimenting with the technology and GDELT’s geocoding capabilities now lie at the heart of numerous initiatives from cataloging disaster coverage for the United Nations to mapping global conflict with the US Institute of Peace to modeling the patterns of world history.

Most recently, a forthcoming collaboration with cloud mapping platform CartoDB will enable ordinary citizens and journalists to create live interactive maps of the ideas, topics, and narratives pulsing through the global news media using GDELT. The example map below shows the geographic focus of Spanish (green), French (red), Arabic (yellow) and Chinese (blue) news media for a one hour period from 8-9AM EST on April 1, 2015, placing a colored dot at every location mentioned in the news media of each language. Ordinarily, mapping the geography of language would be an enormous technical endeavor, but by combining GDELT’s mass multilingual geocoding with CartoDB’s interactive mapping, even a non-technical user can create a map in a matter of minutes. This is a powerful example of what will become possible as libraries increasingly expose the spatial dimension of their collections in data formats that allow them to be integrated into popular mapping platforms. Imagine an amateur historian combining digitized georeferenced historical maps and geocoded nineteenth newspaper articles with modern census data to explore how a region has changed over time – these kinds of mashups would be commonplace if the vast archives of libraries were made available in spatial form.

Figure 10 - Geographic focus of world's news media by language 8-9AM EST on April 1, 2015 (Green = locations mentioned in Spanish media, Red = French media, Yellow = Arabic media, Blue = Chinese media).

Figure 10 – Geographic focus of world’s news media by language 8-9AM EST on April 1, 2015 (Green = locations mentioned in Spanish media, Red = French media, Yellow = Arabic media, Blue = Chinese media).

In short, as we begin to peer into the textual holdings of our world’s libraries using massive-scale data mining algorithms like fulltext geocoding, we are for the first time able to look across our collective informational heritage to see macro-level global patterns never before visible. Geography offers a fundamental new lens through which to understand and observe those new insights, and as libraries increasingly geocode their holdings and make that material available in standard geographic open data formats, they will enable an entirely new era where libraries become conveners of information and innovation that empower a new era of access and understanding of our world.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.