Computational Linguistics & Social Media Data: An Interview with Bryan Routledge

Bryan Routledge, Associate Professor of Finance Tepper School of Business Carnegie Mellon University.

Bryan Routledge, Associate Professor of Finance, Tepper School of Business, Carnegie Mellon University.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture last week. This is part of an ongoing series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

What can a Yelp review or a single tweet reveal about society? How about hundreds of thousands of them? In this installment of the Insights Interviews series, I’m thrilled to talk with researcher Bryan Routledge about two of his projects that utilize a computational linguistic lens to analyze vast quantities of social media data. You can read the article on word choice used in online restaurant reviews here. The article about using Twitter as a predictive tool as compared with traditional public opinion polls here (PDF).

Julia: The research group Noah’s ARK at the Language Technologies Institute, School of Computer Science at Carnegie Mellon University aims in part to “analyze the textual content of social media, including Twitter and blogs, as data that reveal political, linguistic, and economic phenomena in society.”  Can you unpack this a bit for us? What kind of information can social media provide that other kinds of data can’t?

Bryan: Noah Smith, my colleague in the school of computer science at CMU, runs that lab.  He is kind enough to let me hang out over there.  The research we are working on looks at the connection between text and social science (e.g., economics, finance).  The idea is that looking at text through the lens of a forecasting problem — the statistical model between text and some social-science measured variable — gives insight into both the language and social parts.  Online and easily accessed text brings new data to old questions in economics.  More interesting, at least to me, is that grounding the text/language with quantitative external measures (volatility, citations, etc.) gives insight into the text.  What words in corporate 10K annual reports correlate with stock volatility and how that changes over time is cool.

Different metaphors for expensive and inexpensive restaurants in Yelp reviews. From from: Dan Jurafsky, Victor Chahuneau, Bryan R. Routledge, and Noah A. Smith. 2014. Narrative framing of consumer sentiment in online restaurant reviews. First Monday 19:4.

Different metaphors for expensive and inexpensive restaurants in Yelp reviews. From: Dan Jurafsky, Victor Chahuneau, Bryan R. Routledge, and Noah A. Smith. 2014. Narrative framing of consumer sentiment in online restaurant reviews. First Monday 19:4.

Julia: Your work with social media—Yelp and Twitter—are notable for their large sample sizes and emphasis on quantitative methods, using over 900,000 Yelp reviews and 1 billion tweets. How might archivists of social media better serve social science research that depends on these sorts of data sets and methods?

Bryan: That is a good question.  What makes it very hard for archivists is that collecting the right data without knowing the research questions is hard.  The usual answer of “keep everything!” is impractical.  Google’s n-gram project is a good illustration.  They summarized a huge volume of books with word counts (two word pairs, …) by time.  This is great for some research.  But not for the more recent statistical models that use sentences and paragraph information.

Julia:  Your background and most of your work is in the field of finance, which you have characterized as being fundamentally about predicting the behavior of people . How do you see financial research being influenced by social media and other born digital content? Could you tell us a bit about what it means to have a financial background doing this kind of research? What can the fields of finance and archives learn from each other?

In Yelp reviews of Manhattan restaurants and for items with “steak” in the menu (an example).  Predict the (log) menu item price using the words used to describe the item by location.  For example: in most locations, the word “baby” is neutral -- it suggests neither high nor low price.  Except in the Wall Street area of lower Manhattan where it is associated with higher priced steak.

In Yelp reviews of Manhattan restaurants with “steak” in the menu (an example). Predict the (log) menu item price using the words used to describe the item by location. For example: in most locations, the word “baby” is neutral — it suggests neither high nor low price. Except in the Wall Street area of lower Manhattan where it is associated with higher priced steak.

Bryan:  Finance (and economics) is about the collective behavior of large number of people in markets.  To make research possible you need simple models of individuals.  Getting the right mix of simplicity and realism is age-old and ongoing research in the area.  More data helps.  Macroeconomic data like GDP and stock returns is informative about the aggregate.  Data on, say, individual portfolio choices in 401K plans lets you refine models.  Social media data is this sort of disaggregated data.  We can get a signal, very noisy, about what is behind an individual decision.  Whether that is ultimately helpful for guiding financial or economic policy is an open, but exciting, question.

More generally, working across disciplines is interesting and fun.  It is not always “additive.”  The research we have done on menus has nothing to do with finance (other than my observation that in NY restaurants near Wall Street, the word “baby” is associated with expensive menu items).  But if we can combine, for example, decision theory finance with generative text models, we get some cool insights into purposefully drafted documents.

Julia: The data your team collected from Yelp was gathered from the site. Your data from Twitter was collected using Twitter’s Streaming API and “Gardenhose,” which deliver a random sampling of tweets in real-time. I’d be curious to hear what role you think content holders like Yelp or Twitter can or could play in providing access to this kind of raw data.

Bryan: As a researcher with only the interests of science at heart, it would be best if they just gave me access to all their data!  Given that much of the data is valuable to the companies (and privacy, of course), I understand that is not possible.  But it is interesting that academic research, and data-sharing more generally, is in a company’s self-interest.  Twitter has encouraged a whole ecosystem that has helped them grow.  Many companies have an API for that purpose that happens to work nicely for academic research.  In general, open access is most preferred in academic settings so that all researchers have access to the same data.  Interesting papers using proprietary access to Facebook are less helpful than Twitter.

Julia: Could you tell us a bit about how you processed and organized the data for analysis and how you are working to manage it for the future? Given that reproducibility is such an important concept for science, what ways are you approaching ensuring that your data will be available in the future?

Bryan: This is not my strong suit.  But at a high-level, the steps are (roughly) “get,” “clean,” “store,” “extract,” “experiment.”  The “get” varies with the data source (an API).  The “clean” step is just a matter of being careful with special characters and making sure data are lining up into fields right.  If the API is sensible, the “clean” is easy.  We usually store things in a JSON format that is flexible.  This is usually a good format to share data.  The “extract” and “experiment” steps depend on what you are interested in.  Word counts? Phrase counts? Other?  The key is not to jump from “get” to “extract” — storing the data in as raw form as possible makes thing flexible.

Julia:  What role, or potential role, do you see for the future of libraries, archives and museums in working with the kinds of data you collect? That is, while your data is valuable for other researchers now, things like 700,000 Yelp reviews of restaurants will be invaluable to all kinds of folks studying culture, economics and society 10, 20, 50 and 100 years from now. So, what kind of role do you think cultural heritage institutions could play in the long-term stewardship of this cultural data? Further, what kinds of relationships do you think might be able to be arranged between researchers and libraries, archives, and museums? For instance, would it make sense for a library to collect, preserve, and provide access to something like the Yelp review data you worked with? Or do you think they should be collecting in other ways?

Sentiment on Twitter as compared to Gallup Poll. Appeared in From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. Brendan O'Connor,Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 122–129, Washington, DC, May 2010

Sentiment on Twitter as compared to Gallup Poll. Appeared in From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge and Noah A. Smith. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 122–129, Washington, DC, May 2010

Bryan: This is also a great question and also one for which I do not have a great answer.  I do not know a lot about the research in “digital humanities,” but that would be a good place to look.  People doing digital text-based research on a long-horizon panel of data should provide some insight into what sorts of questions people ask.  Similarly, economic data might provide some hints.  Finance, for example, has a strong empirical component that comes from having easy-to-access stock data (the CRSP).  The hard part for libraries is figuring out which parts to keep.  Sampling Twitter, for example, gets a nice time-series of data but loses the ability to track a group of users or Twitter conversations.

Julia: Talking about the paper you co-authored that analyzed Yelp reviews, Dan Jurafsky said “when you write a review on the web you’re providing a window into your own psyche – and the vast amount of text on the web means that researchers have millions of pieces of data about people’s mindsets.” What do you think are some of the possibilities and limitations for analyzing social media content?

Bryan: There are many limitations, of course.  Twitter and Yelp are not just providing a window into things, they are changing the way the world works.  “Big data” is not just about larger sample sizes of draws from a fixed distribution.  Things are non-stationary.  (In an early paper using Twitter data, we could see the “Oprah” effect as the number of users jumped in the day following her show about Twitter).  Similarly, the data we see in social media is not a representative sample of society cross section.  But both of these are the sort of things good modeling – statistical, economic – should, and do, aim to capture.  The possibilities of all this new data are exciting.  Language is a rich source of data with challenging models needed to turn it into useful information.  More generally, social media is an integral part of many economic and social transactions.  Capturing that in a tractable model makes for an interesting research agenda.

Digital Preservation 2014: It’s a Thing

“Digital preservation makes headlines now, seemingly routinely. And the work performed by the community gathered here is the bedrock underlying such high profile endeavors.” – Matt Kirschenbaum The annual Digital Preservation meeting, held each summer in Washington, DC, brings together experts in academia, government and the private and non-profit sectors to celebrate key work and […]

Art is Long, Life is Short: the XFR Collective Helps Artists Preserve Magnetic and Digital Works

XFR STN (“Transfer Station”) is a grass-roots digitization and digital-preservation project that arose as a response from the New York arts community to rescue creative works off of aging or obsolete audiovisual formats and media. The digital files are stored by the Library of Congress’s NDIIPP partner the Internet Archive and accessible for free online. At the […]

The MH17 Crash and Selective Web Archiving

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries. The Internet Archive Wayback Machine has been mentioned in several news articles within the last week  (see here, here and here) for having archived a since-deleted blog post from a Ukrainian separatist leader touting his shooting down a […]

Understanding the Participatory Culture of the Web: An Interview with Henry Jenkins

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and is working on a range of projects related to CurateCamp Digital Culture. This is part of an ongoing series of interviews Julia is conducting to better understand the […]

Future Steward on Stewardship’s Future: An Interview with Emily Reynolds

Each year, the NDSA Innovation Working Group reviews nominations from members and non-members alike for the Innovation Awards. Most of those awards are focused on recognizing individuals, projects and organizations that are at the top of their game. The Future Steward award is a little different. It’s focused on emerging leaders, and while the recipients […]

Digital Preservation 2014 in Three, Two, One…

And we’re off! Digital Preservation 2014 starts today and we’re really excited to welcome our colleagues from near and far to Washington DC this week for a full and packed program! Digital Preservation 2014, the annual meeting of the National Digital Information Infrastructure and Preservation Program and the National Digital Stewardship Alliance, provides opportunities to […]

Digital Preservation 2014 Session Preview: Preserving and Rescuing Heritage Information on Analog Media

The following is a guest post by Dr. Elizabeth Griffin, Volunteer Visitor at the Dominion Astrophysical Observatory, Canada, and Chair of the CODATA “Data at Risk” Task Group.  This is part of an ongoing series of posts to highlight and preview the Digital Preservation 2014 program.   Elizabeth previews the session she’s helped organize, “Preserving and […]

Preserving and Curating Research Data: Panel Preview for DP2014

Continuing with our series of blog posts devoted to the upcoming Digital Preservation 2014 conference, the following interview features a preview of the panel session entitled “Research Data and Curation” with panel members Inna Kouper (Data to Insight Center at Indiana University), Elizabeth Yakel (University of Michigan School of Information) and Ixchel Faniel (OCLC Research). Susan: […]

Preserving Born Digital News at Digital Preservation 2014

The following is a guest post from Anne Wootton, CEO of Pop Up Archive, which makes tools for preserving and creating access to digital spoken word; Edward McCain, the Digital Curator of Journalism at the Donald W. Reynolds Journalism Institute at the University of Missouri; Leslie Johnston, Direction of Digital Preservation at the National Archives […]