This is part four in a seven part resource guide for digital scholarship by Samantha Herron, our 2017 Junior Fellow. Part one is available here, part two about making digital documents is here, part three is about tools to work with data, and part four (below) is all about doing text analysis. The full guide is available as a PDF download.
Clean OCR, good metadata, and richly encoded text open up the possibility for different kinds of computer-assisted text analysis. With instructions from humans (“code”), computers can identify information and patterns across large sets of texts that human researchers would be hard-pressed to discover unaided. For example, computers can find out which words in a corpus are used most and least frequently, which words occur near each other often, what linguistic features are typical of a particular author or genre, or how the mood of a plot changes throughout a novel. Franco Moretti describes this kind of analysis as “distant reading”, a play on the traditional critical method “close reading”. Distant reading implies not the page-by-page study of a few texts, but the aggregation and analysis of large amounts of data.
The strategies of distant reading require scholars to “operationalize” certain concepts–to make abstract ideas quantifiable. For more on text mining see Ted Underwood’s post on what text mining can do and what some of the obstacles are.
Some important text analysis concepts:
- Stylometry –
Stylometry is the practice of using linguistic study to attribute authorship to an anonymous text. Though some of stylometry’s methods and conclusions (both digital and analog) have been disputed, the practice speaks to some of the kinds of evidence researchers hope to surface using text analysis.
One of the early successes of stylometry was in 1964 when Frederick Mosteller and David Wallace used linguistic cues to assign probable authorship to disputed Federalist Papers. Patrick Juola for Scientific American describes it: “[The researchers] showed that the writing style of Alexander Hamilton and James Madison differed in subtle ways. For example, only Madison used the word ‘whilst’ (Hamilton used ‘while’ instead). More subtly, while both Hamilton and Madison used the word ‘by,’ Madison used it much more frequently, enough that you could guess who wrote which papers by looking at how frequently the word was used.” Using these methods, they discovered that the disputed papers were likely written by Madison.
Today, computers can perform these kinds of comparative tasks quickly, and keep track of a many different features difficult to track by hand (e.g. not only the relative presence of the word ‘by’ but the relative presence of ‘by’ at the beginning vs. the end of a sentence, or the ratio of ‘by’ to other prepositions, etc.).
Example Stylometry was recently used by researchers to look into the works of Hildegard of Bingen, a female author from the Middle Ages. Because she was not entirely fluent in Latin, she dictated her texts to secretaries who corrected her grammar. Her last collaborator, Guibert of Gembloux, seemed to have made many changes to her dictation while he was secretary. The researchers used digital stylometry methods to display that collaborative works are often styled very differently from works penned by either author individually.
Stylometric methods and assumptions can also be applied beyond author attribution. If stylometry assumes that underlying linguistic features can function as ‘fingerprints’ for certain authors, linguistic features might also be fingerprints for certain years or genres or national origin of author, and so on. For example, are there linguistically significant identifiers for mystery novels? Can a computer use dialogue to determine if a book was written before 1800? Can computers discover previously unidentified genres? This pamphlet from Stanford Literary Lab gives a good overview of their research into the question of whether computers can determine genre.
- Word counts, etc. and topic models –
Stylometry deals with the attribution and categorization of texts’ style. Other distant reading research looks at semantic content, taking into account the meanings of words as opposed to their linguistic role.
Word frequency – One of the simplest kinds of text analysis is word frequency. Computers can count up and rank which words appear most often in a text or set of texts. Though not computationally complicated, term frequency is often an interesting jumping off point for further analysis, and a useful introduction into some of digital humanities’ debates. Word frequency is the basis for somewhat more sophisticated analyses like topic modeling, sentiment analysis, and ngrams.
To the right is a word cloud for Moby Dick.
A word cloud is a simple visualization that uses font size to represent the relative frequency of words–the bigger the font, the more frequently a word is used.
The word cloud is based on this data: (First column is word, second is word count, third is word frequency)
whale 466 0.004317093
like 315 0.0029182069
ye 283 0.002621754
man 251 0.0023253013
ship 227 0.0021029618
sea 216 0.0020010562
old 212 0.0019639996
captain 210 0.0019454713
dick 199 0.0018435656
moby 199 0.0018435656
said 188 0.00174166
ahab 180 0.0016675468
time 169 0.0015656411
little 165 0.0015285845
white 164 0.0015193204
queequeg 162 0.0015007921
long 150 0.0013896223
great 146 0.0013525657
men 138 0.0012784525
way 134 0.001241396
say 132 0.0012228676
whales 132 0.0012228676
head 124 0.0011487544
good 116 0.0010746412
boat 111 0.0010283205
thought 110 0.0010190563
round 106 0.0009819998
sort 101 0.000935679
hand 98 0.0009078866
world 92 0.00085230166
come 90 0.0008337734
sperm 89 0.00082450925
look 88 0.0008152451
whaling 88 0.0008152451
deck 86 0.0007967168
night 84 0.00077818846
chapter 82 0.0007596602
seen 82 0.0007596602
day 78 0.0007226036
know 78 0.0007226036
tell 78 0.0007226036
things 78 0.0007226036
right 77 0.0007133394
water 76 0.0007040753
away 74 0.000685547
bildad 74 0.000685547
far 74 0.000685547
god 74 0.000685547
You’ll notice that this particular word count (completed using Voyant Tools) doesn’t include certain stop words: ‘fluff’ words like pronouns, articles, conjunctions, and prepositions (e.g. she, that, the, any, but…), keeping only ‘meaning’ words–names, nouns, verbs, adjectives, adverbs.
Mostly, this data aligns with what we already know or would assume about Moby Dick: that it concerns a whale and an old captain at sea. But with this data, we can ask new questions: Is it significant that ‘whale,’ the most frequent word, is used 150 more times than the runner-up (or even more times if we include the plural ‘whales’ or the verb ‘whaling’)? Why is ‘like’ used so often? Can we safely assume that word count says anything at all about the book’s content or meaning? How does Moby Dick’s word frequency compare to Melville’s other works? To the works of his contemporaries?
Voyant is a set of out-of-the-box tools that allows you to manipulate and compare texts. Given a corpus (it is preloaded with two corpora: Jane Austen’s novels, and Shakespeare’s plays, but users can also supply their own) Voyant displays word counts and clouds, comparative frequencies over time, concordances, and other visual displays. There are plenty of other more sophisticated and customizable tools available that do similar tasks, but Voyant is one of the most accessible, because it requires no coding by the user.
Here is a link to a list of clean demo corpora to play around with.
Google Books Ngram Viewer is also a powerful example of how word frequencies can be used as a jumping off point for scholarly inquiry. Using Google Books as its massive database, users can track the relative presence of words in books across time.
Here’s what a Google ngram looks like:
This ngram compares the (case-insensitive) frequency of ‘internet’, ‘television’, ‘radio’, ‘telephone’, and ‘telegram’ across the entire Google Books collection from 1850-2000. This graph (we suppose) reflects a lot of interesting historical information: the birth and quick rise of radio, the birth and quicker rise of the Internet, the birth and steady increase of television, which appears to level out in the 1990s. However, ngrams like this also allow us to ask questions: Does the 1944 peak in frequency of the word ‘radio’ in books reflect a historical peak in radio popularity? If not, is there some reason why people might be writing more about radios than using them? Or, why was the telegram so infrequently written of in books? Would running this same ngram on a corpus of newspapers rather than books return different results? And so on.
Here are some interesting and silly ngrams from webcomic xkcd.
Word frequency data at both the scale of a single book, and of very many books, asks as many questions as it answers, but can be an interesting jumping off point for beginning to envision texts as data.
Another popular text analysis method is topic modeling. A ‘topic’ is a set of words that frequently colocate in a set of texts (meaning that they occur near each other). In general, topic modeling tool looks through a corpus and spits out clusters of words that are related to each other. So, in a very hypothetical example, if you fed a topic modeling tool the text of every ecology textbook you could find, it might return topics like ‘dirt rock soil porous’ and ‘tree leaf branch root’ etc.
The significance of such a tool is more obvious at a large scale. A human can read an article on bananas and state with confidence that the article is about bananas and perhaps that the key words are ‘bananas’ ‘fruit’ ‘yellow’ ‘potassium’… But when working with a corpus that is say, the text of 100 years of a newspaper, or the text mined from every thread on a subreddit page, the ‘topics’ become more difficult to discern.
Example Robert K. Nelson at the Digital Scholarship Lab at the University of Richmond authored Mining the Dispatch, a project that uses topic modeling to look at nearly the full run of a newspaper, the Richmond Daily Dispatch, in the early 1860s. For example, one of the topics his model returned was predicted by the words ‘negro years reward boy man named jail delivery give left black paid pay ran color richmond subscriber high apprehension age ranaway free feet delivered.’ Then, by looking at articles where this topic was most prominent, it was determined that this topic most often refers to fugitive slave ads. By tracking the relative presence of this topic through time, one can track the relative presence of fugitive slave ads through time. Other topics identified by the model and named by Nelson include ‘Poetry and Patriotism’, ‘Anti-Northern Diatribes’, ‘Deserters’, ‘Trade’, ‘War Reports’, ‘Death Notices’, ‘Humor’, among others.
Topic models can reveal latent relationships and track hidden trends. Especially for unindexed corpora (like old newspapers, often, that do not have article-level metadata), topic modeling can be used to identify the incidence of certain kinds of content that would take years to tag by hand, if it were possible at all.
A popular topic modeling tool is MALLET, for those comfortable working in the command line. Programming Historian has a tutorial for getting started using MALLET for topic modeling. If you’re not comfortable in the command line, there is a GUI (graphical user interface) tool for implementing MALLET here (meaning you can input files and output topics without entering code yourself), and a blog post from Miriam Posner on how to interpret the output.
SOME TEXT ANALYSIS TOOLS:
- AntConc: Concordance tool.
- DiRT Directory: Digital Research Tools directory.
- From the Page: Crowdsourcing manuscript transcription software
- Google Ngram Viewer: Explore ngrams in Google books corpus.
- Juxta: For textual criticism (identification of textual variants). Look at base and witness texts side by side, locate variations easily. Offers analytic visualizations like heat maps.
- Natural Language Toolkit: Computational linguistics platform for building Python programs that work with language data. “It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.”
- MALLET: (Machine Learning for Language Toolkit) – Java-based package for sophisticated text analysis. Often used for topic modeling. See description above.
- Programming Historian: Peer-reviewed, novice-friendly tutorials for digital tools and techniques.
- R: Statistical software. Often used for text manipulation, but the language is less user-friendly than other coding languages (say, Python).
- Stanford Natural Language Processing Group: Set of Natural Language Processing tools
- Stylo: Suite of stylometric tools for R.
- Voyant: Web-based text explorer (See above).
- WordHoard: WordHoard contains the entire canon of Early Greek epics, as well as all of Chaucer, Shakespeare, and Spenser. The texts are annotated or tagged by ‘morphological, lexical, prosodic, and narratological crieteria’. Has a user-interface that allows non-technical users to explore textual date.
WHERE TO GET TEXTS:
The next installment of this guide will be all about spatial humanities, GIS, and timelines.
I appreciate the presented comments & outlines describing text mining, but I would add a number of other functions/characteristics of the process:
* extracting parts-of-speech – While the counting & tabulating the words (or phrases) in one or more texts can be quite informative for the purposes of “distant” or “scalable” reading, such measures do not really take “meaning” into account. By counting & tabulating words & phrases, one can begin to answer questions such as, “What is discussed in this corpora?”, “What do they do in this corpora?”, and “How are things described?” Based on the answers to these questions, a fuller analysis can be done. Sentiment analysis is an example.
* extracting named entities – Named entities are types of nouns, usually names of people, places, organizations, times, dates, and money amounts. By extracting these sorts of values from a corpora a person can create maps illustrating where things take place as well as create timelines denoting when actions occur. By looking at the list of names in a text one can posit the author as a Platonist or an Aristotelian, etc.
* classifying & clustering texts – Corpora are often larger than a person can read at a single go. Subdividing a corpora into subsets enables the reader to figuratively “divide & conquer”. Topic modeling is a type of clustering. Obtain a corpos. Denote the number of topics (read “subjects”) desired. Use well-known statistics tools to see what words are used in conjunction with others. Output the results. Classification is what library catalogers often do. Articulate a classification system. Understand the classification system. Read a text. Assign the text to one or more classification terms. Computers can do this too. Classification & clustering are useful functions for describing the aboutness of a corpora.
Finally, IMHO, the integration of text mining services into librarianship is a possible way have the profession evolve. Find is much less of a problem than it used to be. Everybody can find data & information. Moreover, the data & information they find is usually of high quality as well as authoritative. The problem now surrounds the issue of using & understanding the set of found content. Automated text mining services applied against full text search results (whether that be text, sounds, images, etc.) point the way towards enhanced library services.
I am working on a research project for developing a readability formula for Gujarati language. My plan is to use support vector machine algorithm and statistical language model for analysis. I want guidance about how should I prepar my text data for such analysis.