The presentations at the Library of Congress’ Collections As Data conference coalesced into two main themes: 1) digital collections are composed of data that can be acquired, processed and displayed in countless scientific and creative ways and 2) we should always be aware and respectful that data is manipulated by — and derived from — people.
Jane McAuliffe, director of National and International Outreach, welcomed the attendees and live-stream viewers to “the largest repository of recorded knowledge in the world.” McAuliffe said, “It’s not enough anymore to just open the doors of this building and invite people in. We have to open the knowledge itself for people explore and use.” She introduced the Library of Congress’s new division, National Digital Initiatives — who organized the event — and said, “Today is a perfect example of the work we want them to do, leveraging the Library to bring all of you together, discussing the best practices and lessons learned from your work, thinking through next steps and what we can do even better moving forward.”
Setting the tone for the conference in his opening keynote address, Jer Thorp, of the Office for Creative Research, touched on how data points and data-collection methods have become distanced and disconnected from the humans that the data describes. His examples included a project that ran a sentiment analysis of a high school Twitter feeds to find “the saddest high school in New York.” Researchers initially released inaccurate results, assuming the Tweets came from Hunter College High School when they actually came from a single Twitter account located just south of the school. Students themselves pointed out that high schoolers do not use Twitter. Thorp commented on advertisers freely taking personal data from people’s browsing habits and how information gleaned from browser ads and cookies creates a distorted picture of individuals. To prove his point, he paid a group to write a profile of him based on the ads that targeted his browser. Thorp — an exuberant artist, teacher, father and husband — said, “I learned some things about what advertisers believe about me…I am sad and I live alone and play video games.”
He displayed data visualizations. There were his travel patterns going to and from work, and visualizations of people around the world Tweeting “good morning,” GIS-mapped to their geolocations (green dots representing people who got up early and red for people who got up late). Shifting to citizen science, he displayed a project that visually correlates weather events with chronic pain sufferers. Every week, volunteers submitted information about their pain, which was mapped to weather data. The project benefited both patients and doctors.
Thorp got away from visualization of solid data by asking the rhetorical question, “How do we present the cold clinical magnitude of data alongside the human story?” He demonstrated the Time of the Game project, where he and his colleagues overlapped many digital photos of people, residing in different places, watching the same World Cup game. They centered the television screen in each photo and aligned the photos so that as they flickered in an animated sequence, the TV screen became a hub around which the images of people changed. The visual effect conveyed viscerally — in ways that words could not — a shared, collective experience.
In another example of citizen science, Mark Bouslog, of Zooniverse, spoke about the power of crowdsourcing and what the Zooniverse site calls “people-powered research.” He showcased Zooniverse’s do it yourself Project Builder tool and a few tagging and transcription projects. Zooniverse’s Galaxyzoo, for example, invites volunteers to classify galaxies by stepping them through simple observations about different galaxies’ features. “The pattern recognition for the initial classifying, humans excel at, while computers yet do not,” said Bouslog. “The results were incredible. There were over thousand people contributing and it was eventually determined that the crowd-source consensus was as good as classifications shown by professional astronomers showing that we have gotten it right. Some of the volunteers are listed as co-authors.” Penguin Watch invites volunteers to identify penguins, again through carefully guided input and a controlled vocabulary. The interface is clean and colorful and there are competitive games to engage the volunteers. To date, 37,723 volunteers have cataloged 4,353,970 images. As for transcription of handwritten letters, Bouslog said of projects such as Shakespeare’s World that project organizers make it as simple and foolproof as possible to ensure success. “We don’t ask volunteers to transcribe an entire page or even transcribe the entire sentence. You simply ask them to transcribe one visual line at a time. That they feel they can do in confidence.”
Kate Zwaard, of the Library of Congress’ National Digital Initiatives division, addressed the perceived-versus-real tension between innovation and sustainability and how that informs NDI’s work. She talked about Henriette Avram, who helped create MARC standards, and how Avram embodied the complementary skills of computer programming and library science to create a durable cataloging system. Zwaard spoke about the complexity of creating a major collection online, about the number of resources — human, software and hardware — that go into it, and how online collections require complimentary skills. She said NDI’s goals are to maximize the benefit of the Library’s digital collections to the American public and to the world; incubate, encourage and promote the digital innovation; and collaborate with other cultural-memory institutions and digital creators. Zwaard said, “What we do have the power of is knowing people, knowing technology and being able to connect folks.”
Elizabeth Lorang, of the University of Nebraska–Lincoln, researched the challenge of thinking about text in visual terms and computationally finding text that was part of image files. Using image-recognition software, they searched for poems by their shapes. Their reasoning was that regular newspaper text, for example, has a predictably boxy shape but poetry, with its staggered lines and generous use of space, has distinctive shapes. By means of Image Analysis for Archival Discovery, or AIDA, Lorang and her team discovered a batch of poems in images from the Chronicling America collection.
Leah Weinryb Grohsgal of the National Endowment for the Humanities and Deborah Thomas of the Library of Congress also spoke about the Chronicling America project and about the NEH Data Challenge built upon it. Thomas began by profiling Chronicling America. She said, “American historic newspapers are actually archived in state libraries around the country, so using digital technologies we are able to bring this material back together again through the partnerships with the NEH…The data is available for harvesting or reuse outside of the individual interfaces that we provide through the website. So we have digitized page images. We have optical character recognition, which is machine-readable text. We have metadata, which surrounds every page and issue in a standardized METS format for MODS description characteristics, which describes the place and time of that particular issue as well as the newspaper directives. All of this information can be taken out of the site and analyzed in different ways for researchers that don’t involve the actual website.” Grohsgal talked about the results of the Chronicling America Historic Newspaper data challenge, which she wrote about in The Signal in August, and took a closer look at each of the winning entries: Biblical Quotations in U.S. Newspapers, American Lynching, Historical Agricultural News, Chronicling Hoosiers, USNewsMap.com and Revealing History with Chronicling America.
Nicole Saylor, of the Library of Congress, talked about the American Folk Center and, among other things, how it engages the public to contribute to AFC’s collections. At Halloween 2014, for example, AFC invited people to share their favorite photos on Flickr with the hastag #FolklifeHalloween2014. Saylor also spoke about the personal stories acquired by AFC through StoryCorps and the StoryCorp app. Saylor said, “By leveraging or partnering with third-party software platforms, these efforts allow us to focus on preservation and long term access of records while still supporting immediate and dynamic engagement in the community.” Saylor also touched on the subject of bias in metadata, which several other speakers also addressed throughout the day, and how resources such as Traditional Knowledge Labels were enabling communities who have a personal stake in the collections to add their own metadata and reflect their own understanding and viewpoint of the content. This same topic was addressed by another group of scholars a few weeks earlier at the Library of Congress’s American Folklife Collections, Collaborations & Connections Symposium.
Matthew Weber, of Rutgers University, spoke about the Archives Unleashed datathon, which the Library of Congress hosted in June. He said of his collaborative digital humanities work, “I’m a communications scholar and historian and Jimmy (Lin) is a computer scientist and we come from entirely different backgrounds and we speak entirely different languages as academics. And yet together in the same room, we are able to work with data and create meaning out of that data as we collaborate.” Weber talked about the increasing commonality of such collaborations and stressed the need for data laboratories in which scholars can come together and exchange ideas, and how scholars need to be educated about data-processing tools for their research.
Ricardo Punzalan, of the University of Maryland, talked about “virtual reunification” as a strategy to enable dispersed collections to be brought together. He said that in the past, the archival community acknowledged in their finding aids that related elements of the physical collection reside in other institutions but now you can link collections together virtually. He pointed to the Walt Whitman archives as an example of unified resources from various institutions. Punzalan also talked about repatriation, returning things to their cultural owners after the objects have been digitized. This cultural sensitivity echoed Saylor’s mention of Traditional Knowledge Labels and a presentation about repatriation that was presented at the IASA conference at the Library of Congress. In a related repatriation note, the Library of Congress recently donated digitized holdings relating to the culture and history of Afghanistan to cultural and educational institutions in Afghanistan for use in their own digital libraries and online repositories.
Bergis Jules, of UC Riverside, talked about Documenting the Now, which builds free and open-source tools for collecting, analyzing and sharing Twitter data. His work was inspired by the activism and protest that followed the police killing of Michael Brown in Ferguson, Missouri. Jules said that there was more to gathering information about such public events than just archiving Tweets and photos and news stories. He said, “We had the responsibility really not to forget that there are in fact people behind all of this data. We are really interested in how our building of these collections might affect peoples’ lives. It’s also why we are being really transparent with our work, at the same time trying to help build a community of people who also value these ideas….It’s about valuing people enough to care about how we collect and store their data.” Jules also raised the issue of privacy concerns, as Thorp did. “How will our collections of social media data be different than those built by law enforcement or private security firms?” Jules said. “How will the library respond to requests from private security firms and law enforcement for the data?” He said that we need to directly engage with users of social media regarding how collecting this type of data might affect their lives.
Maciej Ceglowski, founder of the social bookmarking site Pinboard, delved even deeper into privacy concerns. “I worry about legitimizing a culture of surveillance,” Ceglowski said. “I am very uneasy when I see social scientists working with Facebook, for example.” He was wary also of finding patterns in data as an end in itself. Ceglowski spoke of the failure of imagination by so-called experts and he encouraged the audience to honor individuality. He recalled how slowly many people embraced Wikipedia. “I saw the (Andrew Mellon Foundation) librarians fail to engage in (Wikipedia in) the early days, a service that they later grew to love, basically because of the lack of trust and openness to an experiment around unstructured tagging,” he said. He acknowledged that collaborating with communities means relinquishing some amount of control, which is frightening and fascinating. He cited how people who would consider themselves technology amateurs actually develop marketable skills just by working and playing on the web. And how communities that form organically through online spaces, such as social bookmarking sites, actually help each other. Ceglowski said, “My dream of the web is for it to feel like the big city, a place where you rub elbows with people that are not like you. A little scary and chaotic and full of many things that you can imagine, and many things where you can’t and also for people to be themselves and for people to create their own spaces and to learn from one another.”
Harriet Green, of the University of Illinois at Urbana-Champaign, spoke about digital scholarship in the Humanities and Social Sciences. Green and her colleagues are developing “Digging Deeper, Reaching Further,” a project empowering users to mine the HathiTrust Digital library resources. Green said, “We enable researchers to gather the information from the library and work with data sets and analyze and produce new findings. We facilitate the access to textual data.” DDRF’s goal is to train the trainer, to teach librarians fundamental text mining skills and how to work with data. They also train librarians and researchers together. They will eventually launch pilot workshops and take them on the road to major conferences and key geographic areas.
Trevor Muñoz, of the University of Maryland, spoke about involving the community with the archives that are supposed to represent it. He focused an initiative titled, “African American History, Culture and Digital Humanities,” or AADHUM. One of their goals is breaking down barriers in the community that they are archiving and engaging with the people the archived stuff comes from. Muñoz said, “We put the data in the center and asked our community all of the ways in which they might wish to respond to it, without presuming that we know particular methods or techniques that we need to communicate outward to them.” Muñoz said that the communities that AADHUM builds will be as much a measure of success as the programming it produces. He sees AADHUM as being a feedback loop that continuously informs and restructures itself.
Marissa Parham, of Amherst College, spoke about the personal element of archives. “When you are thinking about the raw material that constitutes collections, you are often talking about personal things,” Parham said. Archivists too often look at collections, especially digital, as merely a collection of data and they lose sight of the person at the heart of it. She talked about finding photos with no descriptions and about the bias of facial-recognition software technology, and in each case how the person in the photo can become disconnected from the object. The archivist needs to not be invasive and clinical but to do her best to honor the humanity of the person the archives is built on. In speaking about a certain person’s archives, Parham said, “Much of what is at stake in example comes down to ownership. The idea of personal collection is an exercise in exclusion. I have to be careful in thinking about the archive as her collection. It’s a collection of her stuff…We must have humility about the stewardship.”
Thomas Padilla, of UC Santa Barbara, delivered the ending keynote. Padilla stressed the need for expanding an individual’s capacity to act. In the case of web archives, that applies to going beyond searching, browsing and reading archived web pages. “If we peel back the layer, to engage the underlying, less visible structures organizing the representation we see on the screen, it becomes possible to ask more questions of the collection as data,” Padilla said. As an example, he cited Ian Milligan’s network visualization of connections among Canadian Political Parties and Interest Groups.
Padilla faulted a lack of administrative support for exploration and experimentation in many institutions and he singled out the Cooper Hewitt museum and the British Library labs for “brave experiments” that generated innovative results. “You may have heard of (the British Library Labs’) pioneering work, automatically extracting a million images from historical text, and making them available via Flickr under a cc-0 license,” Padilla said. “Subsequently, they have encouraged a number of competitions for working with the collections that they release as data. These efforts…provide opportunities to refine the way that we prepare and provide access to collections, and can lead to concrete reciprocal benefits from outside our institutions.” Padilla echoed the same sentiments of many of the day’s speakers, that we need to be aware of who assembled the collections, who made the decisions and how they may have influenced the collections with a subtle – or not so subtle — bias. He called for more transparency and openness around digital collections to help avoid systematic bias — gender, racial, geographical or cultural — and he singled out the Digital Library Federation’s Cultural Assessment Group as a step in the right direction.
The Library of Congress’s 500-seat Coolidge Auditorium was filled almost to capacity, with visitors from across the United States. The event was live streamed with real-time close captioning, 200-300 concurrent views at a time for a total of over 6,000 views. Individual videos from the event will be posted online soon.
Viewers Tweeted during the conference from across the United States and from as far away as Lebanon, Finland, Italy, Germany, Norway, Switzerland, France, the United Kingdom, Ireland, Brazil, Venezuela, Canada, Mexico, New Zealand, Australia, Indonesia and India.
The Library of Congress has commissioned a report based on the presentations from this event and the small, half-day workshop that followed the next day. We hope to share that with you soon.