Top of page

How to Think About Data: A Conversation with Christine Borgman

Share this post:

Members of the Scholars Council are appointed by the Librarian of Congress to advise on matters related to scholarship at the Library, with special attention to the Kluge Center and the Kluge Prize. The Council includes scholars, writers, researchers, and scientists. “Insights” is featuring some of the work of this group of thinkers. Dan Turello continues the series in a conversation with Christine Borgman.

Christine L. Borgman, Distinguished Research Professor and Presidential Chair Emerita in Information Studies at UCLA, is the author of more than 250 publications in information studies, computer science, and communication. These include three books from MIT Press: “Big Data, Little Data, No Data: Scholarship in the Networked World” (2015), “Scholarship in the Digital Age: Information, Infrastructure, and the Internet” (2007), and “From Gutenberg to the Global Information Infrastructure: Access to Information in a Networked World” (2000).

DT: Christine, you have a background in mathematics. What inspired you to study libraries and data?

CB: In the 1970s, women with math degrees had two career options: teaching school or programming. Having done some of each, I sought a career that combined technology, teaching, and communication. Those were the early days of library automation, and my mother, a superlative reference librarian at Wayne State University who hated computers, recognized the profession’s dire need for my skillset. Allen Kent managed an information empire at the University of the Pittsburgh that spanned the Library and Information Science graduate program, the university library, the computer center, and his NSF and NASA-funded research group. As a project director, he hired me to be a graduate student researcher on his information retrieval grants, and as my academic advisor, he allowed me to count math and computing courses in data modeling toward my Master’s in Library Science (MLS) degree.

My MLS-information science degree was ahead of its time, and an ideal qualification to become systems analyst for the Dallas Public Library. We extended a home-grown, semi-automated, circulation system into the first online catalog of a major public library. Because the library catalog system was built on the city’s IBM mainframe (in assembler language, no less), it was a distributed system available to the branch libraries and to all City of Dallas employees who had computer terminals on their desks. Leading the library’s design team was a humbling experience, and a crash course in collaboration, metadata, human-computer interaction, programming, systems development life cycles, distributed computing, evaluation, management, and much else.

DT: Was the goal to increase accessibility? How did it go?

CB: Dallas still was a frontier city in the 1970s and determined to be a pioneer in many areas, including information technology. Lillian Bradshaw, the legendary director of the DPL (and president of the American Library Association, among other accomplishments), was half of the “power couple” of City of Dallas leadership – her husband was chief of the fire department. Together they made the library the top computing priority for the city, ahead of police and fire, which was indeed amazing. We designed a complete integrated library system, with the MBDB (the master bibliographic database) at the core, with the online catalog, circulation, library acquisitions, and other services as the spokes. The circulation system, which came first, improved the speed of checkout and tracking of collections. The online catalog greatly improved access to collections and facilitated an inventory of shelf materials. The acquisitions system improved accounting and coordination of purchases between Central and the branches. It was an ambitious project, and while technical constraints limited the size of records and sophistication of user interfaces, these automated systems did improve access to collections, services to library patrons, and management of resources.

DT: And from there, you found your way into academia, and to UCLA.

CB: My experiences in information retrieval at the University of Pittsburgh and in library automation at the Dallas Public Library made me appreciate the complexities of human-computer interaction. Human-computer interaction (HCI) was not yet coined as a term, much less as a degree program in the 1970s. In 1980, I was thrilled to be accepted into the best information technology program of the day, which was the Department of Communication at Stanford University. In addition to the requisite courses on communication theory, media, and research methods, generous mentoring by William Paisley and Everett Rogers led me to courses in industrial engineering (“man-machine systems”), cognitive psychology, artificial intelligence, sociology, and elsewhere around the university. The early 1980s were the heady days of technology from which Silicon Valley emerged. Xerox Parc, the Institute for the Future, Apple, SRI, and other local companies all sought Stanford grad students for their labs. OCLC funded much of my degree as an off-site Graduate Student Researcher. Doug Englebart and Allen Newell were guest speakers in our classes. I attended all of Herbert Simon’s lectures when he visited as a distinguished scholar. Amos Tversky and Lee Ross taught a small seminar from their early work on heuristics and judgment. One could easily become a perpetual student with these intellectual riches, but I wrapped up my PhD degree in three years to continue my career.

By the early 1980s, women with technology degrees – especially with a Stanford PhD – had far more options. While I considered faculty positions in communication and in business schools, the heaviest recruiting was from library schools, as they were then known. Anyone who could teach library automation, online searching, and human-computer interaction was hot property. Robert M. Hayes, Dean of the Graduate School of Library and Information Science at UCLA, a mathematician and a pioneer in library automation, was a persuasive recruiter and a superb mentor. The quality of the UCLA faculty and students, the array of excellent programs across the campus, the resources for developing my research agenda, and the top-rated library collections all were part of the attraction. Having spent 35 years on the UCLA faculty, I can affirm that it was the right choice.

DT: The words “data” and “point” are often used together. But in geometry, points don’t have dimensions, and in your book “Big Data, Little Data, No Data” you make the case that data is best understood in the context of time and storyline, which are dimensional. Could you tell us more?

CB: “Data” and “point,” taken together, usually are regarded as “facts.” Each of these terms is problematic, as one person’s fact is another person’s opinion. The contentious nature of “facts” is a common theme in the history of knowledge, and yet data continue to be defined as facts in research policy documents. When a journal or a funding agency requires that data associated with a research project be released, just what “data” or “facts” should be submitted? A list of numbers? A complete documentation of all steps in a project, including hardware, software, and specimens? Almost anything can be treated as data, at some time, for some purpose. Whether numerical observations, experimental outputs, models, physical specimens, audio or visual recordings, archaeological artifacts, or other representations of the natural world, data can only be understood in a context. Thus, as you suggest, data become useful when they are set in a storyline via documentation such as metadata and provenance information; explanations appropriate for experts in the domain; access to necessary software, hardware, and calibration records; and myriad other details that vary by the uses to which data might later be put. A string of numbers or a spreadsheet of alphanumeric observations is of little value without an origin story.

“Turner’s North Carolina almanac: for the year of our Lord …” (Henry D. Turner, 1847), p. 224. From State Library of North Carolina: https://www.flickr.com/photos/internetarchivebookimages

DT: Data derives from the Latin “datum” meaning “given.” It strikes me that there are two ironies here. Data, these days, is more often taken than given – by scientists extracting it from nature, or companies harvesting it from customers. The other irony comes from the second shade of meaning as a “given” in the sense of a reliable base of knowledge, because we have increasingly come to think of data as the result of a framing, of a particular way of viewing certain segments of reality. Could you comment on the idea of consent around data collection? Who is taking, and who is giving, and what are the ethics around these transactions?

CB: Because data exist in a context, they are highly subject to interpretation, and thus to misinterpretation. After years of wrestling with messy data in messy research contexts, I arrived at this phenomenological definition in the “Big Data, Little Data, No Data” book:

“Data are representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship.”

Two scholars, working side-by-side in lab or field or library, can have legitimately different interpretations of the same observation or object. In our own research, for example, an electrical engineer measured temperature based on consistent readings from a sensor network of his own design, whereas a partner biologist only trusted the reading from that sensor network when it had been calibrated against another instrument that was, in turn, calibrated against an international standard. Temperature is not a single scientific construct; it is measured differently for technical, biological, meteorological, medical, and other purposes. It has temporal characteristics, as instantaneous, intermittent, and longitudinal measurements serve different purposes. Common measures such as “average daily temperature” can be computed in many ways.

In scholarly contexts, the ethics of data collection rely on accurate reporting of research methods that identify the strengths and limitations of a study. A reader’s ability to trust findings, and the data and interpretation on which those findings are based, is multi-faceted. Peer reviewers, prior to publication, assess the veracity of the work. Once published, scholars consider the reputation of the journal, the authors, research questions, and methods. They may also inspect the dataset, if available. Trust, in the sense of “a reliable base of knowledge,” is the result of checks and balances in the scholarly communication system, and the accumulation of a consistent base of findings over time.

Raper, Howard, Riley, “Elemental and Dental Radiography” (New York: Consolidated Dental mfg., 1913), p. 122. From Columbia University Libraries: www.flickr.com/photos/internetarchivebookimages

Consent has become extremely problematic in an age of ubiquitous data collection, whether for research, marketing, or surveillance purposes. The Code of Fair Information Practices, which was adopted internationally by the early 1970s, focused on notice of data collection and informed consent, plus other important constraints such as access to information being collected about oneself. These policies were designed for local record keeping systems; they long pre-date mobile computing, networked computer systems, and data integration at scale.

Those of us who conduct research with human subjects, where we interview people or collect data about them in other ways, follow elaborate ethical and legal policies for data protection, confidentiality, and reporting. Businesses, especially in the United States, are subject to much less stringent policies. In Europe, the General Data Protection Regulation (GDPR), provides stronger protections of data about individuals.

Whether for research, business, or government data collection, notice and informed consent remain necessary but are by no means sufficient. Much broader governance of the uses of those data, alone and aggregated, is necessary.

DT: Digitally published materials are very easy to disseminate – they are available to anyone, anywhere around the world with access to a web-browser. Preservation is a different issue. Could you talk about the challenges of digital preservation? What are the trade-offs of digital versus print?

CB: Preserving and conserving print materials is a non-trivial matter, as the Library of Congress is well aware. The most obvious difference between these formats is that print can survive by benign neglect, given adequate environmental controls. Digital materials are compound objects, often consisting of text, images, numeric models, sounds, animations, and other entities, each of which must be migrated to new forms of hardware and software on a regular basis. Maintaining relationships between these compound objects is extremely problematic, given that each unit may degrade at different rates. Preserving our “software heritage” is the new frontier. Digital data are collected, created, interpreted, and disseminated via software. If you don’t have the software, you don’t have the data.

Hoe, Robert, “A short history of the printing press and of the improvements in printing machinery from the time of Gutenberg up to the present day” (1902), p. 22. From Cornell University Library: https://www.flickr.com/photos/internetarchivebookimages

 

Comments

  1. Great interview! It’s amazing to consider the different ways we collect, conceptualize and use data. I will be interested to find out what Professor Borgman finds in the LOC collections.

Add a Comment

Your email address will not be published. Required fields are marked *