The following is a guest post from Jane Mandelbaum, co-chair of the National Digital Stewardship Alliance Innovation Working group and IT Project Manager at the Library of Congress.
As part of our ongoing series of Insights interviews with individuals doing innovative work related to digital preservation and stewardship, we are interested in talking to practitioners from other fields on how they manage and use data over time.
To that end, I am excited to explore some of these issues and questions with Elizabeth Griffin. Elizabeth is an astrophysicist at the Dominion Astrophysical Observatory in Victoria Canada. She is Chair of the International Astronomical Union Task Force for the Preservation and Digitization of Photographic Plates, and Chair of the Data at Risk Task Group of the International Council for Science Committee on Data for Science and Technology. Griffin presented on Preserving and Rescuing Heritage Information on Analogue Media (PDF) at Digital Preservation 2014. We’re interested in understanding how astronomers have been managing and using astronomical data and hope that others can learn from the examples of astronomers.
Jane: Do you think that astronomers deal with data differently than other scientists?
Elizabeth: Not differently in principle – data are precious and need to be saved and shared – but the astronomical community has managed to get its act together efficiently, and is consequently substantially more advanced in its operation of data management and sharing than are other sciences. One reason is that the community is relatively small compared to that of other natural sciences and its attendant international nature also requires careful attention to systems that have no borders.
Another is that its heritage records are photographic plates, requiring a Plate Archivist with duties to catalog what has been obtained; those archives contained a manageable amount of observations per observatory (until major surveys like the Sloan Digital Sky Survey became a practical possibility). Thus, astronomers could always access past observations, even if only as photographs, so the advantages of archiving even analogue data was established from early times.
Jane: It is sometimes said that astronomers are the scientists who are closest to practitioners of digital preservation because they are interested in using and comparing historical data observations over time. Do astronomers think they are in the digital preservation business?
Elizabeth: Astronomers know (not just “think”!) that they are in the digital preservation business, and have established numerous accessible archives (mirrored worldwide) that enable researchers to access past data. But “historical” indicates different degrees; if a star changes by the day, then yesterday’s (born-digital) data are “historical,” whereas for events that have timescales of the order of a century, then “historical” data must include analogue records on photographic plates.
In the former case, born-digital data abound worldwide; in the latter, they are only patchily preserved in digitized form. But the same element of “change” applies throughout the natural sciences, not just in astronomy. Think of global temperature changes and the attendant alterations to glacier coverage, stream flows, dependent flora and fauna, air pollution and so on. Hand-written data in any of the natural sciences, be they ocean temperatures, weather reports, snow and ice measurements or whatever, all belong to modern research, and all relevant scientists have got to see themselves as being in the digital preservation business, and to devote an aliquot portion of their resources to nurturing those precious legacy data.
We have no other routes to the truth about environmental changes that are on a longer time-scale that our own personal memories or records take us. Digital preservation of these types of data are vital for all aspects of knowledge regarding change in the natural world, and the scientists involved must join astronomers in being part of the digital preservation business.
Jane: What do you think the role of traditional libraries, museums and archives should be when dealing with astronomical data and artifacts?
Elizabeth: Traditional libraries and archives are invaluable for retaining and preserving documents that mapped or recorded observations at any point in the past. Some artifacts also need to be preserved and displayed, because so often the precision with which measurements could be made (and thence the reliability of what was quoted as the “measuring error”) was dependent upon the technology of the time (for instance, the use of metals with low expansion coefficients in graduated scales, the accuracy with which graduation marks could be inscribed into metal, the repeatability of the ruling engine used to produce a diffraction grating, etc.).
There is also cultural heritage to be read in the historic books and equipment, and it is important to keep that link visible if only so as to retain a correct perspective of where we are now at. Science advances by the way people customarily think and by what [new] information they can access to fuel that thinking, so understanding a particular line of argument or theory can depend importantly upon the culture of the day.
Jane: The word “innovation” is often used in the context of science and technology, and teaching science. See for example: The Marshmallow Challenge. How do you think the concept of “innovation” can be most effectively used?
Elizabeth: “Innovation” has become something of a buzz-word in modern science, particularly when one is groping for a new way to dress up an old project for a grant proposal! The public must also be rather bemused by it, since so many new developments today are described as “innovative.” What is important is to teach the concept of thinking outside the box. That is usually how “innovative” ideas get converted into new technologies – not just cranking the same old handle to tease out one more decimal place – so whether you label it “innovation” or something else, the principle of steering away from the beaten track, and working across scientific disciplines rather than entombing them within specialist ivory towers, is the essential factor in true progress.
Jane: “Big data” analysis is often cited as valuable for finding patterns and/or exceptions. How does this relate to the work of astronomers?
Elizabeth: Very closely! Astronomers invented the “Virtual Observatory” in the late 20th Century, with the express purpose of federating large data-sets (those resulting from major all-sky surveys, for instance) but at different wavelengths (say) or with other significantly different properties, so that a new dimension of science could be perceived/detected/extracted. There are so very many objects in an astronomer’s “target list” (our Galaxy alone contains some 10 billion stars, though amongst those are very many different categories and types) and it was always going to be way beyond individual human power and effort to perform such federating unaided. Concepts of “big data” analyses assist the astronomer very substantially in grappling with that type of new science, though obviously there are guidelines to respect, such as making all metadata conform to certain standards.
Jane: What do you think astronomers have to teach others about generating and using the increasing amounts of data you are seeing now in astronomy?
Elizabeth: A great deal, but the “others” also need to understand how we got to where we now are. It was not easy; there was not the “plentiful funding” that some outsiders like to assume, and all along the way there were (and still are) internecine squabbles over competitions for limited grant funds: public data or individual research is never an easy balance to strike! The main point is to design the basics of a system that can work, and to persevere with establishing what it involves.
The basic system needs to be dynamic – able to accommodate changing conditions and moving goal-posts – and to identify resources that will ensure long-term longevity and support. One such resource is clearly the funding to maintain and operate dynamic, distributed databases of the sort that astronomers now find usefully productive; another is the trained personnel to operate, develop and expand the capabilities, especially in an ever-changing environment. A third is the importance of educating early-career scientists in the relevance and importance of computing support for compute-intensive sciences. That may sound tautological, but it is very true that behind every successful modern researcher is a dedicated computing expert.
Teamwork has been an essential ingredient in astronomers’ ability to access and re-purpose large amounts of data. The Virtual Observatory was not built just by computing experts; at least one third of committee members are present-day research astronomers, able to give first-hand explanations or caveats, and to transmit practical ideas. These aspects are important ingredients in the model. At the same time, astronomers still have a very long way to go; only very limited amounts of their non-digital (i.e. pre-digital) data have so far made it to the electronic world; most observations from “history” were recorded on photographic plates and the faithful translation of those records into electronic images or spectra is a specialist task requiring specialist equipment. One of the big battles which such endeavors face is even a familial one, with astronomer contending against astronomer: most want to go for the shiny and new things, not the old and less sophisticated ones, and it is an uphill task to convince one’s peers that progress is sometimes reached by moving backwards!
Jane: What do you think will be different about the type of data you will have available and use in 10 years or 20 years?
Elizabeth: In essence nothing, just as today we are using basically the same type of data that we have used for the past 100+ years. But access to those data will surely be a bit different, and if wishes grew on trees then we will have electronic access to all our large archives of historic photographic observations and metadata, alongside our ever-growing digital databases of new observations.
Jane: Do astronomers share raw data, and if so, how? When they do share, what are their practices for assigning credit and value to that work? Do you think this will change in the future?
Elizabeth: The situation is not quite like that. Professional observing is carried out at large facilities which are nationally or internationally owned and operated. Those data do not therefore “belong” to the observer, though the plans for the observing, and the immediate results which the Principal Investigator(s) of the observing program may have extracted, are intellectual property owned by the P.I. or colleagues unless or until published. The corresponding data may have limited access restrictions for a proprietary period (usually of the order of 1 year, but can be changed upon request).
Many of the data thus stored are processed by some kind of pipeline to remove instrumental signatures, and are therefore no longer “raw”; besides, raw data from space are telemetered to Earth and would have no intelligible content until translated by a receiving station and converted into images or spectra of some kind. Credit to the original observing should be cited in papers emanating from the research that others carry out on the same data once they are placed in the public domain. I hope that will not change in the future. It is all too tempting for some “armchair” astronomers (one thinks particularly of theoreticians) who do not carry out their own observing proposals, but wait to see what they can cream off from public data archives. That is of course above board, but those people do not always appreciate the subtleties of the equipment or the many nuances that may have affected the quality or content of the output.
Jane: Do astronomers value quantitative data derived from observations differently than images themselves?
Elizabeth: Yes, entirely. The good scientist is a skeptic, and one very effective driver for the high profile of our database management schemes is the undeniable truth that two separate researchers may get different quantitative data from the same initial image, be that “image” a direct image of the sky or of an object, or its spectrum. The initial image is therefore the objective element that should ALWAYS be retained for others to use; the quantitative measurements now in the journal are very useful, but are always only subjective, and never definitive.
Jane: How do you think citizen science projects such as Galaxy Zoo can be used to make a case for preservation of data?
Elizabeth: There is a slight misunderstanding here, or maybe just a bad choice of example! Galaxy Zoo is not a project in which citizens obtain and share data; the Galaxy data that are made available to the public have been acquired professionally with a major telescope facility; the telescope in question (the Sloan Telescope) obtained a vast number of sky images, and it is the classification of the many galaxies which those images show which constitute the “Galaxy Zoo” project. There is no case to be made out of that project for the preservation of data, since the data (being astronomical!) are already, and will continue to be, preserved anyway.
Your question might be better framed if it referred (for instance) to something like eBird, in which individuals report numbers and dates of bird sightings in their locations, and ornithologists are then able to piece together all that incoming information and worm out of it new information like migration patterns, changes in those patterns, etc. It is the publication of new science like that that helps to build the case for data preservation, particularly when the data in question are not statutorily preserved.