The following is a guest post from Jane Mandelbaum, co-chair of the National Digital Stewardship Alliance Innovation Working group and IT Project Manager at the Library of Congress.
As part of our ongoing series of insights discussions with individuals doing innovative work related to digital preservation and stewardship I am excited to talk with Brian Schmidt. Brian works as an astronomer at the Research School of Astronomy and Astrophysics at the Australian National University and his research is based on a lot of the “big data” that many individuals in the digital preservation and stewardship community have been keenly interested in. Schmidt shared both the 2006 Shaw Prize in Astronomy and the 2011 Nobel Prize in Physics for providing evidence that the expansion of the universe is accelerating.
Jane: I read that you’ve predicted that IT specialists will be at the core of building new telescopes. For example, your SkyMapper project, which is currently scanning the southern sky has a peak data rate of one terabyte per day. The Australian Square Kilometer Array Pathfinder, an array of 36 radio telescope dishes being built in Australia, will generate two terabytes per second. Can you talk about how you think astronomers and IT specialists will work together on these kinds of projects?
Brian: New telescopes like, Skymapper, are creating massive amounts of data, a terabyte of data each night. Processing a terabyte of data a night and making that data useful is as much an interesting computer science problem as it is an astronomy problem. In the past, astronomers did a lot of this kind of computer science work themselves. But the reality is, this has moved beyond what I can do sensibly myself. We need interdisciplinary groups of researchers to work together to meet these challenges. So astronomers need to be able to specify the scientific outcomes and algorithms. But implementation, and design of systems and databases and how that data is served, is computer science problem. So we work with them, to meet our needs. If you have a lot of data, and you’re not a computer scientist, you really want to use expertise that is out there.
Jane: Do you think that astronomers deal with data differently than other scientists?
Brian: Astronomers are very open with their data. This is one of the reasons that projects like the Sloan digital sky survey work in our field. Alongside that, our data is representations of the night sky. Everyone knows what stars look like, which means that people understand what we do in a way that they might not with other sciences. Aside from that, much of our data, for example images of galaxies, is beautiful in a way that something like DNA sequences isn’t. These features are all important for our ability to create complex citizen science projects.
Jane: It is sometimes said that astronomers are the scientists who are closest to practitioners of digital preservation because they are interested in using and comparing historical data observations over time. Do astronomers think they are in the digital preservation business?
Brian: Historical data is of the utmost importance in astronomy. Astronomers are often looking for subtle changes that occur over hundreds of years. For example, if we discover a new asteroid that might come close to Earth you need to go back to the archives and see what data you have on it to figure out if it is a threat. The more years you have, the more accurate you can predict the orbit. Other sciences benefit from this kind of long view of historical data, however, we’re the discipline that has had our act together for the longest period of time.
Jane: What do you think the role of traditional libraries, museums and archives should be when dealing with astronomical data and artifacts?
Brian: I think we are still figuring out the role that libraries, archives and museums have to play in the contemporary work of astronomers. In 2003 a fire-storm largely destroyed the library at the largely destroying Mount Stromlo Observatory. As a result of the work of IT and Library staff all of the digital information of the observatory was backed up and restored from off site. However, all the paper was just gone. Losing a Library of resources is a major loss, however, at this point, astronomy is basically a completely digital field. We keep a small numbers of books around for reference, but when we want to read the literature we have the Harvard/Smithsonian Astrophysics Data System. Just about every interaction I and my colleagues have with papers and articles is through that portal. Just search and download the full text.
While we have digital access to research and reference material through services like the Astrophysics Data System, there are substantial information challenges we are facing that I think libraries, archives and museums could help with. We’re even more information driven than in the past. Our work could be substantially aided with libraries providing systems for working with and curating data. Libraries need to figure out how to help curate and make available data and data products. Ideally, we would have librarians taking on increasingly specialist niches, across many institutions. In our library, we are bringing in more staff who have expertise in data management – trained astronomers who decide they want to be exporting data to the masses. I think training people in library science curation is important too, and I imagine we will increasingly see individuals with these skill sets and background embedded in the teams that produce, maintain, and provide access to various data products.
Jane: “Big data” analysis is often cited as valuable for finding patterns and/or exceptions. How does this relate to the work of astronomers?
Brian: Astronomers are often interested in very rare objects. For example, Skymapper will is cataloging 10 billion stars. And we want to find earliest stars in Milky Way with specific color signature. We need that many stars to find enough of those stars to do our research, and as a result, we need to use data mining techniques to find those very few needles in that gigantic haystack. Techniques allow us to do this.
Jane: What do you think astronomers have to teach others about generating and using the increasing amounts of data you are seeing now in astronomy?
Brian: Astronomers have been very good at developing standards (database and serving standards). There is a persistent danger that every library uses its own standards. You don’t want to have to work across hundreds of standards to make sense of what each piece of data means. You want it to be universal and also flexible to add things. Astronomy has been doing this for a good while and it’s not easy. Getting standards for data in place that work requires a consensus dictatorship. It requires collaborations between librarians, and computer scientists to figure out how to create and maintain data hierarchies. Astronomers developed the FITS data standard in the 1980s and are still using it. In the last five to seven years it’s diverge a bit in the field, which suggests we likely need to revisit and revise. Every time an observatory observes something, there are stars in common between observations that can serve as a point of reference. Linking this data can be very complicated – cross-matching is a difficult problem for 10 billion objects. Obvious thing is to give every object index number, but have to allow uncertainty.
Jane: What do you think will be different about the type of data you will have available and use in 10 years or 20 years?
Brian: We are going to continue to have more and more data and information. Now have images of sky, but in future will have images at thousands of wavelengths (compared to 5 or 6 now). We are going to have data cubes that record coordinates and intensity at 16k frequencies from radio telescopes. We are talking about instruments that generate a petabyte of data a night. This quantity of data is a challenge for every part of a system. It’s difficult to store, retrieve, process, and analyze and exactly how we work with it is a work in progress. We very well may need to be processing this data in real-time, finding the signal we care about and disregarding the noise, because the initial raw data is just too much to deal with if we let it pile up.
Jane: Speaking of raw data, do astronomers share raw data, and if so, how? When they do share, what are their practices for assigning credit and value to that work? Do you think this will change in the future?
Brian: Astronomers tend to store data in multiple formats. There is the raw data, as it comes off the telescope and we tend to store a copy of that. However the average researcher doesn’t care about that. They want it transformed into final state – fully calibrated, and we know where every pixel points to in the sky. At this point, all the data we provide access to is processed data. You can make a query and we give back “here’s this star and it’s properties.” It’s just too hard to query into the actual images we’ve collected. That isn’t how our systems are set up.
Jane: You’ve talked about the value of citizen science projects such as Galaxy Zoo. How do you think these kinds of projects could make a case for preservation of data?
Brian: Citizen science, at its best, serves as outreach/education and the advancement of science simultaneously. We need to be careful that citizen science projects are doing scientifically useful work with the hours and efforts people are putting in. Ideally, we can leverage the work people put into these kinds of projects to calibrate algorithms to double the value of their efforts. The immense data challenges facing astronomy and other sciences and the potential for citizen science projects to bring the public in to help us make sense of this data I think we are entering into a brave new information world. At this point, we need library and information science to become a lot bolder to stay relevant. There are huge opportunities to do great things in this area. I think timidity is likely the biggest threat to the future potential role that libraries, archives and museums could play in the future of sciences like astronomy. There are huge opportunities here to do great things.
Comments
Thank you Brian and Jane for an informative post — there is much to admire in the cooperative work of the world’s astronomers. For those interested in the FITS format, one good starting point is the description that Caroline Arms assembled for the Library of Congress Format Sustainability Web site: http://www.digitalpreservation.gov/formats/fdd/fdd000317.shtml. The format team at the Library was nudged (and then assisted) in the preparation of this description by Lucio Chiappetti (IASF Milano, IT), who serves as the current chair of the International Astronomical Union’s FITS Working Group.