Dealing with the Deluge

IMG_3600

 

Computers have long played an important role in astronomy. It’s just that, back in the late 19th and early 20th centuries, the word was originally applied to people rather than machines—specifically to the men and women who, during daylight hours, carried out repetitive calculations and measurements derived from some of the earliest photographic surveys of the night sky.

At the time, Harvard College Observatory, under the directorship of Edward C Pickering, was at the forefront of extracting information from recorded data; most famously, perhaps, when Dundee-born Mina Fleming assisted Pickering in the classification of thousands of stars, based on the photographed spectra of their light. (As explained in BBC Sky at Night magazine #97, June 2013, during this process Fleming was also the first person to “discover” the Horsehead Nebula, while studying a photographic plate which had been exposed months earlier.)

Thanks to ongoing advances in telescopes, detectors and computer technology, the richness of astronomical observations across the entire electromagnetic spectrum today would undoubtedly have amazed Pickering and Fleming. It’s fair to say, however, that such advances would not be possible without the parallel development of Information Technology.

“Computerisation impacts everywhere, from the control of instruments and the telescopes themselves, all the way through to handling and interpreting the data,” says Nigel Hambly of the Institute for Astronomy at the University of Edinburgh. “The software engineer is an extremely important member of the astronomical team.”

THE CHALLENGE
Hambly and his colleagues at the Royal Observatory, Edinburgh, are currently involved in several major projects which are generating huge amounts of data, most notably the European Space Agency (ESA) Gaia project which aims to chart a three-dimensional map of our own Galaxy. It’s a project that’s expected to generate about 1,000 terabytes (aka one petabyte) of data—enough to fill more than 7,800 iPads!

As the number, range and quality of observations continues to improve, however, are astronomers facing the danger of having more data than they can possibly cope with? Hambly doesn’t think so. “In my experience, during the last 20 years, the ambition of what people have tried to do has always pushed the available technology. In terms of information technology, there’s been such an exponential growth in processing and storage capability that the ambition has gone up in the same way.

“When I first started in this business a couple of decades ago, data storage was enough of an issue that sometimes what would happen in experiments is that the data would be processed in real time and the raw data stream would then be discarded,” he explains. “Now data storage is not so much an issue because the technology has just come on leaps and bounds, so we can archive raw measurements and always have the possibility of going back and looking at them.”

PROCESSING
Hambly’s colleague Michael Read works chiefly on archiving data derived from the panoramic surveys of the southern skies provided by the European Southern Observatory’s VISTA (Visible and Infrared Survey Telescope for Astronomy) project in Chile. While he doesn’t deal with the raw data—the photographic images are “cleaned up” in Cambridge first—he is very much involved with the further processing of the images and “ingesting” the data from them into an archive which can be used by professional—and amateur—astronomers.

“A lot of amateur astronomers use similar techniques with their own cameras and telescopes” Read points out. “They’ll make dark and flat field corrections and calibrate the data in the same way; we just do it on a global scale and, hopefully, to a higher level of detail.” But it’s not just for the benefit of professional astronomers. “We also provide material for citizen science projects. Some of our images have gone into Galaxy Zoo,” he adds.

And there are byproducts; early in his career, Read was involved with the digitisation of photographic plates taken by telescopes in Australia. “Some of the algorithms we developed were adapted to work on mammograms to help detect breast cancer,” he says.

The scale of the current projects can be astounding, such as the cataloging the billions of objects in images of the galactic plane. “We’re now producing tables with tens of billions of rows, which we’ve never done before.” Read explains. “We’ve many many tens of columns of attributes—of stars’ positions, colours, magnitudes, etc. So users can pick out a lot of the data; they can find all the galaxies in a particular region of the sky, or data mine the database for one object across the whole sky—say, the most distant quasars. Either way, you can start out with a hundred million objects and narrow it down to a few hundred, and take observations from that.

Enabling this kind of data-mining—previously impossible without computer assistance—has necessarily required a certain amount of bespoke IT. “We’ve written a lot of ‘curation’ software that ingests the data into the database in bulk,” Read adds. “We have had to overcome certain issues—finding the quickest way of getting the data into the database, bulk-loading, and tuning all that.”

The Gaia data processing system, for example, was engineered using the programming language Java, running on a Unix-based operating system. “One of the advantages of Java is that it’s cross-platform,” explains Hambly. “It’s very easy to deploy it on different platforms, and that’s one of the reasons it was chosen. We try and choose the best tool for the job at the time the job is being done.”

DOING REAL SCIENCE
As with so much else in our world, accessing all this astronomical data is increasingly being done online. “It’s got to the point now that people are talking about cloud solutions for storage,” explains Hambly, “easily accessible to anybody around the world.”

That said, although today’s digital storage systems come with “quite a lot of redundancy in them now,” there’s still a place for using tapes for offline backups. “There’s some pretty good developments in tape technology that mean it’s still feasible to backup very large datasets and stick them on a shelf somewhere as a kind of insurance policy,” he adds. “The actual missions to gather data, be they ground or space-based, are extremely expensive. To go back and get data again is very often not an option, so we owe it to posterity—and to the tax payers who ultimately cough up the money for the projects—to preserve the data.”

Certainly retaining raw data avoids the risk of unknowingly throwing out the baby of an unexpected astronomical discovery along with the bathwater of “background noise”, which might not be recognised as significant depending on what the astronomers are looking for at the time. Which is, arguably, the raison d’être for not just doing such huge surveys in the first place, but of being able to store, assimilate and process the resulting data-sets in an easily accessible digital form. That in itself brings the potential of further extrapolation as various data-sets from around the world can be matched and linked with new software tools and IT infrastructure.

“This systematic, panchromatic approach would enable new science, in addition to what can be done with individual surveys,” according to scientists at the California Institute of Technology, Pasadena, and Johns Hopkins University, Baltimore. “It would enable meaningful, effective experiments within these vast data parameter spaces. It would also facilitate the inclusion of new massive data sets, and optimise the design of future surveys and space missions. Most importantly, [it] would provide access to powerful new resources to scientists and students everywhere, who could do first-rate observational astronomy regardless of their access to large ground-based telescopes.”

. . . . . .

BOXOUT:
* Past: Herschel (2009-2013)
Produced about 3TB over mission.

* Present: VISTA (2009-present)
So far, about 500TB during mission.

* Future: GAIA
Expected: 1PB over 4.5 year mission.

COMPUTING POWER
What equipment you actually need to crunch the numbers? The Atacama Large Millimeter/submillimeter Array (ALMA) of radio telescopes (a partnership between North America, Europe, and East Asia in cooperation with the Republic of Chile) began scientific observations in the latter half of 2011 and has been fully operational since March 2013.

Given that the “low end” of data products will contain about 1GB of raw data, it’s not surprising that the ALMA site suggests that only the smallest datasets “can be effectively processed on a laptop” and that, for desktop computers, 8GB “is probably the minimum memory needed for data reduction”.

The very minimum suggested for “low end” data processing is:
•    Dual core Intel Xeon 2.27GHz processor
•    12GB 1333MHz DDR3 SDRAM
•    1×1.5TB disk

First published by BBC Sky at Night magazine, August 2014, Issue #111.

Subscribe

Subscribe to our e-mail newsletter to receive updates.

,