What is astro-informatics?

For over two hundred years, the usual mode of carrying out astronomical research has been that of a single astronomer or small group of astronomers performing observations of a small number of objects. This traditional approach is now undergoing a dramatic and very rapid change, driven by a revolution—in telescope and detector technology, in computing power and cyber-infrastructure, and in information science—that is without precedent in astronomy. These developments have converged in the last few years and are completely altering the manner in which most astronomical research is carried out, mirroring similar transformations now occurring in other disciplines.

Perhaps most significantly, new large-scale surveys of the sky from space and from the ground are being initiated across the entire electromagnetic spectrum, from radio to gamma-ray wavelengths, generating vast amounts of high-quality data. For example, the 2‑Micron All-Sky Survey—providing infrared measurements of 300 million stars over the entire sky—and the Sloan Digital Sky Survey—providing multi-wavelength images and spectra for 100 million celestial objects covering one-quarter of the entire sky—together represent ~15 terabytes of data, rivaling the total information content of the Library of Congress. And this is only the beginning. The Large Synoptic Survey Telescope, an NSF-funded facility recommended in the most recent National Academy of Science decadal survey, will produce upwards of 30 terabytes per day, repeatedly imaging the entire sky every few days, thus making possible large scale time-domain studies for the first time.

In response, the National Academy of Sciences has recommended, and NSF has funded, the establishment of a National Virtual Observatory (NVO) to link the archival data of space- and ground-based observatories, catalogs of new and upcoming multi-wavelength surveys, and the computational resources necessary to support comparison and cross-correlation among these resources. Rapid querying of large scale catalogs, establishment of statistical correlations, discovery of new data patterns and temporal variations, and confrontation with sophisticated numerical simulations are all avenues for new science that are becoming possible.

At the same time, however, these developments pose a compelling mandate to develop the next generation of researchers with the interdisciplinary expertise needed to fully harness the power of these emerging capabilities, and to function effectively in the large scale international collaborations that are now taking shape. The size and complexity of astronomical data and simulations requires a new breed of scientists with training and skills at the interface of astronomy, physics, computer science, statistics, and information science: astro-informatics.

The interdisciplinarity of data-intensive research and training

Consider: An astronomer studying the formation of massive black holes in galaxy collisions, a cognitive neuroscientist studying the neural basis of perceptual decision making, and a chemist studying the structure of large proteins. On their face, these studies could not be more different. They involve structures that differ vastly in spatial scale—from the molecular (nanometers), to the organic (centimeters), to the galactic (Zettameters)—and involve systems that evolve on timescales ranging from microseconds to Giga-years. They involve very different forces that govern interactions among systems of “particles”—from the electrostatic in the case of proteins, to the biochemical in the case of neurons, to the gravitational in the case of galaxies. Yet, from a computational standpoint, the astronomer, the cognitive neuroscientist, and the chemist in these scenarios are fundamentally tackling a similar class of many-body problems, and all require fundamentally similar high-performance computing resources, algorithms, and skills to simulate interactions, to model real-life systems, and to test hypotheses.

Consider: A team of particle physicists in the U.S. and in Brazil analyzing data flowing from the Large Hadron Collider at CERN in order to search for signs of the elusive Higgs boson. Or a team of astronomers in the U.S. and in South Africa sifting through data streaming from the South African Astronomical Observatory in order to search millions of Sun-like stars for faint “shadows” cast by orbiting exoplanets. Both of these international teams need to employ very similar networking technologies to rapidly transfer and share a deluge of information—up to several terabytes per day—and must deploy similar search algorithms in a highly distributed, parallel fashion. Similarly, contemporary techniques in the brain sciences, from multiunit neurophysiology to multichannel electrophysiology to high-resolution functional magnetic resonance imaging, generate tens of gigabytes of data that must be easily shared and analyzed in a collaborative fashion. These types of data-intensive applications will become increasingly important for real-time analysis and follow-up of data stemming from the large-scale national and international projects now underway such as the high-energy physics Compact Muon Solenoid (CMS) experiment, the astronomy Large Synoptic Survey Telescope (LSST), the biochemistry Protein Data Bank (PDB), data sharing infrastructure for the NSF Science of Learning Centers, and many others.

Consider: A team of faculty and students at Fisk University in the U.S. and at University of the Western Cape in South Africa, both Historically Black Universities, using large-scale databases from the above experiments for statistical analysis and visualization. Indeed, the range of statistical challenges represented by these data archives is truly vast. For example, linking these datasets to complex physical models will require regression and parameter estimation for applications from molecular modeling of proteins to structural equation modeling of the brain to dynamical modeling of galactic structure. In addition, extension of survival analysis and Bayesian methods will be needed where non-detections commonly occur in large experiments. These are computationally highly non-trivial problems, and call for novel and efficient implementation of statistical algorithms. Moreover, as statistical and mechanistic models of physical and brain processes become ever more complex and nonlinear, researchers must learn how to statistically evaluate competing models: When is a model sufficiently complex? How much complexity is necessary? Good visualization tools are also essential for analysis and for dissemination. At the same time, these provide rich opportunities for student training and for capacity-building at institutions with limited resources.

Students and faculty working in these areas are often isolated from those in the other areas by traditional disciplinary boundaries, and tend also to be isolated in their use of computational and statistical methods. We believe that the diffusion of data-intensive applications and approaches associated with VIDA will lead to important new scientific discoveries, improve the capabilities of students, and contribute to successful careers. What is innovative about VIDA is the connective tissue that is deliberately formed at the interfaces of the physical, mathematical, and computational sciences in order to develop the future users and innovators of data-intensive applications and high-performance computing in astronomy and astrophysics.