Indiana University

Media Contact

dsiefert [at] indiana [dot] edu (Daphne Siefert-Herron) Manager of Strategic Initiatives, Pervasive Technology Institute at Indiana University

dsiefert [at] indiana [dot] edu 812-856-1242

NSF funds $433K project on Digital Data Provenance

The tools coming out of this project will help scientists in the life and physical sciences better track their interactions with data, will make storage and reuse of scientific data easier, and will help scientists working with computational modeling and analysis tools work more productively.

January 1, 2007

NSF SDCI:  A New Toolkit for Provenance Collection, Publishing, and Experience Reuse

Beth Plale, David Leake, Yogesh Simmhan, and Dennis Gannon

Department of Computer Science, School of Informatics
Indiana University

New forms of scientific digital data are being generated in huge quantities from sophisticated computational analysis and database access steps executed by scientists in the life and physical sciences. In the past these multi-step computational analysis tasks would require a script hand written by a scientist. But the scripts are notoriously difficult to write and are brittle and difficult to maintain. Software cyberinfrastructure is a suite of tools that gives the scientist a means to specify analysis sequences without having to learn a scripting language. As the volume of research digital data created through computational science experimentation proliferates, it becomes increasingly critical to capture information on the fly about a data’s authentication, validity, and quality. This project, funded by the National Science Foundation, creates a domain-independent tool for capturing and using provenance data of scientific digital data. As there is growing interest in storing scientific data to digital libraries, we additional work with colleagues in the Digital Library Program at Indiana University to understand what provenance of scientific data is necessary for long-term preservation and use of an object. Finally, provenance information collected automatically about a data product when taken in the aggregate, forms interesting history of use by scientists. With this information we are asking questions about what data can be mined to make a domain scientist’s job easier, such as making suggestions about future workflow-driven investigations.

The tools coming out of this project will help scientists in the life and physical sciences better track their interactions with data, will make storage and reuse of scientific data easier, and will help scientists working with computational modeling and analysis tools work more productively.

“Provenance of scientific data is an emerging research area, and one of importance not only to scholars, but to industry as well.” Said Beth Plale, Director of the Center for Data and Search Informatics. “As the volume of scientific data from computational analysis grows into the petabyte range, it is increasingly important that provenance information like ownership and validity travel with the scientific data, wherever it eventually resides.”