Indiana University

 

Digital Data Provenance

Overview
Digital data created through computational science experiment and discovery is growing at a rapid rate and extending to new frontiers as discovery and experiment frameworks gain acceptance and computational power and storage become cheaper. As research digital data collections become more accessible, it becomes increasingly important to address the issues of data validity and quality: To record and manage information about where each data object originated, the processes applied to the data products, and by whom. The ability to routinely collect provenance information about the data products that are produced during the scientific discovery process can have a transformational impact on scientific discovery.

Provenance collection is, in essence, a form of automatic metadata generation. When metadata information collection is automated and done at the point of data product generation, what results is more accurate and complete information being collected, largely because it removes the need of involving users in annotating after-the-fact. As digital library solutions for scientific data collections become more common, as trends indicate is happening already, it will be important that specialized metadata catalogs built up around e-Science discovery, such as the provenance database, be utilized in archival collection for the rich contextual metadata they contain.

We are developing tools for provenance generation and collection and case-based reasoning. The tools and collected data are also available for download for wider community use.

Current Projects

  • Karma Provenance Collection Tool is a stand alone provenance collection toolkit.  The most recent version, v3.0, features instrumentation using Axis2 handlers.
  • Phala is a case-based reasoning recommender system.  It uses computer models of case-based reasoning to develop a support system that leverages the collective experience of the users of the provenance system to provide suggestions.
  • Workflow Emulator (WORKEM) is a tool for executing synthetic workflows.
  • Gigabyte Provenance Database is a multi-Gigabyte collection of provenance information.
  • InstantKarma is a tool for collecting and disseminating provenance of Advanced Microwave Scanning Radiometer - Earth Observing System (AMSR-E) standard data products to improve the collection, preservation, utility and dissemination of the provenance information within the NASA Earth Science community.
  • NetKarma is a tool for capturing the workflow of Global Environments for Network Innovations (GENI) experiments which includes slice creation, topology of the slice, operational status and other measurement statistics and correlate it with experimental data.  NetKarma will allow researchers to see the exact state of the network and store configuration of the experiment and its slice.

Resources

Contact

  • Beth Plale [plale at indiana dot edu]

Project Contributors

  • Bin Cao
  • You-Wei Cheah
  • Sribabu Doddapaneni
  • Dennis Gannon
  • Devarshi Ghoshal
  • Stacy Kowalczyk
  • David Leake
  • Yuan Luo
  • Joseph Morwick
  • Beth Plale
  • Prajakta Purohit
  • Lavanya Ramakrishnan
  • Aparna Rao
  • Ed Robertson
  • Yogesh Simmhan
  • Christopher Small
  • Girish Subramanian
  • Yiming Sun