Indiana University

 

Media Contact

Daphne Siefert-Herron Manager of Strategic Initiatives, Pervasive Technology Institute at Indiana University

dsiefert@indiana.edu 812-856-1242

Assuring data authenticity and quality

Karma Toolkit and other provenance tools developed by the Center for Data and Search Informatics help scientists have confidence in their data

March 31, 2009

E-Science brings large-scale computation to bear on scientific problems, often by carrying out computational modeling and analysis organized into workflows that are executed on Internet-wide compute resources. As the volume of research digital data created through e-Science science experimentation proliferates, information about a data’s authentication, validity, and quality takes on new importance.  The provenance of a piece of data is its history; what transformations it has undergone and by whom, and is the foundational information needed to assert attributions of the data such as authentication, validity, and quality.

At IU’s Center for Data and Search Informatics, researchers are developing new and innovative techniques in the area of data provenance with support from the National Science Foundation and Eli Lilly Corp. These techniques help scientists feel confident that the data they are using in their research has not been damaged or altered.

The Karma project builds upon earlier work carried out in DSI on provenance collection and representation by providing a general tool suite for provenance collection, representation, and use. The provenance collection and representation tools are undergoing major redesign with expected release April 2009.  The project team is expanding the instrumentation support of the tool, and a more comprehensive information model that captures details about both data and software models and tools that contributed to production of a data object. 

Also in 2008, DSI became core participants in the definition of the Open Provenance Model [OPM).] This collaborative community effort is aimed at defining a model agreed upon by the provenance community for representing historical provenance graphs. It additionally defines inferences and a time representation. DSI researchers are enthusiastic about this model and have adopted it as a format by which provenance information can be received by and generated by the Karma provenance tool.

DSI affiliate Professor David Leake, of the IU School of Informatics also researched methods for the exploitation of provenance information by applying case-based reasoning to databases of captured provenance information. These methods are being integrated into a workflow composition interface to support scientists' workflow generation process. This phase of research on the case-based support methods has developed and tested domain-independent methods for retrieval and similarity assessment of workflow cases. A testbed case-based system, Phala, supports workflow generation by aiding re-use of portions of prior workflows and has been evaluated to compare three alternative strategies for generating suggestions.