Provenance & metadata

D2I has a strong presence in provenance and metadata for scientific data

The ability to routinely collect provenance information about the data products that are produced during the scientific discovery process can have a transformational impact on scientific discovery.

Digital data provenance

Provenance collection is, in essence, a form of automatic metadata generation. When metadata information collection is automated and done at the point of data product generation, what results is more accurate and complete information being collected, largely because it removes the need of involving users in annotating after-the-fact.

As digital library solutions for scientific data collections become more common, as trends indicate is happening already, it will be important that specialized metadata catalogs built up around e-Science discovery, such as the provenance database, be utilized in archival collection for the rich contextual metadata they contain.

Often with collaborators at IU and at other institutions, the Data to Insight Center is developing tools for provenance generation and collection and case-based reasoning. The tools and collected data are also available for download for wider community use.

Metadata for scientific data

With the increasing deluge of scientific data, detailed metadata is necessary to enable scientists to share data and find the data and scientific results relevant to their research.

The Data to Insight Center's research emphasizes capturing detailed metadata early in the scientific process. In addition to detailed metadata, research has found that as the distance (spatially or temporally) increases between data creators and data users, additional structured metadata is required. The XMC-Cat suite of tools resulting from our research in the Linked Environments for Atmospheric Discovery (LEAD) and subsequent research projects enables detailed and automated incremental metadata capture early in the scientific process using a generalized architecture that can be adapted to metadata schemas of different scientific communities.

Archived Projects

Archived projects include XMC-Cat, Karma Provenance Collection Tool, and Gigabyte Provenance Database.

For more information and additional projects, please see the Provenance & metadata sections on Open source software and Data sets and tools.


Please contact Beth Plale for more information about projects related to provenance and metadata.

To see more projects, please visit our Open source software and Data sets and tools pages.


  • Mehmet Aktas
  • Bina Bhaskar
  • Bin Cao
  • Kavitha Chandrasekar
  • You-Wei Cheah
  • Peng Chen
  • Sribabu Doddapaneni
  • Dennis Gannon
  • Devarshi Ghoshal
  • Scott Jensen
  • Stacy Kowalczyk
  • Shobana Krishnan
  • David Leake

  • Yuan Luo
  • Joseph Morwick
  • Beth Plale
  • Prajakta Purohit
  • Lavanya Ramakrishnan
  • Aparna Rao
  • Ed Robertson
  • Kalani Ruwanpathirana
  • Bimalee Salpitikorala
  • Yogesh Simmhan
  • Christopher Small
  • Girish Subramanian
  • Yiming Sun