Gigabyte synthetic database

More reliable provenance

About

Provenance of scientific data is a key piece of the metadata record for the data's ongoing discovery and reuse. Provenance collection systems capture provenance on the fly. However, the protocol between application and provenance tool may not be reliable. Consequently, the provenance record can be partial, partitioned, and simply inaccurate.

The Gigabyte Synthetic Database is a noisy data collection generated using the Workflow Emulator Tool (WORKEM) with a number of scientific workflow examples that includes modeled failures.

Resources

The data set of workflow examples has been compressed and made available for download in XML format.

Project Contributors

Current

  • Beth Plale
  • You-Wei Cheah, main contact
  • Yuan Luo
  • Yiming Sun
  • Lavanya Ramakrishnan, Ph.D.,
  • Lawrence Berkeley National Labs

Historical

  • Stacy Kowalczyk
  • Aparna Rao

Citations

Cheah, You-Wei; Plale, Beth; Ramakrishnan, Lavanya (2011): A Noisy 10GB Provenance Database. Data to Insight Center. Dataset. http://dx.doi.org/10.5967/M0VX0DHR