Gigabyte synthetic database

More reliable provenance


Provenance of scientific data is a key piece of the metadata record for the data's ongoing discovery and reuse. Provenance collection systems capture provenance on the fly. However, the protocol between application and provenance tool may not be reliable. Consequently, the provenance record can be partial, partitioned, and simply inaccurate.

The Gigabyte Synthetic Database is a noisy data collection generated using the Workflow Emulator Tool (WORKEM) with a number of scientific workflow examples that includes modeled failures.


Access the Gigabyte Synthetic Databasethrough the Karma Axis2 API.

The data set of workflow examples has been compressed and made available for download in XML format.

Project Contributors


  • Beth Plale
  • You-Wei Cheah, main contact
  • Yuan Luo
  • Yiming Sun
  • Lavanya Ramakrishnan, Ph.D.,
  • Lawrence Berkeley National Labs


  • Stacy Kowalczyk
  • Aparna Rao


Cheah, You-Wei; Plale, Beth; Ramakrishnan, Lavanya (2011): A Noisy 10GB Provenance Database. Data to Insight Center. Dataset.