Cloud infrastructure is an approach that is highly responsive to on-demand scaling of computing and a growing amount of data. As organizations analyze Big Data for insights and predictions, they need workflows that deal with such online and offline processing demands. Different steps in a given data-intensive computation workflow may be performed using different processing frameworks, complicating the lifecycle of a data product, which goes through a Big Data analysis workflow. This is especially the case in emerging Big Data management solutions like Data Lakes in which data from multiple sources are stored in a shared storage system and analyzed by different scientists using different frameworks at different points of time.
Our work draws on experiments with provenance, academic and commercial cloud infrastructures, and access to restricted datasets. Thus, we address the problem of “Big Provenance”, i.e., storing and processing fine-grained provenance collected from data-intensive computations that can be several times larger than the original data itself. Experimental work introduces a parallel stream processing approach to summarize a full provenance stream on-the-fly by preserving backward and forward provenance in the big data cloud computing.
Cloud infrastructures are also deployed to support collection and analysis of socio-ecological data. The software developed at D2I continuously ingests survey data from a cloud mobile texting service used to collect data for a collaborative food security project. Data is downloaded, processed and preserved for further monitoring, analysis, visualization, and integration with other types of data, such as sensor data.
In the areas of cloud infrastructure and access to restricted data, ur research is carried out in collaboration with the HathiTrust Research Center (HTRC). We develop a data management environment that enables analysis of big data textual collections, while respecting necessary restrictions including copyright and other sensitivities. Additionally, we extend this environment call Data Capsule to be deployed in the cloud. Current efforts are carried out to replace the qemu-based backend system to support the Jetstream platform, a web-based cloud platform at Indiana University.