Big Data and Cloud Computing

About

Cloud infrastructure is an approach that is highly responsive to on-demand scaling of computing and a growing amount of data. As organizations analyze Big Data for insights and predictions, they need workflows that deal with such online and offline processing demands. Different steps in a given data-intensive computation workflow may be performed using different processing frameworks, complicating the lifecycle of a data product, which goes through a Big Data analysis workflow. This is especially the case in emerging Big Data management solutions like Data Lakes in which data from multiple sources are stored in a shared storage system and analyzed by different scientists using different frameworks at different points of time.

Our work draws on experiments with provenance, academic and commercial cloud infrastructures, and access to restricted datasets. Thus, we address the problem of “Big Provenance”, i.e., storing and processing fine-grained provenance collected from data-intensive computations that can be several times larger than the original data itself. Experimental work introduces a parallel stream processing approach to summarize a full provenance stream on-the-fly by preserving backward and forward provenance in the big data cloud computing.

Cloud infrastructures are also deployed to support collection and analysis of socio-ecological data. The software developed at D2I continuously ingests survey data from a cloud mobile texting service used to collect data for a collaborative food security project. Data is downloaded, processed and preserved for further monitoring, analysis, visualization, and integration with other types of data, such as sensor data.

In the areas of cloud infrastructure and access to restricted data, ur research is carried out in collaboration with the HathiTrust Research Center (HTRC). We develop a  data management environment that enables analysis of big data textual collections, while respecting necessary restrictions including copyright and other sensitivities. Additionally, we extend this environment call Data Capsule to be deployed in the cloud. Current efforts are carried out to replace the qemu-based backend system to support the Jetstream platform, a web-based cloud platform at Indiana University.

Additionally, D2I provides and contributes to multiple education and outreach opportunities for students interested in data science and big data:

For more information, see:

For additional projects, please see the Grid and cloud computing sections on Open source software and Data sets and tools.

Selected Publications

  • Isuru Suriarachchi and Beth A. Plale. 2017. Crossing analytics systems: A case for integrated provenance in data lakes. Proceedings of the 2016 IEEE 12th International Conference on e-Science, e-Science 2016. http://doi.org/10.1109/eScience.2016.7870919
  • Suriarachchi, Isuru & Plale, Beth. 2016. Provenance as Essential Infrastructure for Data Lakes. International Provenance and Annotation Workshop, 2016. http://doi.org/10.1007/978-3-319-40593-3_16
  • Chakraborty, Abhirup & Pathirage, Milinda & Suriarachchi, Isuru & Chandrasekar, Kavitha & Mattocks, Craig & Plale, Beth. (2014). Executing Storm Surge Ensembles on PAAS Cloud. Cloud Computing for Data-Intensive Applications, 2014. http://doi.org/257-276. 10.1007/978-1-4939-1905-5_11
  • Chakraborty, Abhirup & Pathirage, Milinda & Suriarachchi, Isuru & Chandrasekar, Kavitha & Mattocks, Craig & Plale, Beth. (2013). Storm surge simulation and load balancing in Azure cloud. Proceedings of the High Performance Computing Symposium, 2013. (pdf)

Archived Projects

Archived projects include Sigiri and Streamflow.

Contact

Please contact Beth Plale for more information about projects related to Big Data and Cloud Computing.