Indiana University

 

Indiana University (IU) will establish the National Center for Genome Analysis Support (NCGAS) in partnership with the Texas Advanced Computing Center (TACC). The NCGAS will be an innovative service center (core facility) that supports the national community of NSF-funded researchers who use genome assembly software, particularly software suitable for assembly of data from next-generation sequencers; large-scale phylogenetic software; and other genome analysis software requiring large amounts of memory. This center will be a general source of software support and services that will be provided on the Mason large memory computer cluster at IU, on the TACC Ranger system, and on the San Diego Supercomputer Center Gordon system. The NCGAS will provide the following services:

  • Support of use of genome analysis software on the above clusters.
  • Storage of submitted data sets and results for at least one year following analysis.
  • A repository of open source genome analysis software, including hardened, tuned, and optimized versions of particularly important open source software.
  • Support for use of this open source software, including extended, in-depth consulting for sequence analysis and tutorials and presentations on use of NCGAS services.
  • Implementation of a novel public/private service partnership in support of genome analysis and other biological research, on a fee-for-service basis as an alternative to commercial clouds.

These services will particularly support analyses of next-generation sequencer output in the following categories:

  • De novo assembly, which does not use a reference genome and requires that each read be compared to every other to find overlaps.
  • Metagenomic projects, which simultaneously sequence the combined genomes in an environmental sample such as ocean water or the human mouth.
  • Resequencing, where a closely related genome has already been completely sequenced and assembled.

We are in the midst of dramatic developments in genome sequencing capabilities, driven by availability of high throughput, low cost next-generation sequencers. The proliferation of data generation capabilities leads to new opportunities in biological discovery, such as studies involving complete sequences of a thousand humans or several complex microbial communities. There is, however, a rapidly growing gap between the ability to generate sequences and to analyze the resulting data. The NCGAS will develop innovative solutions to current needs in genome assembly and analysis. It will do so by establishing a core of experts and software tools to support such research on a variety of nationally funded cyberinfrastructure systems, and will add to the suite of available systems a large memory cluster ideal for this work. By developing a community of investigators and technologists and exploring new modalities of provisioning computational resources, such as ‘on demand’ computing, this project aspires to become a sustainable model for the ongoing, and increasing, need for sequence analysis.

Broader impacts: The NCGAS will advance discovery in genome analysis while promoting the integration of cutting edge research software tools in education. A key part of the broader impact of the proposed activities will be ‘disciplinary outreach’ – providing new resources and then informing and enabling practicing biologists to make use of these resources.

Intellectual merit: The NCGAS will enable biologists to assemble genomes that they cannot now assemble using industry-standard methods. The NCGAS will enable innovative and potentially transformative genomics research by providing tools and services that will enable new insights and important scientific discoveries.