OpenStack at George Mason University

Cloud or cluster? Why not both?

The amount of meaningful research being done at U.S. universities exceeds both the computational power and the staff necessary to meet the demands of researchers. The Extreme Science and Engineering Discovery Environment (XSEDE) was initially funded in 2011 by the National Science Foundation (NSF) to substantially enhance the productivity of a growing community of scholars, researchers, and engineers through access to advanced digital services that support open research; and to coordinate and add significant value to the leading cyberinfrastructure resources funded by the NSF and other agencies.

In support of that mission, the XSEDE Cyberinfrastructure Resource Integration (XCRI) group provides tools to help campus cyberinfrastructure staff manage local resources more efficiently. In the case of George Mason University (GMU), this includes supporting innovative cloud deployment at the local level.

In 2018, XCRI staff began working with GMU to help its IT staff develop significant local infrastructure. Staff were interested in developing a resource that could be a cluster or a cloud, depending on an individual’s needs. This forward-thinking idea sought the best of both worlds, along with the ability to have flexibility in provisioning resources. This system enables GMU’s researchers to work with a cloud like Jetstream, or with a cluster like Stampede at Texas Advanced Computing Center—both nationally available resources funded through the NSF. Having this choice allows researchers to work in the environment where they’re most comfortable.

PTI spoke about the collaboration with Richard Knepper, XCRI manager and director of the Cornell Center for Advanced Computing and with GMU Office of Research Computing director Jayshree Sarma, computational research specialist Alastair Neil, and graduate research assistant Rafael Madrid.

PTI: What was the impetus for the collaboration?

Alastair Neil: At the 2018 Supercomputing conference, Eric Coulter from XCRI gave a talk about helping universities set up HPC clusters. Jayshree asked his team to come to GMU to run a two-day tutorial to train system administrators on HPC cluster maintenance. It was very well attended—so much so that we had to find a bigger room. At the time, GMU had no HPC system administrators, and that training sparked an interest. That’s where our partnership began.

We wanted to provide a flexible infrastructure for research computing—to be able to provision virtual machines or bare metal or HPC compute resources on the fly from our pool of hardware. We periodically needed to pull machines out of an HPC cluster because a researcher would need exclusive use or a considerably different configuration, so we’d need to reimage it. This was very administration intensive and was neither flexible nor agile. We couldn’t turn requests around in less than a few weeks. Consequently, we decided to pursue offering more responsive and flexible infrastructure via a local cloud. We looked at various architectures and decided on OpenStack, the most robust option. We then sought assistance from XCRI staff because they have extensive experience with administering OpenStack clusters.

PTI: Why is the collaboration between XSEDE and GMU important?

Richard Knepper: For the XCRI team, it was an opportunity to help bolster community efforts to provide innovative local infrastructure. GMU had the plan; XSEDE had the knowledge and expertise to help them achieve their goals. Being able to translate and provide advice was important and allowed us to help them become a new participant within the national research community.

We were able to provide guidance on using OpenStack, which is an open source cloud computing platform. It’s easy to administer and to use, and it works well on systems ranging from the local level to the biggest national clouds.

Ultimately, we want to connect groups who are using similar technologies so that they can work together and learn from each other moving forward. We want these groups to share experiences and to turn to each other for help. We’re building a community.

Jayshree Sarma: The training that XCRI gave our system administrators was important. When we decided on OpenStack, it made sense to reach out to Rich and his team because they’d been running OpenStack for quite some time. We benefited from their knowledge, especially when it came to what hardware we needed.

Neil: They gave us unvarnished advice about what we really needed to purchase versus what the vendor wanted to sell us. There’s a huge spectrum of ways to deploy OpenStack, and research and academic computing doesn’t need the same robustness and redundancy as a commercial cluster. We got a much more realistic view of what we needed and good advice about how to deploy it. And as we get closer to standing up the cluster, they’re advising us on how to present it to users.

Thanks to the NSF, XCRI helps institutions like GMU benefit from the experience of those that came before.

PTI: What are some of the challenges the research community faces that XCRI has been helping to solve?

Knepper: Primarily, making good decisions about architecture, implementation, and policy. We can help groups determine what pieces can make a system useful and effective; what combinations of pieces work well together. Everyone on the XCRI team has had different experiences with vendors, so we can provide varied perspectives that help campuses make good decisions. We can also help them choose between cluster or cloud, or help them implement both. By using OpenStack Ironic for provisioning, campuses can allow for either cloud or cluster on top through on-the-fly provisioning. This is the route GMU has chosen.

PTI: What is Exosphere and how does it play a part in GMU’s local systems?

Knepper: Exosphere is a user-friendly web interface for OpenStack-based research clouds such as Jetstream. This is notable because it provides a consistent user interface among local and national resources. It allows researchers and other non-IT professionals to deploy their code and run services on their own cloud infrastructure without having to understand virtualization or networking concepts. For researchers who choose to work in the cloud, Exosphere will enable GMU to deliver a familiar, user-friendly, powerful interface with customized branding, naming conventions, and single-sign-on integration.

Neil: Our team meets with the Exosphere team every two weeks to discuss the progress we’ve made, issues we’ve uncovered, and to talk generally about the interface. Communication has been productive for us and for them; we’ve helped with finding solutions and reporting bugs.

Exosphere is much more user-friendly than standard OpenStack. The Exosphere team supports continuing integration and maintains a development environment in GitLab, which allows us to get the latest copies of images we want to support, already patched and automatically configured and tailored to our systems. This is incredibly helpful to our team and is in fact a major benefit to our users—they won’t have to wait an hour for an image to be loaded and patched. Their virtual machines will be provisioned in a matter of minutes.

PTI: What are your future plans with the cluster?

Neil: We need to complete work on a few critical aspects before we can let users onto the cluster, particularly authentication. We plan to implement Keycloak, an authentication integration system. Once authorization is in place, we can allow early adopters onto the system. We also need to look at how we’ll manage allocations and resource constraints, and once that piece is in place—we're aiming for this summer—we’ll be able to allow general access.

PTI: In what other ways does XCRI assist with campus system development?

Knepper: For implementation, we create activization energy. By setting a date to go to a campus and work with their IT staff, we’re creating deadlines that in turn create momentum. We provide advice and troubleshooting that helps the implementation process flow smoothly. Due to the pandemic, we’ve been able to stretch the timelines, taking the opportunity to iterate solutions. If a question goes outside our area of experience, we leverage the broad range of professionals within the XSEDE Federation to help find the necessary expertise.

We can also assist with policy decisions. We can provide details about what other groups are doing in terms of time, allocations, authentication schemes, federation, etc. There’s a lot of expertise within XCRI, and that allows us to really engage and talk about what we’ve seen at other universities in order to help meet the needs of local groups.