Unifying configuration management across High Performance Compute (HPC) and supporting infrastructure systems is a challenge commonly faced by research computing centers. System operators no longer have the luxury to manage one-off cases manually; automation is essential. Centralized tooling and site-wide configuration promises efficiencies, but the substantial differences between, for example, compute nodes and networking gear can dissuade operators from attempting to manage their entire fleet with a single tool. To solve this, the Minnesota Supercomputing Institute (MSI) at the University of Minnesota assembled an innovative collection of utilities for Institute-wide management of systems with Puppet 5 as the centerpiece. The innovative solution emerged from the need for more than a dozen system operators/administrators to coordinate changes across a medium- to large-scale data-center with over 1500 nodes. A centralized base configuration ensures all systems are inline with University security policies and other compliance needs, while the Puppet infrastructure additionally enables per-cluster or even per-node customizations as needed. This document presents the architecture of MSI’s orchestration and management infrastructure, as well as the workflow followed by operators to provision disparate systems with Puppet and ensure that quality, accountability, and compliance requirements are met. The resulting system is actively used in the day-to-day management of the Institute, including persistent infrastructure services, cluster head- and compute-nodes, and even networking equipment. The workflow described herein is recommended for similar research computing institutions, including those with as few as five nodes and two operators.
|Original language||English (US)|
|Title of host publication||Proceedings of the Practice and Experience in Advanced Research Computing|
|Subtitle of host publication||Rise of the Machines (Learning), PEARC 2019|
|Publisher||Association for Computing Machinery|
|State||Published - Jul 28 2019|
|Event||2019 Conference on Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019 - Chicago, United States|
Duration: Jul 28 2019 → Aug 1 2019
|Name||ACM International Conference Proceeding Series|
|Conference||2019 Conference on Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019|
|Period||7/28/19 → 8/1/19|
Bibliographical noteFunding Information:
The authors wish to acknowledge the insightful feedback and assistance from the MSI Operations team, the MSI Application Development team, and the Puppet Training team. Special thanks to Matt Mix for contributions to the Puppet infrastructure code-base and scaling benchmarks for the Mesabi environment. This work was sponsored as internal development by the Minnesota Supercomputing Institute under the University of Minnesota?s Office of the Vice President for Research.
© 2019 Association for Computing Machinery.
- ACM proceedings
- Continuous Integration
- Systems Configuration
- Systems Management