A multi-environment HPC-scale puppet infrastructure for compliance and systems automation

Raychel M. Benson, Edward Munsell, Nicholas Bertrand, Michael Baynton, Evan F Bollig, Jeffrey McDonald

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Unifying configuration management across High Performance Compute (HPC) and supporting infrastructure systems is a challenge commonly faced by research computing centers. System operators no longer have the luxury to manage one-off cases manually; automation is essential. Centralized tooling and site-wide configuration promises efficiencies, but the substantial differences between, for example, compute nodes and networking gear can dissuade operators from attempting to manage their entire fleet with a single tool. To solve this, the Minnesota Supercomputing Institute (MSI) at the University of Minnesota assembled an innovative collection of utilities for Institute-wide management of systems with Puppet 5 as the centerpiece. The innovative solution emerged from the need for more than a dozen system operators/administrators to coordinate changes across a medium- to large-scale data-center with over 1500 nodes. A centralized base configuration ensures all systems are inline with University security policies and other compliance needs, while the Puppet infrastructure additionally enables per-cluster or even per-node customizations as needed. This document presents the architecture of MSI’s orchestration and management infrastructure, as well as the workflow followed by operators to provision disparate systems with Puppet and ensure that quality, accountability, and compliance requirements are met. The resulting system is actively used in the day-to-day management of the Institute, including persistent infrastructure services, cluster head- and compute-nodes, and even networking equipment. The workflow described herein is recommended for similar research computing institutions, including those with as few as five nodes and two operators.

Original languageEnglish (US)
Title of host publicationProceedings of the Practice and Experience in Advanced Research Computing
Subtitle of host publicationRise of the Machines (Learning), PEARC 2019
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450372275
DOIs
StatePublished - Jul 28 2019
Event2019 Conference on Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019 - Chicago, United States
Duration: Jul 28 2019Aug 1 2019

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2019 Conference on Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019
CountryUnited States
CityChicago
Period7/28/198/1/19

Fingerprint

Automation
Gears
Compliance

Keywords

  • ACM proceedings
  • Continuous Integration
  • Git
  • Puppet
  • Systems Configuration
  • Systems Management

Cite this

Benson, R. M., Munsell, E., Bertrand, N., Baynton, M., Bollig, E. F., & McDonald, J. (2019). A multi-environment HPC-scale puppet infrastructure for compliance and systems automation. In Proceedings of the Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019 [3332240] (ACM International Conference Proceeding Series). Association for Computing Machinery. https://doi.org/10.1145/3332186.3332240

A multi-environment HPC-scale puppet infrastructure for compliance and systems automation. / Benson, Raychel M.; Munsell, Edward; Bertrand, Nicholas; Baynton, Michael; Bollig, Evan F; McDonald, Jeffrey.

Proceedings of the Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019. Association for Computing Machinery, 2019. 3332240 (ACM International Conference Proceeding Series).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Benson, RM, Munsell, E, Bertrand, N, Baynton, M, Bollig, EF & McDonald, J 2019, A multi-environment HPC-scale puppet infrastructure for compliance and systems automation. in Proceedings of the Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019., 3332240, ACM International Conference Proceeding Series, Association for Computing Machinery, 2019 Conference on Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019, Chicago, United States, 7/28/19. https://doi.org/10.1145/3332186.3332240
Benson RM, Munsell E, Bertrand N, Baynton M, Bollig EF, McDonald J. A multi-environment HPC-scale puppet infrastructure for compliance and systems automation. In Proceedings of the Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019. Association for Computing Machinery. 2019. 3332240. (ACM International Conference Proceeding Series). https://doi.org/10.1145/3332186.3332240
Benson, Raychel M. ; Munsell, Edward ; Bertrand, Nicholas ; Baynton, Michael ; Bollig, Evan F ; McDonald, Jeffrey. / A multi-environment HPC-scale puppet infrastructure for compliance and systems automation. Proceedings of the Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019. Association for Computing Machinery, 2019. (ACM International Conference Proceeding Series).
@inproceedings{d9860880185f4a6a86f04d0900a9d944,
title = "A multi-environment HPC-scale puppet infrastructure for compliance and systems automation",
abstract = "Unifying configuration management across High Performance Compute (HPC) and supporting infrastructure systems is a challenge commonly faced by research computing centers. System operators no longer have the luxury to manage one-off cases manually; automation is essential. Centralized tooling and site-wide configuration promises efficiencies, but the substantial differences between, for example, compute nodes and networking gear can dissuade operators from attempting to manage their entire fleet with a single tool. To solve this, the Minnesota Supercomputing Institute (MSI) at the University of Minnesota assembled an innovative collection of utilities for Institute-wide management of systems with Puppet 5 as the centerpiece. The innovative solution emerged from the need for more than a dozen system operators/administrators to coordinate changes across a medium- to large-scale data-center with over 1500 nodes. A centralized base configuration ensures all systems are inline with University security policies and other compliance needs, while the Puppet infrastructure additionally enables per-cluster or even per-node customizations as needed. This document presents the architecture of MSI’s orchestration and management infrastructure, as well as the workflow followed by operators to provision disparate systems with Puppet and ensure that quality, accountability, and compliance requirements are met. The resulting system is actively used in the day-to-day management of the Institute, including persistent infrastructure services, cluster head- and compute-nodes, and even networking equipment. The workflow described herein is recommended for similar research computing institutions, including those with as few as five nodes and two operators.",
keywords = "ACM proceedings, Continuous Integration, Git, Puppet, Systems Configuration, Systems Management",
author = "Benson, {Raychel M.} and Edward Munsell and Nicholas Bertrand and Michael Baynton and Bollig, {Evan F} and Jeffrey McDonald",
year = "2019",
month = "7",
day = "28",
doi = "10.1145/3332186.3332240",
language = "English (US)",
series = "ACM International Conference Proceeding Series",
publisher = "Association for Computing Machinery",
booktitle = "Proceedings of the Practice and Experience in Advanced Research Computing",

}

TY - GEN

T1 - A multi-environment HPC-scale puppet infrastructure for compliance and systems automation

AU - Benson, Raychel M.

AU - Munsell, Edward

AU - Bertrand, Nicholas

AU - Baynton, Michael

AU - Bollig, Evan F

AU - McDonald, Jeffrey

PY - 2019/7/28

Y1 - 2019/7/28

N2 - Unifying configuration management across High Performance Compute (HPC) and supporting infrastructure systems is a challenge commonly faced by research computing centers. System operators no longer have the luxury to manage one-off cases manually; automation is essential. Centralized tooling and site-wide configuration promises efficiencies, but the substantial differences between, for example, compute nodes and networking gear can dissuade operators from attempting to manage their entire fleet with a single tool. To solve this, the Minnesota Supercomputing Institute (MSI) at the University of Minnesota assembled an innovative collection of utilities for Institute-wide management of systems with Puppet 5 as the centerpiece. The innovative solution emerged from the need for more than a dozen system operators/administrators to coordinate changes across a medium- to large-scale data-center with over 1500 nodes. A centralized base configuration ensures all systems are inline with University security policies and other compliance needs, while the Puppet infrastructure additionally enables per-cluster or even per-node customizations as needed. This document presents the architecture of MSI’s orchestration and management infrastructure, as well as the workflow followed by operators to provision disparate systems with Puppet and ensure that quality, accountability, and compliance requirements are met. The resulting system is actively used in the day-to-day management of the Institute, including persistent infrastructure services, cluster head- and compute-nodes, and even networking equipment. The workflow described herein is recommended for similar research computing institutions, including those with as few as five nodes and two operators.

AB - Unifying configuration management across High Performance Compute (HPC) and supporting infrastructure systems is a challenge commonly faced by research computing centers. System operators no longer have the luxury to manage one-off cases manually; automation is essential. Centralized tooling and site-wide configuration promises efficiencies, but the substantial differences between, for example, compute nodes and networking gear can dissuade operators from attempting to manage their entire fleet with a single tool. To solve this, the Minnesota Supercomputing Institute (MSI) at the University of Minnesota assembled an innovative collection of utilities for Institute-wide management of systems with Puppet 5 as the centerpiece. The innovative solution emerged from the need for more than a dozen system operators/administrators to coordinate changes across a medium- to large-scale data-center with over 1500 nodes. A centralized base configuration ensures all systems are inline with University security policies and other compliance needs, while the Puppet infrastructure additionally enables per-cluster or even per-node customizations as needed. This document presents the architecture of MSI’s orchestration and management infrastructure, as well as the workflow followed by operators to provision disparate systems with Puppet and ensure that quality, accountability, and compliance requirements are met. The resulting system is actively used in the day-to-day management of the Institute, including persistent infrastructure services, cluster head- and compute-nodes, and even networking equipment. The workflow described herein is recommended for similar research computing institutions, including those with as few as five nodes and two operators.

KW - ACM proceedings

KW - Continuous Integration

KW - Git

KW - Puppet

KW - Systems Configuration

KW - Systems Management

UR - http://www.scopus.com/inward/record.url?scp=85071014906&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071014906&partnerID=8YFLogxK

U2 - 10.1145/3332186.3332240

DO - 10.1145/3332186.3332240

M3 - Conference contribution

AN - SCOPUS:85071014906

T3 - ACM International Conference Proceeding Series

BT - Proceedings of the Practice and Experience in Advanced Research Computing

PB - Association for Computing Machinery

ER -