Hostkeeper: Automated Systems Administration using 'Make' Steve Traugott, Infrastructures.Org -- stevegt@TerraLuna.Org Eric Langhirt, NASA Ames Research Center -- langhirt@nas.nasa.gov Joel Huddleston, Level 3 Communications -- joel.huddleston@level3.com Joyce Cao Traugott, TerraLuna, LLC -- joyce@TerraLuna.Org Abstract ======== Describes in detail an algorithm for running 'make' at system boot time, to provide precise, repeatable change management. The algorithm is suited for infrastructures where automated systems administration is the goal, and lends itself well to disaster recovery, initial deployment, and ongoing maintenance of large systems infrastructures. Overview ======== Hostkeeper is a reference implementation of an algorithm we briefly described in our 1998 paper, "Bootstrapping an Infrastructure" [bootstrap]. In that paper, we detailed a sequence of steps which can be used to "bootstrap" an infrastructure into existence. Central to the theme of infrastructure bootstrapping is the idea of thinking of the entire infrastructure as one enterprise-wide cluster of machines. If we can control the "bits on disk" which make up an individual operating system and its hosted applications and data, then we can control the runtime behavior of that operating system and its interactions with the larger infrastructure. To do this, we need to be able to cause reproducible changes on large numbers of unique but managed machines, not only for initial builds, but also for disaster recovery and growth. The Hostkeeper algorithm is a method of running 'make' [make] during system boot to drive the initial configuration and ongoing maintenance of a UNIX machine. The use of 'make' ensures that changes are processed in a deterministic and repeatable order. Over time, we've found that this technique can provide precise control for infrastructure management. This algorithm is ideally suited for infrastructures of more than a handful of machines, where automated systems administration [automated] or computer immunology [burgess] is the goal. The machines to be managed need not be identical. The algorithm can also be used to install, configure, and coordinate the execution of cfengine [cfengine], rpm [rpm], Arusha [arusha], shell scripts, Perl scripts, and most local or proprietary vendor package tools. Because the concept expressed by Hostkeeper is simple, and a typical implementation of it is relatively small in terms of lines of code, it was not immediately obvious to us for the first few years that the concept was in any way revolutionary, let alone even deserving of a name. We generally thought of it as a hack that happened to work pretty well. We usually implemented the algorithm using Bourne shell scripts and standard UNIX tools. But experience gained since 1998 shows us that running 'make' during system boot provides us with more flexibility, simplicity, elegance, and administrative control than any of the alternatives we have tried. We have also learned, by much trial and error, that the original implementation of the Hostkeeper algorithm was nearly optimal. We've tried using state engines other than 'make', for instance, and have tried using various combinations of other shell scripts and administrative tools. These methods alone, without the use of 'make' as a core, typically do not provide the deterministic and repeatable behavior of a reliable infrastructure, and often require complex code that does not scale well. We've tried rearranging the division of labor between the components that make up Hostkeeper. Doing this has greatly improved maintainability, but we have been unable to find many opportunities for major new functionality for a few years now. The algorithm appears as stable as our own efforts can make it. The code and configuration files which drive the 'make' execution incorporate a platform-independent and precise way of specifying what machines in an infrastructure require what features. Feature descriptions and machine descriptions are cleanly separated into different configuration files for ease and clarity of administration. These configuration files constitute a library and change history describing the current state of each machine. We hope that publication of this algorithm will provide current and future infrastructure architects with a usable, basic component for use in automated systems administration, and a foundation for future work. Operating system vendors and researchers may be interested in the algorithm for inclusion in standard distribution images. We welcome re-implementations, insights, or improvements, and provide a forum for discussion and the latest reference code at http://www.infrastructures.org. A Quick Note about Names ======================== The original name for the algorithm described here was 'isconf', for "infrastructure configuration", but was referred to by the name 'Hostkeeper' in the 1998 paper, to align with the name of another tool. The term isconf is still sometimes used and shows up in various other publications and in the reference code itself. In order to prevent further confusion, we are considering using a the term 'isconf' to denote the algorithm, and Hostkeeper to denote the reference implementation of that algorithm. [Note to reviewers: interested in your thoughts on whether we should try to make the distinction between 'isconf' and 'Hostkeeper' in this paper. The difference between the two can be compared to 'dns' the protocol and 'BIND' the implementation.] Design Goals ============ While the UNIX rc scripts provide a relatively standard way of booting an individual machine, there is not yet an equivalent accepted automatic mechanism for integrating that machine into an enterprise-wide infrastructure. Integration is still far too often performed manually by one or more systems administrators, and subsequently maintained manually. At each boot, a machine needs to ensure that it is at the correct patch level, running the correct versions of daemons and other applications, and otherwise fulfilling its designated role in the infrastructure before offering services to the world. If the machine fails to do this, then it is by definition not doing its job. If the machine's designated role has changed since the previous boot, it needs to be able to detect and respond to that role change, again before offering services to the world. To do this it needs to be able to alter its own disk contents at boot without human intervention. If the machine has been down for days, weeks, or months, was installed from an older disk image, or is otherwise out of date at boot time, it needs to be able to detect that condition and upgrade itself, again before offering services to the world. These needs apply regardless of whether the machine is a high-availability server in a glass walled room, or an ancient Sparc 5 under a user's desk. Infrastructures which fail to respond to these needs will find themselves running out-of-date or insecure applications and operating system components. These infrastructures will incur unnecessarily high labor costs and cause unnecessarily long hours for the systems administrators who have to maintain them. Hostkeeper Implementation ========================= [Probably move some paragraphs from Overview to here.] [List the components of Hostkeeper -- boot hook, configuration manager, platform detector, host description file, packages description file, and the directory structure where all this is stored on local disk.] Boot Hook: rc.isconf ==================== The job of the boot hook is to update itself, update the local copy of the Hostkeeper code, and then execute the Configuration Manager. [Describe how the boot hook, an rc script, first updates itself, then pulls down the Hostkeeper code from a well-known server using http, rsync, SUP, CVS, or equivalent, and then executes it. Show pseudocode.] Configuration Manager: isconf ============================= The Configuration Manager is a Bourne shell script which is called by rc.isconf. The Configuration Manager's main purpose in life is to read the host description file (described below), set up the 'make' execution environment, and then run 'make'. [show pseudocode] Platform Detector: platform =========================== The platform detector (named simply 'platform') is a portable Bourne shell script which, when called, outputs the operating system and hardware types and revision levels in a standardized string format, such as "irix_6.2_mips" or "linux_2.2.17-14_i686". It is similar in concept to the 'config.guess' script used by GNU autoconf [autoconf]. The 'config.guess' script and 'platform' are different in scope though; while 'config.guess' is most interested in function calls and libraries supported by a given machine, 'platform' is interested in the operating system and hardware itself. [describe why the platform script is crucial to the operation of a heterogeneous configuration management tool, and how and where the platform strings are used.] [show examples] Host Description File: hosts.conf ================================= The hosts.conf file [see listing 1] describes packages and features to be installed on a given host. For hosts not explicitly listed in the file, the DEFAULT entry takes precedence. Packages Description File: packages.conf ======================================== The packages.conf file [see listing 2] describes the components that make up a given package or feature. We commonly group one or more default sets of packages into a superpackage, which we normally name "Block00", "Block10", etc. This allows us to group sets of packages to create default classes of machines for a given infrastructure. The packages.conf file is in 'Makefile' format, and it is in fact the makefile which 'make' reads when called by the 'isconf' script. [in depth description of some of the stanzas in the example file] Future Work [describe Perl re-implementation of the Hostkeeper/isconf functionality, currently in progress] [describe possibility of using Arusha-like modular XML files to generate Makefiles on the fly] References ========== [bootstrap] Steve Traugott, Joel Huddleston, "Bootstrapping an Infrastructure", Proceedings of the Twelfth USENIX Systems Administration Conference (LISA '98), Boston, Massachusetts, http://www.infrastructures.org/papers/bootstrap/bootstrap.html [make] Andrew Oram, Steve Talbott, "Managing Projects with make, 2nd Edition", O'Reilly & Associates, 1991 [automated] Steve Traugott, "Automated Systems Administration", unpublished draft, http://www.infrastructures.org/papers/automated.html [burgess] Mark Burgess, "Computer Immunology", Proceedings of the 12th system administration conference (USENIX/LISA), 1998, http://www.iu.hio.no/~mark/research/AIdrift/AIdrift.html [cfengine] Mark Burgess, configuration engine and high-level policy language, http://www.iu.hio.no/cfengine [rpm] RedHat Package Manager, http://www.rpm.org [arusha] Will Partain et al, The Arusha Project, http://ark.sourceforge.net/index.html [autoconf] GNU Autoconf, http://www.gnu.org/software/autoconf/ LISTING 1 -- hosts.conf example =============================== [Note to reviewers: this listing will be edited for size and clarity] # $Id: isconf,v 1.4 2001/06/08 16:18:19 stevegt Exp $ # # hosts.conf - describes hosts managed by isconf # # Parsed by bin/isconf. # # Format: # # hostname: VAR1=foo VAR2="foo bar" ... # # hostname = the output of 'hostname', NOT a DNS alias # # VARn = variables to be defined; multivalued variables and # special characters must be quoted using /bin/sh syntax. Any and all # variables specified here are passed as arguments to 'make'. # # There is one commonly used variable: # # PACKAGES # # Passed as command line argument to 'make'. Typically used to # name the class and/or configuration group of a machine. There # must be stanzas in the makefile which match the words in # PACKAGES. # # WARNING: You can override any per-platform settings that bin/isconf # sets (see the $PLATFORM case statement in that file), by placing an # explicit host entry in etc/environment. This may not be what you # want. # # DEFAULT entry # # Variables defined in DEFAULT entry are overridden by individual host # entries, but only if the host entry explicitly lists the variable. # # If host is not listed in this file, then it uses the DEFAULT entry. DEFAULT: PACKAGES=Block00 PKGBANK=http://isconf/pkgbank CVSROOT=:pserver:isro@isconf:/home/isconf/cvsroot scramjet: PACKAGES="Block00 cvs-1.11-3 etc/ntp.conf" node0.terraluna.org: PACKAGES="Block00 etc/ntp.conf" venus.terraluna.org: PACKAGES=Block00 voyager: PACKAGES=exclude node3: PACKAGES=exclude LISTING 2 -- packages.conf example ================================== [Note to reviewers: this listing will be edited for size and clarity] SHELL=/bin/sh HOSTNAME= PLATFORM= PACKAGES= CVSROOT= PKGBANK= all: $(PACKAGES) # NOTE: the exclude stanza is normally never executed and is only here # to catch errors in the 'isconf' script. The isconf script # itself should catch PACKAGES=exclude exclude: @echo excluding $(HOSTNAME) from configuration management -- exiting... exit 127 Block00: ntp-4.0.99k-15 cvs-1.10.8-8 wget-1.5.3-10 \ openssh-all-2.5.2p2-1.7.2 ntpclock-2 enable-ntp-3 \ Block00-cron Block00-cron: etc/ntp.conf # generic rpm installer cvs-1.10.8-8 cvs-1.11-3 tcsh-6.10-1 wget-1.5.3-10 openssh-2.5.2p2-1.7.2 openssh-clients-2.5.2p2-1.7.2: rpm -q $@ || rpm -Uvh $(PKGBANK)/$@.i386.rpm touch $@ # generic http executor foo: exech $(PKGBANK)/$@ touch $@ openssh-all-2.5.2p2-1.7.2: rpm -q openssh-2.5.2p2-1.7.2 || rpm -Uvh \ $(PKGBANK)/openssh-2.5.2p2-1.7.2.i386.rpm \ $(PKGBANK)/openssh-server-2.5.2p2-1.7.2.i386.rpm \ $(PKGBANK)/openssh-clients-2.5.2p2-1.7.2.i386.rpm service sshd restart touch $@ etc/ntp.conf: isync -o $@ service ntpd restart ntp-4.0.99j-7 ntp-4.0.99k-15: - service ntpd stop rpm -q $@ || rpm -Uvh $(PKGBANK)/$@.i386.rpm ntpdate ntp chkconfig --add ntpd service ntpd start # ntpd -g touch $@ enable-ntp-3: ntp-4.0.99k-15 chkconfig --level 2345 ntpd on service ntpd restart touch $@ ntpclock-2: ntp-4.0.99k-15 isync -o etc/rc.d/init.d/ntpclock chkconfig --add ntpclock service ntpd restart touch $@ t7a-ntp-conf-1.0-1.sh t7a-ntp-conf-1.0-2.sh: ntp-4.0.99k-15 exech $(PKGBANK)/$@ - ln -fs /etc/ntp.conf.$(HOSTNAME) /etc/ntp.conf service ntpd restart touch $@ # dependencies cvs-1.10.8-8: tcsh-6.10-1