Infrastructures.Org: Best Practices in Automated Systems Administration and Infrastructure Architecture: Client O/S Update

Client O/S Update

Prerequisites: Network File Servers, File Replication Servers

Vendors are waking up to the need for decent, large scale operating systems upgrade tools. Unfortunately, due to the "value added" nature of such tools, and the lack of published standards, the various vendors are not sharing or cooperating with one another. It is risky to use these tools even if you think you will always have only one vendor to deal with. In today's business world of mergers and reorgs, single vendor networks become a hodge-podge of conflicting heterogeneous networks overnight.

We started our work on a homogeneous network of systems. Eventually we added a second, and then a third OS to that network. We took about five months adding the second OS. When the third came along, we found that adding it to our network was a simple matter of porting the tools -- it took about a week. Our primary tool was a collection of scripts and binaries that we called Hostkeeper.

Hostkeeper depended on two basic mechanisms; boot time configuration and ongoing maintenance. At boot, the Hostkeeper client contacted the gold server to determine whether it had the latest patches and upgrades applied to its operating system image. This contact was via an NFS filesystem (/is/conf) mounted from the gold server.

We used 'make' for our state engine. Each client always ran 'make' on every reboot. Each OS/hardware platform had a makefile associated with it (/is/conf/bin/Makefile.{platform}). The targets in the makefile were tags that represented either our own internal revision levels or patches that made up the revision levels. We borrowed a term from the aerospace industry -- "block 00" was a vanilla machine, "block 10" was with the first layer of patches installed, and so on. The Makefiles looked something like this:

block00: localize
block10: block00 14235-43 xdm_fix01
14235-43 xdm_fix01:
    /is/conf/patches/$(PLATFORM)/$@/install_patch
    touch $@
localize:
    /is/conf/bin/localize
    touch $@

Note the 'touch' commands at the end of each patch stanza; this prevented 'make' from running the same stanza on the same machine ever again. (We ran 'make' in a local directory where these timestamp files were stored on each machine.)

We had mechanisms that allowed us to manage custom patches and configuration changes on selected machines. These were usually driven by environment variables set in /etc/environment or the equivalent.

The time required to write and debug a patch script and add it to the makefile was minimal compared to the time it would have taken to apply the same patch to over 200 clients by hand, then to all new machines after that. Even simple changes, such as configuring a client to use a multi-headed display, were scripted. This strict discipline allowed us to exactly recreate a machine in case of disaster.

For operating systems which provided a patch mechanism like 'pkgadd', these scripts were easy to write. For others we had our own methods. These days we would probably use RPM for the latter [rpm] .

You may recognize many of the functions of 'cfengine' in the above description [burgess] . At the time we started on this project, 'cfengine' was in its early stages of development, though we were still tempted to use it. In retrospect, the one feature missing from cfengine is some form of lifetime history of ordered changes -- what we're using those 'touch' commands for in 'make'. Attempts since then to use cfengine for this sort of deterministic control have shown us that it's a syntactic issue; the semantics of 'make' are a better fit. Cfengine is better suited for partially converging machines from an unknown state, this 'make' technique (or something similar) requires less mental gymnastics to predict what will happen when you run it, but requires that you start machines from a known image, which we always do [turing].

One tool that bears closer scrutiny is Sun Microsystems' Autoclient. The Autoclient model can best be described as a dataless client whose local files are a cached mirror of the server. The basic strategy of Autoclient is to provide the client with a local disk drive to hold the operating system, and to refresh that operating system (using Sun's CacheFS feature) from a central server. This is a big improvement over the old diskless client offering from Sun, which overloaded servers and networks with NFS traffic.

A downside of Autoclient is its dependence on Sun's proprietary CacheFS mechanism. Another downside is scalability: Eventually, the number of clients will exceed that which can be supported by one server. This means adding a second server, then a third, and then the problem becomes one of keeping the servers in sync. Essentially, Autoclient does not solve the problem of system synchronization; it delays it. However, this delay may be exactly what the system administrator needs to get a grip on a chaotic infrastructure.

Checklist

Authentication Servers

Time Synchronization

Network File Servers

File Replication Servers

Client File Access

Client O/S Update

Client Configuration Management

Client Application Management

Mail

Printing

Monitoring

Unix System Administration

[ Join Now \| Ring Hub \| Random \| << Prev \| Next >> ]