Why Order Matters: Turing Equivalence in Automated Systems Administration

Hosts in a well-architected enterprise infrastructure are self-administered; they perform their own maintenance and upgrades. By definition, self-administered hosts execute self-modifying code. They do not behave according to simple state machine rules, but can incorporate complex feedback loops and evolutionary recursion.

The implications of this behavior are of immediate concern to the reliability, security, and ownership costs of enterprise computing. In retrospect, it appears that the same concerns also apply to manually-administered machines, in which administrators use tools that execute in the context of the target disk to change the contents of the same disk. The self-modifying behavior of both manual and automatic administration techniques helps explain the difficulty and expense of maintaining high availability and security in conventionally-administered infrastructures.

The practice of infrastructure architecture tool design exists to bring order to this self-referential chaos. Conventional systems administration can be greatly improved upon through discipline, culture, and adoption of practices better fitted to enterprise needs. Creating a low-cost maintenance strategy largely remains an art. What can we do to put this art into the hands of relatively junior administrators? We think that part of the answer includes adopting a well-proven strategy for maintenance tools, based in part upon the theoretical properties of computing.

In this paper, we equate self-administered hosts to Turing machines in order to help build a theoretical foundation for understanding this behavior. We discuss some tools that provide mechanisms for reliably managing self-administered hosts, using deterministic ordering techniques.

Based on our findings, it appears that no tool, written in any language, can predictably administer an enterprise infrastructure without maintaining a deterministic, repeatable order of changes on each host. The runtime environment for any tool always executes in the context of the target operating system; changes can affect the behavior of the tool itself, creating circular dependencies. The behavior of these changes may be difficult to predict in advance, so testing is necessary to validate changed hosts. Once changes have been validated in testing they must be replicated in production in the same order in which they were tested, due to these same circular dependencies.

The least-cost method of managing multiple hosts also appears to be deterministic ordering. All other known management methods seem to include either more testing or higher risk for each host managed.

This paper is a living document; revisions and discussion can be found at Infrastructures.Org, a project of TerraLuna, LLC.

All computer systems management methods can be classified into one of three categories: divergent, convergent, and congruent.

Divergence [!divergence.png] generally implies bad management. Experience shows us that virtually all enterprise infrastructures are still divergent today. Divergence is characterized by the configuration of live hosts drifting away from any desired or assumed baseline disk content.

One quick way to tell if a shop is divergent is to ask how changes are made on production hosts, how those same changes are incorporated into the baseline build for new or replacement hosts, and how they are made on hosts that were down at the time the change was first deployed. If you get different answers, then the shop is divergent.

The symptoms of divergence include unpredictable host behavior, unscheduled downtime, unexpected package and patch installation failure, unclosed security vulnerabilities, significant time spent "firefighting", and high troubleshooting and maintenance costs.

The causes of divergence are generally that class of operations that create non-reproducible change. Divergence can be caused by ad-hoc manual changes, changes implemented by two independent automatic agents on the same host, and other unordered changes. Scripts which drive rdist, rsync, ssh, scp, [rdist] [rsync] [ssh] or other change agents as a push operation [bootstrap] are also a common source of divergence.

Convergence [!convergence.png] is the process most senior systems administrators first begin when presented with a divergent infrastructure. They tend to start by manually synchronizing some critical files across the diverged machines, then they figure out a way to do that automatically. Convergence is characterized by the configuration of live hosts moving towards an ideal baseline. By definition, all converging infrastructures are still diverged to some degree. (If an infrastructure maintains full compliance with a fully descriptive baseline, then it is congruent according to our definition, not convergent. )

The baseline description in a converging infrastructure is characteristically an incomplete description of machine state. You can quickly detect convergence in a shop by asking how many files are currently under management control. If an approximate answer is readily available and is on the order of a few hundred files or less, then the shop is likely converging legacy machines on a file-by-file basis.

A convergence tool is an excellent means of bringing some semblance of order to a chaotic infrastructure. Convergent tools typically work by sampling a small subset of the disk -- via a checksum of one or more files, for example -- and taking some action in response to what they find. The samples and actions are often defined in a declarative or descriptive language that is optimized for this use. This emulates and preempts the firefighting behavior of a reactive human systems administrator -- "see a problem, fix it". Automating this process provides great economies of scale and speed over doing the same thing manually.

Convergence is a feature of Mark Burgess' Computer Immunology principles [immunology]. His cfengine is in our opinion the best tool for this job [cfengine]. Simple file replication tools [sup] [cvsup] [rsync] provide a rudimentary convergence function, but without the other action semantics and fine-grained control that cfengine provides.

Because convergence typically includes an intentional process of managing a specific subset of files, there will always be unmanaged files on each host. Whether current differences between unmanaged files will have an impact on future changes is undecidable, because at any point in time we do not know the entire set of future changes, or what files they will depend on.

It appears that a central problem with convergent administration of an initially divergent infrastructure is that there is no documentation or knowledge as to when convergence is complete. One must treat the whole infrastructure as if the convergence is incomplete, whether it is or not. So without more information, an attempt to converge formerly divergent hosts to an ideal configuration is a never-ending process. By contrast, an infrastructure based upon first loading a known baseline configuration on all hosts, and limited to purely orthogonal and non-interacting sets of changes, implements congruence [#methods/congruence]. Unfortunately, this is not the way most shops use convergent tools such as cfengine.

The symptoms of a convergent infrastructure include a need to test all changes on all production hosts, in order to detect failures caused by remaining unforeseen differences between hosts. These failures can impact production availability. The deployment process includes iterative adjustment of the configuration tools in response to newly discovered differences, which can cause unexpected delays when rolling out new packages or changes. There may be a higher incidence of failures when deploying changes to older hosts. There may be difficulty eliminating some of the last vestiges of the ad-hoc methods mentioned in section [#methods/divergence]. Continued use of ad-hoc and manual methods virtually ensures that convergence cannot complete.

With all of these faults, convergence still provides much lower overall maintenance costs and better reliability than what is available in a divergent infrastructure. Convergence features also provide more adaptive self-healing ability than pure congruence, due to a convergence tool's ability to detect when deviations from baseline have occurred. Congruent infrastructures rely on monitoring to detect deviations, and generally call for a rebuild when they have occurred. We discuss the security reasons for this in section [#methods/congruence].

We have found apparent limits to how far convergence alone can go. We know of no previously divergent infrastructure that, through convergence alone, has reached congruence [#methods/congruence]. This makes sense; convergence is a process of eliminating differences on an as-needed basis; the managed disk content will generally be a smaller set than the unmanaged content. In order to prove congruence, we would need to sample all bits on each disk, ignore those that are user data, determine which of the remaining bits are relevant to the operation of the machine, and compare those with the baseline.

In our experience, it is not enough to prove via testing that two hosts currently exhibit the same behavior while ignoring bit differences on disk; we care not only about current behavior, but future behavior as well. Bit differences that are currently deemed not functional, or even those that truly have not been exercised in the operation of the machine, may still affect the viability of future change directives. If we cannot predict the viability of future change actions, we cannot predict the future viability of the machine.

Deciding what bit differences are "functional" is often open to individual interpretation. For instance, do we care about the order of lines and comments in /etc/inetd.conf? We might strip out comments and reorder lines without affecting the current operation of the machine; this might seem like a non-functional change, until two years from now. After time passes, the lack of comments will affect our future ability to correctly understand the infrastructure when designing a new change. This example would seem to indicate that even non-machine-readable bit differences can be meaningful when attempting to prove congruence.

Unless we can prove congruence, we cannot validate the fitness of a machine without thorough testing, due to the uncertainties described in section [#thesis/utm/g/disorder]. In order to be valid, this testing must be performed on each production host, due to the factors described in section [#thesis/order/future]. This testing itself requires either removing the host from production use or exposing untested code to users. Without this validation, we cannot trust the machine in mission-critical operation.

Congruence [!congruence.png] is the practice of maintaining production hosts in complete compliance with a fully descriptive baseline [#howto-describe]. Congruence is defined in terms of disk state rather than behavior, because disk state can be fully described, while behavior cannot [#thesis/defining].

By definition, divergence from baseline disk state in a congruent environment is symptomatic of a failure of code, administrative procedures, or security. In any of these three cases, we may not be able to assume that we know exactly which disk content was damaged. It is usually safe to handle all three cases as a security breach: correct the root cause, then rebuild.

You can detect congruence in a shop by asking how the oldest, most complex machine in the infrastructure would be rebuilt if destroyed. If years of sysadmin work can be replayed in an hour, unattended, without resorting to backups, and only user data need be restored from tape, then host management is likely congruent.

Rebuilds in a congruent infrastructure are completely unattended and generally faster than in any other; anywhere from 10 minutes for a simple workstation to 2 hours for a node in a complex high-availability server cluster (most of that two hours is spent in blocking sleeps while meeting barrier conditions with other nodes).

Symptoms of a congruent infrastructure include rapid, predictable, "fire-and-forget" deployments and changes. Disaster recovery and production sites can be easily maintained or rebuilt on demand in a bit-for-bit identical state. Changes are not tested for the first time in production, and there are no unforeseen differences between hosts. Unscheduled production downtime is reduced to that caused by hardware and application problems; firefighting activities drop considerably. Old and new hosts are equally predictable and maintainable, and there are fewer host classes to maintain. There are no ad-hoc or manual changes. We have found that congruence makes cost of ownership much lower, and reliability much higher, than any other method.

Our own experience and calculations show that the return-on-investment (ROI) of converting from divergence to congruence is less than 8 months for most organizations. See [!t7a_automation_curve.png]. This graph assumes an existing divergent infrastructure of 300 hosts, 2%/month growth rate, followed by adoption of congruent automation techniques. Typical observed values were used for other input parameters. Automation tool rollout began at the 6-month mark in this graph, causing temporarily higher costs; return on this investment is in 5 months, where the manual and automatic lines cross over at the 11 month mark. Following crossover, we see a rapidly increasing cost savings, continuing over the life of the infrastructure. While this graph is calculated, the results agree with actual enterprise environments that we have converted. There is a CGI generator for this graph at Infrastructures.Org, where you can experiment with your own parameters.

Congruence allows us to validate a change on one host in a class, in an expendable test environment, then deploy that change to production without risk of failure. Note that this is useful even when (or especially when) there may be only one production host in that class.

A congruence tool typically works by maintaining a journal of all changes to be made to each machine, including the initial image installation. The journal entries for a class of machine drive all changes on all machines in that class. The tool keeps a lifetime record, on the machine's local disk, of all changes that have been made on a given machine. In the case of loss of a machine, all changes made can be recreated on a new machine by "replaying" the same journal; likewise for creating multiple, identical hosts. The journal is usually specified in a declarative language that is optimized for expressing ordered sets and subsets. This allows subclassing and easy reuse of code to create new host types.

There are few tools that are capable of the ordered lifetime journaling required for congruent behavior. Our own isconf [#examples-isconf] is the only specifically congruent tool we know of in production use, though cfengine, with some care and extra coding, appears to be usable for administration of congruent environments. We discuss this in more detail in section [#examples-cfengine].

We recognize that congruence may be the only acceptable technique for managing life-critical systems infrastructures, including those that:

Influence the results of human-subject health and medicine experiments
Provide command, control, communications, and intelligence (C³I) for battlefield and weapons systems environments
Support command and telemetry systems for manned aerospace vehicles, including spacecraft and national airspace air traffic control

Our personal experience shows that awareness of the risks of conventional host management techniques has not yet penetrated many of these organizations. This is cause for concern.

Automated systems administration is very straightforward. There is only one way for a user-side administrative tool to change the contents of disk in a running UNIX machine -- the syscall interface. The task of automated administration is simply to make sure that each machine's kernel gets the right system calls, in the right order, to make it be the machine you want it to be.

If there are N bits on a disk, then there are 2^N possible disk states. In order to maintain the baseline host description needed for congruent management, we need to have a way to describe any arbitrary disk state in a highly compressed way, preferably in a human-readable configuration file or script. For the purposes of this description, we neglect user data and log files -- we want to be able to describe the root-owned and administered portions of disk.

"Order Matters" whether creating or modifying a disk:

A concise and reliable way to describe any arbitrary state of a disk is to describe the procedure for creating that state.

This procedure will include the initial state (bare-metal build) of the disk, followed by the steps used to change it over time, culminating in the desired state. This procedure must be in writing, preferably in machine-readable form. This entire set of information, for all hosts, constitutes the baseline description of a congruent infrastructure. Each change added to the procedure updates the baseline.

There are tools which can help you maintain and execute this procedure.

While it is conceivable that this procedure could be a documented manual process, executing these steps manually is tedious and costly at best. (Though we know of many large mission-critical shops which try.) It is generally error-prone. Manual execution of complex procedures is one of the best methods we know of for generating divergence [#methods/divergence].

The starting state (bare-metal install) description of the disk may take the form of a network install tool's configuration file, such as that used for Solaris Jumpstart or RedHat Kickstart. The starting state might instead be a bitstream representing the entire initial content of the disk (usually a snapshot taken right after install from vendor CD). The choice of which of these methods to use is usually dependent on the vendor-supplied install tool -- some will support either method, some require one or the other.

A systems administrator, whether a human or a piece of software [#thesis/selfadmin], can easily break an enterprise infrastructure by executing the right actions in the wrong order. In this section, we will explore some of the ways this can happen.

First we will cover a trivial but devastating example that is easily avoided. This once happened to a colleague while doing manual operations on a machine. He wanted to clean out the contents of a directory which ordinarily had the development group's source code NFS mounted over top of it. Here is what he wanted to do:

	umount /apps/src
	cd /apps/src
	rm -rf .
	mount /apps/src

Here's what he actually did:

	umount /apps/src
		...umount fails, directory in use; while resolving
		this, his pager goes off, he handles the interrupt,
		then...
	cd /apps/src
	rm -rf .

Needless to say, there had also been no backup of the development source tree for quite some time...

In this example, "correct order" includes some concept of sufficient error handling. We show this example because it highlights the importance of a default behavior of "halt on error" for automatic systems administration tools. Not all tools halt on error by default; isconf does [#examples-isconf].

We in the UNIX community have long accused Windows developers of poor library management, due to the fact that various Windows applications often come bundled with differing version of the same DLLs. It turns out that at least some UNIX and Linux distributions appear to suffer from the same problem.

Jeffrey D'Amelia and John Hart [hart] demonstrated this in the case of RedHat RPMs, both official and contributed. They showed that the order in which you install RPMs can matter, even when there are no applicable dependencies specified in the package. We don't assume that this situation is restricted to RPMs only -- any package management system should be susceptible to this problem. An interesting study would be to investigate similar overlaps in vendor-supplied packages for commercial UNIX distributions.

Detecting this problem for any set of packages involves extensive analysis by talented persons. In the case of [hart], the authors developed a suite of global analysis tools, and repeatedly downloaded and unpacked thousands of RPMs. They still only saw "the tip of the iceberg" (their words). They intentionally ignored the actions of postinstall scripts, and they had not yet executed any packaged code to look for behavioral interactions.

Avoiding the problem is easier; install the packages, record the order of installation, test as usual, and when satisfied with testing, install the same packages in the same order on production machines.

While we've used packages in this example, we'd like to remind the reader that these considerations apply not only to package installation but any other change that affects the root-owned portions of disk.

There is a "chicken and egg" or bootstrapping problem when updating either an automated systems administration tool (ASAT) or its underlying foundations [#thesis/selfadmin/unintended]. Order is important when changes the tool makes can change the ability of the tool to make changes.

For example, cfengine version 2 includes new directives available for use in configuration files. Before using a new configuration file, the new version of cfengine needs to be installed. The new client is named 'cfagent' rather than 'cfengine', so wrapper scripts and crontab entries will also need to be updated, and so on.

For fully automated operation on hundreds or thousands of machines, we would like to be able to upgrade cfengine under the control of cfengine [#thesis/selfadmin/asat/r2]. We want to ensure that the following actions will take place on all machines, including those currently down:

fetch new configuration file containing the following instructions
install new cfagent binary
run cfkey to generate key pair
fetch new configuration file containing version 2 directives
update calling scripts and crontab entries

There are several ordering considerations here. We won't know that we need the new cfagent binary until we do step 1. We shouldn't proceed with step 4 until we know that 2 and 3 were successful. If we do 5 too early, we may break the ability for cfengine to operate at all. If we do step 4 too early and try to run the resulting configuration file using the old version of cfengine, it will fail.

While this example may seem straightforward, implementing it in a language which does not by default support deterministic ordering requires much use of conditionals, state chaining, or equivalent. If this is the case, then code flow will not be readily apparent, making inspection and edits error-prone. Infrastructure automation code runs as root and has the ability to stop work across the entire enterprise; it needs to be simple, short, and easy for humans to read, like security-related code paths in tools such as PGP or ssh.

If the tool's language does not support "halt on error" by default, then it is easy to inadvertently allow later actions to take place when we would have preferred to abort. Going back to our cfengine example, if we can easily abort and leave the cfengine version 1 infrastructure in place, then we can still use version 1 to repair the damage.

There are many other examples we could show, some including multi-host "barrier" problems. These include:

Updating ssh to openssh on hundreds of hosts and getting the authorized_keys and/or protocol version configuration out of order. This can greatly hinder further contact with the target hosts. Daniel Hagerty [hagerty] ran into this one; many of us have been bitten by this at some point.
Reconfiguring network routes or interfaces while communicating with the target device via those same routes or interfaces. Ordering errors can prevent further contact with the target, and often require a physical visit to resolve. This is especially true if the target is a workstation with no remote serial console access. Again, most readers have had this happen to them.

While there are many automatic systems administration tools (ASAT) available, the two we are most familiar with are cfengine and our own isconf [cfengine] [isconf]. In this section, we will look at these two tools from the perspective of Turing equivalence [#thesis], with a focus on how each can be used deterministically.

In general, some of the techniques that seem to work well for the design and use of most ASATs include:

Keep the "Turing tape" a finite size by holding the network content constant [#thesis/utm/rebuild], or versioning it using CVS or another version control tool [cvs] [bootstrap]. This helps prevent some of the more insidious behaviors that are a potential in self-modifying machines [#thesis/selfadmin/unintended].
Continuing in that vein, when using distributed package repositories such as the public Debian [debian] package server infrastructure, always specify version numbers when automating the installation of packages, rather than let the package installation tool (in Debian's case apt-get) select the latest version. If you do not specify the package version, then you may introduce divergence [#methods/divergence]. This risk varies, of course, depending on your choice of 'stable' or 'unstable' distribution, though we suspect it still applies in 'stable', especially when using the 'security' packages. It certainly applies in all cases when you need to maintain your own kernel or kernel modules rather than using the distributed packages.
We have experienced this repeatedly -- machines which built correctly the first time with a given package list will not rebuild with the same package list a few weeks later, due to package version changes on the public servers, and resulting unresolved incompatibilities with local conditions and configuration file contents. Remember, your hosts are unique in the world -- there are likely no others like them. Package maintainers cannot be expected to test every configuration, especially yours. You must retain this responsibility.

We use Debian in this example because it is a distribution we like a lot; note that other package distribution and installation infrastructures, such as the RedHat up2date system, also have this problem.
Expect long dependency or sequence chains when building enterprise infrastructures. If an ASAT can easily support encapsulation and ordering of 10, 50, or even 100 complex atomic actions in a single chain, then it is likely capable of fully automated administration of machines, including package, kernel, build, and even rebuild management. If the ASAT is cumbersome to use when chains become only two or three actions deep, then it is likely most suited for configuration file management, not package, binary, or kernel manipulation.

As we mentioned in section [#foreword], isconf originally began life as a quick hack. Its basic utility has proven itself repeatedly over the last 8 years, and as adoption has grown it is currently managing more production infrastructures than we are personally aware of.

While we show some ISconf makefile examples here, we do not show any example of the top-level configuration file which drives the environment and targets for 'make'. It is this top-level configuration file, and the scripts which interpret it, which are the core of ISconf and enable the typing or classing of hosts. These top-level facilities also are what governs the actions ISconf is to take during boot versus cron or other execution contexts. More information and code is available at ISconf.org and Infrastructures.Org.

We also do not show here the network fetch and update portions of ISconf, and the way that it updates its own code and configuration files at the beginning of each run. This default behavior is something that we feel is important in the design of any automated systems administration tool. If the tool does not support it, end-users will have to figure out how to do it themselves, reducing the usability of the tool.

Version 2 of ISconf was a late-90's rewrite to clean up and make portable the lessons learned from version 1. As in version 1, the code used was Bourne shell, and the state engine used was 'make'.

In (listing 1), we show a simplified example of Version 2 usage. While examples related to this can be found in [hart] and in our own makefiles, real-world usage is usually much more complex than the example shown here. We've contrived this one for clarity of explanation.

In this contrived example, we install two packages which we have not proven orthogonal. We in fact do not wish to take the time to detect whether or not they are orthogonal, due to the considerations expressed in section [#thesis/cost]. We may be tool users, rather than tool designers, and may not have the skillset to determine orthogonality, as in section [#thesis/whodecides].

These packages might both affect the same shared library, for instance. Again according to [hart] and our own experience, it is not unusual for two packages such as these to list neither as prerequisites, so we might gain no ordering guidance from the package headers either.

In other words, all we know is that we installed package 'foo', tested and deployed it to production, and then later installed package 'bar', tested it and deployed. These installs may have been weeks or months apart. All went well throughout, users were happy, and we have no interest in unpacking and analyzing the contents of these packages for possible reordering for any reason; we've gone on to other problems.

Because we know this order works, we wish for these two packages, 'foo' and 'bar', to be installed in the same order on every future machine in this class. This makefile will ensure that; the touch $@ command at the end of each stanza will prevent this stanza from being run again. The ISconf code always changes to the timestamps directory before starting 'make' (and takes other measures to constrain the normal behavior of 'make', so that we never try to "rebuild" this target either).

The class name in this case (listing 1) is 'Block12'. You can see that 'Block12' is also made up of many other packages; we don't show the makefile stanzas for these here. These packages are listed as prerequisites to 'Block12', in chronological order. Note that we only want to add items to the end of this list, not the middle, due to the considerations expressed in section [#thesis/selfadmin/prior].

In this example, even though we take advantage of the Debian package server infrastructure, we specify the version of package that we want, as in the introduction to section [#examples]. We also use a caching proxy when fetching Debian packages, in order to speed up our own builds and reduce the load on the Debian servers to a minimum.

Note that we get "halt-on-error" behavior from 'make', as we wished for in section [#break-cmd]. If any of the commands in the 'foo' or 'bar' sections exit with a non-zero return code, then 'make' aborts processing immediately. The 'touch' will not happen, and we normally configure the infrastructure such that the ISconf failure will be noticed by a monitoring tool and escalated for resolution. In practice, these failures very rarely occur in production; we see and fix them in test. Production failures, by the definition of congruence [#methods/congruence], usually indicate a systemic, security, or organizational problem; we don't want them fixed without human investigation.

Listing 1: ISconf makefile package ordering example.

Block12: cvs ntp foo lynx wget serial_console bar sudo mirror_rootvg

foo:
	apt-get -y install foo=0.17-9
	touch $@

bar:
	apt-get -y install bar=1.0.2-1
	echo apple pear > /etc/bar.conf
	touch $@

...

ISconf version 3 was a rewrite in Perl, by Luke Kanies. This version adds more "lessons learned", including more fine-grained control of actions as applied to target classes and hosts. There are more layers of abstraction between the administrator and the target machines; the tool uses various input files to generate intermediate and final file formats which eventually are fed to 'make'.

One feature in particular is of special interest for this paper. In ISconf version 2, the administrator still had the potential to inadvertently create unordered change by an innocent makefile edit. While it is possible to avoid this with foreknowledge of the problem, version 3 uses timestamps in an intermediate file to prevent it from being an issue.

The problem which version 3 fixes can be reproduced in version 2 as follows: Refer to (listing 1). If both 'foo' and 'bar' have been executed (installed) on production machines, then the administrator adds 'baz' as a prerequisite to 'bar', then this would qualify as "editing prior actions" and create the divergence described in [#thesis/selfadmin/prior].

ISconf version 3, rather than using a human-edited makefile, reads other input files which the administrator maintains, and generates intermediate and final files which include timestamps to detect the problem and correct the ordering.

ISconf version 4, currently in prototype, represents a significant architectural change from versions 1 through 3. If the current feature plan is fully implemented, version 4 will enable cross-organizational collaboration for development and use of ordered change actions. A core requirement is decentralized development, storage, and distribution of changes. It will enable authentication and signing, encryption, and other security measures. We are likely to replace 'make' with our own state engine, continuing the migration begun in version 3. See ISconf.Org for the latest information.

In section [#methods/congruence], we discussed the concept of maintaining a fully descriptive baseline for congruent management. In [#howto-describe], we discussed in general terms how this might be done. In this section, we will show how we do it in isconf.

First, we install the base disk image as in section [#howto-describe], usually using vendor-supplied network installation tools. We discuss this process more in [bootstrap]. We might name this initial image 'Block00'. Then we use the process we mentioned in [#examples-isconf-v2] to apply changes to the machine over the course of its life. Each change we add updates our concept of what is the 'baseline' for that class of host.

As we add changes, any new machine we build will need to run isconf longer on first boot, to add all of the accumulated changes to the Block00 image. After about forty minutes' worth of changes have built up on top of the initial image, it helps to be able to build one more host that way, set the hostname/IP to 'baseline', cut a disk image of it, and declare that new image to be the new baseline. This infrequent snapshot or checkpoint not only reduces the build time of future hosts, but reduces the rebuild time and chance of error in rebuilding existing hosts -- we always start new builds from the latest baseline image.

In an isconf makefile, this whole process is reflected as in (listing 2). Note that whether we cut a new image and start the next install from that, or if we just pull an old machine off the shelf with a Block00 image and plug it in, we'll still end up with a Block20 image with apache and a 2.2.12 kernel, due to the way the makefile prerequisites are chained.

This example shows a simple, linear build of successive identical hosts with no "branching" for different host classes. Classes add slightly more complexity to the makefile. They require a top-level configuration file to define the classes and target them to the right hosts, and they require wrapper script code to read the config file.

There is a little more complexity to deal with things that should only happen at boot, and that can happen when cron runs the code every hour or so. There are examples of all of this in the isconf-2i package available from ISconf.Org.

Listing 2: Baseline Management in an ISconf Makefile


  # 01 Feb 97 - Block00 is initial disk install from vendor cd,
  # with ntp etc. added later
  Block00: ntp cvs lynx ...

  # 15 Jul 98 - got tired of waiting for additions to Block00 to build,
	# cut new baseline image, later add ssh etc.
  Block10: Block00 ssh ...

  # 17 Jan 99 - new baseline again, later add apache, rebuild kernel, etc.
  Block20: Block10 apache kernel-2.2.12 ...

Cfengine is likely the most popular purpose-built tool for automated systems administration today. The cfengine language was optimized for dynamic prerequisite analysis rather than long, deterministic ordered sets.

While the cfengine language wasn't specifically optimized for ordered behavior, it is possible to achieve this with extra work. It should be possible to greatly reduce the amount of effort involved, by using some tool to generate cfengine configuration files from makefile-like (or equivalent) input files. One good starting point might be Tobias Oetiker's TemplateTree II [oetiker].

Automatic generation of cfengine configuration files appears to be a near-requirement if the tool is to be used to maintain congruent infrastructures; the class and action-type structures tend to get relatively complex rather fast if congruent ordering, rather than convergence, is the goal.

Other gains might be made from other features of cfengine; we have made progress experimenting with various helper modules, for instance. Another technique that we have put to good use is to implement atomic changes using very small cfengine scripts, each equivalent to an ISconf makefile stanza. These scripts we then drive within a deterministically ordered framework.

In the cfengine version 2 language there are new features, such as the FileExists() evaluated class function, which may reduce the amount of code. So far, based on our experience over the last few years in trial attempts, it appears that a cfengine configuration file that does the same job as an ISconf makefile would still need anywhere from 2-3 times the number of lines of code. We consider this an open and evolving effort though -- check the cfengine.org and Infrastructures.Org websites for the latest information.

If it should turn out that the basic logics of a machine designed for the numerical solution of differential equations coincide with the logics of a machine intended to make bills for a department store, I would regard this as the most amazing coincidence that I have ever encountered. -- Howard Aiken, founder of Harvard's Computer Science department and architect of the IBM/Harvard Mark I.

Turing equivalence in host management appears to be a new factor relative to the age of the computing industry. The downsizing of mainframe installations and distribution of their tasks to midrange and desktop machines by the early 1990's exposed administrative challenges which have taken the better part of a decade for the systems administration community to understand, let alone deal with effectively.

Older computing machinery relied more on dedicated hardware rather than software to perform many administrative tasks. Operating systems were limited in their ability to accept changes on the fly, often requiring recompilation for tasks as simple as adding terminals or changing the time zone. Until recently, the most popular consumer desktop operating system still required a reboot when changing IP address.

In the interests of higher uptime, modern versions of UNIX and Linux have eliminated most of these issues; there is very little software or configuration management that cannot be done with the machine "live". We have evolved to a model that is nearly equivalent to that of a Universal Turing Machine, with all of its benefits and pitfalls. To avoid this equivalence, we would need to go back to shutting operating systems down in order to administer them. Rather than go back, we should seek ways to go further forward; understanding Turing equivalence appears to be a good next step.

This situation may soon become more critical, with the emergence of "soft hardware". These systems use Field-Programmable Gate Arrays to emulate dedicated processor and peripheral hardware. Newer versions of these devices can be reprogrammed, while running, under control of the software hosted on the device itself [xilinx]. This will bring us the ability to modify, for instance, our own CPU, using high-level automated administration tools. Imagine not only accidentally unconfiguring your Ethernet interface, but deleting the circuitry itself...

We have synthesized a thought experiment to demonstrate some of the implications of Turing equivalence in host management, based on our observations over the course of several years. The description we provide here is not as rigorous as the underlying theories, and much of it should be considered as still subject to proof. We do not consider ourselves theorists; it was surprising to find ourselves in this territory. The theories cited here provided inspiration for the thought experiment, but the goal is practical management of UNIX and other machines. We welcome any and all future exploration, pro or con.

In the following description of this thought experiment, we will develop a model of system administration starting at the level of the Turing machine. We will show how a modern self-administered machine is equivalent to a Turing machine with several tapes, which is in turn equivalent to a single-tape Turing machine. We will construct a Turing machine which is able to update its own program by retrieving new instructions from a network-accessible tape. We will develop the idea of configuration management for this simpler machine model, and show how problems such as circular dependencies and uncertainty about behavior arise naturally from the nature of computation.

We will discuss how this Turing machine relates to a modern general-purpose computer running an automatic administration tool. We will introduce the implications of the self-modifying code which this arrangement allows, and the limitations of inspection and testing in understanding the behavior of this machine. We will discuss how ordering of changes affects this behavior, and how deterministically ordered changes can make its behavior more deterministic.

We will expand beyond single machines into the realm of distributed computing and management of multiple machines, and their associated inspection and testing costs. We will discuss how ordering of changes affects these costs, and how ordered change apparently provides the lowest cost for managing an enterprise infrastructure.

Readers who are interested in applied rather than mathematical or theoretical arguments may want to review [#howto] or skip to section [#conclusion].

A Turing machine [!turing.png] reads bits from an infinite tape, interprets them as data according to a hardwired program and rewrites portions of the tape based on what it finds. It continues this cycle until it reaches a completion state, at which time it halts [turing].

Because a Turing machine's program is hardwired, it is common practice to say that the program describes or is the machine. A Turing machine's program is stated in a descriptive language which we will call the machine language. Using this language, we describe the actions the machine should take when certain conditions are discovered. We will call each atom of description an instruction. An example instruction might say:

If the current machine state is 's3', and the tape cell at the machine's current head position contains the letter 'W', then change to state 's7', overwrite the 'W' with a 'P', and move the tape one cell to the right.

Each instruction is commonly represented as a quintuple; it contains the letter and current state to be matched, as well as the letter to be written, the tape movement command, and the new state. The instruction we described above would look like:

s3,W ⇒ s7,P,r

Note that a Turing machine's language is in no way algorithmic; the order of quintuples in a program listing is unimportant; there are no branching, conditional, or loop statements in a Turing machine program.

The content of a Turing tape is expressed in a language that we will call the input language. A Turing machine's program is said to either accept or reject a given input language, if it halts at all. If our Turing machine halts in an accept state, (which might actually be a state named 'accept') then we know that our program is able to process the data and produce a valid result -- we have validated our input against our machine. If our Turing machine halts because there is no instruction that matches the current combination of state and cell content [#thesis/turing/machinelang], then we know that our program is unable to process this input, so we reject. If we never halt, then we cannot state a result, so we cannot validate the input or the machine.

A Universal Turing Machine (UTM) is able to emulate any arbitrary Turing machine. Think of this as running a Turing "virtual machine" (TVM) on top of a host UTM. A UTM's machine language program [#thesis/turing/machinelang] is made up of instructions which are able to read and execute the TVM's machine language instructions. The TVM's machine language instructions are the UTM's input data, written on the input tape of the UTM alongside the TVM's own input data [!utmtape.png].

Any multiple-tape Turing machine can be represented by a single-tape Turing machine, so it is equally valid to think of our Universal Turing Machine as having two tapes; one for TVM program, and the other for TVM data.

A Universal Turing Machine appears to be a useful model for analyzing the theoretical behavior of a "real" general-purpose computer; basic computability theory seems to indicate that a UTM can solve any problem that a general-purpose computer can solve [church].

Further work by John von Neumann and others demonstrated one way that machines could be built which were equivalent in ability to Universal Turing Machines, with the exception of the infinite tape size [vonneumann]. The von Neumann architecture is considered to be a foundation of modern general purpose computers [godfrey].

As in von Neumann's "stored program" architecture, the TVM program and data are both stored as rewritable bits on the UTM tape [#thesis/utm] [!utmtape.png]. This arrangement allows the TVM to change the machine language instructions which describe the TVM itself. If it does so, our TVM enjoys the advantages (and the pitfalls) of self-modifying code [nordin].

There is no algorithm that a Turing machine can use to determine whether another specific Turing machine will halt for a given tape; this is known as the "halting problem". In other words, Turing machines can contain constructions which are difficult to validate. This is not to say that every machine contains such constructions, but that that an arbitrary machine and tape chosen at random has some chance of containing one.

Note that, since a Turing machine is an imaginary construct [turing], our own brain, a pencil, and a piece of paper are (theoretically) sufficient to work through the tape, producing a result if there is one. In other words, we can inspect the code and determine what it would do. There may be tools and algorithms we can use to assist us in this [laitenberger]. We are not guaranteed to reach a result though -- in order for us to know that we have a valid machine and valid input, we must halt and reach an accept state. Inspection is generally considered to be a form of testing.

Inspection has a cost (which we will use later):

C_inspect

This cost includes the manual labor required to inspect the code, any machine time required for execution of inspection tools, and the manual labor to examine the tool results.

There is no software testing algorithm that is guaranteed to ensure fully reliable program operation across all inputs -- there appears to be no theoretical foundation for one [hamlet]. We suspect that some of the reasons for this may be related to the halting problem [#thesis/turing/halting], Gödel's incompleteness theorem [godel], and some classes of computational intractability problems, such as the Traveling Salesman and NP completeness [greenlaw] [garey] [brookshear] [dewdney].

In practice, we can use multiple test runs to explore the input domain via a parameter study, equivalence partitioning [richardson], cyclomatic complexity analysis [mccabe], pseudo-random input, or other means. Using any or all of these methods, we may be able to build a confidence level for predictability of a given program. Note that we can never know when testing is complete, and that testing only proves incorrectness of a program, not correctness.

Testing cost includes the manual labor required to design the test, any machine time required for execution, and the manual labor needed to examine the test results:

C_test

For software testing to be meaningful, we must also ensure code coverage. Code coverage requirements are generally determined through some form of inspection [#thesis/turing/inspection], with or without the aid of tools. Coverage information is only valid for a fixed program -- even relatively minor code changes can affect code coverage information in unpredictable ways [elbaum]. We must repeat testing [#thesis/test] for every variation of program code.

To ensure code coverage, testing includes the manual labor required to inspect the code, any machine time required for execution of the coverage tools and tests, and the manual labor needed to examine the test results. Because testing for coverage includes code inspection, we know that testing is more expensive than inspection alone:

C_test > C_inspect

Once we have found a UTM tape that produces the result we desire, we can make many copies of that tape, and run them through many identical Universal Turing Machines simultaneously. This will produce many simultaneous, identical results. This is not very interesting -- what we really want to be able to do is hold the TVM program portion of the tape constant while changing the TVM data portion, then feed those differing tapes through identical machines. The latter arrangement can give us a form of distributed or parallel computing.

Altering the tapes [#thesis/turing/replicate] presents a problem though. We cannot in advance know whether these altered tapes will provide valid results, or even reach completion. We can exhaustively test the same program with a wide variety of sample inputs, validating each of these. This is fundamentally a time-consuming, pseudo-statistical process, due to the iterative validations normally required. And it is not a complete solution [#thesis/test].

If we for some reason needed to solve slightly different problems with the distributed machines in [#thesis/turing/replicate], we may decide to use slightly different programs in each machine, rather than add functionality to our original program. But using these unique programs would greatly worsen our testing problem. We would not only need to validate across our range of input data [#thesis/test], but we would also need to repeat the process for each program variant [#thesis/coverage]. We know that testing many unique programs will be more expensive than testing one:

C_many > C_test

It is easy to imagine a Turing Machine that is connected to a network, and which is able to use the net to fetch data from tapes stored remotely, under program control. This is simply a case of an multiple-tape Turing machine, with one or more of the tapes at the other end of a network connection.

Building on [#thesis/utm/net], imagine a Turing Virtual Machine (TVM) running on top of a networked Universal Turing Machine (UTM) [#thesis/utm]. In this case, we might have 3 tapes; one for the TVM program, one for the TVM data, and a third for the remote network tape. It is easy to imagine a sequence of TVM operations which involve fetching a small amount of data from the remote tape, and storing it on the local program tape as additional and/or replacement TVM instructions [#thesis/utm/selfmod]. We will name the old TVM instruction set A. The set of fetched instructions we will name B, and the resulting merger of the two we will name AB. Note that some of the instructions in B may have replaced some of those in A [!ab.png]. Before the fetch, our TVM could be described [#thesis/turing/machinelang] as an A machine, after the fetch we have an AB machine -- the TVM's basic functionality has changed. It is no longer the same machine.

Note that, if any of the instructions in set B replace any of those in set A, [#thesis/utm/net/fetch], then the order of loading these sets is important. A TVM with the instruction set AB will be a different machine than one with set BA [!ba.png].

It is easy to imagine that the TVM in [#thesis/utm/net/fetch] could later execute an instruction from set B, which could in turn cause the machine to fetch another set of one or more instructions in a set we will call C, resulting in an ABC machine:

After each fetch described in section [#thesis/utm/net/refetch], the local program and data tapes will contain bits from (at least) three sources: the new instruction set just copied over the net, any old instructions still on tape, and the data still on tape from ongoing execution of all previous instructions.

The choice of next instruction to be fetched from the remote tape in section [#thesis/utm/net/refetch] can be calculated by the currently available instructions on the local program tape, based on current tape content [#thesis/utm/net/content].

The behavior of one or more new instructions fetched in [#thesis/utm/net/refetch] can (and usually will) be influenced by other content on the local tapes [#thesis/utm/net/content]. With careful inspection and testing we can detect some of the ways content will affect instruction fetches, but due to the indeterminate results of software testing [#thesis/test], we may never know if we found all of them.

Let us go back to our three TVM instruction sets, A, B, and C [#thesis/utm/net/refetch]. These were loaded over the net and executed using the procedure described in [#thesis/utm/net/nextcall]. Assume we start with blank local program and data tapes. Assume our UTM is hardwired to fetch set A if the local program tape is found to be blank. If we then run the TVM, A can collect data over the net and begin processing it. At some point later, A can cause set B to be loaded. Our local tapes will now contain the TVM data resulting from execution of A, and the new TVM machine instructions AB. If the TVM later loads C, our program tape will contain ABC.

If the networked UTM machine constructed in [#thesis/utm/g] always starts with the same (blank) local tape content, and the remote tape content does not change, then we can demonstrate that an A TVM will always evolve to an AB, then an ABC machine, before halting and producing a result.

Assuming the network-resident data never changes, we can rebuild our networked UTM at any time and restore it to any prior state by clearing the local tapes, resetting the machine state, and restarting execution with the load of A [#thesis/utm/g]. The machine will execute and produce the same intermediate and final results as it did before, as in section [#thesis/utm/g/order].

If the network-resident data does change, though, we may not be able to rebuild to an identical state. For example, if someone were to alter the network-resident master copy of the B instruction set after we last fetched it, then it may no longer produce the same intermediate results and may no longer fetch C [#thesis/utm/net/nextcall]. We might instead halt at AB.

Without careful (and possibly intractable) inspection [#thesis/turing/inspection], we cannot prove in advance whether an BCA or CAB machine can produce the same result as an ABC machine. It is possible that these, or other, variations might yield the same result. We can validate the result for a given input [#thesis/turing/validate]. We would also need to do iterative testing [#thesis/turing/testing] to demonstrate that multiple inputs would produce the same result. Our cost of testing multiple or partially ordered sequences is greater than that required to test a single sequence:

C_partial > C_test

If the behavior of any instruction from B in [#thesis/utm/g/order] is in any way dependent on other content found on tape [#thesis/utm/net/content] [#thesis/utm/net/nextcall] [#thesis/utm/net/nextvars], then we can expect our TVM to behave differently if we load B before loading A [#thesis/utm/i/unique]. We cannot be certain that a UTM loaded with only a B instruction set will accept the input language, or even halt, until after we validate it [#thesis/turing/validate].

We might want to rollback from the load or execution of a new instruction set. In order to do this, we would need to return the local program and data tape to a previous content. For example, if machine A executes and loads B, our instruction set will now be AB. We might rollback by replacing our tape with the A copy.

Due to [#thesis/utm/g/dependence], it is not safe to try to rollback the instruction set of machine AB to recreate machine A by simply removing the B instructions. Some of B may have replaced A. The AB machine, while executing, may have even loaded C already [#thesis/utm/g], in which case you won't end up with A, but with AC. If the AB machine executed for any period of time, it is likely that the input data language now on the data tape is only acceptable to an AB machine -- an A machine might reject it or fail to halt [#thesis/turing/validate]. The only safe rollback method seems to be something similar to [#thesis/utm/g/rollback].

It is easy to imagine an automatic process which conducts a rollback. For example, in [#thesis/utm/g/rollback], machine AB itself might have the ability to clear its own tapes, reset the machine state, and restart execution at the beginning of A, as in section [#thesis/utm/rebuild].

But the system described in [#thesis/utm/g/rollback/auto] will loop infinitely. Each time A executes, it will load B, then AB will execute and reset the local tapes again. In practice, a human might detect and break this loop; to represent this interaction, we would need to add a fourth tape, representing the user detection and input data.

It is easy to imagine an automatic process which emulates a rollback while avoiding loops, without requiring the user input tape in [#thesis/utm/g/rollback/loop]. For example, instruction set C might contain the instructions from A that B overlaid. In other words, installing C will "rollback" B. Note that this is not a true rollback; we never return to a tape state that is completely identical to any previous state. Although this is an imperfect solution, it is the best we seem to be able to do without human intervention.

The loop in section [#thesis/utm/g/rollback/loop] will cause our UTM to never reach completion -- we will not halt, and cannot validate a result [#thesis/turing/validate]. A method such as [#thesis/utm/g/rollback/emulate] can prevent a rollback-induced loop, but is not a true rollback -- we never return to an earlier tape content. If these, or similar, methods are the only ones available to us, it appears that program-controlled tape changes must be monotonic -- we cannot go back to a previous tape content under program control, otherwise we loop.

You are in a maze of twisty little passages, all alike. -- Will Crowther's "Adventure"

Let us now look at a conventional application program, running as an ordinary user on a correctly configured UNIX host. This program can be loaded from disk into memory and executed. At no time is the program able to modify the "master" copy of itself on disk. An application program typically executes until it has output its results, at which time it either sleeps or halts. This application is equivalent to a fixed-program Turing machine [#thesis/turing] in the following ways: Both can be validated for a given input [#thesis/turing/validate] to prove that they will produce results in a finite time and that those results are correct. Both can be tested over a range of inputs [#thesis/test] to build confidence in their reliability. Neither can modify their own executable instructions; in the UNIX machine they are protected by filesystem permissions; in the Turing machine they are hardwired. (We stipulate that there are some ways in which [#thesis/conventional] and [#thesis/turing] are not equivalent -- a Turing machine has a theoretically infinite tape, for instance.)

We can say that the application program in [#thesis/conventional] is running on top of an application virtual machine (AVM). If the application is written in Java, for example, the AVM consists of the Java Virtual Machine. In Perl, the AVM is the Perl bytecode VM. For C programs, the AVM is the kernel system call interface. Low-level code in shared libraries used by a C program uses the same syscall interface to interact with the hardware -- shared libraries are part of the C AVM. A Perl program can load modules -- these become part of the program's AVM. A C or Perl program that uses the system() or exec() system calls relies on any executables called -- these other executables, then, are part of the C or Perl program's AVM. Any executables called via exec() or system() in turn may require other executables, shared libraries, or other facilities. Many, if not most, of these components are dependent on one or more configuration files. These components all form an AVM chain of dependency for any given application. Regardless of the size or shape of this chain, all application programs on a UNIX machine ultimately interact with the hardware and the outside world via the kernel syscall interface.

When we perform system administration actions as root on a running UNIX machine, we can use tools found on the local disk to cause the machine to change portions of that same disk. Those changes can include executables, configuration files, and the kernel itself. Changes can include the system administration tools themselves, and changed components and configuration files can influence the fundamental behavior and viability of those same executables in unforeseen ways, as in section [#thesis/coverage], as applied to changes in the AVM chain [#thesis/avm].

A self-administered UNIX host runs an automatic systems administration tool (ASAT) periodically and/or at boot. The ASAT is an application program [#thesis/conventional], but it runs as root rather than an ordinary user. While executing, the ASAT is able to modify the "master" copy of itself on disk, as well as the kernel, shared libraries, filesystem layout, or any other portion of disk, as in section [#thesis/sysadmin].

The ASAT described in section [#thesis/selfadmin] is equivalent to a Turing Virtual Machine [#thesis/utm] in the ways described in section [#thesis/conventional]. In addition, a self-administered host running an ASAT is similar to a Universal Turing Machine in that the ASAT can modify its own program code [#thesis/utm/selfmod].

A self-administered UNIX host connected to a network is equivalent to a network-connected Universal Turing Machine [#thesis/utm/net] in the following ways: The host's ASAT [#thesis/selfadmin] can fetch and execute an arbitrary new program as in section [#thesis/utm/net/fetch]. The fetched program can fetch and execute another as in [#thesis/utm/net/refetch]. Intermediate results can control which program is fetched next, as in [#thesis/utm/net/nextcall]. The behavior of each fetched program can be influenced by the results of previous programs.

When we do administration via automated means [#thesis/selfadmin], we rely on the executable portions of disk, controlled by their configuration files, to rewrite those same executables and configuration files [#thesis/sysadmin]. Like the Universal Turing Machine in [#thesis/utm/g/monotonic], changes made under program control must be assumed to be monotonic; non-reversible short of "resetting the tape state" by reformatting the disk.

An ASAT [#thesis/selfadmin] runs in the context of the host kernel and configuration files, and depends either directly or indirectly on other executables and shared libraries on the host's disk [#thesis/utm/g/dependence].

The circular dependency of the ASAT AVM dependency tree [#thesis/avm] forces us to assume that, even though we may not ever change the ASAT code itself, we can unintentionally change its behavior if we change other components of the operating system. This is similar to the indeterminacy described in [#thesis/utm/net/nextvars].

It is not enough for an ASAT designer to statically link the ASAT binary and carefully design it for minimum dependencies. Other executables, their shared libraries, scripts, and configuration files might be required by ASAT configuration files written by a system administrator -- the tool's end user.

When designing tools we cannot know whether the system administrator is aware of the AVM dependency tree (we certainly can't expect them to have read this paper). We must assume that there will be circular dependencies, and we must assume that the tool designer will never know what these dependencies are. The tool must support some means of dealing with them by default. We've found over the last several years that a default paradigm of deterministic ordering will do this.

We cannot always keep all hosts identical; a more practical method, for instance, is to set up classes of machines, such as "workstation" and "mail server", and keep the code within a class identical. This reduces the amount of coverage testing required [#thesis/coverage]. This testing is similar to that described in section [#thesis/turing/replicate/unique].

The question of whether a particular piece of software is of sufficient quality for the job remains intractable [#thesis/test].

But in practice, in a mission-critical environment, we still want to try to find most defects before our users do. The only accurate way to do this is to duplicate both program and input data, and validate the combination [#thesis/turing/validate]. In order for this validation to be useful, the input data would need to be an exact copy of real-world, production data, as would the program code. Since we want to be able to not only validate known real-world inputs but also test some possible future inputs [#thesis/test], we expect to modify and disrupt the data itself.

We cannot do this in production. Application developers and QA engineers tend to use test environments to do this work. It appears to us that systems administrators should have the same sort of test facilities available for testing infrastructure changes, and should make good use of them.

Because the ASAT [#thesis/selfadmin] is itself a complex, critical application program, it needs to be tested using the procedure in [#thesis/selfadmin/testing/intractable]. Because the ASAT can affect the operation of the UNIX kernel and all subsidiary processes, this testing usually will conflict with ordinary application testing. Because the ASAT needs to be tested against every class of host [#thesis/selfadmin/testing/classes] to be used in production, this usually requires a different mix of hosts than that required for testing an ordinary application.

The considerations in section [#thesis/selfadmin/testing/asat] dictate a need for an infrastructure test environment for testing automated systems administration tools and techniques. This environment needs to be separate from production, and needs to be as identical as possible in terms of user data and host class mix.

Changes made to hosts in the test environment [#thesis/selfadmin/testenv], once tested [#thesis/turing/testing], need to be transferred to their production counterpart hosts. When doing so, the ordering precautions in section [#thesis/utm/g/dependence] need to be observed. Over the last several years, we have found that if you observe these precautions, then you will see the benefits of repeatable results as shown in [#thesis/utm/g/order]. In other words, if you always make the same changes first in test, then production, and you always make those changes in the same order on each host, then changes that worked in test will work in production.

Because an ASAT [#thesis/selfadmin] installed on many machines must be able to be updated without manual intervention, it is our standard practice to always have the tool update itself as well as its own configuration files and scripts. This allows the entire system state to progress through deterministic and repeatable phases, with the tool, its configuration files, and other possibly dependent components kept in sync with each other.

By having the ASAT update itself, we know that we are purposely adding another circular dependency beyond that mentioned in section [#thesis/selfadmin/unintended]. This adds to the urgency of the need for ordering constraints such as [#thesis/selfadmin/order].

We suspect control loop theory applies here; this circular dependency creates a potential feedback loop. We need to "break the loop" and prevent runaway behavior such as oscillation (replacing the same file over and over) or loop lockup (breaking the tool so that it cannot do anything anymore). Deterministically ordered changes seem to do the trick, acting as an effective damper.

We stipulate that this is not standard practice for all ASAT users. But all tools must be updated at some point; there are always new features or bug fixes which need to be addressed. If the tool cannot support a clean and predictable update of its own code, then these very critical updates must be done "out of band". This defeats the purpose of using an ASAT, and ruins any chance of reproducible change in an enterprise infrastructure.

Due to [#thesis/selfadmin/order], if we allow the order of changes to be A, B, C on some hosts, and A, C, B on others, then we must test both versions of the resulting hosts [#thesis/turing/replicate/unique]. We have inadvertently created two host classes [#thesis/selfadmin/testing/classes]; due to the risk of unforeseen interactions we must also test both versions of hosts for all future changes as well, regardless of ordering of those future changes. The hosts have diverged [#methods/divergence].

It is tempting to ask "Why don't we just test changes in production, and rollback if they don't work?" This does not work unless you are able to take the time to restore from tape, as in section [#thesis/utm/g/rollback]. There's also the user data to consider -- if a change has been applied to a production machine, and the machine has run for any length of time, then the data may no longer be compatible with the earlier version of code [#thesis/utm/g/rollback/dependence]. When using an ASAT in particular, it appears that changes should be assumed to be monotonic [#thesis/selfadmin/monotonic].

It appears that editing, removing, or otherwise altering the master description of prior changes [#thesis/utm/rebuild/change] is harmful if those changes have already been deployed to production machines. Editing previously-deployed changes is one cause of divergence [#methods/divergence]. A better method is to always "roll forward" by adding new corrective changes, as in section [#thesis/utm/g/rollback/emulate].

It is extremely tempting to try to create a declarative or descriptive language L that is able to overcome the ordering restrictions in [#thesis/selfadmin/order] and [#thesis/selfadmin/prior]. The appeal of this is obvious: "Here are the results I want, go make it so."

A tool that supports this language would work by sampling subsets of disk content, similar to the way our Turing machine samples individual tape cells [#thesis/turing]. The tool would read some instruction set P, which was written in L by the sysadmin. While sampling disk content, the tool would keep track of some internal state S, similar to our Turing machine's state [#thesis/turing/machinelang]. Upon discovering a state and disk sample that matched one of the instructions in P, the tool could then change state, rewrite some part of the disk, and look at some other part of the disk for something else to do. Assuming a constant instruction set P, and a fixed virtual machine in which to interpret P, this would provide repeatable, validatable results [#thesis/turing/validate].

Since the tool in section [#thesis/selfadmin/lang] is an ASAT [#thesis/selfadmin], influenced by the AVM dependency tree [#thesis/avm], it is equivalent to a Turing Virtual Machine as in [#thesis/selfadmin/equiv]. This means that it is subject to the ordering constraints of [#thesis/selfadmin/order]. If the host is networked, then the behavior shown in [#thesis/utm/net/fetch] through [#thesis/utm/net/nextvars] will be evident.

Due to [#thesis/selfadmin/lang/equiv], there appears to be no language, declarative or imperative, that is able to fully describe the desired content of the root-owned, managed portions of a disk while neglecting ordering and history. This is not a language problem: The behavior of the language interpreter or AVM [#thesis/avm] itself is subject to current disk content in unforeseen ways [#thesis/sysadmin].

We stipulate that disk content can be completely described in any language by simply stating the complete contents of the disk. This is still a case of ordering, a case in which there is only one change to be made. Cloning, discussed in section [#predict], is an applied example of this case. This class of change seems to be free of the circular dependencies of an AVM; the new disk image is usually applied when running from an NFS or ramdisk root partition, not while modifying a live machine.

A tool constructed as in section [#thesis/selfadmin/lang] is useful for a very well-defined purpose; when hosts have diverged [#thesis/order/future] beyond any ability to keep track of what changes have already been made. At this point, you have two choices; rebuild the hosts from scratch, using a tool that tracks lifetime ordering; or use a convergence tool to gain some control over them.

It is tempting to ask "Does every change really need to be strictly sequenced? Aren't some changes orthogonal?" By orthogonal we mean that the subsystems affected by the changes are fully independent, non-overlapping, cause no conflict, and have no interaction each other, and therefore are not subject to ordering concerns.

While it is true that some changes will always be orthogonal, we cannot easily prove orthogonality in advance. It might appear that some changes are "obviously unrelated" and therefore not subject to sequencing issues. The problem is, who decides? We stipulate that talent and experience are useful here, for good reason: it turns out that orthogonality decisions are subject to the same pitfalls as software testing.

For example, inspection [#thesis/turing/inspection] and testing [#thesis/test] can help detect changes which are not orthogonal. Code coverage information [#thesis/coverage] can be used to ensure the validity of the testing itself. But in the end, none of these provide assurance that any two changes are orthogonal, and like other testing, we cannot know when we have tested or inspected for orthogonality enough.

Due to this lack of assurance, the cost of predicting orthogonality needs to accrue the potential cost of any errors that result from a faulty prediction. This error cost includes lost revenue, labor required for recovery, and loss of goodwill. We may be able to reduce this error cost, but it cannot be zero -- a zero cost implies that we never make mistakes when analyzing orthogonality. Because the cost of prediction includes this error cost as well as the cost of testing, we know that prediction of orthogonality is more expensive than either the testing or error cost alone:

C_predict > C_error
C_predict > C_test

As a crude negative proof, let us take a look at what would happen if we were to allow the order of changes to be totally unsequenced on a production host. First, if we were to do this, it is apparent that some sequences would not work at all, and probably damage the host [#thesis/utm/g/dependence]. We would need to have a way of preventing them from executing, probably by using some sort of exclusion list. In order to discover the full list of bad sequences, we would need to test and/or inspect each possible sequence.

This is an intractable problem: the number of possible orderings of M changes is M!. If each build/test cycle takes an hour, then any number of changes beyond 7 or 8 becomes impractical -- testing all combinations of 8 changes would require 4.6 years. In practice, we see change sets much larger than this; the ISconf version 2i makefile for building HACMP clusters, for instance, has sequences as long as 121 operations -- that's 121!/24/365, or 9.24*10^196 years. It is easier to avoid unsequenced changes.

The cost of testing and inspection required to enable randomized sequencing appears to be greater than the cost of testing a subset of all sequences [#thesis/utm/g/disorder], and greater than the testing, inspection, and accrued error of predicting orthogonality [#thesis/whodecides]:

C_random > C_predict > C_partial

As a self-administering machine changes its disk contents, it may change its ability to change its disk contents. A change directive that works now may not work in the same way on the same machine in the future and vice versa [#thesis/utm/g/dependence]. There appears to be a need to constrain the order of change directives in order to obtain predictable behavior.

In contrast to [#thesis/selfadmin/lang/no], a language that supports execution of an ordered set of changes appears to satisfy [#thesis/selfadmin/dependence], and appears to have the ability to fully describe any arbitrary disk content, as in [#howto-describe].

In practice, sysadmins tend to make changes to UNIX hosts as they discover the need for them; in response to user request, security concern, or bug fix. If the goal is minimum work for maximum reliability, then it would appear that the "ideal" sequence is the one which is first known to work -- the sequence in which the changes were created and tested. This sequence carries the least testing cost. It carries a lower risk than a sequence which has been partially tested or not tested at all.

The costs in sections [#thesis/turing/inspection], [#thesis/test], [#thesis/utm/g/disorder], [#thesis/whodecides], and [#thesis/random] are related to each other as shown in [!costs.png]. This leads us to these conclusions:

Validating, inspecting, testing, and deploying a single sequence (C_test) appears to be the least-cost host change management technique.
Adequate testing of partially-ordered sequences (C_partial) is more expensive.
Predicting orthogonality between partial sequences (C_predict) is yet more expensive.
The testing required to enable random change sequences (C_random) is more expensive than any other testing, due to the N! combinatorial explosions involved.

The behavioral attributes of a complex host seem to be effectively infinite over all possible inputs, and therefore difficult to fully quantify [#thesis/test]. The disk size is finite, so we can completely describe hosts in terms of disk content [#howto-describe], but we cannot completely describe hosts in terms of behavior. We can easily test all disk content, but we do not seem to be able to test all possible behavior.

This point has important implications for the design of management tools -- behavior seems to be a peripheral issue, while disk content seems to play a more central role. It would seem that tools which test only for behavior will always be convergent at best. Tools which test for disk content have the potential to be congruent, but only if they are able to describe the entire disk state. One way to describe the entire disk is to support an initial disk state description followed by ordered changes, as in [#howto-describe].

There appears to be a general statement we can make about software systems that run "on top of" others in a "virtual machine" or other software-constructed execution environment [#thesis/avm]:

If any virtual machine instruction has the ability to alter the virtual machine instruction set, then different instruction execution orders can produce different instruction sets. Order of execution of these instructions is critical in determining the future instruction set of the machine. Faulty order has the potential to remove the ability for the machine to update the instruction set or to function at all.

This applies to any application, automatic administration tool [#thesis/selfadmin/equiv], or shared library code executed as root on a UNIX machine (it also applies to other cases on other operating systems). These all interact with hardware and the outside world via the operation system kernel, and have the ability to change that same kernel as well as higher-level elements of their "virtual machine". This statement appears to be independent of the language of the virtual machine instruction set [#thesis/selfadmin/lang/no].

Why Order Matters: Turing Equivalence in Automated Systems Administration

Abstract

Why Order Matters:
Turing Equivalence
in
Automated Systems Administration