Junkyard

While there is now prior art which frees us from some of the labor-intensive tasks of administration [cfengine][tivoli][unicenter][XXX] , no currently available solution has been proven to be capable of error-free host management.

Systems administrators still tend to build and maintain hosts the same way the automotive industry built cars in the early 1900's: An individual craftsman manually manipulates a machine into being, and manually maintains it afterward. This is expensive in terms of labor, time, reliability, and sysadmin quality of life.

The automotive industry discovered first mass production, then mass customization using standard tooling. The "standardized tooling" for systems administration is not yet complete. XXX

Like the term "computer" itself, the term "systems administrator" may in time come to mean a piece of technology or an outsourced service, rather than a human doing repetitive work. This goal is currently a major focus of the Infrastructure Architect (IA) career field. [infrastructures]

But in order for this goal to be reached, automated systems administration tools and techniques will need to become simple enough to not require an XXX. While there are many general tools now available [cfengine] [isconf] [opsware] [tivoli] [unicenter], none yet meet this criteria.

There is no written consensus on any method for automatic management of the software installed on computers.

Until there is industry consensus in favor of a method for managing computers, tool requirements will be difficult to specify, and management costs will remain high.

In 1998, Joel Hussleston and Steve Traugott offered an infrastructure build checklist and management philosophy [bootstrapping], but declined to advocate many specifics.

But the tools to support Infrastructure Architecture do not seem to be mature yet. While trying to use cfengine in my next major project, for NASA, I hit the realization that I couldn't easily make it explicitly remember -- or act on -- what it had done over the life of a machine; the cfengine "memory horizon" was limited to

But the tools to support Infrastructure Architecture do not seem to be mature yet. While many automated administration tools have now been published [cfengine][opsware][tivoli][unicenter][centerrun][ see selected papers book ], none yet meet [#requirements]

Four years later, based on these findings, Lance Brown and I would like to raise the bar a little more. We'd like to describe one previously unpublished (and originally underestimated) principle of that 1998 toolset -- ordered change [#ordered]. We think that adhering to this principle has served many folks well, and hope that awareness of this principle will help others in tool design. If nothing else, we'd again like to offer a lightning rod for discussion.

[talk about ordering versus convergence]

The question [ of whether strict ordering of administrative actions is more important than convergence ] has important implications for our industry. The answer will guide our actions as well as our effectiveness for decades to come.

Those of us who practice deterministic ordering, and those of us who do not, will quite literally work against each other if administering the same set of machines.

The first two of these, divergence and convergence, are characterized by an assumption that uncontrolled change is to be expected during the lifecycle of a host [#uncontrolled]. A congruent methodology defines uncontrolled change as a security breach [#security].

Because divergence and convergence both accomodate uncontrolled change, they cannot enforce ordered change [#ordered]. In order to gain some control over machines, a convergent tool or methodology analyzes samples of current disk content and acts accordingly to change the disk descriptive language

The latter method, congruence, differs by assuming that drift is undesireable.

ISconf attempts to maintain strict ordering through the use of makefiles. [make]

Network-attached self-administering hosts [#selfadmin] are in practice full Turing machines. They change their own executables, and have an infinite tape.

We tend to think of Von-Neumann [#vn] behavior in terms of limited inputs and outputs such as punched cards and paper tape -- write-once media. But in recent decades, Von-Neumann machines gained first rewriteable non-volatile program storage, then network communications. This fact is normally overlooked today in terms of administrative behavior.

[ network-attached self-administering hosts have an infinite tape ] [ we usually think of tape = ram; instead think of tape = disk ]
If we alter the network-resident, "master" copy of B after it has executed once, then we can cause it to do something other than reset the tape the second time through. But, having altered B, we can no longer call it B -- We have seen research into an alternative, intriguing idea: "Rather than try so hard to preserve order, and rather than ask a human to try to write an optimized comvergence program for a given set of hosts, why don't we try the opposite approach -- let the hosts figure it out?" One way of doing this is to use a pseudo-random ordering mechanism. We execute the set of change actions in random order, monitoring host health and other indicators to detect the effectiveness of a given change [couch][burgess-sandnes]. This appears to be a form of genetic algorithm, with the monitoring providing the fitness function [koza]. In order for this
If we don't constrain changes to a deterministic path, then we by definition do not know the history of the machine.
If we do not know the history of the machine, then we do not know its current state without examining all N bits of the disk.
If we do not know the current state, then we cannot in advance predict the outcome of future changes to that machine.
If we cannot predict the outcome of future changes to a self-administering machine, then we cannot trust its reliability in a mission-critical environment.

	XXX

- sampling disk state alone isn't enough to determine disk
	state, unless you sample the *whole* disk

- ...so state machines driven by only disk samples produce
	non-deterministic behavior

- ...so the equivalence of behavior of two different state
	machines driven only by disk samples is undecidable
	(that's easy)

- configuration driven by only deterministic ordering of
	individual state machines produces deterministic behavior

- because the state machines reside on and execute within
	the context of the driven disk, they can modify each
	other, as well as themselves

- ...so the behavior of individual state machines is
	dependent on the order of invocation of all state machines

- ...so whether two different orders of configuration
	operations exhibit the same behavior is undecidable (takes
	me a while to get here)

Each host in an infrastructure must, over the course of its entire life, follow an ordered, contiguous procedure which is validated elsewhere and is known to work. Failing to do this will result in unpredictable and divergent behavior.

Due to the stateful, Turing-like behavior of a Von-Neumann machine with a disk drive, a given version of configuration file, executable, or shared library cannot be depended on to work correctly in all stages in the lifecycle of a machine. You need to use the right version for the current state of the machine. This is common sense for most systems administrators.

Automated systems administration is the practice of using software to apply all changes to target machines in an enterprise infrastructure. In order for an infrastructure to qualify under this definition, the labor hours expended in applying any given change must not be proportional to the number of hosts.

Our primary driver is the need to avoid unpredictable behavior when automating the administration of mission-critical, usually commercial, enterprise infrastructures.

Infrastructure Architects [#iarch] and application developers [#appdev] have recognized this potential. Recent years have brought a small explosion in the number and variety of systems management toolsets [cfengine] [ark] [tivoli] [opsware] [unicenter] [centerrun] [search for 'tool' in bibilo].

The single, coherent toolset which an Infrastructure Architect [#iarch] uses to deploy and manage an enterprise infrastructure.

To qualify as an Infrastructure Management Toolset (IMT), a toolset must:

Manage all phases of a host's lifecycle, from initial build and ongoing patches to monitoring, security measures, and retirement.
Produce deterministic, repeatable results. [#deterministic] [#repeatable] A properly-used IMT must allow no divergence [#divergence] other than that caused by security breaches.
An IMT must not require administration techniques which call for extraordinary talents on the part of the Infrastructure Administrators [#iadmin]. An Intermediate level Systems Administrator [sage] must be able to learn and apply these techniques within several hours of one-on-one training.
Manage each host continuously, and not require rebuilds of production hosts in order to deploy changes.
Enable a division between, and separate management of, development, testing, production, and disaster recovery environments. The IMT must not require testing in production.
Enable changes to be targeted for particular hosts or groups of hosts.
Enable unified management of client and server machines using the same toolset and configuration files. This is simple to implement with [#imt/targeting].
Enable each host or group of hosts to be unique, if business requirements dictate this. This uniqueness is not ad-hoc [#ad-hoc] though; the characteristics of each host are still managed by [#imt/targeting].
Enable change actions to be targeted for a particular operating context, based on time or on host state. For example, some changes might be applied while a given machine is live, such as from cron, and others only at boot. Example contexts include "boot", "idle", or "Sunday evening".
We know of no tools which currently satisfy all of these requirements; we hope publication of this paper helps produce some. One which comes close is ISconf [isconf].
- walking dependency tree depth-first

- preserving serialization of reusable sequences

- preserving serialization of reusable subsequences

- implicit assertion test of zero return codes for external
	commands

- implicit reproducible serialization of operations not
	explicitly serialized

- semaphores which child processes can use to signal async
	events to parent processes (like 'rebuild kernel and
	reboot when make is done')

- a hierarchical grouping of host attributes, so that host
	function can be determined quickly by eye

- re-usable ordered sets of grouped subsequences, so that
	new hosts can be created by prototype rather than by class
	(This is important.  True class-based configuration tools
	don't seem to work in the field, while prototype-based
	systems consistently do.  I don't think I'm able to
	explain why yet; it may be nothing more than "this is the
	way sysadmins think", or there may be a more theoretical
	basis.  I suspect, again, that it has to do with testing.)

These are things which ISconf/make already does.  In
addition, over the years those of us using ISconf have
concluded that we also need: 

- postrequisites (like 'do foo after I'm done')

- decentralized state machine specification (rather than a
	monolithic makefile or script)

- lexically bound syntax (I want to be able to specify each
	operation in the language most suitable for that
	operation)

- separation of action code from site-specific configuration
	data

- decentralized editing of state machine specifications (no
	need to log into gold server to update makefile)

- state machine and file transport language integrated, as
	in cfengine, to remove need for NFS mounts to get packages
	on demand

- embedded documentation, like POD; including dynamically
	generated runbooks and training checklists generated per
	host class (the latter actually looks easy -- name a
	person to be trained as a "target host" and "build" them,
	checking off training actions as they complete)

An Infrastructure Administrator (IAdmin) is basically a Systems Administrator who manages an Enterprise Virtual Machine (EVM) [#evn] rather than individual hosts. This individual is not expected to create EVM sites -- that's the job of an Infrastructure Architect. [#iarch]

Ordinarily a managed site will have one or more Infrastructure Administrators on staff. An EVM site generally does not have, or need, ordinary Systems Administrators -- conventional systems administration techniques will damage an EVM. [#iadmin/law1]

To qualify as an Infrastructure Administrator, an individual must:

Meet all requirements stated in the SAGE Intermediate Systems Administrator job description.
Demonstrate an understanding of the Infrastructure Architecture [#iarch] field.
Be comfortable using the site's Infrastructure Management Toolset (IMT) [#imt] to apply routine patches and changes to hosts.
Be able to handle, with little or no direction, most day-to-day implementation and testing of minor patches and changes for the entire infrastructure.
Be able to recognize when a change is beyond the scope of the current IMT, and escalate to an Infrastructure Architect. [#iarch]
Never make manual changes on an EVM node -- always use the IMT. Acquiring this discipline is perhaps the most difficult task for an individual making the conversion from Systems Administrator to Infrastructure Administrator.

The Infrastructure Architecture field is the practice of designing and implementing systems which manage an enterprise infrastructure as one large "virtual machine". [#evn] This practice was described by Steve Traugott and Joel Huddleston in 1998. [bootstrapping] That paper included a tentative Infrastructure Architect career definition, in the "Sysadmin, or Infrastructure Architect?" section.

In the intervening 4 years, we've solicited feedback and discussion of the term, as well as SAGE participation. We formed the Infrastructures.Org community to build consensus, and have led numerous USENIX conference Guru and BOF sessions, as well as talks for other industry groups. [usenix] [svlug] [baylisa [baylug] [sfbug] Over the past year, a Google query for "Infrastructure Architect" has consistently returned Steve Traugott's resume [stevegt] as the most-referenced page, and Infrastructures.Org as the most-referenced noncommercial site. (Google's algorithm attempts to ensure that these rankings are based on interest by other parties rather than anything under page owners' direct control.) [google]

These factors, taken together, lead us to believe that we should by now be safe in proposing down the following, more concise, definition. As before, we welcome feedback on the infrastructures mailing list at http://Infrastructures.Org.

The Infrastructure Architect (IArch) career serves as a "next step beyond" Senior Systems Administrator [#sysadmin].

There are major skillset and mindset differences between a Senior Systems Administrator and an IArch. The SAGE job description for a Senior Systems Administrator states only that the individual have an "Ability to identify tasks which require automation and automate them." [sage] This requirement is woefully inadequate for Infrastructure Architecture work.

To qualify as an Infrastructure Architect, an individual should:

Meet all requirements stated in the SAGE Senior Systems Administrator job description.
Meet all requirements stated in our own Infrastructure Administrator job description. [#iadmin]
Prefer to use a coherent and rational plan for audit, deployment, and rework of enterprise infrastructures. One such plan is the "bootstrapping checklist" described in [bootstrapping] and Infrastructures.Org. This approach differs markedly from the largely reaction-oriented patterns of conventional systems administration.
Prefer to use a single, coherent Infrastructure Management Toolset (IMT) for managing all hosts in any given infrastructure. [#imt] This differs from the ad-hoc [#ad-hoc] "on-demand scripting" approach of conventional systems administration.
Demonstrate an ability to train and lead Infrastructure Administrators. [#iadmin]
Demonstrate a consistent ability to factor business plan and financial concerns into daily technical decisions. This does not mean "make bad technical decisions because it's politically expedient", nor does it mean "go for the 90-day ROI when it hurts us in the long run". It does mean "always look for lower TCO" [#tco], and "don't waste time and money implementing a bad business plan -- if only you can see it's bad, then only you can get it fixed".
Demonstrate an awareness and concern for the welfare of the members of the organization. The fundamental goal of Infrastructure Architecture is to raise their quality of life. We do this by accepting responsibility for the integrity of the enterprise infrastructure on which their jobs depend. This contrasts markedly from BOFH. [bofh] Seriously, this does mean that "shut it off and see who screams" is a management practice you'd want to plan away from, likewise "go back and submit a ticket", serially numbered usernames, tiny disk quotas, and filesharing-by-sneakernet.
Demonstrate an ability to create lasting cultural change in support of [#iadmin/law1], [#iarch/business], and [#iarch/people]. Without this change, IT staff members will work against each other. This change must fully encompass both line level and IT management, or these, too, will work against each other.

Assume we have two identical hosts, fred and barney, and two arbitrary change packages,A and B. Assume we want to install A and B on both hosts, and we want fred and barney's behavior to remain identical.

The least-cost method of ensuring that the behavior of fred and barney will remain identical is to install A and B in the same order on both hosts.

We use the word "package" above, but A and B represent any arbitrary change to any disk content.

Note that we say "fred installs" or "barney installs" these packages. This is important to understand -- as systems administrators, we tend to forget that the most we can do on a live machine is direct it to perform actions on our behalf. Any command or tool we use executes in the context of, and is subject to the interpretation of, the entire host operating system.

Note also that we say that there is no algorithm "fred or barney" can use -- fred and barney, being computing machines, have limited powers. As humans, we can halt them both, reboot each from the same CD, and compare the disks to make sure that altering the package order didn't make the disks different. It would take us a while, but we could do it. If we find that the disks are different, we can inspect the differences and decide whether we think they will ever cause a problem, but this will take even longer, and in some cases we may never be sure. None of these "manual effort" methods are effective for large infrastructures with numerous machines or rapid deployment requirements.

as well as those which operate clusters of identical workstations or servers. This includes organizations which use "staging" and "production" environments, perform software quality assurance, or otherwise require reliable operation and high uptime from multiple machines.

The cost of discovering and testing this sequence is less than the cost of testing additional sequences [#thesis/utm/g/disorder]:
Ctest < Cpartial
The cost of predicting whether an alteration in the order of changes will cause a behavior difference is greater than the cost of inspecting the code to determine Ctest > Cvalidate Ctest > Cinspect Cmany > Ctest Cpredict > Cerror Cpredict > Ctest Crandom > Cpartial XXX EVM A distributed computing infrastructure [#thesis/turing/replicate] made up of UNIX machines is subject to the testing issues in [#thesis/turing/testing] if all code installed on the machines is identical, and [#thesis/turing/replicate/unique] if any code installed on the machines is unique. This is a clear incentive to try to keep as much code as possible standardized across machines. XXX show expressions why Due to [#thesis/whodecides], it appears that change orthogonality is undecidable. When designing tools for field use, we should assume that all changes are subject to sequencing issues. This is particularly important if we expect deterministic, reliable behavior in operation and repeatable rebuilds for disaster recovery. XXX glue to cost summary

[kill?] A note about our use of pronouns: This text will use the words "we" and "us" loosely. Lance and I describe here the efforts and findings of a community of people generally but not exclusively centered around Infrastructures.Org. As a founder of that community, I'd like to be able to give voice to our 245 (and growing) members. While I know that nothing I say will match the thoughts of all of those rugged individualists all of the time, I'll do my best to capture current consensus. You'll have to judge for yourself how well I do that -- the list archives are publicly available on the web site.

The least-cost way to ensure reliable behavior in an enterprise infrastructure is to always implement changes to hosts in a deterministic order.

Von-neumann machines are those which separate code and data into different storage locations. A von-neumann machine is not a turing machine, but a turing machine can emulate a von-neumann machine. XXX

[ show diagram of text and data areas ]