<h2>Junkyard</h2>

<p> While there is now prior art which frees us from some of
	the labor-intensive tasks of administration
	[cfengine][tivoli][unicenter][XXX]
	
	, no currently available
	solution has been proven to be capable of error-free host
	management.  </p>

	<p> Systems administrators still tend to build and
		maintain hosts the same way the automotive industry
		built cars in the early 1900's:  An individual craftsman
		manually manipulates a machine into being, and manually
		maintains it afterward.  This is expensive in terms of
		labor, time, reliability, and sysadmin quality of life.
	</p> 

<p> The automotive industry discovered first mass
	production, then mass customization using standard
	tooling.  The "standardized tooling" for systems
	administration is not yet complete.  XXX


<p> Like the term "computer" itself, the term "systems
administrator" may in time come to mean a piece of
technology or an outsourced service, rather than a human
doing repetitive work.  This goal is currently a major focus
of the Infrastructure Architect (IA) career field.
[infrastructures]

<p> But in order for this goal to be reached, automated
systems administration tools and techniques will need to
become simple enough to not require an XXX.  While there are
many general tools now available 
[cfengine] [isconf] [opsware] [tivoli] [unicenter], 
none yet meet this criteria.

	<p> There is no written consensus on any method for
		automatic management of the software installed on
		computers.  </p>

	<p> Until there is industry consensus in favor of a method
		for managing computers, tool requirements will be
		difficult to specify, and management costs will remain
		high.  </p>

	<p> In 1998, Joel Hussleston and Steve Traugott offered an
		infrastructure build checklist and management philosophy
		[bootstrapping], but declined to advocate many
		specifics.  </p>


	<p> But the tools to support Infrastructure Architecture
		do not seem to be mature yet.  While trying to use
		cfengine in my next major project, for NASA, I hit the
		realization that I couldn't easily make it explicitly
		remember -- or act on -- what it had done over the life
		of a machine; the cfengine "memory horizon" was limited
		to 


	<p> But the tools to support Infrastructure Architecture
		do not seem to be mature yet.  While many automated
		administration tools have now been published
		[cfengine][opsware][tivoli][unicenter][centerrun][ see
		selected papers book ], none yet meet 
		[#requirements] </p>

		Four years later, based on these findings, Lance Brown
		and I would like to raise the bar a little more.  We'd
		like to describe one previously unpublished (and
		originally underestimated) principle of that 1998
		toolset -- ordered change [#ordered].  We think that
		adhering to this principle has served many folks well,
		and hope that awareness of this principle will help
		others in tool design.  If nothing else, we'd again like
		to offer a lightning rod for discussion. </p>

	
<p> [talk about ordering versus convergence]

<p> The question [ of whether strict ordering of
administrative actions is more important than convergence ]
has important implications for our industry.  The answer
will guide our actions as well as our effectiveness for
decades to come.  

<p> Those of us who practice deterministic ordering, and
those of us who do not, will quite literally work against
each other if administering the same set of machines.


	<p> The first two of these, divergence and convergence,
		are characterized by an assumption that uncontrolled
		change is to be expected during the lifecycle of a host
		[#uncontrolled].  A congruent methodology defines
		uncontrolled change as a security breach
		[#security].</p>

	
	<p> Because divergence and convergence both accomodate
		uncontrolled change, they cannot enforce ordered change
		[#ordered].
		
			In order to gain some control over
		machines, a convergent tool or methodology analyzes
		samples of current disk content and acts accordingly to
		change the disk 
		descriptive language 
	
	<p> The latter method, congruence, differs by assuming
		that drift is undesireable.  </p>
		
	<p> ISconf attempts to maintain strict ordering through
		the use of makefiles.  [make]  


	<p> Network-attached self-administering hosts [#selfadmin]
		are in practice full Turing machines.  They change their
		own executables, and have an infinite tape.  </p>

	<section name="oldhabits" title="Old Habits">

		<p>We tend to think of Von-Neumann [#vn] behavior in
			terms of limited inputs and outputs such as punched
			cards and paper tape -- write-once media.  But in
			recent decades, Von-Neumann machines gained first
			rewriteable non-volatile program storage, then network
			communications.  This fact is normally overlooked
			today in terms of administrative behavior. </p>

	[ network-attached self-administering hosts have an
	infinite tape ]

	[ we usually think of tape = ram; instead think of tape =
	disk ]

	</section>
	
		If we alter the
		network-resident, "master" copy of <b>B</b> after it has
		executed once, then we can cause it to do something
		other than reset the tape the second time through.  But,
		having altered <b>B</b>, we can no longer call it <b>B</b> -- 


		We have seen research into an alternative, intriguing
		idea:  "Rather than try so hard to preserve order, and
		rather than ask a human to try to write an optimized
		comvergence program for a given set of hosts, why don't
		we try the opposite approach -- let the hosts figure it
		out?"  One way of doing this is to use a pseudo-random
		ordering mechanism.  We execute the set of change
		actions in random order, monitoring host health and
		other indicators to detect the effectiveness of a given
		change [couch][burgess-sandnes].  This appears to be a
		form of genetic algorithm, with the monitoring providing
		the fitness function [koza].   In order for this 

	<section type='point' name="XXX" > If we don't constrain
		changes to a deterministic path, then we by definition
		do not know the history of the machine.  </section>
	
	<section type='point' name="XXX" > If we do not know the
		history of the machine, then we do not know its current
		state without examining all N bits of the disk.
	</section>
	
	<section type='point' name="XXX" >If we do not know the
		current state, then we cannot in advance predict the
		outcome of future changes to that machine.  </section>
	
	<section type='point' name="XXX" > If we cannot predict the
		outcome of future changes to a self-administering
		machine, then we cannot trust its reliability in a
		mission-critical environment.  </section>

<pre>

	XXX

- sampling disk state alone isn't enough to determine disk
	state, unless you sample the *whole* disk

- ...so state machines driven by only disk samples produce
	non-deterministic behavior

- ...so the equivalence of behavior of two different state
	machines driven only by disk samples is undecidable
	(that's easy)

- configuration driven by only deterministic ordering of
	individual state machines produces deterministic behavior

- because the state machines reside on and execute within
	the context of the driven disk, they can modify each
	other, as well as themselves

- ...so the behavior of individual state machines is
	dependent on the order of invocation of all state machines

- ...so whether two different orders of configuration
	operations exhibit the same behavior is undecidable (takes
	me a while to get here)

</pre>

Each host in an infrastructure must, over the course of its
entire life, follow an ordered, contiguous procedure which
is validated elsewhere and is known to work.  Failing to do
this will result in unpredictable and divergent behavior.

<p> Due to the stateful, Turing-like behavior of a
	Von-Neumann machine with a disk drive, a given version of
	configuration file, executable, or shared library cannot
	be depended on to work correctly in all stages in the
	lifecycle of a machine.  You need to use the right version
	for the current state of the machine.  This is common
	sense for most systems administrators.

<section name="problem" title="The Problem Domain">

<p> <i>Automated systems administration</i> is the practice
of using software to apply all changes to target machines in
an enterprise infrastructure.  In order for an
infrastructure to qualify under this definition, the labor
hours expended in applying any given change must <b>not</b>
be proportional to the number of hosts.

<p> Our primary driver is the need to avoid unpredictable
behavior when automating the administration of
mission-critical, usually commercial, enterprise
infrastructures.  

</section>

	<p>	Infrastructure Architects [#iarch] and application
		developers [#appdev] have recognized this potential.
		Recent years have brought a small explosion in the
		number and variety of systems management toolsets
		[cfengine] [ark] [tivoli] [opsware] [unicenter]
		[centerrun] [search for 'tool' in bibilo].  </p>


	<section type="def" name='imt' title="Infrastructure
		Management Toolset">

		<p> The single, coherent toolset which an Infrastructure
			Architect [#iarch] uses to deploy and manage an
			enterprise infrastructure.</p>

		<p> To qualify as an Infrastructure Management Toolset
			(IMT), a toolset must:</p>

		<section type='point' name='imt/phases'> Manage all
			phases of a host's lifecycle, from initial build and
			ongoing patches to monitoring, security measures, and
			retirement.</section>

		<section type='point' name='imt/repeatable'>Produce
			deterministic, repeatable results.  [#deterministic]
			[#repeatable]  A properly-used IMT must allow no
			divergence [#divergence] other than that caused by
			security breaches.  </section>

		<section type='point' name='imt/talents'> An IMT must
			not require administration techniques which call for
			extraordinary talents on the part of the
			Infrastructure Administrators [#iadmin].  An
			Intermediate level Systems Administrator [sage] must
			be able to learn and apply these techniques within
			several hours of one-on-one training.  </section>

		<section type='point' name='imt/continuous'>Manage each
			host continuously, and not require rebuilds of
			production hosts in order to deploy changes.</section>

		<section type='point' name='imt/testing'>Enable a
			division between, and separate management of,
			development, testing, production, and disaster
			recovery environments.  The IMT must not require
			testing in production.</section>

		<section type='point' name='imt/targeting'>Enable
			changes to be targeted for particular hosts or groups
			of hosts.</section>

		<section type='point' name='imt/servers'>Enable unified
			management of client and server machines using the
			same toolset and configuration files.  This is simple
			to implement with [#imt/targeting].  </section>

		<section type='point' name='imt/unique'>Enable each host
			or group of hosts to be unique, if business
			requirements dictate this.  This uniqueness is not
			ad-hoc [#ad-hoc] though; the characteristics of each
			host are still managed by [#imt/targeting].</section>

		<section type='point' name='imt/context'>Enable change
			actions to be targeted for a particular operating
			context, based on time or on host state.  For example,
			some changes might be applied while a given machine is
			live, such as from cron, and others only at boot.
			Example contexts include "boot", "idle", or "Sunday
			evening".  </section>

		<section type='point' name='imt/none'> We know of no
			tools which currently satisfy all of these
			requirements; we hope publication of this paper helps
			produce some.  One which comes close is ISconf
			[isconf].</section>

	<pre>
- walking dependency tree depth-first

- preserving serialization of reusable sequences

- preserving serialization of reusable subsequences

- implicit assertion test of zero return codes for external
	commands

- implicit reproducible serialization of operations not
	explicitly serialized

- semaphores which child processes can use to signal async
	events to parent processes (like 'rebuild kernel and
	reboot when make is done')

- a hierarchical grouping of host attributes, so that host
	function can be determined quickly by eye

- re-usable ordered sets of grouped subsequences, so that
	new hosts can be created by prototype rather than by class
	(This is important.  True class-based configuration tools
	don't seem to work in the field, while prototype-based
	systems consistently do.  I don't think I'm able to
	explain why yet; it may be nothing more than "this is the
	way sysadmins think", or there may be a more theoretical
	basis.  I suspect, again, that it has to do with testing.)

These are things which ISconf/make already does.  In
addition, over the years those of us using ISconf have
concluded that we also need: 

- postrequisites (like 'do foo after I'm done')

- decentralized state machine specification (rather than a
	monolithic makefile or script)

- lexically bound syntax (I want to be able to specify each
	operation in the language most suitable for that
	operation)

- separation of action code from site-specific configuration
	data

- decentralized editing of state machine specifications (no
	need to log into gold server to update makefile)

- state machine and file transport language integrated, as
	in cfengine, to remove need for NFS mounts to get packages
	on demand

- embedded documentation, like POD; including dynamically
	generated runbooks and training checklists generated per
	host class (the latter actually looks easy -- name a
	person to be trained as a "target host" and "build" them,
	checking off training actions as they complete)

</pre>

	</section>

	<section type="def" name='iadmin' title="Infrastructure
		Administrator">  

		<p> An Infrastructure Administrator (IAdmin) is
			basically a Systems Administrator who manages an
			Enterprise Virtual Machine (EVM) [#evn] rather than
			individual hosts.  This individual is not expected to
			create EVM sites -- that's the job of an
			Infrastructure Architect.  [#iarch] </p>
		
		<p> Ordinarily a managed site will have one or more
			Infrastructure Administrators on staff.  An EVM site
			generally does <b>not</b> have, or need, ordinary
			Systems Administrators -- conventional systems
			administration techniques will damage an EVM.
			[#iadmin/law1]  </p>
			
		<p> To qualify as an Infrastructure Administrator, an
			individual must:</p>

		<section type="point" name="iadmin/sage"> Meet all
			requirements stated in the SAGE Intermediate Systems
			Administrator job description. </section>

		<section type="point" name="iadmin/iarch"> Demonstrate
			an understanding of the Infrastructure Architecture
			[#iarch] field.  
		</section>

		<section type="point" name="iadmin/changes">	Be
			comfortable using the site's Infrastructure Management
			Toolset (IMT) [#imt] to apply routine patches and
			changes to hosts.  </section>
    
		<section type="point" name="iadmin/changes">	Be able to
			handle, with little or no direction, most day-to-day
			implementation and testing of minor patches and
			changes for the entire infrastructure.  <sref
				name="iarch"/>  </section>
		
		<section type="point" name="iadmin/changes">	Be able to
			recognize when a change is beyond the scope of the
			current IMT, and escalate to an Infrastructure
			Architect. [#iarch] </section>
		
		<section type="point" name="iadmin/law1"> Never make
			manual changes on an EVM node -- always use the IMT.
			Acquiring this discipline is perhaps the most
			difficult task for an individual making the conversion
			from Systems Administrator to Infrastructure
			Administrator.  </section> 

	</section>

	<section type="def" name='iarch' title="Infrastructure
		Architecture">  

		<p> The Infrastructure Architecture field is the
			practice of designing and implementing systems which
			manage an enterprise infrastructure as one large
			"virtual machine".  [#evn]  This practice was
			described by Steve Traugott and Joel Huddleston in
			1998.  [bootstrapping]  That paper included a
			tentative Infrastructure Architect career definition,
			in the "Sysadmin, or Infrastructure Architect?"
			section.  </p>
			
		<p> In the intervening 4 years, we've solicited feedback
			and discussion of the term, as well as SAGE
			participation.  We formed the Infrastructures.Org
			community to build consensus, and have led numerous
			USENIX conference Guru and BOF sessions, as well as
			talks for other industry groups.  [usenix] [svlug]
			[baylisa [baylug] [sfbug] Over the past year, a Google
			query for "Infrastructure Architect" has consistently
			returned Steve Traugott's resume [stevegt] as the
			most-referenced page, and Infrastructures.Org as the
			most-referenced noncommercial site.  (Google's
			algorithm attempts to ensure that these rankings are
			based on interest by other parties rather than
			anything under page owners' direct control.)  [google]
		</p>

		<p> These factors, taken together, lead us to believe
			that we should by now be safe in proposing down the
			following, more concise, definition.  As before, we
			welcome feedback on the <i>infrastructures</i> mailing
			list at http://Infrastructures.Org.</p>

		<p> The Infrastructure Architect (IArch) career serves
			as a "next step beyond" Senior Systems Administrator
			[#sysadmin].  </p> 

		<p> There are major skillset and mindset differences
			between a Senior Systems Administrator and an IArch.
			The SAGE job description for a Senior Systems
			Administrator states only that the individual have an
			"Ability to identify tasks which require automation
			and automate them." [sage] This requirement is
			woefully inadequate for Infrastructure Architecture
			work.  </p>

		<p> To qualify as an Infrastructure Architect, an
			individual should:</p>

		<section type="point" name="iarch/sage"> Meet all
			requirements stated in the SAGE Senior Systems
			Administrator job description. </section>

		<section type="point" name="iarch/sage"> Meet all
			requirements stated in our own Infrastructure
			Administrator job description. [#iadmin] </section>

		<section type="point" name="iarch/plan"> Prefer to use a
			coherent and rational plan for audit, deployment, and
			rework of enterprise infrastructures.  One such plan
			is the "bootstrapping checklist" described in
			[bootstrapping] and Infrastructures.Org.  This
			approach differs markedly from the largely
			reaction-oriented patterns of conventional systems
			administration. </section>

		<section type="point" name="iarch/imt"> Prefer to use a
			single, coherent Infrastructure Management Toolset
			(IMT) for managing all hosts in any given
			infrastructure.  [#imt]  This differs from the ad-hoc
			[#ad-hoc] "on-demand scripting" approach of
			conventional systems administration.  </section> 

		<section type="point" name="iarch/lead"> Demonstrate an
			ability to train and lead Infrastructure
			Administrators.  [#iadmin] </section> 

		<section type="point" name="iarch/business"> Demonstrate
			a consistent ability to factor business plan and
			financial concerns into daily technical decisions.
			This does not mean "make bad technical decisions
			because it's politically expedient", nor does it mean
			"go for the 90-day ROI when it hurts us in the long
			run".  It <b>does</b> mean "always look for lower TCO"
			[#tco], and "don't waste time and money implementing a
			bad business plan -- if only you can see it's bad,
			then only you can get it fixed".  </section>

		<section type="point" name="iarch/people"> Demonstrate
			an awareness and concern for the welfare of the
			members of the organization.  The fundamental goal of
			Infrastructure Architecture is to raise their quality
			of life.  We do this by accepting responsibility for
			the integrity of the enterprise infrastructure on
			which their jobs depend.  This contrasts markedly from
			BOFH.  [bofh]  Seriously, this does mean that "shut it
			off and see who screams" is a management practice
			you'd want to plan away from, likewise "go back and
			submit a ticket", serially numbered usernames, tiny
			disk quotas, and filesharing-by-sneakernet.</section> 

		<section type="point" name="iarch/culture"> Demonstrate an
			ability to create lasting cultural change in support
			of [#iadmin/law1], [#iarch/business], and
			[#iarch/people].  Without this change, IT staff
			members will work against each other.  This change
			must fully encompass both line level and IT
			management, or these, too, will work against each
			other.  </section> 

	</section>

			<p> Assume we have two identical hosts, <i>fred</i>
				and <i>barney</i>, and two arbitrary change
				packages,<i>A</i> and <i>B</i>.  Assume we want to
				install A and B on both hosts, and we want fred and
				barney's behavior to remain identical.  </p>

			<p> <b> The least-cost method of ensuring that the
					behavior of fred and barney will remain identical
					is to install A and B in the same order on both
					hosts.  </b> </p>

	<p> We use the word "package" above, but A and B represent
		any arbitrary change to any disk content.  </p>

	<p> Note that we say "fred installs" or "barney installs"
		these packages.  This is important to understand -- as
		systems administrators, we tend to forget that the most
		we can do on a live machine is direct it to perform
		actions on our behalf.  Any command or tool we use
		executes in the context of, and is subject to the
		interpretation of, the entire host operating system.
	</p>

	<p> Note also that we say that there is no algorithm "fred
		or barney" can use -- fred and barney, being computing
		machines, have limited powers.  As humans, we can halt
		them both, reboot each from the same CD, and compare the
		disks to make sure that altering the package order
		didn't make the disks different.  It would take us a
		while, but we could do it.	If we find that the disks
		are different, we can inspect the differences and decide
		whether we think they will ever cause a problem, but
		this will take even longer, and in some cases we may
		never be sure.  None of these "manual effort" methods
		are effective for large infrastructures with numerous
		machines or rapid deployment requirements. </p>

		
		as well as
		those which operate clusters of identical workstations
		or servers.  This includes organizations which use
		"staging" and "production" environments, perform
		software quality assurance, or otherwise require
		reliable operation and high uptime from multiple
		machines.  </p>


			The cost of
			discovering and testing this sequence is less than the
			cost of testing additional sequences
			[#thesis/utm/g/disorder]: 


		<blockquote>
			C<sub>test</sub> < C<sub>partial</sub>
		</blockquote>

		The cost of predicting whether an alteration in the
		order of changes will cause a behavior difference is
		greater than the cost of inspecting the code to
		determine 

				<b>C<sub>test</sub></b> > C<sub>validate</sub>
				C<sub>test</sub> > C<sub>inspect</sub>
			<b>C<sub>many</sub></b> > C<sub>test</sub>
			<b>C<sub>predict</sub></b> > <b>C<sub>error</sub></b>
			C<sub>predict</sub> > C<sub>test</sub>
			<b>C<sub>random</sub></b> > C<sub>partial</sub>

		XXX EVM


	A distributed computing infrastructure
	[#thesis/turing/replicate] made up of UNIX machines is
	subject to the testing issues in [#thesis/turing/testing]
	if all code installed on the machines is identical, and
	[#thesis/turing/replicate/unique] if any code installed on
	the machines is unique.  This is a clear incentive to try
	to keep as much code as possible standardized across
	machines.  XXX show expressions why

		Due to [#thesis/whodecides], it appears that change
		orthogonality is <i>undecidable</i>.  When designing
		tools for field use, we should assume that all changes
		are subject to sequencing issues.  This is particularly
		important if we expect deterministic, reliable behavior
		in operation and repeatable rebuilds for disaster
		recovery.  XXX glue to cost summary
	
		
	<p> [kill?] A note about our use of pronouns: This text
		will use the words "we" and "us" loosely.  Lance and I
		describe here the efforts and findings of a community of
		people generally but not exclusively centered around
		Infrastructures.Org.  As a founder of that community,
		I'd like to be able to give voice to our 245 (and
		growing) members.  While I know that nothing I say will
		match the thoughts of all of those rugged individualists
		all of the time, I'll do my best to capture current
		consensus.  You'll have to judge for yourself how well I
		do that -- the list archives are publicly available on
		the web site. </p>

		<p> <b> The least-cost way to ensure reliable behavior
				in an enterprise infrastructure is to always
				implement changes to hosts in a deterministic order.
		</b> </p>


<section type="def" name="def-vn" title="Von-Neumann
	Machine"> 
	
	<p> Von-neumann machines are those which separate code and
		data into different storage locations.  A von-neumann
		machine is not a turing machine, but a turing machine
		can emulate a von-neumann machine.  XXX </p>

[ show diagram of text and data areas ]

</section>