Thinking:

Automated systems administration is very straightforward.  There is
only one way to change the contents of disk or RAM in a running UNIX
machine -- the syscall interface.  The task of automated
administration is simply to make sure that each machine's kernel gets
the right system calls, in the right order, to make it be the machine
you want it to be.  

- sampling disk state alone isn't enough to determine disk state,
  unless you sample the *whole* disk

- ...so state machines driven by only disk samples produce
  non-deterministic behavior

- ...so the equivalence of behavior of two different state machines
  driven only by disk samples is undecidable  (that's easy)

- configuration driven by only deterministic ordering of individual
  state machines produces deterministic behavior

- because the state machines reside on and execute within the context
  of the driven disk, they can modify each other, as well as
  themselves

- ...so the behavior of individual state machines is dependent on the
  order of invocation of all state machines

- ...so whether two different orders of configuration operations
  exhibit the same behavior is undecideable (takes me a while to get
  here)

Body:

if sequence changes, then revalidation required
  sampling not enough
  ordering is
    only one way to change a unix machine -- syscall interface
      make sure same syscalls in same order
    behavior will always change if disk state changes
      cat /dev/*
problem domain
  simplicity, harried admins, commercial environment
  remove unpredictability, nondeterminism
    through validation, repeatability, invariance [alva]
      log files, user data not invariant, not subject to validation
      comment lines in config files are subject to validation
        otherwise why have comment lines?  they are human code
  congruence, not convergence
    diminishing returns as automation increases [Northrup]
  enterprise-wide consistent builds and subsequent management
    continuous management over life of machine
      some operations can be done live
      some must be done at boot
      most operations must be repeated 
      no rebuilds in production
      no testing in production
      no convergence
    more you automate, the easier [layers]
      convergence is opposite
    testing is role of admin, not vendor
      vendor unit testing
      admin integration testing
  full auto
    can't be done halfway
    politics main issue until we reach critical mass of automated sites
  problem solving vs. corrective action
    humans solve problems
    code implements action
  tend to be commercial
    why
  don't tend to be academic
    why
      athena exception
  should they be different?	
disk state
  2^N
  only spot checks are possible
    suitable only for (partial) convergence, not congurence
    2^(N-S)
    cfengine
      encourages use of spot checks
      100k is .005 percent of a 2Gb disk
      any rule can be triggered by 2^(N-S) disk states
        non-deterministic?
    sampling cannot be deterministic, no matter how careful
      action triggered by result of sample can be retriggered by future change 
      editing prior actions 
  ignore it
    ISconf
      act only on permanent history
sequence
  is a full expression of a host
    managed
    unmanaged
  2!
    hacmp cluster build
    culling and leaves
  orthogonality
    local admin talent -- who decides?
      difficult to train
  reproducible outcome
    testing and validation
  changing order
    means changing set
    randomization
      N! means infeasible to prove same outcome
        N as low as 8 or 9
      full regression math 
      genetic algorithm 
	build is 1-hour fitness function
	  DNA 8 = 4.6 years
  determining set
    discovery order
  dependency graph
    8-30 deep for hacmp
    breaking implicit order breaks build 
      show examples
  discovering sequence is easy
    discovery order
    P, P'
      ISconf makefile edit
  deterministic works
    easy to train
    start at beginning of host life
    only add new to end
    testing and validation
  new hosts
    create by prototype, not class
    include known subsequences
      internal ordering preserved
  ISconf/make
    expresses sequenced groups of sequenced operations
    human error (missing deps) masked by deterministic make behavior
      randomized known to break
    only one possible disk state represented by given timestamps
      deterministic state transitions
  convergence-only tools
    don't care about sequence
    can't do it
    "can't just tack things onto end" [alva]
      must recreate convergence via rewrite
        editing prior actions
	admin can't be depended on to detect non-orthogonal subsystems
  security
    not exploited
  unforseen dependencies
    cause build failures during development, common failure mode
  dependent order between peers
    x43,x42
      commutivity
    cfengine 1.X ignores?
    make implicitly avoids altering order
  parallelizing bad
  multiple admin agents bad
chomsky's heirarchy
  four levels
  state
    configuration tools pretend to be in here
      cfengine 
        detect/act
	no permanent history
      ISconf 
        timestamp state
	act only on permanent history
  turing
    useful model for illustrating points
    machines+config tool combo actually here
    difficult for admin to avoid turing equivalence
      more complexity than simple state mods means difficult to predict
      avoid pitfalls through ordering
    reload ruleset?
    cfengine
      one-tape 
    ISconf
      two-tape
    all collapse into one-tape
      self-modifying
        avoid pitfalls thru ordering
        direct
	indirect -- kernel shared lib etc.
	  difficult to predict
	  must be assumed self-modifying
      church?
      n-tape to 1-tape equivalence theorem
      complexity requires simplification
        ordering is a simplification
church
  order/outcome undecidable?
  discovering a valid ordering is like halting problem
table comparing cfengine, isconf
out of band changes
  security breach
LSB
  "transitivity of validation"
cannot prove that two configuration always exhibit same behavior?
  so two configuration operations undecidable by lifting?
test environments
  needed
  testing in production bad
enterprise consistency via reproducable sequences
  allows us to ignore orthogonality
'make' isn't the only way to go
ISconf
  stable for years
    barrier problem
  cloth wings and piano wire
  6 hours to implement at cat
  4 days clearing fud first
future
  kernel level support to intercept open() etc.
ISconf vs cfengine
  orthogonal to each other
  combination of convergence and congruence might be ideal
alva's recap


Tool Requirements:

- walking dependency tree depth-first

- preserving serialization of reusable sequences

- preserving serialization of reusable subsequences

- implicit assertion test of zero return codes for external commands

- implicit reproducable serialization of operations not explicitly
  serialized

- semaphores which child processes can use to signal async events to
  parent processes (like 'rebuild kernel and reboot when make is done')

- a hierarchical grouping of host attributes, so that host function
  can be determined quickly by eye

- re-usable ordered sets of grouped subsequences, so that new hosts
  can be created by prototype rather than by class (This is important.
  True class-based configuration tools don't seem to work in the
  field, while protoype-based systems consistently do.  I don't think
  I'm able to explain why yet; it may be nothing more than "this is
  the way sysadmins think", or there may be a more theoretical basis.
  I suspect, again, that it has to do with testing.)

These are things which ISconf/make already does.  In addition, over
the years those of us using ISconf have concluded that we also need: 

- postrequisites (like 'do foo after I'm done')

- decentralized state machine specification (rather than a monolithic
  makefile or script)

- lexically bound syntax (I want to be able to specify each operation
  in the language most suitable for that operation)

- separation of action code from site-specific configuration data

- decentralized editing of state machine specifications (no need to
  log into gold server to update makefile)

- state machine and file transport language integrated, as in
  cfengine, to remove need for NFS mounts to get packages on demand

- embedded documentation, like POD; including dynamically generated
  runbooks and training checklists generated per host class (the
  latter actually looks easy -- name a person to be trained as a
  "target host" and "build" them, checking off training actions as
  they complete)


Conclusion:

I suspect deterministic ordering is the airfoil of self-administering
systems.  If I'm right, then other techniques, while
fascinating, might be roughly equivalent to wing-flapping machanisms.
It took centuries for aerodynamics experimenters to figure out and
prove to each other that the flapping wasn't important, but the
airfoil was.  I think we may be in a similar situation here.

I know that to encourage things to get out of sequence is to risk
breaking production machines in such a way that users will notice.
This is exactly what we're supposed to prevent.  It's a sysadmin's
version of the hippocratic oath; "First, do no harm".

"But ordering isn't as important as long as you test your changes
before putting them in production."  Sure.  But to do that, the
production machines have to have been built *exactly* the same way
that the test environment was, or else the test is invalid.  The only
way to ensure that test and production are built the same way is to,
well, build them the same way, applying changes in the same order in
each case.

"But you can just test in production, and if the change broke
something, you can always back it out.  If the backout fails, then
just re-install".  If anyone still feels this way, then they should
re-read the above paragraphs about reliability and downtime.  Testing
in production, and relying on scheduled downtime and backout windows,
eats into uptime numbers and precludes 24x7 operation.  A global
economy has no "off hours".  I think about this every time I'm waiting
in line at the Hertz counter after arriving in Tampa at 2 a.m. on a
Friday night, right smack in the middle of the Hertz scheduled
maintenance downtime.


References:

Daniel Hagerty
> total state of the machine is very important
> to ongoing mutation.  No, I haven't actually done work towards this, I
> just noticed it in how my infrastructure was failing.

Alva:
I recently bought a new book "Testing IT" by Watson?.  This book
serves as a really good starting point for an essay on validation and
its implications.

Alva:
Yes. The verdict so far is that cfengine 1.x is dead and that cfengine 2.x
will probably track long-term state.