Thinking: Automated systems administration is very straightforward. There is only one way to change the contents of disk or RAM in a running UNIX machine -- the syscall interface. The task of automated administration is simply to make sure that each machine's kernel gets the right system calls, in the right order, to make it be the machine you want it to be. - sampling disk state alone isn't enough to determine disk state, unless you sample the *whole* disk - ...so state machines driven by only disk samples produce non-deterministic behavior - ...so the equivalence of behavior of two different state machines driven only by disk samples is undecidable (that's easy) - configuration driven by only deterministic ordering of individual state machines produces deterministic behavior - because the state machines reside on and execute within the context of the driven disk, they can modify each other, as well as themselves - ...so the behavior of individual state machines is dependent on the order of invocation of all state machines - ...so whether two different orders of configuration operations exhibit the same behavior is undecideable (takes me a while to get here) Body: if sequence changes, then revalidation required sampling not enough ordering is only one way to change a unix machine -- syscall interface make sure same syscalls in same order behavior will always change if disk state changes cat /dev/* problem domain simplicity, harried admins, commercial environment remove unpredictability, nondeterminism through validation, repeatability, invariance [alva] log files, user data not invariant, not subject to validation comment lines in config files are subject to validation otherwise why have comment lines? they are human code congruence, not convergence diminishing returns as automation increases [Northrup] enterprise-wide consistent builds and subsequent management continuous management over life of machine some operations can be done live some must be done at boot most operations must be repeated no rebuilds in production no testing in production no convergence more you automate, the easier [layers] convergence is opposite testing is role of admin, not vendor vendor unit testing admin integration testing full auto can't be done halfway politics main issue until we reach critical mass of automated sites problem solving vs. corrective action humans solve problems code implements action tend to be commercial why don't tend to be academic why athena exception should they be different? disk state 2^N only spot checks are possible suitable only for (partial) convergence, not congurence 2^(N-S) cfengine encourages use of spot checks 100k is .005 percent of a 2Gb disk any rule can be triggered by 2^(N-S) disk states non-deterministic? sampling cannot be deterministic, no matter how careful action triggered by result of sample can be retriggered by future change editing prior actions ignore it ISconf act only on permanent history sequence is a full expression of a host managed unmanaged 2! hacmp cluster build culling and leaves orthogonality local admin talent -- who decides? difficult to train reproducible outcome testing and validation changing order means changing set randomization N! means infeasible to prove same outcome N as low as 8 or 9 full regression math genetic algorithm build is 1-hour fitness function DNA 8 = 4.6 years determining set discovery order dependency graph 8-30 deep for hacmp breaking implicit order breaks build show examples discovering sequence is easy discovery order P, P' ISconf makefile edit deterministic works easy to train start at beginning of host life only add new to end testing and validation new hosts create by prototype, not class include known subsequences internal ordering preserved ISconf/make expresses sequenced groups of sequenced operations human error (missing deps) masked by deterministic make behavior randomized known to break only one possible disk state represented by given timestamps deterministic state transitions convergence-only tools don't care about sequence can't do it "can't just tack things onto end" [alva] must recreate convergence via rewrite editing prior actions admin can't be depended on to detect non-orthogonal subsystems security not exploited unforseen dependencies cause build failures during development, common failure mode dependent order between peers x43,x42 commutivity cfengine 1.X ignores? make implicitly avoids altering order parallelizing bad multiple admin agents bad chomsky's heirarchy four levels state configuration tools pretend to be in here cfengine detect/act no permanent history ISconf timestamp state act only on permanent history turing useful model for illustrating points machines+config tool combo actually here difficult for admin to avoid turing equivalence more complexity than simple state mods means difficult to predict avoid pitfalls through ordering reload ruleset? cfengine one-tape ISconf two-tape all collapse into one-tape self-modifying avoid pitfalls thru ordering direct indirect -- kernel shared lib etc. difficult to predict must be assumed self-modifying church? n-tape to 1-tape equivalence theorem complexity requires simplification ordering is a simplification church order/outcome undecidable? discovering a valid ordering is like halting problem table comparing cfengine, isconf out of band changes security breach LSB "transitivity of validation" cannot prove that two configuration always exhibit same behavior? so two configuration operations undecidable by lifting? test environments needed testing in production bad enterprise consistency via reproducable sequences allows us to ignore orthogonality 'make' isn't the only way to go ISconf stable for years barrier problem cloth wings and piano wire 6 hours to implement at cat 4 days clearing fud first future kernel level support to intercept open() etc. ISconf vs cfengine orthogonal to each other combination of convergence and congruence might be ideal alva's recap Tool Requirements: - walking dependency tree depth-first - preserving serialization of reusable sequences - preserving serialization of reusable subsequences - implicit assertion test of zero return codes for external commands - implicit reproducable serialization of operations not explicitly serialized - semaphores which child processes can use to signal async events to parent processes (like 'rebuild kernel and reboot when make is done') - a hierarchical grouping of host attributes, so that host function can be determined quickly by eye - re-usable ordered sets of grouped subsequences, so that new hosts can be created by prototype rather than by class (This is important. True class-based configuration tools don't seem to work in the field, while protoype-based systems consistently do. I don't think I'm able to explain why yet; it may be nothing more than "this is the way sysadmins think", or there may be a more theoretical basis. I suspect, again, that it has to do with testing.) These are things which ISconf/make already does. In addition, over the years those of us using ISconf have concluded that we also need: - postrequisites (like 'do foo after I'm done') - decentralized state machine specification (rather than a monolithic makefile or script) - lexically bound syntax (I want to be able to specify each operation in the language most suitable for that operation) - separation of action code from site-specific configuration data - decentralized editing of state machine specifications (no need to log into gold server to update makefile) - state machine and file transport language integrated, as in cfengine, to remove need for NFS mounts to get packages on demand - embedded documentation, like POD; including dynamically generated runbooks and training checklists generated per host class (the latter actually looks easy -- name a person to be trained as a "target host" and "build" them, checking off training actions as they complete) Conclusion: I suspect deterministic ordering is the airfoil of self-administering systems. If I'm right, then other techniques, while fascinating, might be roughly equivalent to wing-flapping machanisms. It took centuries for aerodynamics experimenters to figure out and prove to each other that the flapping wasn't important, but the airfoil was. I think we may be in a similar situation here. I know that to encourage things to get out of sequence is to risk breaking production machines in such a way that users will notice. This is exactly what we're supposed to prevent. It's a sysadmin's version of the hippocratic oath; "First, do no harm". "But ordering isn't as important as long as you test your changes before putting them in production." Sure. But to do that, the production machines have to have been built *exactly* the same way that the test environment was, or else the test is invalid. The only way to ensure that test and production are built the same way is to, well, build them the same way, applying changes in the same order in each case. "But you can just test in production, and if the change broke something, you can always back it out. If the backout fails, then just re-install". If anyone still feels this way, then they should re-read the above paragraphs about reliability and downtime. Testing in production, and relying on scheduled downtime and backout windows, eats into uptime numbers and precludes 24x7 operation. A global economy has no "off hours". I think about this every time I'm waiting in line at the Hertz counter after arriving in Tampa at 2 a.m. on a Friday night, right smack in the middle of the Hertz scheduled maintenance downtime. References: Daniel Hagerty > total state of the machine is very important > to ongoing mutation. No, I haven't actually done work towards this, I > just noticed it in how my infrastructure was failing. Alva: I recently bought a new book "Testing IT" by Watson?. This book serves as a really good starting point for an essay on validation and its implications. Alva: Yes. The verdict so far is that cfengine 1.x is dead and that cfengine 2.x will probably track long-term state.