How Complex Systems Fail | the morning paper

How Complex Systems Fail – Cook 2000

This is a wonderfully short and easy to read paper looking at how complex systems fail – it’s written by a Doctor (MD) in the context of systems of patient care, but that makes it all the more fun to translate the lessons into complex IT systems, including their human operator components. The paper consists of 18 observations. Here are some of my favourites….

Complex systems are intrinsically hazardous, which drives over time the creation of defense mechanisms against those hazards. (Things can go wrong, and we build up mechanisms to try and prevent that from happening).
Complex systems are heavily and successfully defended against failure, since the high consequences of failures lead to the build up of defenses against those failures over time.
Because of this, a catastrophe requires multiple failures – single point failures are generally not sufficient to trigger catastrophe.
The complexity of complex systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure, they are regarded as a minor factor during operations.
Complex systems therefore run in degraded mode as their normal mode of operation!

After accident reviews nearly always note that the system has a history of prior ‘proto-accidents’ that nearly generated catastrophe. Arguments that these degraded conditions should have been recognized before the overt accident are usually predicated on naïve notions of system performance. System operations are dynamic, with components (organizational, human, technical) failing and being replaced continuously.

Catastrophe is always just around the corner!
Post-accident attribution to a ‘root cause’ is nearly always wrong.

Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of these causes together that creates the circumstances required for the accident. Thus, no isolation of the ‘root cause’ of an accident is possible. The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcome.

Hindsight bias (believing that the consequences of an action should have been obvious) remains the primary obstacle to root-cause analysis.
Dev and Ops is a combined role 😉 :

The system practitioners operate the system in order to produce its desired product and also work to forestall accidents. This dynamic quality of system operation, the balancing of demands for production against the possibility of incipient failure is unavoidable. Outsiders rarely acknowledge the duality of this role. In non-accident filled times, the production role is emphasized. After accidents, the defense against failure role is emphasized. At either time, the outsider’s view misapprehends the operator’s constant, simultaneous engagement with both roles.

Changes introduce new forms of failure.
“Post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult.”
Safety is a characteristic of the system, and not of individual components (you can build a reliable system out of unreliable parts). And,

Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system. This means that safety cannot be manipulated like a feedstock or raw material. The state of safety in any system is always dynamic; continuous systemic change insures that hazard and its management are constantly changing.

Failure-free operations require experience with failure. More robust system performance is likely to arise in systems where operators can discern ‘the edge of the envelope’ and how their actions move system performance towards or away from it.