The role of software in spacecraft accidents

The role of software in spacecraft accidents Leveson, AIAA Journal of Spacecraft and Rockets, 2004

With thanks to @Di4naO (Thomas Depierre) who first brought this paper to my attention.

Following on from yesterday’s look at safety in AI systems, I thought it would make an interesting pairing to follow up with this 2004 paper from Nancy Leveson studying spacecraft accidents. What lessons can we learn that might help us think about e.g., autonomous driving vehicles?

Leveson analyses five spacecraft accidents that occurred between 1996 and 2000: the explosion of the Ariane 5 launcher, the loss of the Mars Climate Orbiter, the destruction of the Mars Polar Lander in the following year, the placing of a Milstar satellite in an incorrect and unusable orbit, and the loss of contact with the SOHO (SOlar Heliospheric Observatory) spacecraft.

On the surface, the events and conditions involved in the accidents appear to be very different. A more careful, detailed analysis of the systemic factors, however, reveals striking similarities.

These similarities occur across three broad areas: the attitude towards safety (safety culture, or lack thereof); management and organisational factors; and technical deficiencies. As a bonus, I reckon you could run s/safety/security/g on the text of this paper and most of the lessons would equally apply.

What happened (the accidents)

  • The Ariane 5 explosion was put down to specification and design errors in inertial reference system software that had been reused from the Ariane 4.
  • The Mars Climate Orbiter (MCO) was lost on entry to the Martian atmosphere due to the use of imperial rather than metric units when coding the trajectory models.
  • The Mars Polar Lander (MPL) was destroyed on entry when trying to land. Touchdown sensors were known to generate a false momentary signal when the landing legs were deployed, but the software requirements made no mention of this. The best guess is that the software interpreted these spurious signals as valid touchdown events, and shutdown the engines about 40 meters above the ground causing the lander to fall to the surface where it was destroyed on impact.
  • The Titan IV B-32 mission to place a Milstar satellite in orbit failed due to an incorrect roll rate filter constant (a configuration error).
  • Contact with SOHO was lost for four months due to a series of errors in making software changes, and problems during recovery from entering an emergency safe mode.

Safety Culture Flaws

The safety culture is the general attitude and approach to safety reflected by those working in an industry. The accident reports all described various aspects of complacency and a discounting or misunderstanding of the risks associated with software. Success is ironically one of the progenitors of accidents when it leads to overconfidence and cutting corners or making tradeoffs that increase risk. This phenomenon is not new, and is extremely difficult to country when it enters the engineering culture of an organization.

The pressure of meeting cost and schedule goals led to too many corners being cut in the application of proven engineering practices and in the needed checks and balances. Overconfidence leads to inadequate testing and review of changes. Protections built into processes were bypassed.

While management may express their concern for safety and mission risks, true priorities are shown during resource allocation… the first things to be cut are often system safety, system engineering, mission assurance, and operations, which are assigned a low priority and assumed to be the least critical parts of the project.

In such a climate, a culture of denial can take hold, where any evidence of significant risk has a tendency to be dismissed.

In most of the cases of failure, the individual software components were actually “working as designed.” It was the dysfunctional interactions among components that led to catastrophe. These are design errors, and thus appropriate techniques for handling design errors must be used.

Management and Organisational Factors

The five accidents studied during this exercise, as well as most other major accidents, exhibited common organizational and managerial flaws, notably a diffusion of responsibility and authority, limited communication channels, and poor information flow.

“Faster, Better, Cheaper” can too easily turn into just “cheaper” (cutting budgets). The resulting reduction in personnel and budgets left no-one responsible for specific critical tasks. In many of the accidents, people were also simply overworked. Processes involving multiple parties were neither well defined nor completely understood by any of the participants. Lots of things fell through gaps between groups. Inadequate transition from development to operations played a role in several of the accidents.

Most important, responsibility for safety does not seem to have been clearly defined outside of the quality assurance function on any of these programs. All the accident reports (except the Titan/Centaur) are surprisingly silent about their safety programs. One would think that the safety activities and why they had been ineffective would figure prominently in the reports.

It’s a common mistake, argues Leveson, to place safety efforts solely inside the QA function: “while safety is certainly one property (among many) that needs to be assured, safety cannot be engineered into a design through after-the-fact assurance activities alone.”

A dedicated and effective safety program ensures that someone is focusing attention on what the the system is not supposed to do, and not just on what it is supposed to do. Both perspectives are necessary.

Technical Deficiencies

The cultural and managerial flaws outlined above manifested themselves in technical deficiencies: (i) inadequate system and software engineering, (ii) inadequate review activities, (iii) ineffective system safety engineering, (iv) inadequate human factors engineering, and (v) flaws in the test and simulation environments.

I found it interesting as I worked through these to think about the parallels in machine learning and agent-based systems. They’re not always obvious…

Preventing system accidents falls into the province of system engineering — those building individual components have little control over events arising from dysfunctional interactions among components. As the systems we build become more complex (much of that complexity being made possible by the use of computers), system engineering will play an increasingly important role in the engineering effort. In turn, system engineering will need new modeling and analysis tools that can handle the complexity inherent in the systems we are building.

In almost all software-related aerospace accidents, it turns out that the software is behaving as designed, but the designed behaviour was not safe from a system viewpoint. Safety-related software errors arose from discrepancies between the documented requirements, and what was actually needed for correct functioning of the system, and misunderstandings about the software’s interface with the rest of the system. Note that these bugs originate in the mental models in the heads of the designers. More accurately specifying those same models — e.g., using formal methods — isn’t going to help here. (Although the mental processes the designers go through in order to formally specify things certainly might help sharpen their mental models in places).

Inadequate documentation of design rationale to allow effective review of design decisions is a very common problem in system and software specifications. The Ariane report recommends that justification documents be given the same attention as code and that techniques for keeping code and its justifications consistent be improved.

Where does all this fit in the world of evolutionary architectures and emergent designs (and learned behaviours!)? There’s a lean principle called the ‘last responsible moment.’ As a working tool it’s sometimes hard to know ‘in the moment’ if this really is the last responsible one or not (it’s much easier to tell with hindsight!). But in the abstract, we can think about this question: when is the last responsible moment in a project lifecycle to start thinking about safety? If that software is, for example, responsible for propelling me along a road at 60mph in a metal box, then I find it very hard not to answer “pretty damn early!”. (Related thought exercise: when would you say in general is the last responsible moment to start thinking about security?).

Moving on…

One of the most basic concepts in engineering critical systems is to “keep it simple.” … All the accidents, except MCO, involved either unnecessary software functions or software operating when it was not necessary. (The MCO report does not mention or discuss the software features).

Simple, understandable, explainable software wins when it comes to reasoning about safety. And DNNs?

Code reuse (also think ‘transfer learning’) is another common cause of safety problems. And not because the code being reused has bugs within it.

It is widely believed that because software has executed safely in other applications, it will be safe in the new one. This misconception arises from confusion between software reliability and safety: most accidents involve software that is doing exactly what it was designed to do, but the designers misunderstood what behavior was required and would be safe, i.e., it reliably performs the wrong function. The blackbox (externally visible) behavior of a component can only be determined to be safe by analyzing its effects on the system in which it will be operating, that is, by considering the specific operational context.

Code reuse and ‘keep it simple’ often seem to be in tension with one another: COTS software is often constructed with as many features as possible to make it commercially useful in a variety of systems (these days you could also think ‘open source libraries and frameworks’). “Thus there is a tension between using COTS versus being able to perform a safety analysis and have confidence in the safety of the system.”

Safety Engineering Practices

Although system safety engineering textbooks and standards include principles for safe design, software engineers are almost never taught them. As a result, software often does not incorporate basic safe design principles — for example, separating and isolating critical functions, eliminating unnecessary functionality, designing error-reporting messages such that they cannot be confused with critical data (which occurred in the Ariane 5 loss), and reasonableness checking of inputs and internal states…. We need to start applying safe design principles to software just as we do for hardware.

Unsafe behaviour should be identified before software development begins, and the design rationale and features used to prevent the unsafe behaviour should have been documented such that it can be reviewed. “This presupposes, of course, a system safety process to provide the information, which does not appear to have existed for the projects that were involved in the accidents studied.” Proving the information needed to make safety-related engineering decisions is the major contribution of system safety techniques to engineering.

It has been estimated that 70 to 90% of the safety-related decisions in an engineering project are made during the early concept development stage. When hazard analyses are not performed, are done only after the fact (for example, as a part of quality or mission assurance of a completed design), or are performed but the information is never integrated into the system design environment, they can have no effect on these decisions and the safety effort reduces to a cosmetic and perfunctory role.

Having someone responsible for safety ensures that attention is paid to what the system is not supposed to do.