Analyzing software requirements errors in safety-critical embedded systems

Analyzing software requirements errors in safety-critical embedded systems Lutz, IEEE Requirements Engineering, 1993

With thanks once more to @Di4naO (Thomas Depierre) who first brought this paper to my attention.

We’re going even further back in time today to 1993, and a paper analysing safety-critical software errors uncovered during integration and system testing of the Voyager and Galileo spacecraft. There are 87 software errors in Voyager, and 122 software errors in Galileo that are classified as safety-related since they have ‘potentially significant or catastrophic effects’. Unlike the errors we looked at yesterday, you could make the case that the overall system development process here was effective in the sense that the problems were caught before the system was deployed. Lutz is interested in tracking down why so many safety-critical errors are found so late in the process though.

The analysis methodology is a “3 whys” process:

This approach allows classification not only of the documented software error (called the program fault), but also of the earlier human error (the root cause, e.g., a misunderstanding of an interface specification), and, before that, of the process flaws that contribute to the likelihood of the error’s occurrence (e.g., inadequate communication between systems engineering and software development teams).

Having thus created an error profile of safety-related software errors, the paper concludes with a set of six guidelines to help prevent them.

The first why: program faults

Safety-related program faults break down into four categories:

  1. Behavioural faults — the software doesn’t follow the functional requirements. This category accounts for 52% of Voyager issues, and 47% of Galileo’s. (Functionality is present, but incorrect).
  2. Operating faults — a required but omitted operation in the software. “Often the omitted operation involves the failure to perform adequate reasonableness checks on data input to a module.” (Functionality is not present).
  3. Conditional faults — nearly always an erroneous value on a condition or limit. Conditional faults had a high (73%) chance of being safety related. “The association between conditional faults and safety-related software errors emphasizes the importance of specifying the correct values for any data used in control decisions in safety-critical, embedded software.”
  4. Interface faults — incorrect interactions with other system components.

The second why: from program faults to human factors

Communication errors between teams (rather than within teams) is the major cause of interface faults:

Safety-related interface faults are associated overwhelmingly with communications errors between a development team and others (often between software developers and systems engineers), rather than communications errors within teams.

The primary cause of the behavioural, operating, and conditional faults is errors in recognising (understanding) the requirements.

The third why: from human factors to process flaws

The third stage of the analysis examines both flaws or inadequacies in the control of system complexity, and well as associated process flaws in the communication or development methods used.

For safety-related interface faults, the most common complexity control flaw is interfaces not adequately identified or understood.

Related to process, lack of documentation of hardware behaviour, poor communication between hardware and software teams, and undocumented (or communicated) interface specifications. In summary, the software developers are working with incomplete information and erroneous assumptions about the environment in which the software will operate. Not all of these things are perfectly knowable up front though:

Leveson (1991) listed a set of common assumptions that are often false for control systems, resulting in software errors. Among these assumptions are that the software specification is correct, that it is possible to predict realistically the software’s execution environment (e.g., the existence of transients), and that it is possible to anticipate and specify correctly the software’s behavior under all possible circumstances.

For functional faults, the most common cause is requirements which have not been identified. Missing requirements are involved in nearly half the safety-related errors that involve recognising requirements. Imprecise or unsystematic specifications were more than twice as likely to be associated with safety-related functional faults.

These results suggest that the sources of safety-related software errors lie farther back in the software development process — in inadequate requirements — whereas the sources of non-safety-related errors more commonly involve inadequacies in the design phase.

Six recommendations for reducing safety-related software errors

Since safety-related software errors tend to be produced by different mechanisms than non-safety-related errors, we should be able to improve system safety by targeting safety-related error causes. Lutz presents six guidelines:

  1. Focus on the interfaces between the software and the system in analyzing the problem domain, since these interfaces are a major source of safety-related software errors.
  2. Identify safety-critical hazards early in the requirements analysis.
  3. Use formal specification techniques in addition to natural-language software requirements specifications. “The capability to describe dynamic events, the timing of process interactions in distinct computers, decentralized supervisory functions, etc., should be considered in choosing a formal method.
  4. Promote informal communication among teams: “the goal is to be able to modularize responsibility in a development project without modularizing communication about the system under development.” For example, the identification and tracking of safety hazards in a system is clearly best done across team boundaries.
  5. As requirements evolve, communicate the changes to the development and test teams: “frequently, changes that appear to involve only one team or system component end up affecting other teams or components at same later date…”
  6. Include requirements for “defensive design.” This includes things like input validity checking, error-handling, overflow protection, signal saturation limits, and system behaviour under unexpected conditions.

Requirements specifications that account for worst-case scenarios, models that can predict the range of possible (rather than allowable) values, and simulations that can discover unexpected interactions before system testing contribute to the system’s defense against hazards.