The objective of modeling and analysis is to demonstrate that a system is reliable over a certain mission time or safe to within the desired degree of risk, and to determine and address critical components of hazards. This generally involves a demonstration that the system operates as intended, even in the presence of a prescribed number, set, or set of sequences of faults. This becomes difficult as the set of faults increases, and approaches impossibility as the set becomes large. It is heavily dependent on fault and hazard models, and therefore a great deal of experimental data is helpful in attaining realism. To simplify analysis, some approaches first define the set of errors or hazards, and then work backward to find all sequences of faults that could produce them.
Hazard analysis starts from system hazard analysis, and identifies hazards associated with different implementation levels so that the level covering each fault can be explicitly considered. Hazards are generally broken down into priorities, the highest of which is usually the completion of the mission being designed for, and the lowest of which is usually minimizing stresses on the system. Safety requirements are then derived from the hazard analysis. These should include requirements for detecting, eliminating, and controlling hazards, for limiting damage in case of an accident, for ways in which the system can fail safely, and for the extent to which failure is tolerable. Safety requirements may conflict with other requirements and these conflicts must be determined and resolved.
Verification and analysis is not usually considered sufficient for protection in highly hazardous situations because these techniques are so complex as to be error prone, and elimination of all hazards may require too severe a cost for practical use. A goal that often conflicts with safety is the desire to minimize verification and certification requirements, which are quite costly and time consuming. In many cases, we may choose to control hazards rather than prevent their occurance. Risk is reduced by reducing hazard likelihood or severity or both. Hazards can be prevented or they can be detected and treated. Prevention of hazards tends to involve reducing functionality or inhibiting design freedom, while detection of hazards is often difficult and unreliable.
The goal of hazard prevention is to attain intrinsic safety through design to make faults non-hazardous. General techniques for hazard prevention include minimization of complexity, separation of safety-critical functions and data, limiting actions of subsystems, minimizing interfaces (not only for reliability enhancement, but to aid in verification and certification of safety), fire walls (both in the physical and logical sense), authority limitation, access limitations, minimization of hazardous states or time in hazardous states, control flow limitations, sequence controls, and hierarchical design.
The goal of hazard detection is the detection of unsafe states. The first step in detection is identification of safety critical items. General techniques for identification include checking safety assertions, periodic or continuous monitoring, exception handling, watchdog timers, acceptance tests, and algorithm redundancy. Typically, these checks use replication of operations in time and/or in space, timing checks, reversal checks, coding checks, reasonableness checks, structural checks, diagnostic checks, and checks for hazardous conditions.
The goal of recovery is the return of a system to a non hazardous state, given that it is determined to be in a hazardous state, or more generally, the movement of a system towards less and less hazardous states. This may involve the use of 'fail-operational' states in which operation may continue even in the presence of failures, 'fail-soft' states in which the system gracefully degrades to maintain more critical systems when failures prevent full functionality, and 'fail-safe' states in which the system is designed to fail in such a manner as to assure the system safety. Redundancy may be used to mask faults, to allow 'backward' recovery in which backup states are maintained for continuation of computation after subsequent failures, to allow 'forward' recovery in which redundant data and program structures are used to compensate for errors after they occur, for dynamic alteration of flow of control, for reconfiguration and graceful degradation, and to allow design with wide margins of safety.
Verification and certification of safety typically involve checks in the design, analysis, implementation, and maintenance phases of a system's life cycle. Commonly used techniques include proof of program correctness, extensive simulation and testing, and extensive engineering changes to systems in place once potentially hazardous faults are detected.
Clearly the safety issue is closely linked to the fault tolerance issue, and just as clearly, there are no perfect solutions to protection from the realities of the world; life is a terminal disease. The best we can do, is to assess the risks associated with events, and work towards reducing those risks to acceptable levels given the resources we are willing to spend in the process.