Things fall apart. This is an unavoidable fact of the universe in which we live. The study of fault tolerant computing is the study of how to compute without error in the presence of faults. Safety is defined as freedom from those conditions that can cause death, injury, occupational illness, or damage to or loss of equipment or property. Thus these two fields cover very similar ground, use very similar techniques, and are covered here together.
Fault tolerant computing is often divided into two categories [Siewiorek82] . The techniques of 'fault avoidance' are those methods by which faults are made inherently less likely to occur. This predominantly includes the use of more stable parts under the types of perturbations expected, physical protection mechanisms in which faults are prevented by limiting those agents that have physical access to them, and other similar sorts of methods. Techniques of 'fault tolerance' are those techniques which enable a system to continue operation in the presence of faults. These predominantly consist of the well thought out use of redundancy to 'cover' faults so they don't result in errors.
Ignoring the well documented link between physical and mental well being, we consider safety as predominantly concerned with the design of systems that provide assurance of the prevention of physical harm. Protection in general, and safety in particular, is a relative concept, in that no system can be designed so as to prevent all possible physical harm under all possible conditions. Even the definition of harm is a relative concept in that causing harm to one individual may prevent harm from coming to another.
Fault tolerant computing and safety are therefore predominantly concerned with the design of systems that optimize some set of tradeoffs between risks and techniques by which they may be compensated for. We concentrate our discussion on the techniques available for the reduction of risk, and the basic nature of the tradeoffs in their use.
1 - Describe:
a) The difference between fault tolerance and fault avoidance b) Why no system can be perfectly safe c) Why software is rarely without errors d) Why redundancy is the major technique for tolerating faults
2 - Describe and give detailed examples of:
a) TMR b) Duplicate and compare c) Hot standbys d) Tempest attacks and protection e) How backup can fail to protect information integrity f) How a backup tape can be exploited to leak secrets g) A color change operation
3 - List 10 hazards in a critical control system and how they can be covered
4 - Perform a sample
a) Hazard analysis b) Risk analysis c) Hazard coverage exersizeusing randomly generated or made up data.
5 - Describe and differentiate:
a) N-version programming and N-modular redundancy b) Exhaustive testing and proof of correctness c) Intrinsic safety, built-in self-test, and fault tolerance d) Faults and hazards e) Avoidance, prevention, detection, warning, and correction