All.Net

Fri Apr 8 06:49:41 PDT 2016

Risk Management: Failsafes: When should failsafes be required and how should they be determined?

Options:

Option 1: Event sequences leading to potentially serious negative consequences should have failsafe mechanisms.
Option 2: Interdependent failures should be mitigated in advance through failsafes and alternative operating modes.
Option 3: Single points of failure (SPOFs) should be mitigated in advance through failsafes.
Option 4: Control should be separated from and less forceful than safety so that control cannot override failsafe mechanisms.
Option 5: Combined control sequences should be less forceful than safety mechanisms to prevent complex intentional overwhelming of safety systems.
Option 6: Failsafe mechanisms available within system mechanisms should be properly configured.
Option 7: No special attention to failsafes need be paid.

Decision:

The suggested approach to failsafes should be as follows:

Risk Level	Skill	Maturity	Alternatives
High	High	Managed+	Event sequences leading to potentially serious negative consequences should have failsafe mechanisms. AND Interdependent failures should be mitigated in advance through failsafes and alternative operating modes. AND Single points of failure (SPOFs) should be mitigated in advance through failsafes. AND Control should be separated from and less forceful than safety so that control cannot override failsafe mechanisms. AND Combined control sequences should be less forceful than safety mechanisms to prevent complex intentional overwhelming of safety systems. AND Failsafe mechanisms available within system mechanisms should be properly configured.
High	---	Defined-	This situation should be avoided - do not proceed under this condition.
High	Med-	---	This situation should be avoided - do not proceed under this condition.
Medium	Med+	Defined+	Interdependent failures should be mitigated in advance through failsafes and alternative operating modes. AND Single points of failure (SPOFs) should be mitigated in advance through failsafes. AND Control should be separated from and less forceful than safety so that control cannot override failsafe mechanisms. AND Failsafe mechanisms available within system mechanisms should be properly configured.
Medium	---	Repeatable-	This situation should be avoided - do not proceed under this condition.
Medium	Low	---	This situation should be avoided - do not proceed under this condition.
Low	Low	Repeatable+	Control should be separated from and less forceful than safety so that control cannot override failsafe mechanisms. AND Failsafe mechanisms available within system mechanisms should be properly configured. AND No special attention to failsafes need be paid.
Low	Low	Initial-	This situation should be avoided - do not proceed under this condition.

Failsafe risk mitigation approach

Basis:

Event sequences leading to potentially serious negative consequences should have failsafe mechanisms.
When consequences are sufficiently high to warrant through examination of the situation, failsafe mechanisms should be preferred over alternatives.

Interdependent failures should be mitigated in advance through failsafes and alternative operating modes.
Interdependencies that cannot be resolved by redundancy or hardening (e.g., common-mode failures, insider malicious acts, etc.) should be covered by failsafe mechanisms where feasible.

Single points of failure (SPOFs) should be mitigated in advance through failsafes.
All SPOFs should be examined to determine whether they can produce direct consequences above management specified tolerances, and if so, they should be protected with failsafe mechanisms.

Control should be separated from and less forceful than safety so that control cannot override failsafe mechanisms.
As a general rule, failsafe and other safety systems and mechanisms should be separate and different from alterable and/or tunable control mechanisms. This prevents alterations to control settings (e.g., closing a lock before the door is shut) from creating hazardous conditions (e.g., the door is opened even though it is locked) or causing failsafe to fail into an unsafe mode (e.g., the lock is broken and the door is opened). In addition, when a control mechanism acts in opposition to a failsafe mechanism, the control mechanism should not be capable of overwhelming the failsafe mechanism (e.g., the door lock should be strong enough so that if the door opening mechanism attempts to open the door when locked, the door does not open and the lock does not break).

Combined control sequences should be less forceful than safety mechanisms to prevent complex intentional overwhelming of safety systems.
The potential malicious use of control mechanisms to invoke sequences of events (e.g., sequences that set up positive feedback in a mechanical system) in a direct (e.g., by trying repeatedly at a given frequency to open a gate which is locked) or indirect (e.g., through time variant operation of an electrical mechanism that couples to a mechanical mechanism) manner, and as a combination of multiple mechanisms (e.g., timed sequences of pressure changes in the chamber the door seals combined with attempts to force the door open when locked) should be unable to overwhelm or damage the failsafe mechanism except to put it into a safe mode.

Failsafe mechanisms available within system mechanisms should be properly configured.
If and to the extent that there are failsafe mechanisms present, they should be configured so as to fail in safe modes as determined by analysis.

No special attention to failsafes need be paid.
To the extent that failsafes exist, they should be operated properly, but no special attention should be paid to seek out settings or make changes to manufacturer defaults. Manufacturers' specifications should be followed.

Failsafe prioritization.
Failsafe mechanisms should be designed so as to prefer more certain methods over less certain ones. Use the following prioritization in evaluating the surety of failsafe mechanisms.

Physics is more sure than engineering: To the extent that fundamental principles of physics (e.g., gravity will cause this to drop if that breaks) can be directly applied for failsafe mechanisms, this is generally more sure than engineered solutions (e.g., gravity will cause this to drop, forcing that to rise through this engineered component). This also applies to information technology (e.g., the control signals from the SCADA system to the PLC are also used to power the normally open safety relay closed, so that loss of SCADA control signals for more than 10 seconds will cause the relay to open, thus disabling the ignition sequence and forcing a final countdown restart). A corollary is that fighting nature is harder than applying it.
Passive is more sure than active: To the extent that passive mechanisms (e.g., the centrifuge has a dynamic coefficient of friction such that the motor will overheat and fail before the centrifuge goes fast enough to break out of the enclosure) can be used, they are more sure than active mechanisms (e.g., the control system checks velocity and weight and when the combination reached 95% of the force necessary to break out of the enclosure, refuses to allow the motor voltage to increase). In the more general case, anything that requires a programmed sequence of actions is less sure than something that does not (e.g., combinatorics is more sure than sequencers).
Closer is more sure than further away: To the extent that a control mechanism is more proximate to the mechanism it controls (e.g., right next to it as opposed to 50 miles away), there are fewer things that can interfere with the control linkage (e.g., the wires between the controller and the valve). However, beware of common mode failures. Proximity also tends to make common mode failures (e.g., heating of the mechanism damages the control causing it to fail closed and increase the heating) more likely, so be careful to anticipate the operating range under failure modes to account for minimum distance requirements.

Failsafe analysis methodology.
In analysis of failure modes for failsafe determinations (i.e., determining what mode is safer to fail into), the following methodologies are available and should be applied.

Fault to failure to consequence analysis

This approach is generally preferred for components that are being used as part of larger systems because it allows the ultimate consequences to be ignored in favor of the technical outcomes that can be produced by the component(s).

Fault modeling: Based on historical experience and data (e.g., manufacturers data, experimental data, plant experience, proximate and root cause analysis, etc.), classes of faults (e.g., stuck open, stuck closed, transient, bridging, operator error or malice, etc.) associated with components (e.g., PLC model #33457, manufacturing run 23, valve controllers, the components comprising it, its operating environment, its protocols, etc.) should be identified, cataloged, and modeled.

Fault set determination: For the composite comprising relevant components (e.g., the gas mixing chamber PLC), composites (e.g., the gas mixing chamber and associated components), the entire plant (the fuel factory), and its environment (i.e., a war zone in Tornado alley next to a global child refugee camp), the set of relevant faults should be identified. This depends on the level of analysis being performed (e.g., all SPOFs, multiple fault models, event sequence analysis, common mode failure analysis, human accidental, intentional, and malicious acts, etc.).

Analysis linking faults to failures: Analysis of the paths from faults to failures are then undertaken to determine all failures that can results from all faults in the fault set(s) (e.g., various faults in the PLC as well as operator override can lead to the ability to open the door to the gas missing chamber while it has gas present).

Failure consequence analysis: If this is the final plant or composite, then analysis of the consequences of each failure should be undertaken to rate the consequences per the risk rating methodology in use. At the level of the PLC, all the failures are essentially conditions that can exist from underlying faults, but no consequences are identified other than "wrong output". At the level of the gas mixing chamber, consequences could include having the access door open while gas was in the gas mixing chamber. At the level of the gas mixing chamber and associated components, consequences could include ignition of the gas, explosion, and death to present operator(s). As the context increases, the greater consequences to the children in the refugee camp may be revealed.

Consequence to failure to fault analysis

This approach is generally preferred for large composite systems because it limits the analysis sooner and further but has less thorough coverage of details and can miss cases that were not anticipated or found to be realistic.

Consequence hypotheses: A set of potentially serious negative consequences of interest are generated and used as the basis for the analysis. In this analysis, feasibility of consequence is ignored with the goal of producing a comprehensive set of serious negative consequences (e.g., the gas mixing chamber blows up).

Consequence to failure analysis: From the set of consequences, hypothetical proximate causes that could produce those consequences are generated (e.g., an ignition source comes into contact with the gas in the gas mixing chamber). From those proximate causes, causal sequences and root causes are sought to the level of components in the plant (e.g., the gas chamber PLC could fail and allow the chamber to be opened when gas is present) and threats and their ultimate actions (e.g., authorized employee decides to commit suicide by lighting a match and throwing it into the gas mixing chamber). This results in a set of failure modes and sequences of interest (e.g., the PLC could be overridden by an employee who then decides to commit suicide ...). In this analysis, it is assumed that failures can happen and that normal controls (e.g., the employee reliability program and PLC change controls) are not effective.

Analysis linking failures to faults: Analysis then combines the proximate causes to a set of root causes that may produce these consequences and combines them to identify fault sets (e.g., the gas mixing chamber access door can be opened when gas is present by manual override of the PLC AND the same person who can override the PLC controls can access the access door) that may be addressed by failsafe mechanisms (e.g., a mechanical interlock that prevents physical opening of the door when the internal pressure differs from normal atmospheric pressure) as opposed to another compensating control (e.g., separation of duties prohibits the individual who has physical PLC access from also having physical access to the gas mixing chamber door).

Determination of preferred (safest) failure mode(s) and mechanism(s): Given all the scenarios in which different event sequences may produce potentially serious negative consequences (e.g., failure to allow physical access to the gas mixing chamber during an overpressure event may result in a breech of the chamber resulting in poisoning of plant personnel) and the different potential "safe" modes for failures (e.g., a relief valve releasing noxious gases to the environment) some of which may conflict with each other (e.g., the relief valve could be damaged by mechanical attack leading to a mixing chamber explosion), failsafe modes and mechanisms are selected so as to cover, or leave uncovered with management acceptance of the risk, each sequence with consequences above management specified thresholds.

Engineering limits on failsafes: All failsafes must be engineered, and engineering helps to define the limits of these failsafes. Such limits are designed so as to withstand the anticipated ranges of events to the level necessary in order to fail in the desired (safe) mode.

Root (and proximate) cause analysis and improvement: Over time, better understanding and analysis and changing environments (i.e., operating ranges and threats) lead to anticipatable and/or realized events with consequences. These are analyzed over time to identify and better understand proximate and root causes of failures and to adapt the approach to and mechanisms for failsafes.