Fri Apr 8 06:47:17 PDT 2016
Risk Management: Failsafes: When failsafes are required and how are they determined?
Option 1: Event sequences leading to potentially serious negative consequences should have failsafe mechanisms.
Option 2: Interdependent failures should be mitigated in advance through failsafes and alternative operating modes.
Option 3: Single points of failure (SPOFs) should be mitigated in advance through failsafes.
Option 4: Control should be separated from and less forceful than safety so that control cannot override failsafe mechanisms.
Option 5: Combined control sequences should be less forceful than safety mechanisms to prevent complex intentional overwhelming of safety systems.
Option 6: Failsafe mechanisms available within system mechanisms should be properly configured.
Option 7: No special attention to failsafes need be paid.
Event sequences leading to potentially serious
negative consequences should have failsafe mechanisms.
consequences are sufficiently high to warrant through examination of
the situation, failsafe mechanisms should be preferred over
Interdependent failures should be mitigated in
advance through failsafes and alternative operating modes.
Interdependencies that cannot be resolved by redundancy or hardening
(e.g., common-mode failures, insider malicious acts, etc.) should be
covered by failsafe mechanisms where feasible.
Single points of failure (SPOFs) should be
mitigated in advance through failsafes.
All SPOFs should be
examined to determine whether they can produce direct consequences
above management specified tolerances, and if so, they should be
protected with failsafe mechanisms.
Control should be separated from and less forceful
than safety so that control cannot override failsafe
As a general rule, failsafe and other safety
systems and mechanisms should be separate and different from alterable
and/or tunable control mechanisms. This prevents alterations to
control settings (e.g., closing a lock before the door is shut) from
creating hazardous conditions (e.g., the door is opened even though
it is locked) or causing failsafe to fail into an unsafe mode (e.g.,
the lock is broken and the door is opened). In addition, when a
control mechanism acts in opposition to a failsafe mechanism, the
control mechanism should not be capable of overwhelming the failsafe
mechanism (e.g., the door lock should be strong enough so that if the
door opening mechanism attempts to open the door when locked, the
door does not open and the lock does not break).
Combined control sequences should be less forceful
than safety mechanisms to prevent complex intentional overwhelming of
The potential malicious use of control
mechanisms to invoke sequences of events (e.g., sequences that set up
positive feedback in a mechanical system) in a direct (e.g., by trying
repeatedly at a given frequency to open a gate which is locked) or
indirect (e.g., through time variant operation of an electrical
mechanism that couples to a mechanical mechanism) manner, and as a
combination of multiple mechanisms (e.g., timed sequences of pressure
changes in the chamber the door seals combined with attempts to force
the door open when locked) should be unable to overwhelm or damage the
failsafe mechanism except to put it into a safe mode.
Failsafe mechanisms available within system mechanisms
should be properly configured.
If and to the extent that there
are failsafe mechanisms present, they should be configured so as to
fail in safe modes as determined by analysis.
No special attention to failsafes need be
To the extent that failsafes exist, they should be
operated properly, but no special attention should be paid to seek out
settings or make changes to manufacturer defaults. Manufacturers'
specifications should be followed.
Failsafe mechanisms should be designed so as to prefer more certain
methods over less certain ones. Use the following prioritization in
evaluating the surety of failsafe mechanisms.
- Physics is more sure than engineering: To the
extent that fundamental principles of physics (e.g., gravity will
cause this to drop if that breaks) can be directly applied for
failsafe mechanisms, this is generally more sure than engineered
solutions (e.g., gravity will cause this to drop, forcing that to rise
through this engineered component). This also applies to information
technology (e.g., the control signals from the SCADA system to the PLC
are also used to power the normally open safety relay closed, so that
loss of SCADA control signals for more than 10 seconds will cause the
relay to open, thus disabling the ignition sequence and forcing a
final countdown restart). A corollary is that fighting nature is
harder than applying it.
- Passive is more sure than active: To the
extent that passive mechanisms (e.g., the centrifuge has a dynamic
coefficient of friction such that the motor will overheat and fail
before the centrifuge goes fast enough to break out of the enclosure)
can be used, they are more sure than active mechanisms (e.g., the
control system checks velocity and weight and when the combination
reached 95% of the force necessary to break out of the enclosure,
refuses to allow the motor voltage to increase). In the more general
case, anything that requires a programmed sequence of actions is less
sure than something that does not (e.g., combinatorics is more sure
- Closer is more sure than further away: To the
extent that a control mechanism is more proximate to the mechanism it
controls (e.g., right next to it as opposed to 50 miles away), there
are fewer things that can interfere with the control linkage (e.g.,
the wires between the controller and the valve). However, beware of
common mode failures. Proximity also tends to make common mode
failures (e.g., heating of the mechanism damages the control causing
it to fail closed and increase the heating) more likely, so be careful
to anticipate the operating range under failure modes to account
for minimum distance requirements.
Failsafe analysis methodology.
of failure modes for failsafe determinations (i.e., determining what
mode is safer to fail into), the following methodologies are available
and should be applied.
Copyright(c) Fred Cohen, 1988-2015 - All Rights Reserved
Fault to failure to consequence analysis
This approach is generally preferred for components
that are being used as part of larger systems because it allows the
ultimate consequences to be ignored in favor of the technical outcomes
that can be produced by the component(s).
Fault modeling: Based on historical experience
and data (e.g., manufacturers data, experimental data, plant
experience, proximate and root cause analysis, etc.), classes of
faults (e.g., stuck open, stuck closed, transient, bridging, operator
error or malice, etc.) associated with components (e.g., PLC model
#33457, manufacturing run 23, valve controllers, the components
comprising it, its operating environment, its protocols, etc.) should
be identified, cataloged, and modeled.
Fault set determination: For the composite
comprising relevant components (e.g., the gas mixing chamber PLC),
composites (e.g., the gas mixing chamber and associated components),
the entire plant (the fuel factory), and its environment (i.e., a war
zone in Tornado alley next to a global child refugee camp), the set of
relevant faults should be identified. This depends on the level of
analysis being performed (e.g., all SPOFs, multiple fault models,
event sequence analysis, common mode failure analysis, human
accidental, intentional, and malicious acts, etc.).
Analysis linking faults to failures: Analysis
of the paths from faults to failures are then undertaken to determine
all failures that can results from all faults in the fault set(s)
(e.g., various faults in the PLC as well as operator override can lead
to the ability to open the door to the gas missing chamber while it has
Failure consequence analysis: If this is the
final plant or composite, then analysis of the consequences of each
failure should be undertaken to rate the consequences per the risk
rating methodology in use. At the level of the PLC, all the failures
are essentially conditions that can exist from underlying faults, but
no consequences are identified other than "wrong output". At the level
of the gas mixing chamber, consequences could include having the
access door open while gas was in the gas mixing chamber. At the level
of the gas mixing chamber and associated components, consequences
could include ignition of the gas, explosion, and death to present
operator(s). As the context increases, the greater consequences to the
children in the refugee camp may be revealed.
Consequence to failure to fault analysis
This approach is generally preferred for large
composite systems because it limits the analysis sooner and further
but has less thorough coverage of details and can miss cases that were
not anticipated or found to be realistic.
Consequence hypotheses: A set of potentially
serious negative consequences of interest are generated and used as
the basis for the analysis. In this analysis, feasibility of
consequence is ignored with the goal of producing a comprehensive set
of serious negative consequences (e.g., the gas mixing chamber blows up).
Consequence to failure analysis: From the set
of consequences, hypothetical proximate causes that could produce
those consequences are generated (e.g., an ignition source comes into
contact with the gas in the gas mixing chamber). From those proximate
causes, causal sequences and root causes are sought to the level of
components in the plant (e.g., the gas chamber PLC could fail and
allow the chamber to be opened when gas is present) and threats and
their ultimate actions (e.g., authorized employee decides to commit
suicide by lighting a match and throwing it into the gas mixing
chamber). This results in a set of failure modes and sequences of
interest (e.g., the PLC could be overridden by an employee who then
decides to commit suicide ...). In this analysis, it is assumed that
failures can happen and that normal controls (e.g., the employee
reliability program and PLC change controls) are not effective.
Analysis linking failures to faults: Analysis
then combines the proximate causes to a set of root causes that may
produce these consequences and combines them to identify fault sets
(e.g., the gas mixing chamber access door can be opened when gas is
present by manual override of the PLC AND the same person who can
override the PLC controls can access the access door) that may be
addressed by failsafe mechanisms (e.g., a mechanical interlock that
prevents physical opening of the door when the internal pressure
differs from normal atmospheric pressure) as opposed to another
compensating control (e.g., separation of duties prohibits the
individual who has physical PLC access from also having physical
access to the gas mixing chamber door).
Determination of preferred (safest) failure
mode(s) and mechanism(s): Given all the scenarios in which
different event sequences may produce potentially serious negative
consequences (e.g., failure to allow physical access to the gas mixing
chamber during an overpressure event may result in a breech of the
chamber resulting in poisoning of plant personnel) and the different
potential "safe" modes for failures (e.g., a relief valve releasing
noxious gases to the environment) some of which may conflict with
each other (e.g., the relief valve could be damaged by mechanical
attack leading to a mixing chamber explosion), failsafe modes and
mechanisms are selected so as to cover, or leave uncovered with
management acceptance of the risk, each sequence with consequences
above management specified thresholds.
Engineering limits on failsafes: All failsafes
must be engineered, and engineering helps to define the limits of
these failsafes. Such limits are designed so as to withstand the
anticipated ranges of events to the level necessary in order to fail
in the desired (safe) mode.
Root (and proximate) cause analysis and
improvement: Over time, better understanding and analysis and
changing environments (i.e., operating ranges and threats) lead to
anticipatable and/or realized events with consequences. These are
analyzed over time to identify and better understand proximate and
root causes of failures and to adapt the approach to and mechanisms