Fault tolerance

Fault tolerance is the property that allows a system to continue operating properly in the event of the failure of some or some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system in which a small failure can cause total breakdown. Fault tolerance is highly sought after in high-availability or life-critical systems . The ability to maintain functionality when serving as a breakthrough . [1]

fault-tolerant design enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely, when some parts of the system fail . [2] The term is most commonly used to describe computer systems, which may be more or less fully operational, or perhaps a reduction in throughput or an increase in response time in the event of some partial failure. That is, the system as a whole is not stopped due to problems Either in the hardware or the software. An example in another field is a motor vehicle designed so as to be capable of being driven, or a structure that is able to retain its integrity in the presence of such causes as fatigue , corrosion , manufacturing flaws, or impact.

Within the scope of an individual system, fault tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and in general, aiming for self-stabilization that the system converges towards an error-free state. However, if the consequences of a catastrophic failure, or the cost of making it are very high, a better solution may be used for some form of duplication. In any case, the consequence of a system failure is so catastrophic, the system must be able to use a safe mode. This is similar to a roll-back recovery but can be a human action if humans are present in the loop.


An example of graceful degradation by design in an image with transparency. The top two images are the result of viewing the composite image in a viewer that recognize transparency. The bottom two images are the result of a no support for transparency. Because the transparency mask (center bottom) is discarded, only the overlay (center top) remains; the image on the left has been degraded gracefully, hence is still meaningful.

A highly fault-tolerant system could continue at the same level of performance but one or more components have failed. For example, a building with a backup electrical generator will provide the same voltage to wall outlets even if the grid power fails.

A system that is designed to fail safe , or fail-secure, or fail gracefully , whether it functions at a reduced level or not at all, it is so that it protects people, property, or data from injury, damage, intrusion, or disclosure. In computers, a program might fail-safe by executing a graceful exit (as Opposed to an uncontrolled crash) in order to prevent prevention data corruption after-experiencing an error. A similar distinction is made between “failing well” and ” failing badly “.

Fail-deadly is the opposite strategy, which can be used in such systems that are designed to kill or injure targets even if part of the system is damaged or destroyed.

A system that is designed to experience graceful degradation , or to fail (used in computing, similar to “fail safe” [3] ) operates at a reduced level of performance after some component failures. For example, at a reduced level of power, or at a reduced power level, rather than at full power. In computing an example of a low-resolution version of this video, a lower-resolution version might be streamed in place of the high-resolution version. Progressive enhancementis an example in computing, where web pages are available in a basic functional format for older, small-screen, or limited-capable web browsers, but in an enhanced version for browsers.

In fault-tolerant computer systems , programs are considered to be robust to continue operation despite an error, exception, or invalid input, instead of crashing completely. Software brittleness is the opposite of robustness. Resilient networks continue to transmit data despite the failure of some links or nodes; resilient buildings and infrastructure are likewise expected to prevent complete failure in situations like earthquakes, floods, or collisions.

A system with high failure transparency will have a negative impact, if it continues to operate with full performance, so that it can be repaired or imminent complete failure anticipated. Likewise, a fail-fast component is designed to report at the first point of failure, rather than allow downstream components to fail and then generate reports. This method facilitates diagnosis of the underlying problem, and may prevent improper operation in a broken state.


Main article: Redundancy (engineering)

Redundancy is the provision of functional capabilities that would be unnecessary in a fault-free environment. [4] This can consist of backup components that automatically “kick in” should one component fail. For example, large cargo trucks can be removed without any major consequences. They have many tires, and no one is critical (with the exception of the front tires, which are used to steer, but generally carry less load, each and in total, than the other four to 16, so are less likely to fail ). The idea of ​​incorporating redundancy in order to improve the reliability of a system was pioneered by John von Neumann in the 1950s. [5]

Two types of redundancy are possible: [6] space redundancy and time redundancy. Space redundancy provides additional components, functions, or data items that are unnecessary for fault-free operation. Space redundancy is further classified into hardware, software and information redundancy, depending on the type of redundant resources added to the system. In time redundancy the computation or data transmission is repeated and the result is compared to a stored copy of the previous result. The current terminology for this type of testing is referred to as In Service Fault Tolerance Testing or ISFTT for short.


Providing fault-tolerant design for every component is normally not an option. Associated redundancy brings a number of penalties: increase in weight, size, power consumption, cost, design, verify, and test. Therefore, a number of choices to be considered to determine which components should be tolerant: [7]

  • How critical is the component? In a car, the radio is not critical, so this component has less need for fault tolerance.
  • How is the component to fail? Some components, like the drive shaft in a car, are not likely to fail, so no fault tolerance is needed.
  • How expensive is it to make the component fault tolerant? Requiring a redundant car engine, for example, to be considered.

An example of a component that passes the tests is a car’s occupant restraint system. Whereas we do not normally think of the primary occupant restraint system, it is gravity . If the vehicle rolls over or undergoes severe g-forces, then this primary method of occupying restraint may fail. Restraining the occupants during such an accident is absolutely critical to safety, so we pass the first test. Accidents Causing occupant ejection Were quite common before seat belts , so we pass the second test. The cost of a redundant restraint method is so low, both economically and in terms of weight and space, so we pass the third test. Therefore, adding seat to all vehicles is an excellent idea. Other “supplemental restraint systems”, such asairbags , are more expensive and so pass that test by a smaller margin.

The following are some of the most important examples of this process: they are critical, they are not particularly prone to sudden (rather than progressive) failure, and are in any case necessarily duplicated to allow even and balanced application of brake force to all wheels. It would also be prohibitively costly to further double-up the main components and they would add considerable weight. However, the similarly critical systems are inherently robust, using a cable (can rust, stretch, jam, snap) or hydraulic fluid (can leak, boil and develop bubbles, absorb water and thus ). Thus in most modern cars the hydraulic break is If there is a lack of sufficient force, then there is a lack of sufficient force. in the form of the cable-actuated parking brake that operates the otherwise relatively weak rear brakes, but can still bring the vehicle to a safe halt in conjunction with transmission / engine braking so long as the normal traffic flow . The culmulatively unlikely combination of total failure with the need for an accident would be likely to result in a collision.

In this case, the parking brake is a less important item, and unless it is used as a one-time backup for the footbrake, of application. Therefore, it is necessary to use a cheaper, lighter, but less hardwearing cable actuation system, and it can suffice, if this happens on a hill, to use the footbrake to momentarily hold the vehicle still , before driving off to find a flat piece of road on which to stop. Alternatively, on shallow gradients, the transmission can be shifted into the Park, Reverse or First gear, and the transmission lock / engine compression used to hold it stationary, as it is no need for them to include the sophistication to first bring it to a halt .

On motorcycles, a similar level of fail-safety is provided by simpler methods; First, the first and second lines of the system, being independent, irrespective of their method of activation, allowing one to fail and leaving the other unaffected. Secondly, the former brake is relatively strong compared to its automotive cousins, even if it is a powerful disc on sports models, even though the usual intent is for the front system to provide the vast majority of braking force; as the overall vehicle weight is more important, and the rider can lean back to the back of the wheel. Cheaper, slower utility-class machines,


The basic characteristics of fault tolerance require:

  1. No single point of failure – If a system experiences a failure, it must continue to operate without interruption during the repair process.
  2. Fault isolation to the failing component – When a failure occurs, the system must be able to isolate the failure of the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation. Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology (NIST) categorizes faults based on locality, cause, duration, and effect. where? clarification needed ]
  3. Fault containment to prevent propagation of the failure. An example of this kind of failure is the “transmitter transmitter” that can not be used in a system. Firewalls or other mechanisms that require isolation or failing.
  4. Availability of reversion modes clarification needed ]

In addition, fault-tolerant systems are characterized in terms of both planned service and unplanned service outages. These are usually measured at the level of hardware. The figure of merit is called availability and is expressed as a percentage. For example, a five nines system would provide 99.999% availability.

Fault-tolerant systems are typically based on the concept of redundancy.


Spare components address the first fundamental characteristic of fault tolerance in three ways:

  • Replication : Providing multiple instances of the same system or subsystem, directing tasks or requests to all of them in parallel , and choosing the correct result on the basis of a quorum ;
  • Redundancy : Providing multiple identical instances of the same system and switching to one of the remaining instances in a failure ( failover );
  • Diversity: Providing multiple different implementations of the same specification, and using them to cope with errors in a specific implementation.

All implementations of RAID , redundant array of independent disks , except RAID 0, are examples of a fault-tolerant storage device that uses data redundancy .

A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, the replications of each element should be in the same state. The same inputs are provided for each replication , and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is dual modular redundant (DMR). The voting system can only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed triple modular redundant(TMR). The voting is a two-to-one vote. In this case, the voting circuit can be used to obtain the correct result, and discard the erroneous version. After this, the internal state of the art is assumed to be different from that of the other two, and the voting is a DMR mode. This model can be applied to any number of replications.

Lockstep fault-tolerant machines are most easily made synchronously , with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement.

Bringing the replications into synchrony requires their internal stored states. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replica can be copied to another replica.

One variant of DMR is even-and-spare . Two complicated elements operate in a lockstep, with a voting circuit that detects any mismatch between their operations and outputs. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of TMR, but has been used commercially.


Fault-tolerant design is advantageous, while many of its disadvantages are not:

  • Interference with fault detection in the same component. Punctured is one of the most important examples of a punctured tire. This is usually handled with a separate “automated fault-detection system”. In the case of the tire, an air pressure monitor detects the loss of pressure and notifies the driver. The alternative is a “manual fault-detection system”, such as manually inspecting all tires at each stop.
  • Interference with fault detection in another component. Another variation of this problem is a fault in one component. For example, if component A, then fault tolerance in B can hide a problem with A. If component B is later changed to a less fault-tolerant design making it appear that the new component B is the problem. Only after the system will have become a part of the problem.
  • Reduction of priority of fault correction. Even if the operator is aware of the fault, having a fault-tolerant system is likely to reduce the importance of repairing the fault. If the fault is not corrected, this will eventually lead to system failure, when the fault-tolerant component fails completely or when all redundant components have failed.
  • Test difficulty. For certain critical fault-tolerant systems, such as a nuclear reactor , there is no easy way to verify that the backup components are functional. The most infamous example of this is Chernobyl , where operators tested the emergency backup by disabling primary and secondary cooling. The backup failed, resulting in a massive meltdown and massive release of radiation.
  • Cost. Both fault-tolerant components and redundant components tend to increase cost. This may be a purely economic cost or other measure, such as weight. Manned spaceships , for example, have so many redundant and fault-tolerant components that their weight is increased dramatically over unmanned systems, which do not require the same level of safety.
  • Inferior components. A fault-tolerant design may allow for the use of inferior components, which would have otherwise made the system inoperable. While this practice has the potential to mitigate the cost increase, a comparable non-fault-tolerant system.


Hardware fault tolerance sometimes requires that the parts are still operational (in computing known as hot swapping ). Such a system is a tolerant , and represents the vast majority of fault-tolerant systems. Such systems In the mean time between failures shoulds be long enough for the operators to-have time to fix the broken devices ( mean time to repair ) before the backup aussi facts. It helps as long as possible, but this is not necessarily required in a fault-tolerant system.

Fault tolerance is very successful in computer applications. Tandem Computers built Entire Their business is Such machines used qui single point to create Their tolerance NonStop systems with uptimes Measured in years.

Fail-safe architectures may also encompass computer software, for example by process replication (computer science) .

Data formats can also be degraded gracefully. HTML for example, is designed to be forward compatible , allowing new HTML entities to be ignored by Web browsers .

Related terms

There is a difference between fault tolerance and systems that rarely have problems. For instance, the Western Electric crossbar systems have been highly fault resistant . But when a fault has occurred they have been stopped completely, and are not fault tolerant .

See also

  • Control reconfiguration
  • Damage tolerance
  • Defense in depth
  • Elegant degradation
  • Error-tolerant design (human-error-tolerant design)
  • Failure semantics
  • Fault-tolerant computer system
  • List of system quality attributes
  • Resilience (ecology)
  • Resilience (network)
  • Safe-life design


  1. Jump up^ Adaptive Fault Tolerance and Graceful Degradation, Oscar González et al., 1997, University of Massachusetts – Amherst
  2. Jump up^ Johnson, BW (1984). “Fault-Tolerant Microprocessor-Based Systems”, IEEE Micro, vol. 4, no. 6, pp. 6-21
  3. Jump up^ Stallings, W (2009): Operating Systems. Internals and Design Principles, sixth edition
  4. Jump up^ Laprie, JC (1985). “Dependable Computing and Fault Tolerance: Concepts and Terminology”, Proceedings of the 15th International Symposium on Fault-Tolerant Computing (FTSC-15), pp. 2-11
  5. Jump up^ von Neumann, J. (1956). “Probabilistic Logics and Synthesis of Reliable Organisms from Unreliable Components”, in Automata Studies, eds. C. Shannon and J. McCarthy, Princeton University Press, pp. 43-98
  6. Jump up^ Avizienis, A. (1976). “Fault-Tolerant Systems”, IEEE Transactions on Computers, vol. 25, no. 12, pp. 1304-1312
  7. Jump up^ Dubrova, E. (2013). “Fault-Tolerant Design”, Springer, 2013,ISBN 978-1-4614-2112-2

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright computerforum.eu 2019
Shale theme by Siteturner