Fault-tolerant computer systems are designed around the concepts of fault tolerance . In essence, they must be able to work on a level of satisfaction in the presence of faults.
Fault tolerance is not just a property of individual machines; it can also characterize the rules by which they interact. For example, the transmission control protocol (TCP) is designed to allow two-way communication in a packet-switched network , even in the presence of communications that are imperfect or overloaded. It does this by requiring the endpoints of the communication to expect packet loss, duplication, reordering, and corruption, so that these conditions do not damage data integrity, and only reduce throughput by a proportional amount.
Recovery from errors in fault-tolerant systems can be characterized as either ‘roll-forward’ or ‘roll-back’. When the system detects that it has made an error, roll-forward recovery takes the system at that time and is correct, to be able to move forward. Roll-back recovery reverts the system state back to some earlier, correct version, for example using checkpointing , and moves forward from there. Roll-back recovery requires that the operations between the checkpoint and the detected erroneous state can be made idempotent . Some systems make use of both roll-forward and roll-back recovery for different errors or different parts of one error.
Types of fault tolerance
Most fault-tolerant computer systems are designed to handle failures Several feasible, Including hardware-related faults Such As hard disk failures, input or output device failures, or other temporary or permanent failures; software bugs and errors; interface errors between hardware and software, including driver failures; operator errors, such as erroneous keystrokes, bad command sequences or installing unexpected software and physical damage or other flaws introduced to the system from an outside source. 
Hardware fault-tolerance is the most common application of these systems, designed to prevent failures due to hardware components. Most basically, this is provided by redundancy , particularly dual modular redundancy . Typically, components have multiple backups and are separated into smaller segments, which act to contain a fault, and extra redundancy is built into all physical connectors, power supplies, fans, etc.  There are special software and instrumentation packages designed to detect failures, such as fault masking, which is a way to main and backups do not give the same results, the flawed output is ignored.
Software fault-tolerance is based on real-time, or static “emergency” subprograms to fill in for programs that crash. There are many ways to conduct such fault-regulation, depending on the application and the available hardware. 
The first known fault-tolerant computer was SAPO , built in 1951 in Czechoslovakia by Antonín Svoboda .  Its basic design Was magnetic drums connected via relays, with a voting method of memory error detection ( triple modular redundancy ). Several other machines have been developed along this line, mostly for military use. Eventually, they were separated into three distinct categories: machines that would have lasted a long time, any such maintenance, NASA space probes and satellites; computers that were very dependable but required constant monitoring, such as those used to monitor and control nuclear power plants or supercollider experiments; and finally, computers with a high amount of runtime Which would be under heavy use, Such As Many of the supercomputers used by insurance companies for Their probability monitoring.
Most of the development in the so-called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960s,  in preparation for Project Apollo and other research aspects. NASA’s first machine went into a space observatory , and their second attempt, the JSTAR computer, was used in Voyager . JPL Self-Testing-And-Repairing computer. This computer was a backup of memory. It could detect its own errors and fix them as soon as possible. The computer is still working today [ when? ] .
Hyper-dependable computers have been pioneered mainly by aircraft manufacturers,  nuclear power companies, and the railroad industry in the USA. These needed computers with massive water equivalent of uptime That Would fail gracefully enough with a fault to allow continued operation, while Relying on the fact que la computer output Would Be Constantly monitored by humans to detect faults. Again, IBM developed the first computer of this kind for NASA for guidance of Saturn V rockets, but later on BNSF , Unisys , and General Electric built their own. 
The 1970 F14 CADC had built-in self-test and redundancy. 
In general, the early efforts at fault-tolerant designs have been focused on the subject of internal diagnosis, where a fault would have been a problem. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure.  Later efforts shown, to be fully effective, the system had to be self-repairing and diagnosing – isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where it is more often than not, and it is one of the most common forms of fault-tolerant design.
Voting was another initial method, with repeated redundant backups, and other results, with the result that, for example, four components reported an answer of 5 and one component reported would “vote” that the fifth component was faulty and have it taken out of service. This is called M out of N majority voting.
Historically, it has been argued that the complexity of systems and the difficulty of ensuring the transitive state of fault-negative to fault-positive did not disrupt operations.
Tandem and Stratus were among the first companies specializing in the design of fault-tolerant computer systems for online transaction processing .
Fault tolerance verification and validation
The most important requirement of a fault is to make sure it meets its requirements for reliability. This is done by using various failures to simulate various failures, and analyzing how well the system reacts. These statistic models are very complex, involving latency curves, error rates, and the like. The most commonly used models are HARP, SAVE, and SHARPE in the USA, and SURF gold LASS in Europe.
Fault tolerance research
Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work. The more complex the system, the more carefully all possible interactions have been considered and prepared for. Considering the importance of high-value systems in transportation, public utilities and the military, the field of topics That touch is research is very wide: it can include Such Obvious subjects as software modeling and reliability, gold hardware design , to arcane Elements Such As stochastic models, graph theory , formal or exclusionary logic, parallel processing , remote data transmission , and more. 
Failure-oblivious computing is a technique that enables computer programs to continue executing despite memory errors . The technique handles attempts to read invalid memory by returning value to the program, which in turn, makes use of the built-in value and ignores the value of memory . This is a great contrast to typical memory checkers , which informs the program of the error or abort the program. In failure-oblivious computing, no attempt is made to inform the program that an error occurred. 
The approach to performance costs: because the technique rewrites code to insert dynamic checks for validity, execution time will increase by 80% to 500%. 
Recovery shepherding is a lightweight technical tool to recover from the software.  Comparing to the oblivious computing failure, recovery and recovery. It uses the just-in-time binary instrumentation framework Pin. It attaches to the application process when an error occurs, repairs the execution, tracks the repair effects and the execution continues, contains the repair effects within the application process, and detaches from the process after all repair effects are flushed from the process state. It does not interfere with the normal execution of the program and therefore incurs negligible overhead.  For example, for a systematically collected real world null-dereference and divide-by-zero errors, a prototype implementation enables the application to continue to provide acceptable output and service to its users on the error-triggering inputs. 
- Byzantine fault tolerance
- Computer cluster
- Data redundancy
- Error detection and correction
- Fall back and forward
- Fault tolerance
- Graceful exit
- Immunity aware programming
- Intrusion tolerance
- List of system quality attributes
- Multipath routing
- Progressive enhancement
- Resilience (network)
- Robustness (computer science)
- Rollback (data management)
- Separation of protection and security
- Jump up^ Fault-tolerant computer system design book contents. Dhiraj K. Pradhan, Pages: 135 – 138 1996ISBN 0-13-057887-8
- Jump up^ Formal Techniques in Real-Time and Fault-Tolerant Systems: Second International Symposium, Nijmegen, The Netherlands, January 8-10, 1992, Proceedings By Jan Vytopil Contributor Jan Vytopil, Published by Springer, 1991,ISBN 3-540- 55092-5, 978-3-540-55092-1
- Jump up^ Fault-tolerant computer system design book contents. Dhiraj K. Pradhan, Pages: 221 – 235 1996ISBN 0-13-057887-8
- Jump up^ Computer Structures: Principles and Examples, pg 155 By Daniel P. Siewiorek, Gordon C. Bell, Allen Newell Published by McGraw-Hill, 1982ISBN 0-07-057302-6, 978-0-07-057302-4
- Jump up^ Computer Structures: Principles and Examples, pg 189 By Daniel P. Siewiorek, Gordon C. Bell, Allen Newell Published by McGraw-Hill, 1982ISBN 0-07-057302-6, 978-0-07-057302-4
- Jump up^ Computer Structures: Principles and Examples, pg 210 By Daniel P. Siewiorek, Gordon C. Bell, Allen Newell Published by McGraw-Hill, 1982ISBN 0-07-057302-6, 978-0-07-057302-4
- Jump up^ Computer Structures: Principles and Examples, pg 223 By Daniel P. Siewiorek, C. Gordon Bell, Allen Newell Published by McGraw-Hill, 1982ISBN 0-07-057302-6, 978-0-07-057302-4
- Jump up^ Ray Holt. “The F14A Central Air Data Computer, and the LSI Technology State-of-the-Art in 1968”.
- Jump up^ Fault tolerant computing in computer design Neilforoshan, MR Journal of Computing Sciences in Colleges archive Volume 18, Issue 4 (April 2003) Pages: 213 – 220,ISSN 1937-4771
- Jump up^ Reliability Evaluation of Some Fault-Tolerant Computer Architectures By Shunji Osaki, Toshihiko Nishio Published by Springer, 1980ISBN 3-540-10274-4, 978-3-540-10274-8
- Jump up^ Rinard, Martin; Cadar, Cristian; Dumitran, Daniel; Roy, Daniel M .; Leu, Tudor; Beebee, William S. (2004), “Enhancing server availability and security through failure-oblivious computing” Enhancing server availability and security through failure-oblivious computing , Proceedings of the 6th conference we Symposium on Operating Systems Design and Implementation, 6 , Berkeley , CA: USENIX Association, CiteSeerX 10.1.1.68.9926
- Jump up^ Keromytis, Angelos D. (2007), “Characterizing Software Self-Healing Systems” , in Gorodetski, Vladimir I .; Kotenko, Igor; Skormin, Victor A.,Characterizing Software Self-Healing Systems , Computer Network Security:Springer ,Fourth International Conference on Mathematical Methods, Models, and Architectures for Computer Network Security, ISBN 3-540-73985-8
- ^ Jump up to:a b c Long, Fan; Sidiroglou-Douskos, Stelios; Rinard, Martin (2014). “Automatic Runtime Error Repair and Containment via Recovery Shepherding”. Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation . PLDI ’14’. New York, NY, USA: ACM. pp. 227-238. doi : 10.1145 / 2594291.2594337 . ISBN 978-1-4503-2784-8 .