Fault tolerance
Fault tolerance is the ability of a system to contain the propagation of faults (e.g. failed transistor, shorted connector, intermittent data bus). Faults may manifest as errors (e.g. bad data value, missing message) causing incorrect system state. Failure tolerant systems mask errors and maintain failure-free operation in the presence of one or more faulty components. This capability is essential for high-availability, mission-critical, or even life-critical systems.
Fault tolerance specifically refers to a system's capability to handle faults without any degradation or downtime. In the event of an error, end-users remain unaware of any issues. Conversely, a system that experiences errors with some interruption in service or graceful degradation of performance is termed 'resilient'. In resilience, the system adapts to the error, maintaining service but acknowledging a certain impact on performance.
Typically, fault tolerance describes computer systems, ensuring the overall system remains functional despite hardware or software issues. Non-computing examples include structures that retain their integrity despite damage from fatigue, corrosion or impact.