Question Details

No question body available.

Tags

design system system-reliability

Answers (3)

Accepted Answer Available
Accepted Answer
March 29, 2025 Score: 9 Rep: 85,842 Quality: Expert Completeness: 50%

I havent got the book, but the first page has this:

enter image description here

Seems to me that unless one of the chapters specifically defines "Fault Tolerance" somewhere they are just using "reliability" and "fault tolerance" to mean the same thing.

In fact if you read a little father there are several paragraphs about fault tolerance vs resilience vs failures and preventing faults.

Reliablity

"..we can understand reliability as meaning, roughly, “continuing to work correctly, even when things go wrong.”

Fault Tolerance

"The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient"

As developers, we are not particularly scientific about defining the terms we use. You have to always say "what does X mean... in this context" in this case the context is the book, and other people can and will use the terms to mean different things in other books and talks etc

March 29, 2025 Score: 4 Rep: 119,888 Quality: Medium Completeness: 50%

A reliable nuclear reactor keeps producing power without a life threatening meltdown.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure.

Wikipedia - Reliability Engineering

A fault tolerant nuclear reactor stops producing power to avoid a life threatening meltdown.

A highly fault-tolerant system might continue at the same level of performance even though one or more components have failed. For example, a building with a backup electrical generator will provide the same voltage to wall outlets even if the grid power fails.

A system that is designed to fail safe, or fail-secure, or fail gracefully, whether it functions at a reduced level or fails completely, does so in a way that protects people, property, or data from injury, damage, intrusion, or disclosure. In computers, a program might fail-safe by executing a graceful exit (as opposed to an uncontrolled crash) to prevent data corruption after an error occurs.[12] A similar distinction is made between "failing well" and "failing badly".

Wikipedia - Fault Tolerance

Thus a reactor that has never killed anyone, but also keeps shutting down, could be described as fault tolerant because it maintained safety. But it's hardly reliable.

Usage of the two terms overlap in many ways. But this question isn't about that. It's about their differences. The difference is subtle but it’s real.

Life isn’t the only thing to protect. In many systems the process halting exception is protecting data from corruption from a system in an undefined state.

A desire for reliability may drive you to think halting a system is a bad thing. But bad as that is, it’s better than letting the system tear itself apart.

So no, fault tolerance is not "sufficient" for reliability. For that you need more than one reactor.

March 29, 2025 Score: -1 Rep: 12,819 Quality: Medium Completeness: 30%

I would offer a different definition.

The reliability of a machine is the extent to which it continues to do what it should without further supervision and intervention - "what it should" being whatever expectation the people relying on it have.

Some machines are unreliable for our use not because they are inconsistent, but because the people relying on it can't discern its capability or limitations.

Machines can be reliable without being particularly fault-tolerant.

In fact a lot of engineering is about preventing machines from sustaining faults or adverse conditions (like clean rooms for manufacturing), or preventing the effect of faults from spreading through the use of redundancy in subsystems (like backup electric power generators), not making machines actually withstand faults or adverse conditions.

What we think of as "fault-tolerance" is usually the idea that the machine works in a reasonable range of cases without requiring supervisor intervention, or presses to make further progress through retries or a range of methods without demanding constant attention. The cases might be "faults" in a sense, but they are in fact ordinary cases that the machine is designed to handle - variations in circumstances, rather than true faults which stop the machine or cause it to go haywire.

Sometimes machines can incorporate unexpected reliability in their design - for example, machines with poor tolerances between moving parts (often considered an undesirable quality), might unexpectedly withstand the introduction of sand or dirt into their mechanisms.

The overall point is, machine reliability should be thought of from the perspective of a relationship between man and machine, "doing as it should without requiring undue amounts of supervision labour to be available or applied".

I wouldn't try and define every aspect of "what it should", because that is very sensitive to the eye of the beholder (including what they think or know about what the machine does or will do) and the circumstances in which the machine is applied.