Why Distributed Systems Fail? (part 1)
Distributed systems are tricky - it's easy to make wrong assumptions that lead to problems down the road. Back in the 90s, computer scientist L. Peter Deutsch identified several common misconceptions, or "fallacies," that trip up engineers working on distributed systems. Surprisingly these fallacies are still relevant today:
- The Network is Reliable: It's risky to assume networks are 100% reliable. Networks can and do fail in various ways.
- Latency is Zero: While we might wish our networks had no latency, that's simply not physically possible - even light takes time to travel distances. Ignoring the inevitable delay in data transmission can lead to unrealistic expectations of system performance.
- Bandwidth is Infinite: This overlooks the physical and practical limitations on data transfer rates.
- The Network is Secure: No wonder Security is a growing industry. Assuming inherent security can lead to vulnerabilities and oversight in protective measures.
- Topology Doesn't Change: This neglects the dynamic nature of network configurations.
- There is One Administrator: A simplification that fails to consider the complexity of managing distributed systems.
- Transport Cost is Zero: Overlooking the resources required for data movement.
- The Network is Homogeneous: Ignoring the diversity in network systems and standards.
These fallacies, if not recognized and addressed, can lead to design flaws, performance issues, and security vulnerabilities in distributed systems. In the following sections, we will break down each of these misconceptions, exploring their implications and how to mitigate the risks they pose in real-world applications.
Fallacy 1: The Network is Reliable
The belief that 'The Network is Reliable' is one of the most common and risky assumptions in the field of distributed computing. This fallacy leads to an underestimation of the likelihood and impact of network failures. In reality, networks are susceptible to a range of issues, from temporary outages and packet loss to more severe disruptions caused by hardware failures, software bugs, or external factors like natural disasters.