SRE Prodverbs

"Prodverbs" (or production proverbs) is a cool collection of sayings maintained by Google's SRE team, which all of us who write or maintain distributed systems should understand. Let's explore each one:
1️⃣ If two systems must agree for them to work, someday they will inevitably disagree
It captures an important challenge in distributed systems: the need for agreement among multiple components. In a distributed system, processes often need to reach consensus on a single value or state to maintain consistency and coordinate their actions. However, relying on perfect agreement is unrealistic in the face of network partitions, split-brain situations, hardware failures, etc. The prodverb serves as a reminder that building reliable distributed systems requires careful design and the use of proven consensus algorithms, such as Paxos and Raft, which ensure that all processes in the group will eventually converge on a single agreed-upon value, even in the presence of failures and unreliable networks.
2️⃣ Decrease variance, increase mean
You should be focusing on improving the typical user experience rather than over-optimizing for edge cases. Reducing variability leads to more predictable performance for majority of the users.
3️⃣ Backups are only as good as the last restore
Backups are a critical part of our reliability toolkit, but they're not something we can just set and forget. This saying emphasizes the importance of regularly testing our restore processes. Automating and exercising restores helps ensure that when we need our backups, they'll actually work.
4️⃣ If you have no SLOs, toil is your job
Without clear reliability targets, SRE work can become an endless cycle of reactive firefighting. Service Level Objectives align teams around key metrics and help prioritize work that meaningfully improves reliability. This prodverb reminds us that if we don't set our own agenda with SLOs, toil will rule the day.
5️⃣ Hope is not a strategy
This prodverb reminds us that wishful thinking is not enough to build reliable systems. We can't just hope that things will work out or that failures won't happen.
Some key aspects of having a solid strategy include:
Designing for failure
Capacity planning
Security and access controls
Incident response planning
etc
This prodverb encapsulates mindset shift from passively wishing for the best to actively working to make the system as robust as possible.
6️⃣ Scale maintenance sublinearly with the growth of the service
What worked for 10 servers may not cut it for 10,000. Plan for the future, explore ways to reduce maintenance overhead - automate tasks, simplify architectures, and empower dev teams to share the load.
7️⃣ May all your incidents be novel
In a perfect world, we'd never have repeat incidents. While that may not be realistic, this saying suggests us to strive to learn from every incident. By performing blameless postmortems and addressing the core of the problem, we can work to ensure that the same type of incident won't happen again.
Hey there, reader! 👋
You know what they say, sharing is caring! If you found this newsletter helpful please take a moment to share it with your friends, colleagues and social media followers. Thanks for being a part of this community! 🚀