Retries, Backoff and Jitter
đŸ’¡ In one of the previous posts we introduced Eight Pillars of Fault-tolerant Systems and today we will discuss "Retries".
In distributed systems, failures and latency issues are inevitable. Services can fail due to overloaded servers, network issues, bugs, and various other factors. As engineers building distributed systems, we need strategies to make our services robust and resilient in the face of such failures. One useful technique is using retries.
Understanding Retries
At a basic level, a retry simply involves attempting an operation again after a failure. This helps mask transient failures from the end user. Together with fail-fast pattern, retries are essential for distributed systems where partial failures and temporary blips happen frequently. Without retries, these minor glitches affect user experience and will result in lost availability.
Scenarios Where Retries Are Beneficial
Transient Failures - These are short-lived blips in availability, performance, or consistency. Common causes include network congestion, load spikes, database connection issues, brief resource bottlenecks. A retry can allow the request to successfully pass through once the disruption has cleared.
Partial Failures - In large distributed systems, it's common for a percentage of requests to fail at any given time due to nodes going down, network partitions, software bugs, and various edge cases. Retries help smooth over these intermittent inconsistencies and exceptions. The retry logic hides the partial failure from the end user so the system keeps running reliably.
In both cases, requests are given multiple chances to succeed before surfacing the failure to the user or a client. Retries transform an unreliable system into one that performs consistently.
Challenges with Retries
While retries are very useful, they also come with some risks and challenges that must be mitigated:
Load Amplification - if a system is already struggling with high load or is completely down, retries can amplify problems by slamming the system with additional requests. This overburdening can lengthen the outage or trigger cascading failures.
Solution: implement exponential backoff
between retries to progressively increase wait time. Limit the total number of retry attempts. Use a circuit breaker to stop retries when error thresholds are exceeded. This prevents overloading the already struggling system.
Side Effects - some operations like creating a resource or transferring money have real world "side effects". Repeatedly retrying these can unintentionally duplicate outcomes like duplicate charges or duplicate records.
Solution: design interfaces and systems to be idempotent
whenever possible. Idempotent operations can be safely retried. For non-idempotent operations, make the requests uniquely identifiable to filter duplicates or fail them right away.
Fairness & Capacity - retries aren't cost-free, as each one compete for resources alongside new incoming requests. Additionally, the lack of a centralized control over retry loads can impact downstream capacity.
Solution: implement a rate limiter on your retry load upstream. This approach helps you identify worst-case scenarios regarding downstream overloading. Moreover, it ensures that only a minor percentage of retry requests compete with new requests, as opposed to an unbounded percentage of retry requests.
Synchronized Retries - if multiple clients time out at the same time and retry simultaneously, it can create a "retry storm" that overloads the system. This thundering herd problem can be worse than the original issue.
Solution: add jitter to randomize the wait times before retrying. This avoids synchronized spikes in traffic. Limit number of retries and use circuit breakers to prevent overloading the system.
Exponential backoff
Exponential backoff is a retry strategy where the delay between retries increases exponentially. For example:
The increasing waits help prevent hammering an already struggling system with constant rapid retries. The progressively longer delays give the backend services time to recover from disruptions.
Here is an example implementation in Go:
// Exponential backoff
func RetryWithBackoff(retries int) {
backoff := 100 * time.Millisecond // start at 100ms
for r := 0; r < retries; r++ {
time.Sleep(backoff)
// Attempt operation
backoff *= 2 // double for next retry
}
}
This implements a basic exponential backoff by doubling the backoff duration each retry. It's also a good practice to have a max number of retries you want your system to have.
While exponential backoff spreads out retries over time, it can still cause clusters and spikes in traffic when multiple clients are timing out at the same moments. As the quote points out, contention hasn't been reduced much - we've just introduced periods of no competing requests instead of a constant barrage.
This clustering effect is the "thundering herd" problem. If 100 clients all retry at the exact same 400ms intervals, they still overload the system in synchronized waves. The solution is to add jitter
Jitter
Jitter is a random variation in timing that's introduced to "spread out" retries more evenly over time, reducing synchronized request bursts.
There are various ways to implement jitter, but a common approach is "full jitter". With full jitter, the retry delay is randomized between 0 and the computed delay. For example, if your computed delay is 400ms, a retry could happen at any time between 0ms and 400ms.
import (
"math/rand"
"time"
)
func RetryWithBackoff(retries int) {
baseDelay := 100 * time.Millisecond // start at 100ms
maxDelay := 10 * time.Second // define a maximum delay
for r := 0; r < retries; r++ {
jitter := time.Duration(rand.Int63n(int64(baseDelay)))
delay := baseDelay + jitter
// Cap the delay to a maximum value
if delay > maxDelay {
delay = maxDelay
}
time.Sleep(delay)
// Attempt operation
// ...
baseDelay *= 2 // double for next retry
}
}
Here, we've added a random jitter to the base delay and also introduced a maximum delay cap to ensure it doesn't grow indefinitely.
A Note on Deadlines
When implementing retries you have to think about the concept of deadlines. Deadlines provide a time boundary for an operation to complete, including both the initial attempt and any subsequent retries. They ensure that tasks either complete successfully within the time or fail promptly.
Understanding Deadlines vs. Timeouts:
Timeouts typically represent the time allocated for a specific operation or a single request. For instance, you might set a timeout of 5 seconds for a database read operation or 100 ms to connect to a downstream service.
Deadlines represent the absolute point in time by which an operation must complete, factoring in all attempts and retries. If you start an operation at 12:00 PM with a deadline of 1 minute, the operation should complete by 12:01 PM, regardless of the number of retries in between.
Why Use Deadlines?
First, deadlines offer predictability - they provide an upper bound on how long an operation can take end-to-end across retries and multiple services. This predictability is important for maintaining service level agreements and ensuring good user experience.
Second, deadlines allow more efficient resource management. By preventing operations from running indefinitely, systems can better plan resource usage.
Finally, deadlines can be propagated across multiple microservices and system components. This ensures the end-to-end operation respects time constraints, rather than individual hops. Overall, deadlines bring predictability, improve resource planning, and enforce constraints across distributed systems.
Most modern languages and libraries provide ways to enforce deadlines across threads and processes.
In Go, deadlines can be implemented via contexts
:
// Create a context with deadline
ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(timeout))
defer cancel()
// Make request within context
err := MakeRequest(ctx)
// If context expired, we hit the deadline
if err == context.DeadlineExceeded {
// handle deadline exceeded
}
This will cleanly interrupt the retry sequence when the deadline is hit.
Conclusion
Implementing retries brings significant benefits for distributed systems, including masking transient failures and increasing service reliability. Exponential backoff and jitter help spread out retries without overloading fragile services.
However, there are several important considerations when adding retries:
Use idempotent operations whenever possible to avoid side effects
Implement exponential backoff and jitter to smooth retries
Set deadlines to prevent endless retries wasting resources
Monitor metrics like retry counts, rates, and latency
Handle overall system health and use circuit breakers to stop retries