The Fail Fast Principle

Sep 26, 2023

🚀 In one of the previous posts we introduced Eight Pillars of Fault-tolerant Systems and today we will discuss "The fail fast principle".

The fail fast principle is a design pattern used in software development to immediately report any exception in an application, rather than trying to continue execution. It aims to immediately detect and propagate failures to prevent localized faults from cascading across system components.

Applying fail fast principles in distributed architectures provides several advantages:

Localizes failures - Failing components quickly contains issues before they cascade. Failures are isolated to specific services.
Reduces debugging costs - When processes terminate immediately at the source of errors, it's easier to pinpoint root causes based on crash logs and traces.
Allows graceful degradation - Services shutting down rapidly allows load balancers to route traffic to healthy nodes. The overall system remains operational (in a degraded mode).
Improves reliability - By assuming processes can crash anytime, developers build more resilient systems. Failures are handled gracefully.

Practical Examples

Let's consider 3 scenarios where fail fast pattern would be applicable

Failing Fast with Network Calls

Network communication between services is prone to timeouts and failures. Make requests fail fast by setting short timeouts and immediately returning errors:

// Timeout after 100ms
client := &http.Client{Timeout: 100 * time.Millisecond} 

resp, err := client.Get("http://remote-service")
if err != nil {
  return fmt.Errorf("Request failed: %v", err)
}

This prevents the system from waiting on delayed responses or retrying failed requests that are unlikely to succeed. When you don't set aggressive downstream timeouts your service will keep these connections open and it can exhaust sockets/resources and bring your service to a halt.

Validating Startup Health Checks

Services should check dependent resources like databases during initialization and terminate early if unavailable:

db, err := sql.Open("mysql", "localhost:3306")
if err != nil {
  log.Fatal("Failed to connect to database") 
}

err = db.Ping() 
if err != nil {
  log.Fatal("Database unavailable") 
}

Failing fast on startup ensures components don't stay up in degraded modes. It also reduces debugging costs and MTTR time if the proper monitoring and alerting is in place.

Securing APIs with Request Validation

APIs should validate headers, auth tokens, and payload before handling requests:

func authenticate(r *http.Request) error {
  token := r.Header.Get("Auth-Token")
  if token == "" {
    return fmt.Errorf("no auth token provided")
  }
  
  // Validate token...

  return nil
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
  if err := authenticate(r); err != nil {
    http.Error(w, "authentication failed", 401)
    return
  }

  // Process request
}

Defensive programming with proper request validation is fundamental to secure cloud-native applications. The fail fast principle says to reject bad inputs early before any damage is done.

Best practices

Incorporating fail fast pattern into your software can add some overhead and even make things less stable, so you need to make sure you apply this practice carefully and utilize it for good.

Backoff Strategies

Backoff strategies are important for retry situations when a failed component or service is being restarted. This prevents a thundering herd problem where all clients retry simultaneously and overload the recovering service.

Two common backoff approaches are:

Fixed backoff - Wait a predetermined amount of time between retry attempts (e.g. 5 seconds). The delay stays constant.
Exponential backoff - Progressively wait longer between retries, using exponentially increasing waits like 100ms, 200ms, 400ms, etc.

Exponential backoff with jitter is generally preferable as it provides better distribution across clients. The random jitter prevents clients from retrying in lockstep.

Backoffs should also include a max timeout cap to prevent really long wait times. so the backoff doesn't grow infinitely large.

Failure Context

When a component fails fast, include contextual debugging information in the logs/errors beyond just a stack trace. For example:

// BAD
fmt.Errorf("invalid user id") 

// GOOD 
fmt.Errorf("invalid user id %d received in OrderRequest on endpoint %s", userId, r.URL)

Some tips:

Log key parameter values, request info, identifiers related to the failure.
Obviously be cautious of logging sensitive data - mask out passwords/PII.
Surface original error messages from underlying dependencies when wrapping errors.
Capture metrics and request traces on failures to aid in post-mortems.
Include a unique request/failure identifier to correlate logs across services.

Providing rich failure context speeds up diagnosing root causes of problems without needing a debugger or reproducing locally. This enables faster recovery and resolution.

Dependency Isolation

Isolating non-critical services from core components prevents their failure from cascading. Patterns like bulkheads and circuit breakers help contain failures:

Segregate risky operations into separate processes or clusters so they can fail independently.
Use circuit breakers to isolate points of access when downstream dependencies fail.
Implement request queues or pools to bound concurrent resource usage.
Containerize services to isolate resources and dependencies.

Final Thoughts

The fail fast pattern is highly relevant for building reliable distributed systems. Rapid error detection and failure propagation prevents localized issues from cascading across system components.

For core services, optimization for recoverability and graceful degradation may be preferable over failing at the slightest issue. Compensating actions like caching and retries may help mask transient failures. For non-critical paths, failing immediately protects the overall system integrity.

Codereliant’s Substack

Discussion about this post