Error Handling
Core Principles
- Never silently swallow errors — every error must be handled or propagated
BAD: catch (e) { /* ignore */ }
GOOD: catch (e) { logger.error("payment failed", { orderId, error: e }); throw; }
- Use typed errors with error codes for programmatic handling — callers can branch on error type
- Return Result types for recoverable operations
- Throw/raise only for programmer errors (bugs), not expected failures — expected failures are control flow, not exceptions
Error Messages
- Write actionable error messages: what happened, why, and how to fix it
BAD: "Error occurred"
GOOD: "Failed to connect to database at localhost:5432 — check DB_HOST and ensure PostgreSQL is running"
- Include relevant context: operation, input, expected vs actual
- Never expose internal details (stack traces, SQL) to end users — attackers use these to find vulnerabilities
- Log full details server-side, return sanitized messages to clients
Logging & Observability
- Log errors with structured context (who, what, where, when) — enables filtering and alerting in log aggregation tools
- Use appropriate severity levels: debug, info, warn, error, fatal
- Include correlation IDs for request tracing — essential for debugging in distributed systems
- Do not log sensitive data (passwords, tokens, PII)
- Record errors as OpenTelemetry span events with structured attributes — enables correlation across distributed services
- Include trace IDs in error responses — allows support teams to trace user-reported errors through the entire system
- Track error rates as SLIs and define error budgets — pause feature releases when the budget is depleted
Circuit Breaker
- Use circuit breakers for calls to external services — stop calling a failing service and fail fast instead of queuing timeouts
- Circuit breaker states: Closed (normal), Open (fail fast), Half-Open (probe recovery) — monitor state transitions as metrics
- Combine with retries: retries handle transient blips, circuit breakers handle persistent failures
Bulkhead Isolation
- Isolate failure domains with bulkheads — a failure in one integration should not exhaust resources for unrelated requests
- Use separate thread pools or connection pools per external dependency
Resilience
- Implement timeouts for all external calls — prevents a single slow service from blocking the entire system
- Use retries with exponential backoff for transient failures
- Provide fallback behavior where appropriate
- Fail fast on configuration errors at startup, not at runtime — surfaces problems before they reach users
Async Error Handling
- Use dead letter queues for messages that fail processing after max retries — prevents poison messages from blocking the queue
- Implement compensation (saga) patterns for multi-step distributed operations — partial completion must be reversible
Frontend Error Boundaries
- Wrap UI sections in error boundaries — a crash in one widget should not take down the entire page
- Display user-friendly fallback UI with a retry option — never show raw stack traces
Cleanup
- Always release resources in finally blocks or equivalent
- Close database connections, file handles, and network sockets — leaked connections exhaust pools and cause outages
- Roll back partial operations on failure
- Leave the system in a consistent state after errors