Error Handling

Core Principles

BAD: catch (e) { /* ignore */ } GOOD: catch (e) { logger.error("payment failed", { orderId, error: e }); throw; }

Use typed errors with error codes for programmatic handling — callers can branch on error type
Return Result types for recoverable operations
Throw/raise only for programmer errors (bugs), not expected failures — expected failures are control flow, not exceptions

BAD: "Error occurred" GOOD: "Failed to connect to database at localhost:5432 — check DB_HOST and ensure PostgreSQL is running"

Include relevant context: operation, input, expected vs actual
Never expose internal details (stack traces, SQL) to end users — attackers use these to find vulnerabilities
Log full details server-side, return sanitized messages to clients

Log errors with structured context (who, what, where, when) — enables filtering and alerting in log aggregation tools
Use appropriate severity levels: debug, info, warn, error, fatal
Include correlation IDs for request tracing — essential for debugging in distributed systems
Do not log sensitive data (passwords, tokens, PII)
Record errors as OpenTelemetry span events with structured attributes — enables correlation across distributed services
Include trace IDs in error responses — allows support teams to trace user-reported errors through the entire system
Track error rates as SLIs and define error budgets — pause feature releases when the budget is depleted

Use circuit breakers for calls to external services — stop calling a failing service and fail fast instead of queuing timeouts
Circuit breaker states: Closed (normal), Open (fail fast), Half-Open (probe recovery) — monitor state transitions as metrics
Combine with retries: retries handle transient blips, circuit breakers handle persistent failures

Isolate failure domains with bulkheads — a failure in one integration should not exhaust resources for unrelated requests
Use separate thread pools or connection pools per external dependency

Implement timeouts for all external calls — prevents a single slow service from blocking the entire system
Use retries with exponential backoff for transient failures
Provide fallback behavior where appropriate
Fail fast on configuration errors at startup, not at runtime — surfaces problems before they reach users

Use dead letter queues for messages that fail processing after max retries — prevents poison messages from blocking the queue
Implement compensation (saga) patterns for multi-step distributed operations — partial completion must be reversible

Wrap UI sections in error boundaries — a crash in one widget should not take down the entire page
Display user-friendly fallback UI with a retry option — never show raw stack traces

Always release resources in finally blocks or equivalent
Close database connections, file handles, and network sockets — leaked connections exhaust pools and cause outages
Roll back partial operations on failure
Leave the system in a consistent state after errors