Production Mindset

Core Principle

  • Treat every project as production-grade — no workarounds, no temporary solutions
  • Implement as if deploying to production TODAY
  • All solutions must follow current industry standards

Observability — The Three Pillars

  • Instrument with structured logs, metrics, and distributed traces from the first commit — observability is a design-time concern, not a runtime patch
  • Use OpenTelemetry as the instrumentation standard — vendor-neutral and widely supported
  • Include correlation IDs (trace IDs) in every log entry and API response — enables end-to-end debugging across services
  • Track error rates as SLIs and define error budgets — pause feature releases when the budget is depleted

SLOs and Reliability Targets

  • Define Service Level Objectives (SLOs) for every user-facing service: availability, latency, error rate
  • Measure with Service Level Indicators (SLIs) derived from real telemetry, not synthetic checks
  • Use error budgets to gate releases — if the budget is depleted, prioritize reliability over features

Database & Migrations

  • Never apply migrations that are not written to a migration file first — ad-hoc schema changes cause drift
  • Only execute migrations with explicit user confirmation
  • Use connection pooling appropriate to your database and expected load
  • Test migrations against a copy of production schema before deploying

UI Standards

  • Ensure all UI is responsive across mobile, tablet, and desktop breakpoints
  • Ensure accessibility compliance (WCAG 2.1 AA minimum) — not optional, it is a legal requirement in many jurisdictions
  • Use semantic HTML elements for screen reader compatibility
  • Test with keyboard navigation and screen readers

Reliability

  • Implement retries with exponential backoff for transient failures
  • Set timeouts on all external calls — unbounded waits cascade into outages
  • Use health checks and readiness probes in containerized deployments
  • Design for graceful degradation — partial functionality is better than total failure

Progressive Rollout

  • Use feature flags to decouple deployment from release — deploy code to production without exposing it to users
  • Roll out features progressively: internal > canary (1-5%) > beta (10-25%) > general availability
  • Monitor error rates and latency at each stage — automated rollback if SLO thresholds are breached

Deployment Strategies

  • Use blue-green or canary deployments for zero-downtime releases — never deploy directly to all traffic
  • Automate rollback based on health check and SLO signals — manual rollback is too slow for production incidents

Infrastructure as Code

  • Define all infrastructure in version-controlled code (Terraform, Pulumi, CDK) — manual provisioning drifts and is unreproducible
  • Review infrastructure changes through the same PR process as application code

Chaos Engineering

  • Test failure scenarios proactively in staging and production — verify that retries, circuit breakers, and failovers actually work
  • Start small: kill a single pod or inject latency on one route — expand scope as confidence grows

Incident Response

  • Define an incident response playbook before the first incident — roles, communication channels, escalation paths
  • Conduct blameless postmortems for every significant incident — document what happened, why, and what changes prevent recurrence

No Shortcuts

  • Never suppress linter errors without a documented reason
  • Never skip tests to meet a deadline — technical debt compounds faster than financial debt
  • Never use // @ts-ignore or equivalent without a ticket reference explaining why
  • If a proper solution takes longer, discuss scope with the team — do not hack around it