Production Mindset

Core Principle

Treat every project as production-grade — no workarounds, no temporary solutions
Implement as if deploying to production TODAY
All solutions must follow current industry standards

Instrument with structured logs, metrics, and distributed traces from the first commit — observability is a design-time concern, not a runtime patch
Use OpenTelemetry as the instrumentation standard — vendor-neutral and widely supported
Include correlation IDs (trace IDs) in every log entry and API response — enables end-to-end debugging across services
Track error rates as SLIs and define error budgets — pause feature releases when the budget is depleted

Define Service Level Objectives (SLOs) for every user-facing service: availability, latency, error rate
Measure with Service Level Indicators (SLIs) derived from real telemetry, not synthetic checks
Use error budgets to gate releases — if the budget is depleted, prioritize reliability over features

Never apply migrations that are not written to a migration file first — ad-hoc schema changes cause drift
Only execute migrations with explicit user confirmation
Use connection pooling appropriate to your database and expected load
Test migrations against a copy of production schema before deploying

Ensure all UI is responsive across mobile, tablet, and desktop breakpoints
Ensure accessibility compliance (WCAG 2.1 AA minimum) — not optional, it is a legal requirement in many jurisdictions
Use semantic HTML elements for screen reader compatibility
Test with keyboard navigation and screen readers

Implement retries with exponential backoff for transient failures
Set timeouts on all external calls — unbounded waits cascade into outages
Use health checks and readiness probes in containerized deployments
Design for graceful degradation — partial functionality is better than total failure

Use feature flags to decouple deployment from release — deploy code to production without exposing it to users
Roll out features progressively: internal > canary (1-5%) > beta (10-25%) > general availability
Monitor error rates and latency at each stage — automated rollback if SLO thresholds are breached

Use blue-green or canary deployments for zero-downtime releases — never deploy directly to all traffic
Automate rollback based on health check and SLO signals — manual rollback is too slow for production incidents

Define all infrastructure in version-controlled code (Terraform, Pulumi, CDK) — manual provisioning drifts and is unreproducible
Review infrastructure changes through the same PR process as application code

Test failure scenarios proactively in staging and production — verify that retries, circuit breakers, and failovers actually work
Start small: kill a single pod or inject latency on one route — expand scope as confidence grows

Define an incident response playbook before the first incident — roles, communication channels, escalation paths
Conduct blameless postmortems for every significant incident — document what happened, why, and what changes prevent recurrence

Never suppress linter errors without a documented reason
Never skip tests to meet a deadline — technical debt compounds faster than financial debt
Never use // @ts-ignore or equivalent without a ticket reference explaining why
If a proper solution takes longer, discuss scope with the team — do not hack around it