Projects
Reliability & Observability Initiative
Shifting a team from reactive firefighting to deliberate, measured reliability.
- Role
- Engineering Lead
- Year
- 2023
- Status
- ongoing
- Domain
- Reliability
Impact
- Established SLOs and error budgets for critical services
- Reduced mean time to resolution through tracing and better signals
- Introduced blameless retrospectives that surfaced root causes earlier
Context
The team was capable but reactive. Incidents were resolved through heroics and tribal knowledge rather than instrumentation, and the same classes of problems recurred because root causes were never fully understood.
Approach
The initiative made reliability a discipline rather than a reaction:
- Consistent instrumentation. Standardizing on OpenTelemetry meant traces and metrics looked the same everywhere, so an engineer could navigate any service during an incident.
- SLOs and error budgets. Defining what “reliable enough” meant for each critical service turned reliability into a measurable, prioritizable concern.
- Blameless retrospectives. Making it safe to discuss failures honestly surfaced root causes early, while they were still cheap to fix.
Outcome
Incidents became survivable and, increasingly, rare. Engineers could go from “users are unhappy” to “this is the cause” in minutes instead of hours, and the culture shifted from blame to learning — which is what actually prevents the next outage.
Key architecture decisions
- Consistent instrumentation via OpenTelemetry across services
- Service-level objectives backed by actionable alerting
- Runbooks and on-call practices that made incidents survivable