Engineering Note by Yasin Engin
Fast Incident Triage for Lab and Production
A short triage model to reduce mean time to recovery when network automation fails.
ENincident-responsesretroubleshooting
Three-phase triage model
Use this order every time:
- Scope: identify impacted services and blast radius.
- Stabilize: stop automation loops and prevent new writes.
- Recover: rollback or patch, then resume controlled traffic.
Data you always collect
- Last successful deployment id.
- Change diff for affected nodes.
- Error rates and latency by region.
- Queue lag if message broker is used.
Team rule
One person drives the timeline, one person validates rollback, one person communicates status.