Site Reliability Engineering - Notes

Hope is not a strategy.
Traditional sysadmins approach is not scalable with traffic.
Sysadmins want stability; developers want features. This may cause strife between teams.
UNIX internals and networking(Layer 1 to Layer 3) knowledge is a plus for SRE work.
50% cap on “ops” work vs development for SREs.
Availability, Latency, Performance, Efficiency, Change, Monitoring, Emergency, Capacity.
100% reliability target is hard to achieve and almost always unnecessary.
Remaining time from SLO(e.g. 99.9% availability) makes error budget. Spend it on new features.
Software should monitor and humans should only be alerted when they need to take action.
Monitoring output: Alerts(immediate action), tickets(relaxed action), logging(only when asked to look).
Disaster playbooks are very helpful to reduce MTTR(mean time to repair) and improve emergency response.
Change: Progressive rollouts -> Detect problems -> Roll back in case of problems.
Capacity Planning: Organic and inorganic demand casting, regular load testing.