Site reliability engineering Study Guide
Study Guide
📖 Core Concepts
Site Reliability Engineering (SRE) – A software‑engineering discipline focused on keeping large services available, fast, and efficient.
Primary goals – Reliable response times during deployments, hardware failures, and security attacks.
Automation & Infrastructure‑as‑Code – SREs heavily script and automate operations to reduce manual toil.
Reliability metrics –
Service Level Indicator (SLI): measurable attribute of service performance (e.g., latency).
Service Level Objective (SLO): target value for an SLI (e.g., 99.9 % request success).
Error budget: the allowable deviation from the SLO; drives release cadence.
Observability – Designing systems so any “arbitrary question” about state can be answered without prior knowledge.
Toil – Repetitive, manual work that scales linearly with service size; SREs aim to minimize it.
Chaos engineering – Deliberately injecting failures to verify resilience.
Deployment models – Kitchen Sink, Infrastructure, Product/Application, Embedded – each defines the scope of SRE ownership.
---
📌 Must Remember
SRE = DevOps implementation that zeroes in on reliability (DevOps is broader).
Core SRE responsibilities: availability, latency, performance, efficiency, change management, monitoring, emergency response, capacity planning.
Error budget = (1 – SLO); when exhausted, pause releases and focus on reliability.
Observability ≠ monitoring – observability lets you ask new questions; monitoring only answers predefined ones.
Toil reduction is a primary KPI for SRE teams.
Chaos engineering is a practiced SRE technique, not a one‑off test.
---
🔄 Key Processes
Define Reliability Goals
Choose SLIs → set SLOs → compute error budget.
Measure & Monitor
Instrument code → collect metrics → alert on SLI breaches.
Incident Management Flow
Detect → Triage → Mitigate → Post‑mortem → Action items.
Capacity Planning Cycle
Forecast demand → model usage → provision resources → revisit after changes.
Toil Reduction Loop
Identify repetitive tasks → automate via scripts/CI‑CD → verify automation → repeat.
Chaos Experiment
Pick failure mode → inject fault → observe impact → improve design.
---
🔍 Key Comparisons
SRE vs. DevOps – SRE: reliability‑focused implementation; DevOps: broader culture of collaboration and delivery.
Kitchen Sink vs. Embedded Model – Kitchen Sink: SRE team owns many services end‑to‑end; Embedded: SREs sit inside product teams, applying reliability practices locally.
Observability vs. Monitoring – Observability: answers any unknown question; Monitoring: watches predefined metrics.
Error Budget vs. Uptime SLA – Error Budget: flexible, drives release decisions; SLA: contractual guarantee, may be stricter.
---
⚠️ Common Misunderstandings
“SRE is just ops” – Wrong; SRE is a software‑engineering role that builds automation, not a manual ops shop.
“Higher uptime automatically means good SRE” – Not enough; must also meet latency & error‑budget targets.
“Chaos engineering = breaking production” – Chaos experiments are controlled, scoped, and designed to be safe.
“All SRE teams use the same deployment model” – Teams choose Kitchen Sink, Infrastructure, Product, or Embedded based on organization size and service ownership.
---
🧠 Mental Models / Intuition
“Error budget as fuel” – When you have plenty, you can safely push new features; when low, you must refuel by fixing reliability.
“Observability is a window, not a door” – It lets you see inside the system; you still need actions (automation) to act on what you see.
“Toil is a leak; automation is the patch” – Find leaks (repetitive work) and seal them with code.
---
🚩 Exceptions & Edge Cases
Small organizations may combine SRE and DevOps roles; pure SRE team structures (e.g., Embedded) may not exist.
Error budget exhaustion does not always halt all releases; critical hot‑fixes may proceed with explicit risk acknowledgment.
Chaos experiments must be paused during major incidents to avoid compounding failures.
---
📍 When to Use Which
Choose Kitchen Sink model when the organization has many cross‑service dependencies and a central reliability team can provide global oversight.
Pick Embedded model for fast‑moving product teams that need reliability expertise directly in the development loop.
Apply chaos engineering when the system has mature monitoring/observability and can tolerate controlled failures.
Use error‑budget gating for services with strict SLOs; otherwise, rely on informal reliability reviews.
---
👀 Patterns to Recognize
Repeated “manual script” tasks → flag as toil → candidate for automation.
SLO breach alerts → check error‑budget status before deciding on rollback.
High latency spikes coinciding with deployment windows → suspect change‑management issue.
Post‑mortem lacking actionable items → indicates weak incident‑learning process.
---
🗂️ Exam Traps
Confusing “observability” with “monitoring” – exam answers that define observability as simply “having dashboards” are wrong.
Selecting “DevOps = SRE” – they are related but not identical; look for the reliability‑specific focus.
Choosing “Chaos engineering only for disaster recovery” – it’s actually for testing overall system resilience, not just DR plans.
Assuming the Kitchen Sink model is always best – size and service ownership dictate the appropriate model; a blanket answer will be a distractor.
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or