Site Reliability Engineer Roadmap

A beginner-to-job-ready path covering Linux, networking, coding, observability, CI/CD, and SRE principles needed to operate and scale reliable production systems.

✓ Every resource link below is verified live.

1. Stage 1: Linux & Networking Foundations

Linux Command Line & Administration
SREs live in terminals; Linux is the universal server OS.
tutorialThe Linux Command Line (full book free online)courseLinux Basics for Hackers – No Starch Press intro
Networking Fundamentals (TCP/IP, DNS, HTTP)
Diagnosing outages requires deep network protocol knowledge.
docMDN – HTTP Overview tutorialCloudflare Learning Center – How DNS Works
Shell Scripting (Bash)
Automating repetitive ops tasks is a core SRE responsibility.
tutorialBash Scripting Tutorial – Ryan's Tutorials docGNU Bash Manual
System Performance Basics
Identifying CPU, memory, and I/O bottlenecks is daily SRE work.
tutorialLinux Performance Tools by Brendan Gregg

2. Stage 2: Programming & Scripting Proficiency

Python for Automation & Tooling
Python is the dominant language for SRE scripts and tooling.
coursefreeCodeCamp – Scientific Computing with Python docPython Official Documentation
Go (Golang) Fundamentals
Most SRE-critical tools (Kubernetes, Prometheus) are written in Go.
docA Tour of Go – Official Interactive Tour tutorialGo by Example
Git & Version Control
All infrastructure and code changes must be version-controlled.
docPro Git Book (free)tutorialGitHub Skills – Introduction to GitHub
Data Formats: JSON, YAML, Regex
Config files, APIs, and log parsing all rely on these formats.
docYAML Official Specification tutorialRegex101 – Interactive Regex Tester

3. Stage 3: Containers, Orchestration & Infrastructure as Code

Docker & Containerization
Modern services run in containers; SREs must build and debug them.
docDocker Official Documentation – Get Started tutorialPlay with Docker – Browser-based Labs
Kubernetes
Kubernetes is the standard for deploying and scaling containerized workloads.
docKubernetes Official Documentation tutorialKubernetes Basics Interactive Tutorial
Terraform – Infrastructure as Code
Reproducible infrastructure requires declarative IaC tooling.
docTerraform Official Documentation tutorialHashiCorp Learn – Get Started with Terraform
Cloud Platforms (AWS / GCP / Azure)
Most production infrastructure runs on public cloud providers.
courseAWS Cloud Practitioner Essentials (free)docGoogle Cloud Documentation

4. Stage 4: Observability – Monitoring, Logging & Tracing

Metrics & Prometheus
SREs measure system health through metrics; Prometheus is the standard.
docPrometheus Official Documentation tutorialPrometheus Getting Started Guide
Grafana for Dashboards
Visualizing metrics enables fast incident detection and review.
docGrafana Official Documentation tutorialGrafana Fundamentals Tutorial
Centralized Logging (ELK / Loki)
Logs are essential evidence for diagnosing production failures.
docElastic – Getting Started with Elasticsearch docGrafana Loki Documentation
Distributed Tracing with OpenTelemetry
Tracing pinpoints latency sources across microservice boundaries.
docOpenTelemetry Official Documentation tutorialOpenTelemetry Getting Started

5. Stage 5: CI/CD, Release Engineering & Incident Management

CI/CD Pipelines (GitHub Actions / Jenkins)
Reliable, automated deployments reduce human error in releases.
docGitHub Actions Documentation tutorialGitHub Actions Quickstart
GitOps with ArgoCD / Flux
GitOps makes deployment state auditable and self-healing.
docArgo CD Official Documentation docFlux CD Documentation
Incident Management & Postmortems
Structured incident response minimizes MTTR and prevents recurrence.
docGoogle SRE Book – Chapter 14: Managing Incidents (free)tutorialPagerDuty Incident Response Guide
Chaos Engineering
Proactively injecting failures builds confidence in system resilience.
docPrinciples of Chaos Engineering docChaos Monkey – Netflix OSS Documentation

6. Stage 6: SRE Principles, SLOs & Reliability Design

SLIs, SLOs, and Error Budgets
Error budgets balance reliability and feature velocity systematically.
docGoogle SRE Book – Chapter 4: SLOs (free)
Capacity Planning & Load Testing
Predicting and validating capacity prevents saturation incidents.
dock6 Load Testing Documentation tutorialk6 Getting Started Guide
Toil Reduction & Automation Strategy
Eliminating toil is the SRE mandate that frees time for engineering.
docGoogle SRE Book – Chapter 5: Eliminating Toil (free)
Security & Compliance Basics for SRE
SREs own production; secure defaults and access controls are essential.
docOWASP Top 10 tutorialAWS Security Best Practices Documentation

7. Stage 7: Job Readiness & Career Launch

SRE Interview Preparation
System design and coding interviews gate SRE job offers.
tutorialGoogle SRE Interview Tips (Google Careers)coursefreeCodeCamp – System Design for Interviews
Build a Portfolio & Home Lab
Demonstrable projects prove hands-on skills to employers.
tutorialGitHub – Building a SRE Home Lab Guide tutorialKatacoda / Killercoda Interactive Scenarios
Certifications (CKA, AWS SAA)
Recognized certs validate Kubernetes and cloud skills to hiring managers.
docCNCF – Certified Kubernetes Administrator (CKA)docAWS Certified Solutions Architect – Associate
Community & Continuous Learning
SRE practices evolve rapidly; community keeps skills current.
docGoogle SRE Books (all three, free online)tutorialSREcon Conference Talks – USENIX

Want this taught by an AI tutor — with lessons, quizzes, flashcards, and progress tracking?

Open the app — free to start

Generated & verified by RM Full Stack & AI Engineer · Generate your own roadmap · Browse all roadmaps