Data Engineer Roadmap

A structured path from programming fundamentals through production-grade data pipelines, warehousing, and cloud infrastructure — everything you need to land a data engineering role.

✓ Every resource link below is verified live.

1. Stage 1: Programming & SQL Foundations

Python Fundamentals
Primary language for data scripting and pipeline development.
docPython Official Docs coursefreeCodeCamp Scientific Computing with Python
SQL & Relational Databases
Core skill for querying, transforming, and modeling structured data.
coursefreeCodeCamp Relational Database Certification docPostgreSQL Official Documentation
Linux & Command Line Basics
Data tools run on Linux; shell fluency is essential daily.
tutorialThe Linux Command Line (free online)docGNU Bash Manual
Git & Version Control
Track pipeline code changes and collaborate on engineering teams.
docGit Official Documentation tutorialGitHub Skills

2. Stage 2: Data Modeling & Warehousing Concepts

Data Modeling Fundamentals
Well-designed schemas underpin every reliable data product.
tutorialdbt Learn: Data Modeling docKimball Group Design Tips
Data Warehouse Concepts
Understand OLAP, star schemas, and analytical query patterns.
videofreeCodeCamp Data Warehousing Full Course docGoogle BigQuery Concepts Overview
dbt (data build tool)
Industry-standard tool for SQL-based transformations in warehouses.
docdbt Official Documentation coursedbt Fundamentals Free Course
NoSQL & Document Databases
Many pipelines ingest unstructured or semi-structured data sources.
docMongoDB Official Documentation tutorialfreeCodeCamp MongoDB & Mongoose

3. Stage 3: Python for Data Engineering

Pandas & Data Wrangling
Essential for local data exploration, cleaning, and batch transforms.
docPandas Official Documentation tutorialKaggle Pandas Course
Working with APIs & File Formats
Ingest JSON, CSV, Parquet, and REST/GraphQL data sources routinely.
docPython Requests Library Docs docApache Parquet Format Overview
Python Testing for Pipelines
Reliable pipelines require automated unit and integration tests.
docpytest Official Documentation tutorialReal Python: Testing Your Code
Object-Oriented & Functional Python Patterns
Write maintainable, reusable pipeline components at scale.
docPython OOP Official Tutorial

4. Stage 4: Pipeline Orchestration & Batch Processing

Apache Airflow
De facto standard orchestrator for scheduling and monitoring DAGs.
docApache Airflow Official Documentation videofreeCodeCamp Airflow Full Course
Apache Spark & PySpark
Process massive datasets that cannot fit in single-machine memory.
docApache Spark Official Documentation tutorialDatabricks PySpark Getting Started
ETL vs ELT Patterns
Choose the right pattern to match modern cloud warehouse capabilities.
docdbt ETL vs ELT Guide
Data Quality & Validation
Catch bad data early before it corrupts downstream analytics.
docGreat Expectations Official Docs docdbt Tests Documentation

5. Stage 5: Streaming & Real-Time Data

Apache Kafka Fundamentals
Industry-standard event streaming backbone for real-time pipelines.
docApache Kafka Official Documentation tutorialConfluent Kafka Tutorials
Stream Processing with Flink or Spark Structured Streaming
Transform and aggregate data in motion, not just at rest.
docSpark Structured Streaming Guide docApache Flink Official Documentation
Change Data Capture (CDC)
Sync database changes into pipelines without full table scans.
docDebezium Official Documentation

6. Stage 6: Cloud Platforms & Infrastructure

Cloud Fundamentals (AWS or GCP)
Nearly all production data infrastructure runs on a cloud provider.
docAWS Getting Started Resource Center docGoogle Cloud Documentation
Infrastructure as Code with Terraform
Provision repeatable, version-controlled cloud data infrastructure.
docTerraform Official Documentation tutorialHashiCorp Learn Terraform
Docker & Containerization
Package pipelines and dependencies for consistent deployment everywhere.
docDocker Official Documentation tutorialDocker Getting Started Tutorial
Cloud Data Services (S3, BigQuery, Redshift, GCS)
Store and query petabytes cheaply using managed cloud services.
docAmazon S3 Developer Guide docGoogle BigQuery Documentation

7. Stage 7: Production Engineering & Job Readiness

Observability & Pipeline Monitoring
Production pipelines need alerting, logging, and data SLA tracking.
docOpenTelemetry Documentation docApache Airflow Monitoring Guide
Data Lakehouse Architecture
Delta Lake and Iceberg unify streaming and batch on object storage.
docDelta Lake Official Documentation docApache Iceberg Official Documentation
System Design for Data Engineers
Interviews and real roles require designing scalable pipeline architectures.
videofreeCodeCamp System Design Full Course
Portfolio Projects & Interview Prep
Tangible end-to-end projects prove job-ready skills to employers.
tutorialDataTalks.Club Data Engineering Zoomcamp (free)

Want this taught by an AI tutor — with lessons, quizzes, flashcards, and progress tracking?

Open the app — free to start

Generated & verified by RM Full Stack & AI Engineer · Generate your own roadmap · Browse all roadmaps