Data engineers tend to do quite a bit of learning on the job. That’s to be expected. As a company’s data tends to evolve, so does the way it stores, processes, and analyzes that data. This means the data engineer’s work is always changing and as such, learning new skills and methodologies quickly is part of the job description.
Data engineer books are one of the best ways to learn on-the-job. Data engineer cookbooks, reference guides, and the like contain super practical information, as well as deep dives into data engineer theory. A study of these data engineer references will give you practical data engineering skills that will help you continue to stay ahead of the curve.
These are the best data engineering books– which you should have a copy of on your desk– and we've covered a range of topics, including AWS, data cleaning and Python books.
Take a look at sample data engineer interview questions. See our Data Engineer Interview Questions guide.
by Chris Fregly and Antje Barth
This is really an end-to-end book, but data engineers will find here a solid introduction to building cloud pipelines in AWS. In particular, the focus is on pipelines for AI and machine learning applications, including natural language processing, fraud detection and computer vision tools.
Throughout the authors sprinkle in insights to help reduce costs and improve pipeline performance. Ultimately, the guide ties all the concepts together, providing a blueprint for a scale a replicable machine learning pipeline, creating an essential guide for anyone scaling AWS AI pipelines. Key concepts include:
- How the Amazon AI and ML stacks apply to real-world cases like fraud detection
- Practical step-by-step use cases
- Amazon AWS pipelines
- Scaling operations pipelines in AWS
- Data ingestion techniques
Who Will Find It Most Useful: If you’re interested in ML and AI data engineering projects (especially those based on AWS), pick up a copy of this book.
2. Data Engineering with Python: Work with massive datasets to design data models and automate data pipelines using Python
by Paul Crickard
Python is one of the most tested skills in data engineering interviews, so if you’re looking for an introduction, this is the resource. Crickard’s guide provides an engaging and practical primer on Python’s use in data engineering - covering the basics and going on to advanced concepts that are necessary for building and scaling pipelines.
All areas of data engineering are covered, including data cleaning, data processing, and working on production databases. This guide was published in 2020, and contains plenty of up-to-date and highly relevant information. Key concepts include:
- ETL pipelines
- Data processing and data cleaning
- Building robust pipelines in Python
- Fundamentals and basic Python data engineering concepts
Who Will Find It Most Useful: Any data engineer who wants to level up their Python coding skills and knowledge of ETL pipeline tools.
by Ralph Kimball and Margy Ross
Since its first edition, which introduced many to the concept of dimensional modeling, this resource by Ralph Kimball has only gotten better. The third edition is the go-to resource for designing fast dimensional databases built for efficient querying.
Although the work does contain 12 case studies, and focuses a lot on the business side of things, there’s plenty of knowledge that will help a data engineer grow and learn, from the fundamentals of pipeline design, all the way through complex considerations. Key concepts include:
- In-depth review of ETL systems and design
- 34 ETL subsystems and techniques
- Case studies from industry, including healthcare, education, finance and e-commerce (with sample data)
- Design considerations for dimension and fact tables
- Tips for collaborating on design with stakeholders
Who Will Find It Most Useful: A must-have resource for anyone that wants to dive deep into dimensional modeling and adjacent data warehousing techniques.
4. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
The strengths of this book are its depth and real-world practicality. It’s one of the most comprehensive data engineering resources (which is a big reason it has more than 1,600 5-star reviews). Kleppmann’s book is organized around three fundamentals: reliability, scalability and maintainability. And it’s designed to help you understand how data architecture relates to these three categories.
Really, it’s the all-in-one design guide, which is extremely helpful for interviews. In particular, you’ll gain the vocabulary to talk about the pros and cons of a particular solution, and ultimately, improve your ability to assess which technology is best for a specific business problem. Key concepts include:
- Foundational data engineering concepts (processing, encoding, structures, models, etc.)
- Clear explanations of data engineering theory
- Practical design tips and considerations
- Real-world case studies that “go under the hood”
Who Will Find This Most Useful: This is the best book on theory you will find, and it’s great for beginners through mid-career data engineers.
by Hamid Mamood Qureshi and Hammad Sharif
Snowflake has established itself as a powerful cloud-based data warehousing solution, and it’s gained a fast following by many in the business community. The platform has its nuances, and this user-friendly cookbook is the best resource to learn the ins and outs of Snowflake.
The book provides a deep-dive on the basics, and will help beginners quickly develop a baseline of knowledge. Some of the topics Snowflake Cookbook covers include:
- Data processing techniques (including SQL queries and statements)
- Scaling warehouses in Snowflake
- Cloud-based data warehousing
- Building pipelines with Snowflake
Who Will Find It Most Useful: This is the best primer on Snowflake. The authors provide an easy-to-grasp look at the fundamentals through advanced concepts.
by James Densmore
Don’t judge a book by (the size of) its cover. That’s especially true about the Data Pipelines Pocket Reference, one of the most helpful primers on data engineering. Sure, it might fit in your pocket, but it’s full of helpful and practical definitions and tips. This is a must-own for any early-career data engineer.
In particular, the book - which was published in 2021 - features an accounting of key data engineering concepts - from as simple as “how a data pipeline works”, to advanced pipeline maintenance considerations. Key concepts include:
- Explanations of basic pipeline concepts
- Helpful visualizations for how data pipelines work
- A breakdown of common data engineering tools
- An explanation of how pipelines are used in analytics and reporting
Who Will Find This Most Useful: This is like a “phrase book” you carry with you on vacation. It’s a go-to resource for those new to the field or just starting out.
by Tobias Macey
A timely and relevant book (published in June 2021), 97 Things features essays and interviews with data engineers at top companies including Google, LinkedIn, Twitter and Microsoft. The book is full of practical tips and guidance, and will get you up to speed on the latest best practices in data science engineering.
But not only does it look at the technical side of things, it also contains a lot of helpful information about launching your data engineer career. Key concepts include:
- Data engineer career advice
- Best practices used by top companies
- The latest metadata techniques
- Tips for cleaning, storing and processing data
Who Will Find This Most Useful: A great guide if you're looking for data engineer career advice, or want to familiarize yourself with the latest issues in data engineering.
by Andriy Burkov
Since its publication in 2020, this book has become the go-to resource for machine learning engineering. In fact, if you’re looking into ML engineer roles or just want to level up your machine learning skills, Burkov gives you all the fundamentals.
You’ll find insights and in-depth coverage of ML fundamentals that move beyond just an under-the-hood look at algorithms. There’s a step-by-step process for engineering machine learning apps here. Key concepts include:
- Processing data at scale
- ML engineering prototyping
- Product management and design advice and tips
- Reliability engineering how-to
Who Will Find It Most Useful: The best book for data scientists or engineers who are interested in ML engineer roles.
9. Python Data Cleaning Cookbook: Modern techniques and Python tools to detect and remove dirty data and extract key insights
by Michael Walker
Data engineers know that bad data equals bad results. You can’t expect project success if you aren’t feeding your models clean, reliable data. But you have to know how to clean data efficiently - and that's something this Python book will help you do.
This book is chock full of insights and modern techniques for data cleaning. Learn to use Python to handle missing values, to monitor data for anomalies, techniques for managing outliers and much more. Key concepts include:
- Data wrangling with Python
- Modern data cleaning techniques
- Engineering/pipeline concepts for Python
- EDA techniques and tips
Who Will Find It Most Useful: Hands down, the best resource for learning about data cleaning in Python.