Whether you’re just getting in the data engineer job market, or your interview is tomorrow, practice is an essential part of the data engineer interview preparation process.
Data engineering interview questions assess your data engineering skills and domain expertise. These questions are based on a company’s tech stack and technology goals, and they test your ability to perform job functions.
After analyzing 1000+ data engineering interview questions, here are the most common skills tested across all companies and roles.
To help, we’ve counted down the top 100 data engineering interview questions. These questions are from real-life interview experiences, and they cover essential skills for data engineers, including:
Summary. These data engineer practice questions cover all of these subjects. Before your interview, familiarize yourself with all of them, and practice questions from each category.
Behavioral questions assess soft skills (e.g. communication, leadership, adaptability), your skill level, and how you’d fit on the company’s data engineering team.
Behavioral questions are common early in the data engineering process (e.g. recruiter call) and include questions about your past experience.
Examples of behavioral interview questions for a data engineer role would be:
Questions like this assess a lot of different soft skills, including your ability to communicate and how you respond to adversity. Your answer should clearly convey:
Your response might demonstrate your experience level, that you take the initiative, and your problem-solving approach. This is your chance to show the unique skills and creative solutions you bring to the table.
Don’t have this type of experience? You can relate your experiences to coursework or projects. Or you can talk hypothetically about your knowledge of data governance and how you would apply that in the role.
Candidates should have an understanding of how data engineering plays into product development. Interviewers want to know you well you’ll fit in with the team, your organizational ability in product development, or how you might simply an existing workflow.
One Tip: Get to know the company’s products and business model before the interview. Knowing this will help you relate your most relevant skills and experiences. Plus, it shows you did your homework and care about the position.
4. Describe a time you had to explain a complex subject to a non-technical person.
5. Why are you interested in working at our company?
6. How would you describe your communication style?
7. Describe a time that you failed. How did you respond?
8. Provide an example of a goal you did not meet and how you handled it.
9. How did you handle meeting a tight deadline?
10. Tell me about a time you used data to influence a decision or solve a problem.
11. Talk about a time when you had to persuade someone?
12. What data engineering projects have you worked on? Which was most rewarding?
13. Why are you interested in this role? Working at this company? A career in data engineering?
Easy technical questions are used to weed out candidates without the right experience. This type of question assesses your experience level, comfort with specific tools, and the depth of your domain expertise. Basic technical questions include:
Data cleaning and data processing are key job responsibilities in engineering roles. Inevitability unexpected issues will come up. Interviewers ask questions like these to determine, A) how well you adapt, B) the depth of your experience, and C) your technical problem-solving ability.
Clearly explain the issue, what you proposed, the steps you took to solve the problem, and the outcome.
There are many variations to this type of question. A different version would be about a specific ETL tool, “Are you experienced with Apache Spark or Amazon Redshift?” Basically, if a tool is in the job description, it might come up in a question like this.
One tip: Be sure to mention any trainings, how long you’ve used the tech, and specific tasks you can perform.
This question is basically asking: What do you do when there are gaps in your technical expertise? In your response, you might include:
17. How would you design a data warehouse given X criteria?
18. How would you design a data pipeline?
19. What questions do you ask before designing data pipelines?
20. How do you gather stakeholder input before beginning a data engineering project?
21. What is your experience with X skill on Python?
22. What experience do you have with cloud technologies?
23. What are some challenges unique to cloud computing?
24. What’s the difference between structured and unstructured data?
25. What are the key features of Hadoop?
SQL questions for data engineers cover fundamental concepts, like joins, subqueries, case statements and filters. In addition, you may be asked to write SQL code to pull metrics or questions that determine how you handle errors and NULL values. Common SQL questions include:
Hint: Given two date ranges, what determines if the subscriptions overlap? If one range is neither completely after the other, nor completely before the other, then the two ranges must overlap.
To answer this SQL question, you can think of De Morgan’s law, which says that:
Not (A Or B) <=> _Not A And Not B_.
What is that equivalent to? And how could we model that out for a SQL query?
You might want to clarify a question like and ask some follow-up questions of your own. Specifically, you might be interested in A. what kind of data is being processed, B. and what types of values are most likely to be duplicated.
With some clarity, you’ll be able to suggest more relevant strategies. For example, you might propose using the DISTINCT or UNIQUE key to reduce duplicate data. Or you could walk the interviewer through how the GROUP BY key could be used.
This question is an example of a multi-part logic-based SQL question that data engineers face. With this SQL question (see a solution here), you need:
Calculate the total number of employees that are making $100,000+ by department**. This means that we’re going to have to run a GROUP BY on the department name since we want a new row for each department.
Formula to differentiate employees making $100,000+ vs. those that make less. What does that formula look like?
Whenever the question asks about finding “0 values,” e.g. users or neighborhoods, start thinking LEFT JOIN! An inner join finds any values that are in both tables; a LEFT JOIN keeps only the values in the left table.
With this question, our predicament is to find all the neighborhoods without users. To do this we must do a left join from the neighborhoods table to the users table. Here’s an example solution:
SELECT n.name FROM neighborhoods AS n LEFT JOIN users AS u ON n.id = u.neighborhood_id WHERE u.id IS NULL
This question has been asked in Facebook data engineer interviews.
30. Write a query to get the current salary data for each employee.
31. What is the difference between DELETE and TRUNCATE?
32. Write a query to find the nominee who has won the most awards.
33. What are aggregate functions in SQL?
34. What SQL commands can be used in ETL?
35. How do you change a column name by writing a query in SQL?
36. How would you design the database for a recommendation engine?
37. What’s the difference between WHERE and HAVING?
38. What is an index in SQL? When would you use an index?
Be prepared for a wide range of data engineer Python questions. Expect questions about 1) data structures and data manipulation (e.g. Python lists, data types, data munging with pandas), 2) explanations (e.g. tell us about search/merge), 3) Python coding tests. Sample Python questions include:
Bigrams are two words that are placed next to each other, and they’re relevant in feature engineering for NLP models. With this question we’re looking for an output like this:
def find_bigrams(sentence) -> [('have', 'free'), ('free', 'hours'), ('hours', 'and'), ('other', 'activities.')]
Solution overview: To actually parse them out of a string, you must split the input string. You can do this with the python function .split() to create a list with each individual word as an input. Then, create another empty list that will eventually be filled with tuples.
This question has been asked in Google data engineer interviews.
A cache database is fast storage solution for short-lived structured or unstructured data. Generally, this database is much smaller in size than a production database and can be stored in memory.
Caching is helpful for faster data retrieval, because the data can be accessed from the temporary location. There are many ways to implement caching in Python. You can create local data structures to build the cache, or host a cache as a server, for example.
The built-in data types in Python include lists, tuples, dictionaries and sets. These data types are already defined and supported by Python, and they act as containers for grouping data by type. User-defined data types share commonalities with the primitive types, and they are based on these concepts. But ultimately, they allow users to create their own data structures, including queues, trees and linked lists.
This question asks you to aggregate lists in Python, and your goal is an output like this:
def weekly_aggregation(ts) -> [ ['2019-01-01', '2019-01-02'], ['2019-01-08'], ['2019-02-01', '2019-02-02'], ['2019-02-05'], ]
HINT: This question sounds like it should be a SQL question, doesn’t it? Weekly aggregation implies a form of
GROUP BY in a regular SQL or pandas question. But since it’s a scripting question, it’s trying to pry out if the candidate deals with unstructured data. Data scientists deal with a lot of unstructured data.
Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then just check if the character exists in that saved set. If it does, return the character. Here’s a sample output for this question:
input = "interviewquery" output = "i" input = "interv" output = "None"
44. What is the difference between “is” and “==”?
45. What is a decorator?
46. What is a decorator?
47. Are lookups faster with dictionaries or lists in Python?
48. How familiar are you with TensorFlow? Keras? OpenCV? SciPy?
49. What is the difference between a list and a tuple?
50. What is data smoothing? How do you do it?
51. What is NumPy used for? What are its benefits?
52. Given a list of integers, find all combinations that equal the value N
Data modeling and database design questions assess your knowledge of entity-relationship modeling, normalization and denormalization tradeoffs, dimensional modeling and related concepts. Common questions include:
Let’s first approach this problem by understanding the scope of the dating app and what functionality we must design around.
Start by listing out 1) key app functions for users (e.g. onboarding, matching, messaging), and 2) specific feature goals to account for (e.g. hard or soft user preferences or how the matching algorithm works).
With this information, we can create an initial design for the database.
In this two-part table schema question, we’re tracking not just enter/exit times, but also car make, model and license plate info.
The car model to license plate information will be one-to-many, given that each license plate represents a single car, and a car model can be replicated many times. Here’s an example for crossings (left) and model/license plate (right):
This question is more architecture-based and assesses experience within developing databases, setting up architectures, and in this case, representing client-side tracking in the form of clicks.
A simple but effective design schema would be to represent each action with a specific label. In this case, assigning each click event a name or label describing its specific action.
Many database design questions for data engineers are vague, and require a follow up. With a question like this, you might want to ask:
Don’t rush into answers to questions. A helpful tip for all Python and technical questions is to ask for more information. This shows you’re thoughtful and look at problems from every angle.
58. How do joins affect the size of a dataset? Provide an example.
59. How would you architect a distributed solution to convert a billion JPEG images to PNG?
60. What are the features of a physical data model?
61. What database relationships do you know?
62. How would you handle data loss during a migration?
63. What are 3 types of data models?
64. What is normalization? Denormalization?
65. What critical relationship types are found in data models?
66. What are some things to avoid when building a data model?
67. Why are NoSQL databases more useful than relational databases?
Data engineering case studies, or “data modeling case studies,” are scenario-based data engineering problems. Many questions focus on designing architecture, and then you walk the interviewer through developing a solution.
Many case study questions for data engineers are similar to database design questions. With a question like this, start with clarifying questions. You might want to know goals for the notification system, user information, and the types of notifications that are being sent.
Then, you’ll want to make assumptions. A basic solution might could start with the notifications:
With this debugging data question, you should start with some clarification, e.g. how far back does the bug extend? What’s the table schema like? One potential solution would be to look at other dimensions and columns that might be able to answer if someone is actually married (like marriage data or spouse’s name).
This question has been asked in Amazon data engineer interviews.
You might want to start with questions about the goals and uses of the database, when answering this question. You want to design a database for how the data will be used.
71. How would you build a data pipeline that data that originates at POS systems?
72. Design an ETL to process a billion events every day and generate a daily report.
73. What database optimizations might you consider for a Tinder-style app?
74. How would you design a relational database of customer data?
75. How do you go about debugging an ETL error?
76. What’s your approach to design methodologies and design patterns?
77. What architectural patterns do you have the most experience with?
Data engineers and data scientists work hand in hand. Data engineers are responsible for developing ETL processes, analytical tools, and storage tools and software. Thus, expertise with existing ETL and BI solutions is a much-needed requirement.
ETL refers to how the data is taken (extraction) from a data source, converted (transformation) into a format that can be easily analyzed, and stored (loading) into a data warehouse. The ETL process then loads the converted data into a database or BI platform in order to be used and viewed by anyone in the organization.
lifetime_playsthat records the song count by date for each user.
For this problem, we use the
INSERT INTO keywords to add rows to the
lifetime_plays table. If we set this query to run daily, it becomes a daily extract, transform, and load (ETL) process.
The rows we add are selected from the subquery that selects the
song_id, and count columns from the
song_plays table for the current date.
This question gets asked in data engineer interviews to assess your knowledge of ETL processes and tools. Explain the tools you’ve used most frequently, any trainings you have had, and common tasks you done using these tools.
One tip: Talk about the trade-offs of one tool over another.
With a question like this, you would be provided with a table representing company payroll schema.
HINT: The first step we need to do would be to remove duplicates and retain the current salary for each user. Given we know there aren’t any duplicate first and last name combinations, we can remove duplicates from the employees table by running a
GROUP BY on two fields, the first and last name. This allows us to then get a unique combinational value between the two fields.
There are many problems that can arise from concurrent transactions. One is lost updates. Lost updates occur when a committed value written by one transaction is overwritten by a subsequent committed write from a concurrent transaction. Another is write skew, which happens when an update is made within a transaction based upon stale data.
82. What’s the difference between ETL and ELT?
83. What is an initial load in ETL? What about full load?
84. What are the three layers of ETL?
85. What are database snapshots? What’s the importance of them?
86. What are views in ETL? What are they built with?
87. What are partitions? Why might you increase the number of partitions?
88. What is a session in ETL?
89. Describe how an operational data store is used.
Data engineers focus mostly on data modeling and data architecture, but a basic knowledge of algorithms and data structure is also needed. Of particular importance is the data engineer’s ability to develop inexpensive methods for the transfer of large amounts of data. If you’re responsible for a database with potentially millions (let alone billions) of records, it’s important to find the most efficient solution. Common algorithm questions include:
nums = [0, 1, 2, 4, 5] missingNumber(nums) -> 3
There are two ways we can solve this problem: through logical iteration or mathematical formulation.
Example S 3 4 5 6 2 4 6 1 1 9 9 9 9 E
Here’s a solution: Recursive backtrack to the end while saving the max path water level on each function call. Track a visited cell set to trim the search space. O(n^2)
The brute force solution to this question would be to try every permutation, and verify if it’s a palindrome. If we find one, then return true, otherwise return false. You can see the full solution on Interview Query.
A function that is O(1) means it does not grow with the size of the input data.
That means, for this problem, the function must loop through the stream, inputting 2 entries at a time and choosing between the two of them with a random method.
94. Write a Python function to transform the input vector containing two different strings into integers 0 and 1?
95. What algorithms support missing values?
96. What is the difference between linear and non-linear data structures?
97. Give some examples of uses for linked lists and arrays.
98. Use list comprehension to print odd numbers between 0 and 100.
99. How would you implement a queue using a stack?
100. What is a dequeue?
The best way to prepare for a data engineer interview is practice. Practice as many example interview questions as possible, focusing primarily on the most important skills for the job, as well as where you have gaps in knowledge.
You’ll find a variety of interview prep resources on Interview Query, including:
Need some more help? Consider a data engineering coach for mock interview help or developing an interview prep strategy.