Interview Query

Top 100 Data Engineer Interview Questions for 2022

Explore common data engineering interview questions. See example data engineer questions for SQL, ETL, database design and more.

Overview

What are the most common data engineer interview questions? Here’s a broad answer: The majority of questions fall into two categories - behavioral and technical.

1. Behavioral questions are common early in the process (e.g. recruiter call) and include questions about your experience. An example of a behavioral question for a data engineer role would be:

  • Tell us about a data engineering project you worked on. What challenges did you face?

2. Technical interview questions assess your data engineering skills and domain expertise. These questions are based on a company’s tech stack and technology goals, and they test your ability perform job functions. Here’s an example of a technical question:

  • Write a SQL query to perform X action.

Here’s an overview of the subjects that get asked in data engineer interviews:

a/b testingalgorithmsanalyticsmachine learningprobabilityproduct metricspythonsqlstatistics
Data Engineer
High confidence

How to Prepare for Data Engineer Interview

Whether you’re just getting on the job market, or your interview is tomorrow, practice is an essential part of data engineer interview prep.

The more you practice, the better your results will be in the interview room.

To help, we’ve counted down 100 top data engineering interview questions. These questions are from real-life interview experiences, and they cover essential skills for data engineers, including:

Summary. These data engineer practice questions cover all of these subjects. Before your interview, familiarize yourself with all of them, and practice questions from each category.

Behavioral Interview Questions for Data Engineers

Behavioral questions assess soft skills (e.g. communication, leadership, adaptability), your skill level, and how you’d fit on the company’s data engineering team. Common behavioral questions include:

1. Describe a data engineering problem you worked on. What were some challenges that you faced?

Questions like this assess a lot of different soft skills, including your ability to communicate and how you respond to adversity. Your answer should clearly convey:

  • The situation
  • Specific tactics you proposed
  • What actions you took
  • The results you achieved

2. Talk about a time you noticed a discrepancy in company data or an inefficiency in data processing. What did you do?

Your response might demonstrate your experience level, that you take the initiative, and your problem-solving approach. This is your chance to show the unique skills and creative solutions you bring to the table.

Don’t have this type of experience? You can relate your experiences to coursework or projects. Or you can talk hypothetically about your knowledge of data governance and how you would apply that in the role.

3. You’re asked to develop a new product. Where would you begin?

Candidates should have an understanding of how data engineering plays into product development. Interviewers want to know you well you’ll fit in with the team, your organizational ability in product development, or how you might simply an existing workflow.

One Tip: Get to know the company’s products and business model before the interview. Knowing this will help you relate your most relevant skills and experiences. Plus, it shows you did your homework and care about the position.

MORE BEHAVIORAL PRACTICE QUESTIONS

4. Describe a time you had to explain a complex subject to a non-technical person.

5. Why are you interested in working at our company?

6. How would you describe your communication style?

7. Describe a time that you failed. How did you respond?

8. Provide an example of a goal you did not meet and how you handled it.

9. How did you handle meeting a tight deadline?

10. Tell me about a time you used data to influence a decision or solve a problem.

11. Talk about a time when you had to persuade someone?

12. What data engineering projects have you worked on? Which was most rewarding?

13. Why are you interested in this role? Working at this company? A career in data engineering?

Basic Technical Questions

Easy technical questions are used to weed out candidates without the right experience. This type of question assesses your experience level, comfort with specific tools, and the depth of your domain expertise. Basic technical questions include:

14. Describe a time you had a difficulty merging data. How did you solve this issue?

Data cleaning and data processing are key job responsibilities in engineering roles. Inevitability unexpected issues will come up. Interviewers ask questions like these to determine, A) how well you adapt, B) the depth of your experience, and C) your technical problem-solving ability.

Clearly explain the issue, what you proposed, the steps you took to solve the problem, and the outcome.

15. What ETL tools do you have experience with? What tools do you prefer?

There are many variations to this type of question. A different version would be about a specific ETL tool, “Are you experienced with Apache Spark or Amazon Redshift?” Basically, if a tool is in the job description, it might come up in a question like this.

One tip: Be sure to mention any trainings, how long you’ve used the tech, and specific tasks you can perform.

16. Tell me about a situation where you dealt with a technology that you weren’t familiar with.

This question is basically asking: What do you do when there are gaps in your technical expertise? In your response, you might include:

  • Education and/or data engineering bootcamps
  • Self-guided learning
  • Working with specialists and collaborators

MORE BASIC TECH PRACTICE QUESTIONS

17. How would you design a data warehouse given X criteria?

18. How would you design a data pipeline?

19. What questions do you ask before designing data pipelines?

20. How do you gather stakeholder input before beginning a data engineering project?

21. What is your experience with X skill on Python?

22. What experience do you have with cloud technologies?

23. What are some challenges unique to cloud computing?

24. What’s the difference between structured and unstructured data?

25. What are the key features of Hadoop?

SQL Interview Questions for Data Engineers

SQL questions for data engineers cover fundamental concepts, like joins, subqueries, case statements and filters. In addition, you may be asked to write SQL code to pull metrics or questions that determine how you handle errors and NULL values. Common SQL questions include:

26. Write a query that returns true or false whether or not each user has a subscription date range that overlaps with any other user. (Asked by Twitch)

Hint: Given two date ranges, what determines if the subscriptions overlap? If one range is neither completely after the other, nor completely before the other, then the two ranges must overlap.

To answer this SQL question, you can think of De Morgan’s law, which says that:

Not (A Or B) <=> _Not A And Not B_.

What is that equivalent to? And how could we model that out for a SQL query?

27. How do you handle duplicate data in SQL?

You might want to clarify a question like and ask some follow-up questions of your own. Specifically, you might be interested in A. what kind of data is being processed, B. and what types of values are most likely to be duplicated.

With some clarity, you’ll be able to suggest more relevant strategies. For example, you might propose using the DISTINCT or UNIQUE key to reduce duplicate data. Or you could walk the interviewer through how the GROUP BY key could be used.

28. Given a table of employees and departments, write a query to select the top 3 departments with at least 10 employees. Rank them by the percentage of employees making $100,000+.

This question is an example of a multi-part logic-based SQL question that data engineers face. With this SQL question (see a solution here), you need:

  • Calculate the total number of employees that are making $100,000+ by department**. This means that we’re going to have to run a GROUP BY on the department name since we want a new row for each department.

  • Formula to differentiate employees making $100,000+ vs. those that make less. What does that formula look like?

29. You’re given two tables a users table (with demographic information) and a neighborhoods table. Write a query that returns all of the neighborhoods that have 0 users.

Whenever the question asks about finding “0 values,” e.g. users or neighborhoods, start thinking LEFT JOIN! An inner join finds any values that are in both tables; a LEFT JOIN keeps only the values in the left table.

With this question, our predicament is to find all the neighborhoods without users. To do this we must do a left join from the neighborhoods table to the users table. Here’s an example solution:

SELECT n.name
FROM neighborhoods AS n
LEFT JOIN users AS u
    ON n.id = u.neighborhood_id
WHERE u.id IS NULL

MORE SQL PRACTICE QUESTIONS

30. Write a query to get the current salary data for each employee.

31. What is the difference between DELETE and TRUNCATE?

32. Write a query to find the nominee who has won the most awards.

33. What are aggregate functions in SQL?

34. What SQL commands can be used in ETL?

35. How do you change a column name by writing a query in SQL?

36. How would you design the database for a recommendation engine?

37. What’s the difference between WHERE and HAVING?

38. What is an index in SQL? When would you use an index?

Data Engineer Python Interview Questions

Be prepared for a wide range of data engineer Python questions. Expect questions about 1) data structures and data manipulation (e.g. Python lists, data types, data munging with pandas), 2) explanations (e.g. tell us about search/merge), 3) Python coding tests. Sample Python questions include:

39. Write a function find_bigrams to take a string and return a list of all bigrams.

Bigrams are two words that are placed next to each other, and they’re relevant in feature engineering for NLP models. With this question we’re looking for an output like this:

def find_bigrams(sentence) ->

 [('have', 'free'),
 ('free', 'hours'),
 ('hours', 'and'),
 ('other', 'activities.')]

Solution overview: To actually parse them out of a string, you must split the input string. You can do this with the python function .split() to create a list with each individual word as an input. Then, create another empty list that will eventually be filled with tuples.

40. What is a cache database? And why would you use one?

A cache database is fast storage solution for short-lived structured or unstructured data. Generally, this database is much smaller in size than a production database and can be stored in memory.

Caching is helpful for faster data retrieval, because the data can be accessed from the temporary location. There are many ways to implement caching in Python. You can create local data structures to build the cache, or host a cache as a server, for example.

41. What are some primitive data structures in Python? What are some user-defined data structures?

The built-in data types in Python include lists, tuples, dictionaries and sets. These data types are already defined and supported by Python, and they act as containers for grouping data by type. User-defined data types share commonalities with the primitive types, and they are based on these concepts. But ultimately, they allow users to create their own data structures, including queues, trees and linked lists.

42. Given a list of timestamps in sequential order, return a list of lists grouped by week using the first timestamp as the starting point.

This question asks you to aggregate lists in Python, and your goal is an output like this:

def weekly_aggregation(ts) -> [
    ['2019-01-01', '2019-01-02'],
    ['2019-01-08'],
    ['2019-02-01', '2019-02-02'],
    ['2019-02-05'],
]

HINT: This question sounds like it should be a SQL question, doesn’t it? Weekly aggregation implies a form of GROUP BY in a regular SQL or pandas question. But since it’s a scripting question, it’s trying to pry out if the candidate deals with unstructured data. Data scientists deal with a lot of unstructured data.

43. Given a string, write a function recurring_char to find its first recurring character.

Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then just check if the character exists in that saved set. If it does, return the character. Here’s a sample output for this question:

input = "interviewquery"
output = "i"

input = "interv"
output = "None"

MORE PYTHON PRACTICE PROBLEMS

44. What is the difference between “is” and “==”?

45. What is a decorator?

46. What is a decorator?

47. Are lookups faster with dictionaries or lists in Python?

48. How familiar are you with TensorFlow? Keras? OpenCV? SciPy?

49. What is the difference between a list and a tuple?

50. What is data smoothing? How do you do it?

51. What is NumPy used for? What are its benefits?

52. Given a list of integers, find all combinations that equal the value N

Database Design and Data Modeling Questions

Data modeling and database design questions assess your knowledge of entity-relationship modeling, normalization and denormalization tradeoffs, dimensional modeling and related concepts. Common questions include:

53. Design a database to represent a Tinder-style dating app. What does the schema look like?

Let’s first approach this problem by understanding the scope of the dating app and what functionality we must design around.

Start by listing out 1) key app functions for users (e.g. onboarding, matching, messaging), and 2) specific feature goals to account for (e.g. hard or soft user preferences or how the matching algorithm works).

With this information, we can create an initial design for the database.

54. Create a table schema for the Golden Gate Bridge to track how long each car took to enter and exit the bridge.

55. Write a query on the given tables to get the car model with the fastest average times for the current day.

In this two-part table schema question, we’re tracking not just enter/exit times, but also car make, model and license plate info.

The car model to license plate information will be one-to-many, given that each license plate represents a single car, and a car model can be replicated many times. Here’s an example for crossings (left) and model/license plate (right):

Column Type
id INTEGER
license_plate VARCHAR
enter_time DATETIME
exit_time DATETIME
car_model_id INTEGER
Column Type
id INTEGER
model_name VARCHAR

56. How would you create a schema to represent client click data on the web?

This question is more architecture-based and assesses experience within developing databases, setting up architectures, and in this case, representing client-side tracking in the form of clicks.

A simple but effective design schema would be to represent each action with a specific label. In this case, assigning each click event a name or label describing its specific action.

57. You have a table with a billion rows. How would you add a column inserting data without affecting user experience?

Many database design questions for data engineers are vague, and require a follow up. With a question like this, you might want to ask:

  • What’s the potential impact of downtime?

Don’t rush into answers to questions. A helpful tip for all Python and technical questions is to ask for more information. This shows you’re thoughtful and look at problems from every angle.

MORE DATABASE DESIGN PRACTICE QUESTIONS

58. How do joins affect the size of a dataset? Provide an example.

59. How would you architect a distributed solution to convert a billion JPEG images to PNG?

60. What are the features of a physical data model?

61. What database relationships do you know?

62. How would you handle data loss during a migration?

63. What are 3 types of data models?

64. What is normalization? Denormalization?

65. What critical relationship types are found in data models?

66. What are some things to avoid when building a data model?

67. Why are NoSQL databases more useful than relational databases?

Data Engineering Case Study

Data engineering case studies, or “data modeling case studies,” are scenario-based data engineering problems. Many questions focus on designing architecture, and then you walk the interviewer through developing a solution.

68. You’re tasked with building a notification system for a Reddit-style app. What would the backend and data model look like?

Many case study questions for data engineers are similar to database design questions. With a question like this, start with clarifying questions. You might want to know goals for the notification system, user information, and the types of notifications that are being sent.

Then, you’ll want to make assumptions. A basic solution might could start with the notifications:

  • Trigger-based notifications - This might be something like an email notification for comment replies on a submitted post.
  • Scheduled notifications - This might be a targeted push notification for new content. These are notifications designed to drive engagement.

69. You are analyzing auto insurance data and find that the marriage attribute column is marked TRUE for all customers.

How would you debug what happened? What data would you look into and how would you find out who is actually married and who is not?

With this debugging data question, you should start with some clarification, e.g. how far back does the bug extend? What’s the table schema like? One potential solution would be to look at other dimensions and columns that might be able to answer if someone is actually married (like marriage data or spouse’s name).

70. Design a relational database for storing metadata about songs. This would include metadata like song title, song length, artist, album, release year, genre, etc.

How would you go about designing this database?

You might want to start with questions about the goals and uses of the database, when answering this question. You want to design a database for how the data will be used.

MORE CASE STUDY QUESTIONS

71. How would you build a data pipeline that data that originates at POS systems?

72. Design an ETL to process a billion events every day and generate a daily report.

73. What database optimizations might you consider for a Tinder-style app?

74. How would you design a relational database of customer data?

75. How do you go about debugging an ETL error?

76. What’s your approach to design methodologies and design patterns?

77. What architectural patterns do you have the most experience with?

Data Engineering ETL Questions

Data engineers and data scientists work hand in hand. Data engineers are responsible for developing ETL processes, analytical tools, and storage tools and software. Thus, expertise with existing ETL and BI solutions is a much-needed requirement.

ETL refers to how the data is taken (extraction) from a data source, converted (transformation) into a format that can be easily analyzed, and stored (loading) into a data warehouse. The ETL process then loads the converted data into a database or BI platform in order to be used and viewed by anyone in the organization.

78. Let’s say we want to create an aggregate table called lifetime_plays that records the song count by date for each user.

Write a SQL query that could make this ETL each day.

For this problem, we use the INSERT INTO keywords to add rows to the lifetime_plays table. If we set this query to run daily, it becomes a daily extract, transform, and load (ETL) process.

The rows we add are selected from the subquery that selects the created_at date, user_idsong_id, and count columns from the song_plays table for the current date.

79. What ETL tools are you most comfortable working with?

This question gets asked in data engineer interviews to assess your knowledge of ETL processes and tools. Explain the tools you’ve used most frequently, any trainings you have had, and common tasks you done using these tools.

One tip: Talk about the trade-offs of one tool over another.

80. Due to an ETL error, instead of updating yearly salary data for employees an insert was done instead. How would you get current salary of each employee?

With a question like this, you would be provided with a table representing company payroll schema.

HINT: The first step we need to do would be to remove duplicates and retain the current salary for each user. Given we know there aren’t any duplicate first and last name combinations, we can remove duplicates from the employees table by running a GROUP BY on two fields, the first and last name. This allows us to then get a unique combinational value between the two fields.

81. You have two ETL jobs that feed into a single production table each day. What problems might this cause?

There are many problems that can arise from concurrent transactions. One is lost updates. Lost updates occur when a committed value written by one transaction is overwritten by a subsequent committed write from a concurrent transaction. Another is write skew, which happens when an update is made within a transaction based upon stale data.

MORE ETL PRACTICE QUESTIONS

82. What’s the difference between ETL and ELT?

83. What is an initial load in ETL? What about full load?

84. What are the three layers of ETL?

85. What are database snapshots? What’s the importance of them?

86. What are views in ETL? What are they built with?

87. What are partitions? Why might you increase the number of partitions?

88. What is a session in ETL?

89. Describe how an operational data store is used.

Data Structures and Algorithms Questions

Data engineers focus mostly on data modeling and data architecture, but a basic knowledge of algorithms and data structure is also needed. Of particular importance is the data engineer’s ability to develop inexpensive methods for the transfer of large amounts of data. If you’re responsible for a database with potentially millions (let alone billions) of records, it’s important to find the most efficient solution. Common algorithm questions include:

90. Write a function that returns the missing number in the array. Complexity of O(N) required.

Example:

nums = [0, 1, 2, 4, 5]
missingNumber(nums) -> 3

There are two ways we can solve this problem: through logical iteration or mathematical formulation.

91. Given a grid and a start and end, find what the max water height that you can traverse to before there exists no path. You can only go in horizontal and vertical directions.

Example

S 3 4 5 6
2 4 6 1 1
9 9 9 9 E

Here’s a solution: Recursive backtrack to the end while saving the max path water level on each function call. Track a visited cell set to trim the search space. O(n^2)

92. Given a string, determine whether any permutation of it is a palindrome.

The brute force solution to this question would be to try every permutation, and verify if it’s a palindrome. If we find one, then return true, otherwise return false. You can see the full solution on Interview Query.

93. Given a stream of numbers, select a random number from the stream, with O(1) space in selection.

A function that is O(1) means it does not grow with the size of the input data.

That means, for this problem, the function must loop through the stream, inputting 2 entries at a time and choosing between the two of them with a random method.

MORE ALGORITHM PRACTICE QUESTIONS

94. Write a Python function to transform the input vector containing two different strings into integers 0 and 1?

95. What algorithms support missing values?

96. What is the difference between linear and non-linear data structures?

97. Give some examples of uses for linked lists and arrays.

98. Use list comprehension to print odd numbers between 0 and 100.

99. How would you implement a queue using a stack?

100. What is a dequeue?

More Data Engineer Interview Resources

The best way to prepare for a data engineer interview is practice. Practice as many example interview questions as possible, focusing primarily on the most important skills for the job, as well as where you have gaps in knowledge.

You’ll find a variety of interview prep resources on Interview Query, including:

Need some more help? Consider a data engineering coach for mock interview help or developing an interview prep strategy.