Interview Query

50+ Data Engineer Interview Questions

Explore common data engineering interview questions. See example data engineer questions for SQL, ETL, database design and more.

Overview

Data engineers are called the backbone of all data science and engineering. Running data analytics, building predictive models, and developing software is how all modern businesses are built and scaled, but this isn’t possible without having good data to begin with.

Data engineers are responsible for ingesting data in a way that is reliable, timely, and accurate. In more technical terms, the responsibilities of data engineers are in developing interfaces and processes for the flow and access of data across a business.

What to Expect in Data Engineer Interviews

Data engineer interviews assess your ability to generate actionable business insights from Big Data, and they typically focus on testing your knowledge of data architecture and frameworks, as well as your understanding of key operational metrics.

The majority of technical data engineering interviews have three phases: An initial screen, a technical screen and an on-site interview.

Here’s what you can expect from each stage of the data engineer interview process:

1. Initial Screen

Initial screens are generally 30 minutes long and are conducted with a recruiter or HR. These screens assess if you’re the right fit for the role, and you can expect some behavioral questions about your experience in data engineering.

Sample question: Describe a data engineering project you worked on? What were challenges you faced?

2. Technical Screen

Technical screens for data engineering roles are typically one hour, and include live coding sessions in SQL and Python. SQL questions typically cover intermediate-to-advanced concepts, like sub-queries and window functions, while the Python portion will cover your understanding of loops, strings and dictionaries. The key here is efficiency in solving coding problems, as well as your ability to work through complex problems.

Sample question: Write a SQL query to perform X function.

3. Onsite Interview

On-site interviews for data engineering positions typically include 3-5 interviews (up to an hour long). Each session has a general focus. For example, you might have an ETL round, a data modeling round, a case study round and/or a behavioral round. Data modeling interviews include SQL and Python questions based on business scenarios. ETL interviews look at your ability to use SQL and Python to write ETL code.

Sample question: Describe how you would design a database for a Reddit-style app.

What to Study for Data Engineer Interviews

Interview Query regularly analyzes the data engineering interview process, and based on more than 10,000 interviews, we’ve found that SQL questions are most common in data engineering interviews.

In fact, they’re asked in about 90% of interviews for engineering roles:

algorithmsmachine learningprobabilityproduct metricspythonsqlstatistics
Data Engineer
High confidence

Here are some of the most frequently asked question topics for data engineers:

We’ll dive into each of these areas and offer sample questions, as well as approaches to answering common interview questions.

SQL Questions for Data Engineers

SQL questions in data engineering interviews cover fundamental concepts, like joins, subqueries, case statements and filters. In addition, you’ll be asked to write SQL code to answer business questions and solve problems and will be assessed on your ability to handle errors and NULL values.

Q1. You’re given two tables a users table with demographic information and the neighborhood they live in and a neighborhoods table.

Write a query that returns all of the neighborhoods that have 0 users.

Users table:

columns type
id int
name varchar
neighborhood_id int
created_at datetime

Neighborhoods table

columns type
id int
name varchar
city_id int

Hint: Whenever the question asks about finding values with 0 something, e.g. users or neighborhoods, start thinking LEFT JOIN! An inner join finds any values that are in both tables; a LEFT JOIN keeps only the values in the left table.

In this question, our predicament is to find all the neighborhoods without users. To do this we must do a left join from the neighborhoods table to the users table which will give us an output like this:

neighborhoods.name users.id
castro 123
castro 124 c
ole valley null
castro heights 534
castro heights 564

If we then add in a where condition of WHERE users.id IS NULL, then we will get every single neighborhood without a singular user.

SELECT n.name
FROM neighborhoods AS n LEFT JOIN users AS u ON n.id = u.neighborhood_id WHERE u.id IS NULL

An alternative method is to use the IN clause. This method involves finding all the neighborhoods that the user’s live in, and then using that as a filter to subset all of the existing neighborhoods in the neighborhoods table.

SELECT name FROM neighborhoods WHERE id NOT IN ( SELECT DISTINCT neighborhood_id FROM users )

This method might be slightly more inefficient depending on the SQL interpreter used. In this secondary method, we have to find the DISTINCT number of neighborhood IDs which would result in scanning the table once before then doing a second scan on the neighborhoods table.

The first method however does a join but then only needs to do one scan through the combined tables to filter out the NULL values.

Overall if you’re using an advanced SQL interpreter, it will find the most efficient method out of the existing ones no matter what.

Q2. Write a query that returns true or false whether or not each user has a subscription date range that overlaps with any other user. (Asked by Twitch)

Hint: Given two date ranges, what determines if the subscriptions overlap? If one range is neither completely after the other, nor completely before the other, then the two ranges must overlap.

To answer this question, you can think of De Morgan’s law, which says that:

Not (A Or B) <=> Not A And Not B.

What is that equivalent to? And how could we model that out for a SQL query?

Q3. What are some of the ways you can deal with duplicate data in SQL?

You’ll likely want to clarify a question like this with some follow-up questions of your own. Specifically, you might be interested in what kind of data is being processed and what types of values are most likely to be duplicated.

Then, you could suggest some strategies. You might propose using the DISTINCT or UNIQUE key to reduce duplicate data. Or you could walk the interviewer through how the GROUP BY key could be used.

Q4. Write a query to select the top three departments with at least ten employees and rank them according to the percentage of their employees making over 100K in salary.

Hint: With this question, we know that we need to calculate the total number of employees that are making over $100K by each department. This means that we’re going to have to run a GROUP BY on the department name since we want a new row for each department.

We also need a formula to represent how we can differentiate employees that make over $100K and those that make less. What does that formula look like?

Data Engineer Python Questions

Be prepared for a wide range of data engineer Python questions. The most common topics include distribution-based questions, data munging with pandas, and data manipulation. Overall, data engineer Python questions tend to focus on three core categories:

1) Data structures and manipulation - You’ll likely be asked questions about using Python lists, data types and basic Python operations like searching and other data manipulation techniques. Typically these are definitions-based questions.

2) Python definitions - This type of question includes definitions of sequences, especially search, merge and sort functions, as well as creating new data by combining existing data. Algorithms-based Python questions might be asked too.

3) Python practice programs - Data engineering interviews all include coding exercises. This technical portion will ask you to solve problems based on an existing data set, e.g. practical application of your Python programming skills.

Q1. Write a function find_bigrams to take a string and return a list of all bigrams.

Example:

sentence = """
Have free hours and love children?
Drive kids to school, soccer practice
and other activities.
"""
def find_bigrams(sentence) ->

 [('have', 'free'),
 ('free', 'hours'),
 ('hours', 'and'),
 ('and', 'love'),
 ('love', 'children?'),
 ('children?', 'drive'),
 ('drive', 'kids'),
 ('kids', 'to'),
 ('to', 'school,'),
 ('school,', 'soccer'),
 ('soccer', 'practice'),
 ('practice', 'and'),
 ('and', 'other'),
 ('other', 'activities.')]

At its core, bigrams are two words that are placed next to each other. Two words versus one in feature engineering for a NLP model gives an interaction effect.

To actually parse them out of a string, we need to first split the input string. We would use the python function .split() to create a list with each individual word as an input. Create another empty list that will eventually be filled with tuples.

Then, once we’ve identified each individual word, we need to loop through k-1 times (if k is the amount of words in a sentence) and append the current word and subsequent word to make a tuple. This tuple gets added to a list that we eventually return. Remember to use the python function .lower() to turn all the words into lowercase!

def find_bigrams(sentence):
  input_list = sentence.split()
  bigram_list = []

  # Now we have to loop through each word
  for i in range(len(input_list)-1):
    #strip the whitespace and lower the word to ensure consistency
    bigram_list.append((input_list[i].strip().lower(), input_list[i+1].strip().lower()))
  return bigram_list

Q2. What are some primitive data structures in Python? What are some user-defined data structures?

The built-in data types in Python include lists, tuples, dictionaries and sets. These data types are already defined and supported by Python, and they act as containers for grouping data by type. User-defined data types share commonalities with the primitive types, and they are based on these concepts. But ultimately, they allow users to create their own data structures, including queues, trees and linked lists.

Q3. Write a function to create a queue and display all the members and size of the queue.

Here’s a solution:

import queue
q = queue.Queue()
for x in range(4):
q.put(x)
print("Members of the queue:")
y=z=q.qsize()
for n in list(q.queue):
print(n, end=" ")
print("\nSize of the queue:")
print(q.qsize())
Sample Output:
Members of the queue:
0 1 2 3
Size of the queue:
4

Q4. Given a string, write a function recurring_char to find its first recurring character. Return None if there is no recurring character.

Try this question on Interview Query.

Database Design and Data Modeling Questions

With data modeling and database design questions, you should be prepared to answer questions about entity-relationship modeling and normalization and denormalization tradeoffs, as well as dimensional modeling and related concepts.

Often, the data modeling interviews start with being presented an application, and then you have to design the schema for it. In most cases, the interviewer will then ask you to write a query to produce metrics for the business.

Q1. Design a database to represent a Tinder-style dating app. What does the schema look like?

Let’s first approach this problem by understanding the scope of the dating app and what functionality we must design around.

If we were to list the key Tinder app capabilities, it would be something like:

  • Onboarding - User opens up, adds preferences, add pictures, and starts swiping on users.
  • Matching - If the user matches with another user, we notify them and create a messaging system.
  • Messaging - Users can message each other only if both have been matched. Users can also leave conversations at any time.

Next, we should figure out if we have specific feature goals that we have to account for within engineering and system design.

For example if we look at swiping:

  • Do we care about how advanced the matching algorithm works? Do we need to account for a similarity algorithm or could we just do a dumb matching algorithm?
  • Which preferences by users are hard versus soft?
    • For example, distance filters and gender filters are usually hard while race, age, and others could be softer

Let’s assume that matching will be based off of hard filters for now but design a database that makes it easier to increase the complexity of our matching algorithm. Our priorities should include efficiency for users to quickly onboard and retrieve matches as fast as possible to not allowing users to drop-off.

With this information, we can create an initial design for the database.

Q2. Let’s say we want to run some data collection on the Golden Gate bridge.

What would the table schema look like if we wanted to track how long each car took coming into San Francisco to enter and exit the bridge?

Write a query on the given tables to get the car model with the fastest average times for the current day.

In this question, we have to track time entered and exited leaving the bridge, but also the car make and model along with license plate information. We know that the car model to license plate information will be one-to-many, given that each license plate represents a single car, and a car model can be replicated many times.

Q3. How would you create a schema to represent client click data on the web?

This question is more architecture-based and assesses experience within developing databases, setting up architectures, and in this case, representing client-side tracking in the form of clicks.

What exactly does click data on the web mean?

Any form of button clicks, scrolls, or action at all as an interaction with the client interface, in this case, desktop, would be somehow represented into a schema form for the end-user to query. This does not include client views, however.

A simple but effective design schema would be to first represent each action with a specific label. In this case, assigning each click event a name or label describing its specific action.

Data Engineering Case Study

Data engineering case studies are becoming more and more common in interviews, but you might be wondering:

What is a data engineering case study?

Data engineering case studies, or “data modeling case studies,” are scenario-based data engineering problem. For example, many questions focus on architecture, e.g. design the database for a Tinder-style app. Then you walk the interviewer developing a solution. With data engineering case studies, you should talk through design considerations, steps for building the solution (like an ETL pipeline), and potential pitfalls.

Q1. You’re tasked with building a notification system for a simple Reddit-style app.

What would the backend and data model look like?

A first step with a question like this is to ask clarifying questions. You might want to know goals for the notification system, user information, and the types of notifications that are being sent.

Then, you’ll want to make assumptions. You could start with the notifications:

Notifications have two general types:

  • Trigger-based notifications - This might be something like an email notification for comment replies on a submitted post.
  • Scheduled notifications - This might be a targeted push notification for new content. These are notifications designed to drive engagement.

The key difference here is that one is event-triggered and the other is user-triggered. Want to see more.

Q2. You are analyzing auto insurance data and find that the marriage attribute column is marked TRUE for all customers.

How would you debug what happened? What data would you look into and how would you find out who is actually married and who is not?

Try this question on Interview Query.

Q3. Design a relational database for storing metadata about songs. This would include metadata like song title, song length, artist, album, release year, genre, etc.

How would you go about designing this database?

Try this question on Interview Query.

Behavioral Questions for Data Engineers

In data engineering interviews, hiring managers want to assess whether you’re a good culture fit, are open to collaboration, can handle adversity, and know that your resume stands up to scrutiny.

Behavioral interviews test your soft skills and determine if you’ll be the right fit for the position. For many questions, you can use a simple format to structure your answers:

  1. Describe the project or situation.
  2. Talk about your approach to addressing the situation.
  3. Walk the interviewer through the steps you took to solve the problem.
  4. Provide measurable results.

Here are some common behavioral questions for data engineers:

Q1. Describe a Big Data problem you worked on. What were some challenges that you faced?

A question like this assesses your judgement, your adaptability and creative problem-solving skills. Every project has challenges. It could be budgetary, technology limitations or a communication failure.

Don’t be afraid to clearly state the challenges; just make sure you back it up with how you responded and overcame the challenges. Another tip: A simple framework here would help you clearly walk the interviewer through your thought process.

Q2. Tell me about a situation where you dealt with a technology that you weren’t familiar with.

Q3. What is your experience using cloud technologies?

Q4. Describe a time you had to explain a complex subject to a non-technical person.

Q5. Why are you interested in working at our company? Why did you choose the field of data engineering?

Additional Engineering Behavioral Questions

Behavioral questions also want to assess your relevant past experience and quiz you on your familiarity with various tools and architectures.

Q6. When would you use Java instead of Python?

Q7. How would you design a data warehouse given X criteria?

Q8. How would you design a data pipeline?

Q9. What is the architecture of Spark?

Data Engineering ETL Questions

Data engineers and data scientists work hand in hand. Data engineers are responsible for developing ETL processes, analytical tools, and storage tools and software. Thus, expertise with existing ETL and BI solutions is a much-needed requirement.

ETL refers to how the data is taken (extraction) from a data source, converted (transformation) into a format that can be easily analyzed, and stored (loading) into a data warehouse. The ETL process then loads the converted data into a database or BI platform in order to be used and viewed by anyone in the organization.

Q1. Let’s say we want to create an aggregate table called lifetime_plays that records the song count by date for each user.

Write a SQL query that could make this ETL each day.

column type
id integer
created_at datetime
user_id integer
song_id integer

For this problem, we use the INSERT INTO keywords to add rows to the lifetime_plays table. If we set this query to run daily, it becomes a daily extract, transform, and load (ETL) process.

The rows we add are selected from the subquery that selects the created_at date, user_idsong_id, and count columns from the song_plays table for the current date.

Q2. Let’s say you have a table with a billion rows. How would you add a column inserting data from the original source without affecting the user experience?

Before jumping into the question we should remember to clarify a few details that we can potentially get out of the interviewer. It helps to ask questions to understand and show that you can think holistically about the problem. Rushing too fast into a solution is a red flag for many interviewers.

Data Structures and Algorithms Questions

Data engineers focus mostly on data modeling and data architecture, but a basic knowledge of algorithms and data structure is also needed.

Of particular importance is the data engineer’s ability to develop inexpensive methods for the transfer of large amounts of data. If you’re responsible for a database with potentially millions (let alone billions) of records, it’s important to find the most efficient solution.

These questions are common for software engineers and similarly apply to data engineers as well.

Q1. Write a function that returns the missing number in the array. Complexity of O(N) required.

Example:

nums = [0, 1, 2, 4, 5]
missingNumber(nums) -> 3

There are two ways we can solve this problem: through logical iteration and mathematical formulation. We can look at both as they both hold O(N) complexity.

The first would be through general iteration through the array. We can pass in the array and create a set that will hold each value in the input array. Then, we create a for loop that will span the range from 0 to n, and look to see if each number is in the set we just created. If it isn’t, we return the missing number.

def missingNumber(nums):
    num_set = set(nums)
    n = len(nums) + 1
    for number in range(n):
        if number not in num_set:
            return number

The second solution requires formulating an equation. If we know that one number is supposed to be missing from 0 to n, then we can solve for the missing number by taking the sum of numbers from 0 to n and subtracting it from the sum of the input array with the missing value.

An equation for the sum of numbers from 0 to n is n*(n+1)/2. Now all we have to do is apply the internal sum function to the input array, and then subtract the values from each other

def missing_number(nums):
    n = len(nums)
    total = n*(n+1)/2
    sum_of_nums = sum(nums)
    return total - sum_of_nums

Q2. Given a grid and a start and end, find what the max water height that you can traverse to before there exists no path. You can only go in horizontal and vertical directions.

Example

S 3 4 5 6
2 4 6 1 1
9 9 9 9 E

Here’s a solution:

Recursive backtrack to the end while saving the max path water level on each function call. Track a visited cell set to trim the search space. O(n^2)

Q3. Given a string, determine whether any permutation of it is a palindrome.

The brute force solution to this question would be to try every permutation, and verify if it’s a palindrome. If we find one, then return true, otherwise return false. You can see the full solution on Interview Query.

Q4. Given a stream of numbers, select a random number from the stream, with O(1) space in selection.

A function that is O(1) means it does not grow with the size of the input data.

That means, for this problem, the function must loop through the stream, inputting 2 entries at a time and choosing between the two of them with a random method.

More Interview Resources

See company specific data engineering interview guides here:

Sign up for Interview Query, for 400+ questions, company specific guides, a data science course and more to prep for your interview.