Data engineers are called the backbone of all data science and engineering. Running data analytics, building predictive models, and developing software is how all modern businesses are built and scaled, but this isn’t possible without having good data to begin with.
Data engineers are responsible for ingesting data in a way that is reliable, timely, and accurate. In more technical terms, the responsibilities of data engineers are in developing interfaces and processes for the flow and access of data across a business.
Data engineer interviews assess your ability to generate actionable business insights from Big Data, and they typically focus on testing your knowledge of data architecture and frameworks, as well as your understanding of key operational metrics.
The majority of technical data engineering interviews have three phases: An initial screen, a technical screen and an on-site interview.
Here’s what you can expect from each stage of the data engineer interview process:
Initial screens are generally 30 minutes long and are conducted with a recruiter or HR. These screens assess if you’re the right fit for the role, and you can expect some behavioral questions about your experience in data engineering.
Sample question: Describe a data engineering project you worked on? What were challenges you faced?
Technical screens for data engineering roles are typically one hour, and include live coding sessions in SQL and Python. SQL questions typically cover intermediate-to-advanced concepts, like sub-queries and window functions, while the Python portion will cover your understanding of loops, strings and dictionaries. The key here is efficiency in solving coding problems, as well as your ability to work through complex problems.
Sample question: Write a SQL query to perform X function.
On-site interviews for data engineering positions typically include 3-5 interviews (up to an hour long). Each session has a general focus. For example, you might have an ETL round, a data modeling round, a case study round and/or a behavioral round. Data modeling interviews include SQL and Python questions based on business scenarios. ETL interviews look at your ability to use SQL and Python to write ETL code.
Sample question: Describe how you would design a database for a Reddit-style app.
Interview Query regularly analyzes the data engineering interview process, and based on more than 10,000 interviews, we’ve found that SQL questions are most common in data engineering interviews.
In fact, they’re asked in about 90% of interviews for engineering roles:
Here are some of the most frequently asked question topics for data engineers:
We’ll dive into each of these areas and offer sample questions, as well as approaches to answering common interview questions. For more help, see our guide How to Prepare for Data Engineering Interviews.
SQL questions in data engineering interviews cover fundamental concepts, like joins, subqueries, case statements and filters. In addition, you’ll be asked to write SQL code to answer business questions and solve problems and will be assessed on your ability to handle errors and NULL values.
Hint: Whenever the question asks about finding values with 0 something, e.g. users or neighborhoods, start thinking LEFT JOIN! An inner join finds any values that are in both tables; a LEFT JOIN keeps only the values in the left table.
In this question, our predicament is to find all the neighborhoods without users. To do this we must do a left join from the neighborhoods table to the users table which will give us an output like this:
If we then add in a where condition of WHERE users.id IS NULL, then we will get every single neighborhood without a singular user.
SELECT n.name FROM neighborhoods AS n LEFT JOIN users AS u ON n.id = u.neighborhood_id WHERE u.id IS NULL
An alternative method is to use the IN clause. This method involves finding all the neighborhoods that the user’s live in, and then using that as a filter to subset all of the existing neighborhoods in the neighborhoods table.
SELECT name FROM neighborhoods WHERE id NOT IN ( SELECT DISTINCT neighborhood_id FROM users )
This method might be slightly more inefficient depending on the SQL interpreter used. In this secondary method, we have to find the DISTINCT number of neighborhood IDs which would result in scanning the table once before then doing a second scan on the neighborhoods table.
The first method however does a join but then only needs to do one scan through the combined tables to filter out the NULL values.
Overall if you’re using an advanced SQL interpreter, it will find the most efficient method out of the existing ones no matter what.
Hint: Given two date ranges, what determines if the subscriptions overlap? If one range is neither completely after the other, nor completely before the other, then the two ranges must overlap.
To answer this question, you can think of De Morgan’s law, which says that:
Not (A Or B) <=> Not A And Not B.
What is that equivalent to? And how could we model that out for a SQL query?
You’ll likely want to clarify a question like this with some follow-up questions of your own. Specifically, you might be interested in what kind of data is being processed and what types of values are most likely to be duplicated.
Then, you could suggest some strategies. You might propose using the DISTINCT or UNIQUE key to reduce duplicate data. Or you could walk the interviewer through how the GROUP BY key could be used.
Hint: With this question, we know that we need to calculate the total number of employees that are making over $100K by each department. This means that we’re going to have to run a GROUP BY on the department name since we want a new row for each department.
We also need a formula to represent how we can differentiate employees that make over $100K and those that make less. What does that formula look like?
Be prepared for a wide range of data engineer Python questions. The most common topics include distribution-based questions, data munging with pandas, and data manipulation. Overall, data engineer Python questions tend to focus on three core categories:
1) Data structures and manipulation - You’ll likely be asked questions about using Python lists, data types and basic Python operations like searching and other data manipulation techniques. Typically these are definitions-based questions.
2) Python definitions - This type of question includes definitions of sequences, especially search, merge and sort functions, as well as creating new data by combining existing data. Algorithms-based Python questions might be asked too.
3) Python practice programs - Data engineering interviews all include coding exercises. This technical portion will ask you to solve problems based on an existing data set, e.g. practical application of your Python programming skills.
sentence = """ Have free hours and love children? Drive kids to school, soccer practice and other activities. """ def find_bigrams(sentence) -> [('have', 'free'), ('free', 'hours'), ('hours', 'and'), ('and', 'love'), ('love', 'children?'), ('children?', 'drive'), ('drive', 'kids'), ('kids', 'to'), ('to', 'school,'), ('school,', 'soccer'), ('soccer', 'practice'), ('practice', 'and'), ('and', 'other'), ('other', 'activities.')]
At its core, bigrams are two words that are placed next to each other. Two words versus one in feature engineering for a NLP model gives an interaction effect.
To actually parse them out of a string, we need to first split the input string. We would use the python function .split() to create a list with each individual word as an input. Create another empty list that will eventually be filled with tuples.
Then, once we’ve identified each individual word, we need to loop through k-1 times (if k is the amount of words in a sentence) and append the current word and subsequent word to make a tuple. This tuple gets added to a list that we eventually return. Remember to use the python function .lower() to turn all the words into lowercase!
def find_bigrams(sentence): input_list = sentence.split() bigram_list =  # Now we have to loop through each word for i in range(len(input_list)-1): #strip the whitespace and lower the word to ensure consistency bigram_list.append((input_list[i].strip().lower(), input_list[i+1].strip().lower())) return bigram_list
The built-in data types in Python include lists, tuples, dictionaries and sets. These data types are already defined and supported by Python, and they act as containers for grouping data by type. User-defined data types share commonalities with the primitive types, and they are based on these concepts. But ultimately, they allow users to create their own data structures, including queues, trees and linked lists.
Here’s a solution:
import queue q = queue.Queue() for x in range(4): q.put(x) print("Members of the queue:") y=z=q.qsize() for n in list(q.queue): print(n, end=" ") print("\nSize of the queue:") print(q.qsize()) Sample Output: Members of the queue: 0 1 2 3 Size of the queue: 4
Try this question on Interview Query.
With data modeling and database design questions, you should be prepared to answer questions about entity-relationship modeling and normalization and denormalization tradeoffs, as well as dimensional modeling and related concepts.
Often, the data modeling interviews start with being presented an application, and then you have to design the schema for it. In most cases, the interviewer will then ask you to write a query to produce metrics for the business.
Let’s first approach this problem by understanding the scope of the dating app and what functionality we must design around.
If we were to list the key Tinder app capabilities, it would be something like:
Next, we should figure out if we have specific feature goals that we have to account for within engineering and system design.
For example if we look at swiping:
Let’s assume that matching will be based off of hard filters for now but design a database that makes it easier to increase the complexity of our matching algorithm. Our priorities should include efficiency for users to quickly onboard and retrieve matches as fast as possible to not allowing users to drop-off.
With this information, we can create an initial design for the database.
In this question, we have to track time entered and exited leaving the bridge, but also the car make and model along with license plate information. We know that the car model to license plate information will be one-to-many, given that each license plate represents a single car, and a car model can be replicated many times.
This question is more architecture-based and assesses experience within developing databases, setting up architectures, and in this case, representing client-side tracking in the form of clicks.
What exactly does click data on the web mean?
Any form of button clicks, scrolls, or action at all as an interaction with the client interface, in this case, desktop, would be somehow represented into a schema form for the end-user to query. This does not include client views, however.
A simple but effective design schema would be to first represent each action with a specific label. In this case, assigning each click event a name or label describing its specific action.
Data engineering case studies are becoming more and more common in interviews, but you might be wondering:
What is a data engineering case study?
Data engineering case studies, or “data modeling case studies,” are scenario-based data engineering problem. For example, many questions focus on architecture, e.g. design the database for a Tinder-style app. Then you walk the interviewer developing a solution. With data engineering case studies, you should talk through design considerations, steps for building the solution (like an ETL pipeline), and potential pitfalls.
A first step with a question like this is to ask clarifying questions. You might want to know goals for the notification system, user information, and the types of notifications that are being sent.
Then, you’ll want to make assumptions. You could start with the notifications:
Notifications have two general types:
The key difference here is that one is event-triggered and the other is user-triggered. Want to see more.
Try this question on Interview Query.
Try this question on Interview Query.
In data engineering interviews, hiring managers want to assess whether you’re a good culture fit, are open to collaboration, can handle adversity, and know that your resume stands up to scrutiny.
Behavioral interviews test your soft skills and determine if you’ll be the right fit for the position. For many questions, you can use a simple format to structure your answers:
Here are some common behavioral questions for data engineers:
A question like this assesses your judgement, your adaptability and creative problem-solving skills. Every project has challenges. It could be budgetary, technology limitations or a communication failure.
Don’t be afraid to clearly state the challenges; just make sure you back it up with how you responded and overcame the challenges. Another tip: A simple framework here would help you clearly walk the interviewer through your thought process.
Behavioral questions also want to assess your relevant past experience and quiz you on your familiarity with various tools and architectures.
Data engineers and data scientists work hand in hand. Data engineers are responsible for developing ETL processes, analytical tools, and storage tools and software. Thus, expertise with existing ETL and BI solutions is a much-needed requirement.
ETL refers to how the data is taken (extraction) from a data source, converted (transformation) into a format that can be easily analyzed, and stored (loading) into a data warehouse. The ETL process then loads the converted data into a database or BI platform in order to be used and viewed by anyone in the organization.
lifetime_playsthat records the song count by date for each user.
For this problem, we use the
INSERT INTO keywords to add rows to the
lifetime_plays table. If we set this query to run daily, it becomes a daily extract, transform, and load (ETL) process.
The rows we add are selected from the subquery that selects the
song_id, and count columns from the
song_plays table for the current date.
Before jumping into the question we should remember to clarify a few details that we can potentially get out of the interviewer. It helps to ask questions to understand and show that you can think holistically about the problem. Rushing too fast into a solution is a red flag for many interviewers.
Data engineers focus mostly on data modeling and data architecture, but a basic knowledge of algorithms and data structure is also needed.
Of particular importance is the data engineer’s ability to develop inexpensive methods for the transfer of large amounts of data. If you’re responsible for a database with potentially millions (let alone billions) of records, it’s important to find the most efficient solution.
These questions are common for software engineers and similarly apply to data engineers as well.
nums = [0, 1, 2, 4, 5] missingNumber(nums) -> 3
There are two ways we can solve this problem: through logical iteration and mathematical formulation. We can look at both as they both hold O(N) complexity.
The first would be through general iteration through the array. We can pass in the array and create a set that will hold each value in the input array. Then, we create a for loop that will span the range from 0 to n, and look to see if each number is in the set we just created. If it isn’t, we return the missing number.
def missingNumber(nums): num_set = set(nums) n = len(nums) + 1 for number in range(n): if number not in num_set: return number
The second solution requires formulating an equation. If we know that one number is supposed to be missing from 0 to n, then we can solve for the missing number by taking the sum of numbers from 0 to n and subtracting it from the sum of the input array with the missing value.
An equation for the sum of numbers from 0 to n is n*(n+1)/2. Now all we have to do is apply the internal sum function to the input array, and then subtract the values from each other
def missing_number(nums): n = len(nums) total = n*(n+1)/2 sum_of_nums = sum(nums) return total - sum_of_nums
Example S 3 4 5 6 2 4 6 1 1 9 9 9 9 E
Here’s a solution:
Recursive backtrack to the end while saving the max path water level on each function call. Track a visited cell set to trim the search space. O(n^2)
The brute force solution to this question would be to try every permutation, and verify if it’s a palindrome. If we find one, then return true, otherwise return false. You can see the full solution on Interview Query.
A function that is O(1) means it does not grow with the size of the input data.
That means, for this problem, the function must loop through the stream, inputting 2 entries at a time and choosing between the two of them with a random method.