Case study interviews for data engineers are one of the most challenging stages of the interview process.
During a data engineer case study interview, candidates are asked scenario-based questions that deal with architecture or a data engineering problem. They then have to brainstorm solutions and walk the interviewer through a hypothetical design and build the solution.
Data engineer interviews test a variety of skills. They provide insights into your technical abilities, especially in ETL architecture and design. But they’re also used to assess your problem-solving ability, communication style, command of design best practices, and ability to think of potential trade-offs to a solution.
How do you go about solving a data engineer case study? We’ve highlighted what you can expect, as well as provided sample data engineer case study interview questions in this overview guide.
During the data engineer case study interview, you’ll be provided a scenario-based problem. Typically, these questions focus on architecture and require you to develop a solution and walk the interviewer through how you would design and build the solution.
For example, you might be asked:
“You’re tasked with building a data pipeline for POS data from a store like Walmart. This data will be used by data scientists. How would you do it?”
With a case study question, the first step is to ask clarifying questions. You should gather as much information as you need. Then, you would propose your solution.
A few tips for data engineer case interviews include:
Ultimately, these questions focus on a range of subjects including database design, data warehousing, ETL pipelines and data modeling.
Let’s go through an example data engineer case study.
The majority of case studies for data engineers are multi-faceted. They build on top of one another. For example, you might start with a basic question about database design, and then progress to more challenging questions about your initial design.
This makes data engineer case interviews so difficult to pass, especially if the baseline solution is wrong.
Here’s a sample case question:
You’re tasked with building a notification system for a simple Reddit-style app.
What would the backend and data model look like?
Hint: Always start with a clarifying question. For our sample, we might want to know the goals of the notification system, what type of notification they’ll be, etc.
First, define how you’ll want to think about notifications. Notifications have two general types:
The key difference here is that one is event-triggered and the other is user-triggered.
Next, you can walk the interviewer through how you would design a simple database for notifications. This database might include two tables:
All notifications that get triggered would be sent to a task manager to run the task asynchronously from the actual application. We don’t want to log these notifications within the app given that if we were to suddenly have to send an influx of notifications, it could bring down the application.
When we log the notification, we could insert a value into the notification_metrics table to record a value
Let’s say that we want to also track reads and opens, as well as follow-through rates.
Would we update the notification_metrics table in real-time?
If yes, we could set a webhook to run an update command. However, we need to verify the performance of an update in the notifications_metrics table versus just an insert into another events-style table.
Let’s assume for design reasons that we mainly want to do an insert.
What’s the refresh cadence for the ETL to pipe-out notification analytics?
In this specific cadence, we have to make some assumptions about how important it is that notifications analytics are checked. More often than not, when handling a huge amount of data, we care just the same about data quality as we do performance.
Ultimately, more questions would come up, like “what if notifications are delayed? How would we come up with a backfill?” and others. Plus, the interviewer will likely ask questions along the way.
Your goal should be to methodically walk the interviewer through a solution, discuss the trade-offs, make assumptions, and ask questions if needed.
Here’s another similar data engineer case study question focusing on database design.
Design a database to represent a Tinder-style dating app. What does the schema look like and what are some optimizations that you think we might need?
First, you want to approach this problem from the product perspective. It’s a dating app, and we have to know what user functionalities we must design around.
It’d be good to start with listing some of the key capabilities of a Tinder-style app:
Next, we should identify if we have specific feature goals that we need to account for within engineering or system design. For example, if we look at swiping, we might ask:
When you make assumptions, be sure to ask if you’re on the right track.
The interviewer will likely provide additional information to help refine your solution like this:
Let’s say the matching will be based on hard filters for now, but you should design a database that makes it easier to increase the complexity of our matching algorithm.
Your priorities should include efficiency for users to quickly onboard and retrieve matches.
We will need a range of tables to represent users, swipes, and messages between them.
Let’s walk through the design.
The swipes table is the most challenging one. We know we have to make sure that we save each state of each user. We have two users, user_a and user_b, that we can denote by whichever user swiped on the other user first.
But since this table will be massive and a log of the actions that follow each user’s swipe, we need to make sure we save each specific state and action from the two-user combination.
Therefore, the states are:
Each of these states could be represented by a numerical categorization in the database for simplicity and space saving. And not only could we have these states but we could also have states for when the other user hasn’t swiped yet.
For example if User A swipes right on User B, but User B hasn’t swiped on User A, then we would save a state that could be updated later on.
This relational format allows us to easily query for common metrics around total user matches, total number of swipes per user, and the number of matches per swipe ratio.
Additionally we separated out the locations table from just the user profile.
This is given the definite need for improving matching with hard filters set. While something like gender could be pretty easily done with a simple query, computing geo-location coordinates might be difficult in SQL. But can be done if we can at least narrow the result set down to the zip code level.
Lastly, let’s look at database engineering optimizations that we can perform.
The biggest beneficiary of optimizations would likely be increasing the speed and performance of the locations and swipes table.
While we can easily add an index to the locations table on something like zip code (U.S. only assumptions), we can’t really add one to the swipes table given the size of that table.
One option is to implement a sharded design for our database. While indexing does a table copy and rearranges records to allow you to read off of a table sequentially, sharding gives you capabilities to add multiple nodes where the specific record you want is only on one of those nodes. This allows for a more bounded result in terms of retrieval time.
Lastly the ultimate optimization we can make is to remove the entire swipes table from the database and treat the swipes we get as a log. We would scale this out by using a streaming service.
Try these additional case questions, covering database design, data architecture and data debugging:
Let’s say you work at Spotify.
We want to design a relational database for storing metadata about songs.
We want to include metadata like song title, song length, the date the song was added to the platform, artist, album, track number (on the album), the song’s release year, and its genre.
How would you go about designing this database?
How would you create a schema to represent client click data on the web?
Let’s say we want to run some data collection on the Golden Gate Bridge.
What would the table schema look like if we wanted to track how long each car took coming into San Francisco to enter and exit the bridge? Let’s say we want to track additional descriptives like the car model and license plate.
Suppose you are analyzing auto insurance data. You find that in the demographic information for all insurance clients that the marriage attribute column is marked TRUE for all customers.
How would you debug what happened? What data would you look into and how would you find out who is actually married and who is not?
In addition to data engineer case study questions, your interview will likely include questions on SQL, data structures and algorithms, data modeling and database design. See helpful sample questions from our guide: Top Data Engineering Interview Questions.