Interviews for data analysts and data scientists typically include SQL questions. That's something that hasn't changed, even as new tools like Pandas, Julia, Spark and NoSQL attempt to dethrone and replace it. Typically, SQL data science interviews ask a range of questions, all the way from basic definition-based SQL questions, to advanced querying exercises and questions.
To help you prep for your SQL interview, we've curated some of the most common SQL practice problems, exercises and questions (with answers) in a range of categories.
What Types of SQL Interview Questions Get Asked?
SQL is one of the most common topics in data science interviews. That's true across the board. According to Interview Query analysis, they're asked 70% of the time in Facebook data science interviews, and 94% of the time in Amazon's business intelligence interviews.
Ultimately, the most common questions for SQL interviews fall into seven key categories:
- Definition based SQL questions
- Basic SQL questions
- Reporting and metrics SQL questions
- Analytics SQL questions
- ETL SQL questions
- Database design questions
- Logic based SQL questions
In other words, as you prepare for a SQL interview, test your knowledge in these areas. Practice questions, SQL exercises and courses can help you quickly increase your core competencies in each of these categories.
Definition Based SQL Questions
Definition-based SQL questions are commonly asked in interviews, and essentially, it's a way for interviewers to quickly gauge your SQL knowledge. To prepare, all you have to do is study a list of definitions of SQL terms and applications. These questions will include understanding the differences between SQL joins, what kinds of aggregations exist, and knowing the basic functions like
CASE WHEN or
HAVING. Useful definition-based questions for SQL interviews include:
Q1. What is a join in SQL?
JOIN is a keyword used to merge together two or more tables on the same key.
Q2. Which SQL command is used to add rows to a table?
Q3. What is OLAP and OLTP? And when do you denormalize data?
OLTP are databases intended for online transaction processing and OLAP are databases intended for online analytical processing. Denormalize when its OLAP operations and normalize when OLTP.
Q4. What's the difference between
The main difference is that a
WHERE clause is used to filter rows before grouping and
HAVING is used to exclude records after grouping.
Q5. When do you use the
CASE WHEN function?
CASE WHEN lets you write complex conditional statements on the
SELECT clause and also allows you to pivot data from wide to long formats.
Q6. What's the difference between a
LEFT JOIN and an
Essentially, a join is used to combine data from multiple data sources. A
LEFT JOIN combines returns all rows from the left table, even if there are no matches in the right table. An
INNER JOIN, on the other hand, only returns rows if there is a match in both of the tables.
Q7. When would you use
UNION ALL? What if there were no duplicates?
UNION is used to join multiple data sets.
UNION is thusly used to combine two queries into a single result, based on the parameters of the query. Conversely,
UNION ALL extracts all rows from a set of two tables, including duplicates.
Q8. What's the difference between
COUNT function is used to count the number of rows specified by the query. It returns all the rows specified by a
COUNT DISTINCT eliminates duplicate values before the count is applied, or it returns the number of unique rows in a table. Additionally, it does not count rows with
Q9. When would you use a
HAVING clause versus a
You would use
HAVING if you're using them together with the GROUP BY clause. For example,
WHERE would be used to filter rows before the GROUP BY clause, while
HAVING would filter rows after the clause.
Basic SQL Query Interview Questions
Want to improve your SQL skills before your interview? Build your competency with our SQL course.
Basic SQL questions are generally easy and focus on assessing if you know the basics. For example, basic querying questions will require you to perform entry-level queries, like getting the
COUNT of a table, knowing what the
HAVING clause does, what a unique key is, and figuring out how to utilize
LEFT JOIN versus
INNER JOIN to give you the values that you need.
Q1. We're given two tables, a users table with demographic information and the neighborhood they live in and a neighborhoods table. Write a query that returns all of the neighborhoods that have 0 users.
Hint: Our aim to find all the neighborhoods without users. In a sense we need all the neighborhoods that do not have a singular user living in them. This means we have to introduce a concept of existence of a column in one table, but not in the other.
Q2. Write a query to return pairs of projects where the end date of one project matches the start date of another project.
Here, we're trying to use values in a single table to generate comparisons. Specifically, we're looking for pairs of projects such that one project starts on the same day that another project ends.
This will require us to perform a self-join.
Hint: Remember, when performing a self-join, the order of your values is very important. You'll want to be particular about aliasing to make sure you keep your tables straight.
Q3. You're given a table that represents search results from searches on Facebook. The query column is the search term, position column represents each position the search result came in, and the rating column represents the human rating of the search result from 1 to 5 where 5 is high relevance and 1 is low relevance.
1. Write a query to compute a metric to measure the quality of the search results for each query.
2. You want to be able to compute a metric that measures the precision of the ranking system based on position. For example, if the results for dog and cat are....
|dog||1000||1||2||picture of hotdog|
|cat||123||1||4||picture of cat|
...we would rank 'cat' as having a better search result ranking precision than 'dog' based on the correct sorting by rating.
Write a query to create a metric that can validate and rank the queries by their search result precision. Round the metric (avg_rating column) to 2 decimal places.
The ratings of each search result (1 to 5 where 5 is high relevance and 1 is low relevance) could be used to compute a metric that measures the quality of the search results for each query.
For example, if the search query for 'tiger' has 5s for each result, then that would average to be a perfect result. Therefore, taking the average of the ratings seems like a good way to measure the quality of the search results.
Q4. We're given a table called employers that consists of a user_id, year, and employer EIN label. Users can have multiple employers dictated by the different EIN labels.
Write a query to add a flag to each user if they've added a new employer in the last year in the table.
# employer # # user_id year employer_ein # 34323 2018 A # 34323 2018 B # 34323 2018 C # 34323 2017 F # 34323 2017 A # 34323 2017 B # # 86323 2018 A # 86323 2018 B # 86323 2018 C # 86323 2017 B # # 98787 2018 A # 98787 2018 B # 98787 2018 F # 98787 2017 F # 98787 2017 B # 98787 2017 A # # 55559 2018 A # 55559 2018 B # 55559 2018 C
# Output # user_id year new_ein_flag # 34323 2018 1 # 86323 2018 1 # 98787 2018 0
Q5. Write a SQL query to select the 2nd highest salary in the engineering department. If more than one person shares the highest salary, the query should select the next highest salary.
employees +---------------+---------+ | id | int | | first_name | varchar | | last_name | varchar | | salary | int | | department_id | int |--+ +---------------+---------+ | | departments | +---------------+---------+ | | id | int |<-+ | name | varchar | +---------------+---------+
First, we need the name of the department to be associated with each employee in the employees table, to understand which department each employee is a part of.
The “department_id” field in the employees table is associated with the “id” field in the departments table. We call the “department_id” a foreign key because it is a column that references the primary key of another table, which in this case is the “id” field in the departments table.
Based on this common field, we can join both tables, using INNER JOIN, to associate the name of the department name to the employees that are a part of those departments.
INNER JOIN departments
ON employees.department_id = departments.id
Q6. Given the revenue transactions table above, write a query that finds the third purchase of every user.
We need an indicator of which purchase was the third value. Whenever we think of ranking our dataset, it's helpful to then immediately think of a specific window function we can use.
We need to apply the RANK function to the transactions table. The RANK function is a window function that assigns a rank to each row in the partition of the result set.
Reporting and Metrics SQL Interview Questions
Reporting and metrics SQL questions are probably the most common type of SQL question to show up in interviews. Reporting SQL interview questions replicate the work that many business and reporting analysts do on a day-to-day basis. This means writing queries that end up in dashboards and key metrics.
These questions are generally some of the most common problems that analysts and data scientists will run into when pulling metrics for their day-to-day job. The output of what they need to pull is very clearly defined and the queries themselves require complex joins, sub-queries, self-joins, window functions and more.
Metrics-Based SQL: Easy Questions
Typically, easy SQL reporting questions tend to require basic knowledge of SQL functions. Some basic questions include:
Q1. What's the total distance traveled for all riders on Lyft in the month of March?
Q2. Given three tables, representing customer transactions and customer attributes: Write a query to get the average order value by gender.
Q3. Given three tables: user_dimension, account_dimension, and download_facts, find the average number of downloads for free vs paying customers broken out by day.
Note: The account_dimension table maps users to multiple accounts where they could be a paying customer or not. Also, round average_downloads to 2 decimal places.
Let's first break it down. What values in which tables can we join together to get the data that we need?
Ideally the data should be broken down such that we could easily graph the values to visualize two line plots of free vs paying users. The x-axis would represent the date and the y-axis would represent the average number of downloads.
The user_dimension table represents the mapping between account ids and user ids while the account_dimension table holds the value if the customer is paying or not. Lastly the download_facts table has the date and number of downloads per user.
SELECT * FROM user_dimension AS ud INNER JOIN account_dimension AS ad ON ud.account_id = ad.account_id LEFT JOIN download_facts AS df ON ud.user_id = df.user_id
Reporting with SQL: Advanced Questions
Advanced SQL reporting questions for business analysts typically require some advanced SQL skills, as well as problem-solving ability. Some example exercises include:
Q4. Write a query to get the month-over-month change of new users in January.
Q5. Given a table of job postings, write a query to breakdown the number of users that have posted their jobs once versus the number of users that have posted at least one job multiple times.
This question is kind of complicated so it's helpful to break it into multiple steps. First let's visualize what the output would look like.
We want the value of two different metrics, the number of users that have posted their jobs once versus the number of users that have posted at least one job multiple times. What does that mean exactly?
Well if a user has 5 jobs but only posted them once, then they are part of the first statement. But if they have a 5 jobs and posted a total of 7 times, that means that they had to at least posted one job multiple times.
To get to that point, we need a table with the count of user-job pairings and the number of times each job gets posted. We can pretty easily get to that point with just a simple GROUP BY on two variables, user_id and job_id.
Now we just need a way to differentiate the users that posted each job once from users that posted multiple times. Let's go back to our example. We can deduce that if we take the sum of the number of rows for each user and the sum of the number of times each job is posted, they must be equal for the user to have posted each job only once, since we grouped by the user and the job id.
Q6. Write a query to get a histogram of the number of posts per user in 2020.
Q7. We're given a table of product purchases. Each row in the table represents an individual user product purchase. Write a query to get the number of customers that were upsold by purchasing additional products.
Note: If the customer purchased two things on the same day that does not count as an upsell as they were purchased within a similar timeframe.
Hint: An upsell is determined by multiple days by the same user. Therefore we have to group by both the date field and the user_id to get each transaction broken out by day and user.
Q8. Given a table of students and their SAT test scores, write a query to return the two students with the closest test scores with the score difference.
If there are multiple students with the same minimum score difference, select the student name combination that is higher in the alphabet.
Hint: Given the problem statement is referencing one table with only two columns, we have to self-reference different creations of the same table. It's helpful to think about this problem in the form of two different tables with the same values.
Analytics SQL Interview Questions
Analytics SQL interview questions, which you might hear referred to as SQL case studies, are some of the trickiest interview questions that you will face. This is because they test two concepts:
- Understanding what metrics we need to answer the question.
- Writing the correct SQL query that will output these metrics.
Analytics SQL interview questions are designed to test how you would think about solving a problem, and are purposely left more ambiguous than other types of problems. The tough part is that you not only have to think critically about what the SQL output has to look like, you also need to understand EXACTLY the right data points to pull.
For example, an interviewer might ask you to write a SQL query (given a few tables) to understand which AB test variant won. But there might not even be any understanding of what winning actually means.
Q1. We ran an A/B test on two different sign up funnels. Write a query to see which variant "won."
Q2. The schema below is for a retail online shopping company consisting of two tables, attribution and user_sessions.
- The attribution table logs a session visit for each row.
- If conversion is true, then the user converted to buying on that session.
- The channel column represents which advertising platform the user was attributed to for that specific session.
- Lastly the `user_sessions` table maps many to one session visits back to one user.
First touch attribution is defined as the channel to which the converted user was associated with when they first discovered the website.
Calculate the first touch attribution for each user_id that converted.
Imagine the full path of a user converting to a purchase after multiple visits.
- 1st Session: User sees Facebook ad -> Clicks to order -> Leaves
- 2nd Session: User sees Google ad -> Leaves
- 3rd Session: User types in website -> Clicks to order -> Purchases
How do we figure out the beginning path of the Facebook ad and connect it to the end purchasing user?
We need to do two actions: 1) subset all of the users that converted to customers and 2) figure out their first session visit to attribute the actual channel. We can do that by creating a sub-query that only gets the distinct users that have actually converted.
Q3. We're looking to understand the effect of a new Uber driver incentive promotion released in the past month on driver behavior. Write a query to figure out if the incentive worked as indicated.
Q4. Let's say we want to build a naive recommender. We're given two tables, one table called `friends` with a user_id and friend_id columns representing each user's friends, and another table called `page_likes` with a user_id and a page_id representing the page each user liked.
Write an SQL query to create a metric to recommend pages for each user based on recommendations from their friends liked pages.
Note: It shouldn't recommend pages that the user already likes.
Let's solve this problem by visualizing what kind of output we want from the query. Given that we have to create a metric for each user to recommend pages, we know we want something with a user_id and a page_id along with some sort of recommendation score.
Let's try to think of an easy way to represent the scores of each user_id and page_id combo. One naïve method would be to create a score by summing up the total likes by friends on each page that the user hasn't currently liked. Then the max value on our metric will be the most recommendable page.
The first thing we have to do is then to write a query to associate users to their friends liked pages. We can do that easily with an initial join between the two tables.
Q5. Amazon released a new recommendation widget on their landing page. Write a query to determine the impact the recommendation widget made on user behavior for one metric.
Q6. Write a query to show the number of users, number of transactions placed, and total order amount per month in the year 2020. Assume that we are only interested in the monthly reports for a single year (January-December).
Hint: We've decided to
DISTINCTusers in our transactions table to generate our number of monthly customers. How can we take advantage of the relationship between our transactions and products tables to calculate the total purchase amount per month of any given customer?
Q7. Given a table of transactions and products, write a query to return the product id and average product price for that id. Only return the products where the average product price is greater than the average price of all transactions.
Q8. Given a table of product subscriptions with a subscription start date and end date for each user, write a query that returns true or false whether or not each user has a subscription date range that overlaps with any other user.
Q9. In the table below, column action represents either ('post_enter', 'post_submit', 'post_canceled') for when a user starts a post (enter), ends up canceling it (cancel), or ends up posting it (submit).
Write a query to get the post success rate for each day in the month of January 2020.
Hint: Let's see if we can clearly define the metrics we want to calculate before just jumping into the problem. We want post success rate for each day over the past week.
To get that metric let's assume post success rate can be defined as:
(total posts created) / (total posts entered)
Additionally since the success rate must be broken down by day, we must make sure that a post that is entered must be completed on the same day.
ETL SQL Interview Questions
ETL stands for "Extract, Transfer, Load" and describes the process for which data flows between different data warehousing systems.
Extract does the process of reading data from a database. Transform converts data into a format that could be appropriate for reporting, analysis, machine learning, etc., and Load writes the transformed data into another database, table, or any other data storage service that can be then used by another data scientist or engineer for reporting.
In the interview, ETL tools and concepts are important to know for virtually all roles. The more difficult interview questions, however, will likely be focused and asked in data engineering, business intelligence, and related interviews.
Q1. What's the difference between
Q2. What is a
PRIMARY KEY in SQL syntax?
Q3. We have a table called `song_plays` that tracks each time a user plays a song. Let's say we want to create an aggregate table called `lifetime_plays` that records the song count by date for each user. Write a SQL query that could make this ETL each day.
Hint: For this problem, we use the
INSERT INTO keywords to add rows into the
lifetime_plays table. If we set this query to run daily, it becomes a daily extract, transform, and load (ETL) process.
Q4. What is a
Q5. Let's say you have a table with a billion rows. How would you add a column inserting data from the original source without affecting the user experience?
Before jumping into the question we should remember to clarify a few details that we can potentially get out of the interviewer. It helps to ask questions to understand and show that you can think holistically about the problem. Rushing too fast into a solution is a red flag for many interviewers.
Q6. List an example of when you would add an
INDEX to a table?
Q7. What's the difference between a
PARTITION and an
Q8. Let’s say we have a table representing a company payroll schema. Due to an ETL error, the employees table instead of updating the salaries every year when doing compensation adjustments, did an insert instead. The head of HR still needs the current salary of each employee.
Write a query to get the current salary for each employee.
Note: Assume no duplicate combination of first and last names (e.g. two John Smiths)
Q9. Does creating a view require storage in a database?
Q10. Let's say that we have two ETL jobs that feed into a single production table each day. Can you think of any problems this might bring up?
Database Design SQL Interview Questions
Database design SQL questions test your knowledge of data architecture and design. Most importantly, it tests whether you know how to design a database from scratch, given a business idea, application, or any other software that needs to interact with a database.
Many times, when databases need to scale for performance or breadth size of tables, we need to realize how to modify our database to fit the new requirements. Starting from a solid initial design is always important when building out databases.
Q1. Let's say we're working at Spotify. In the users table we have columns such as
location but also columns like
favorite_artist. Should we do something about this table?
Q2. Design a database to represent a Tinder style dating app. What does the schema look like and what are some optimizations that you think we might need?
Let’s first approach this problem by understanding the scope of the dating app and what functionality we must design around.
If we were to list the key Tinder app capabilities, it would be something like:
- Onboarding - User opens up, adds preferences, add pictures, and starts swiping on users.
- Matching - If the user matches with another user, we notify them and create a messaging system.
- Messaging - Users can message between each other only if both have been matched. Users can also leave conversations at any time.
Q3. How would you design a database for a fast food restaurant?
Q4. Let's say we want to run some data collection on the Golden Gate bridge.
1. What would the table schema look like if we wanted to track how long each car took coming into San Francisco to enter and exit the bridge? Let's say we want to track additional descriptives like the car model and license plate.
2. Write a query on the given tables to get the time of the fastest car on the current day.
3. Write a query on the given tables to get the car model with the average fastest times for the current day. (Example: Let's say three Ferraris crossed the bridge in an average of one minute).
This question functions slightly more like data engineering or architecture program. Given a certain use case or application, we have to model how we want to store the information.
In this case we're given that we have to track time entered and exited leaving the bridge, but also the car make and model along with license plate information. We know that the car model to license plate information will be one to many, given that each license plate represents a single car, and a car model can be replicated many times.
Q5. Let's say we are productionizing a machine learning model. How would you set up the data architecture to ensure that the model serves all of our users in real time? How would you ensure that the model can be re-trained each day with new data coming in?
Q6. How would you create a schema to represent client click data on the web?
Hint: What exactly does click data on the web mean? Any form of button clicks, scrolls, or action at all is an interaction with the client interface–in this case desktop–and would somehow be represented into a schema form for the end user to query. This does not include client views. A simple but effective design schema would be to first represent each action with a specific label. In this case, assigning each click event a name or label describing its specific action.
Q7. Let's say we have a table representing vacation bookings. How would you make an aggregate table represented below called listing_bookings with values grouped by the listing_id and columns that represented the total number of bookings in the last 90 days, 365 days, and all time?
listing bookings table:
Logic based SQL Interview Questions
Logic based SQL interview questions are very tricky. They aren't really based on real life examples so much as putting the trickiness of algorithms and data structure interviews into SQL questions. This is exemplified on sites such as LeetCode, where you'll see a lot of interview questions that aren't very practical for real life scenarios.
Q1. Write a query to create a new table, named flight routes, that displays unique pairs of two locations.
Example: Duplicate pairs from the flights table, such as Dallas to Seattle and Seattle to Dallas, should have one entry in the flight routes table.
Q2. Let's say we have a table with an id and name field. The table holds over 100 million rows and we want to sample a random row in the table without throttling the database.
Write a query to randomly sample a row from this table.
Hint: We know that the RAND() function actually returns a floating-point between 0 and 1. So if we were to run this function, SELECT RAND(), we would get a random decimal point to some Nth degree of precision. RAND() essentially allows us to seed a random value. How can we use this to select a random row quickly?
Q3. Given a users table, write a query to get the cumulative number of new users added by day, with the total reset every month.
Hint: This question first seems like it could be solved by just running a COUNT(*) and grouping by date. Or maybe it's just a regular cumulative distribution function? But we have to notice that we are actually grouping by a specific interval of month and date. And that when the next month comes around, we want to the reset the count of the number of users.
Strategies for the live SQL interview
Let's go over the common strategies when tackling SQL interview questions.
1.Repeat the problem statement
When presented with a SQL question, listen carefully to the problem description and repeat back what you think the crux of the problem is. The interviewer can then help verify if your understanding is correct.
2. Understand the edge cases
If time permits, write out a base case and an edge case to show that you understand the problem. For example: if the interviewer asks you to pull the average number of events per user per day, write out an example scenario where you're verifying this metric.
Do duplicate events matter? Are we looking at distinct users? These are questions we need to clarify.
3. Try working backwards if the problem is tricky
Sketching out what the output of the SQL question will look like is a great strategy towards solving the problem. Usually, if I know what the end output table is supposed to look like, I can work backwards from there on what functions need to be applied before.
For example, if the output looks like this:
date | average events per user ------------+----------------------- 2021-12-01 | 3.5 2021-12-02 | 4.0
I know that the table before this aggregation would have to look something like this.
date | event | user_id -----------+-------+-------- 2021-12-01 | click | 1 2021-12-01 | view | 1 ......
And then, I can figure out what functions I should use to get to my desired output!
4. Pattern match to different functions
As you practice more and more SQL exercises, what you'll find is that many SQL problems follow similar patterns. There are techniques we can use in SQL, like utilizing
HAVING on aggregations, self-joins and cross-joins, and applying window functions. But, additionally, we'll see problems that run in a similar vein.
For example, writing a query to get the second highest salary or writing a query to isolate every fifth purchase by a user utilizes the same
RANK function in SQL.
Understanding the commonalities between questions will help you understand the first step to solving SQL questions faster because you can re-use similar code and stitch together techniques on top of each other.
5. Start writing SQL
Finally, it's important to just start writing SQL. It's better to start writing an imperfect solution vs trying to perfectly understand the problem or trying to perfect the solution on the first try.
Verbalize your assumptions and what you're doing as you write SQL and your interviewer can then be put on the same page as you.
SQL Study plan for your next interview
SQL interview questions on Interview Query are bucketed into easy, medium, and hard and go over every SQL concept that we went through in this blog post.
Use Interview Query to level yourself at how good you are at SQL. Here’s how you should approach each problem.
- Try the SQL question without running any SQL code in the engine. Give yourself 15 to 20 minutes to work on each problem by just reading the problem statement and using the editor.
- Work out all the bugs and make sure that when you click run or submit, you feel like this is the definitive solution that will work without syntax errors.
- If you get stuck, try the problem hint. The hints on certain questions will push you into learning how to think about solving the problem at hand.
- Try submitting your code to check if it passes our test cases. If it misses a few test cases, figure out why by debugging your code.
- If you need to debug your code, you can hit the
RUN SQLbutton to try it on our test database. Run your code and check for syntax errors. If you get any, re-read your query to fix the necessary errors.
- Check your solution against ours. Take note of whether there are any optimizations you can use to write a more efficient query.
Thanks for Reading!
Start studying for your next SQL interview with resources from Interview Query. Try our SQL course or check out these helpful guides: Amazon SQL Questions, the SQL Case Study Interview, and Google SQL Questions.