Interview Query

Top 25+ Data Science SQL Interview Questions

Overview

Full-time data scientists use SQL every day to perform analysis and gather insights from data, which is why SQL is tested so frequently (around 70% of the time) in data science interviews.

Data science SQL interview questions mimic the work that data scientists actually perform. These questions ask candidates to pull metrics, aggregate data and conduct analysis. In the majority of interviews, you will be required to write out or whiteboard SQL queries in a short timeframe. Occasionally, you may have the opportunity to take this portion home and present your findings later, but you should not count on that format.

Data science SQL questions fall into three main categories:

  • Beginner SQL data science questions - These questions are definition-based, include use case examples or contain comparisons of SQL functions. A basic SQL data science question might be: What is the difference between WHERE and HAVING?

  • Intermediate data science SQL questions - Intermediate query writing questions will be the primary focus in data science interviews. You are provided with data or table schema and required to set up a SQL query to generate the requested metric. These questions will cover basic joins, aggregations and date manipulations.

  • Hard SQL data science questions - Hard SQL questions test SQL concepts like subqueries, window functions, and advanced joins. Additionally, you can expect multistep SQL case studies that ask you to investigate a business problem.

SQL Concepts for Data Science Interviews

In data scientist interviews, you will be asked to write SQL queries and, usually, these queries cover intermediate-to-hard SQL concepts. Although you can never truly know what will be asked, five SQL concepts are tested most frequently based on our survey of the industry.

If you can prepare for these concepts - which include aggregations, joins, date manipulations, window functions and advanced subqueries - you will do well in the SQL portion of any data science interview. These are the most frequently tested technical SQL concepts in data science interviews:

  • Aggregations - Aggregation using functions like SUM(), AVG(), and COUNT() are tested in nearly every interview. Categorizations using CASE statements also fall into this category. You should have a strong command of the aggregate functions in SQL for your data science interview.
  • JOINS - This is a must-have SQL skill for data science jobs. Data scientists pull data from more than one table each and every day on the job, and questions in this category test intermediate-to-hard JOINs in SQL.
  • Date manipulations - Time series analysis requires the ability to aggregate and process raw data by weeks, months, etc. This requires strong working knowledge of date manipulation in SQL.
  • Subqueries and Common Table Expressions (CTEs) - You should be prepared for basic subqueries using the FROM clause, as well as advanced subqueries using WHERE. These subqueries can be complex, and the best way to become comfortable in working through them is to practice as much as possible.
  • Window functions - Window functions are basically advanced aggregate functions that allow you to combine a wide variety of data.

What Types of SQL Questions Can You Expect?

Typically, when you are asked to write SQL code, you will be asked to perform a task. Some of the most common tasks you will be asked to demonstrate proficiency with include:

  • Reporting and metrics - These questions are the most common, and ask you to write queries that end up in dashboards. Although the output is clearly defined, these questions typically require complex joins, sub-queries, inner joins and window functions to reach the final product..

  • Analytics case studies - These questions are multi-step problems that test two skills: 1) understanding what metrics you require to answer the question, and 2) writing the correct SQL query to pull the metrics you have identified. Analytics case studies test your problem-solving skills, and are generally left more ambiguous than other types of SQL problems.

  • Database design - These questions test your knowledge of data architecture and design. Most importantly, they test whether you know how to design a database from scratch when you are given a business idea, application, or any other software that needs to interact with a database.

  • ETL questions - These questions test your ability to perform ETL tasks, and translates to extracting data from a database, transforming the data, and loading the data into another database or table. These questions are most common in engineering roles, but data scientists should expect them to appear periodically.

  • Logic-based questions - Logic-based SQL interview questions are very tricky. They are not typically based on real life examples so much as injecting the complexities of algorithms and data structure interviews into SQL questions. This is exemplified on sites such as LeetCode, where you will see quite a few interview questions that are not practical for real world scenarios.

How to Practice SQL Questions for Data Science Interviews Using the Interview Query Database?

image

The best approach for studying SQL is to practice as many problems as possible. To that end, Interview Query is constantly updating our question database, and already stands with 500+ problems banked for you to practice against. Here is a simple process to use for SQL practice problems from our database:

  1. Try the SQL question without running any SQL code in the engine. Give yourself 15 to 20 minutes to work on each problem by just reading the problem statement and using the editor.

  2. Work out all the bugs and make sure that when you click run or submit, you feel like this is the definitive solution that will work without syntax errors.

  3. If you get stuck, try the problem hint. The hints on certain questions will push you into learning how to think about solving the problem at hand.

  4. Try submitting your code to check if it passes our test cases. If it misses a few test cases, figure out why by debugging your code.

  5. If you need to debug your code, you can hit the RUN SQL button to try it on our test database. Run your code and check for syntax errors. If you get any, re-read your query to fix the necessary errors.

  6. Check your solution against ours. Take note of whether there are any optimizations you can use to write a more efficient query.

Beginner SQL Questions for Data Scientists

image

Definition-based basic SQL questions are commonly asked in interviews, and these questions are used by interviewers to quickly gauge your SQL knowledge. Commonly, these questions will lead to more challenging questions.

For example, you might be asked: “What is the difference between an INNER JOIN and a LEFT JOIN”? If you answer correctly, the interviewer might then provide a scenario and ask if you would use the INNER JOIN or LEFT JOIN to solve.

Some of the most common basic SQL questions for data scientists include:

1. What is a JOIN in SQL?

A JOIN is a keyword used to merge together two or more tables on the same key.

2. Which SQL command is used to add rows to a table?

The INSERT command is used to add rows to an existing table.

3. What is OLAP and OLTP? When do you denormalize data?

OLAP are databases intended for online analytical processing, while OLTP are databases intended for online transaction processing. Denormalize the data when it falls under OLAP operations and normalize when OLTP.

4. What is the difference between WHERE and HAVING?

The WHERE clause is used to filter rows before grouping and HAVING is used to exclude records after grouping.

5. When do you use the CASE WHEN function?

CASE WHEN lets you write complex conditional statements on the SELECT clause, and also allows you to pivot data from wide to long formats.

6. What is the difference between a LEFT JOIN and an INNER JOIN?

Breaking it down, a JOIN is used to combine data from multiple data sources. A LEFT JOIN combines and returns all rows from the left table, even if there are no matches in the right table. An INNER JOIN, only returns rows if there is a match in both of the tables.

7. When would you use UNION vs UNION ALL? What if there were no duplicates?

UNION is used to join multiple data sets. UNION is thus used to combine two queries into a single result, based on the parameters of the query. Conversely, UNION ALL extracts all rows from a set of two tables, including duplicates.

8. What is the difference between COUNT and COUNT DISTINCT?

The COUNT function is used to return the number of rows specified by the query, which is to say those rows that are specified by a WHERE condition. COUNT DISTINCT eliminates duplicate values before the count is applied, or it returns the number of unique rows in a table. Additionally, it does not count rows with NULL values.

9. When would you use a HAVING clause versus a WHERE clause?

You would use WHERE and HAVING if you are using them together with the GROUP BY clause. For example, WHERE would be used to filter rows before the GROUP BY clause, while HAVING would filter rows after the clause.

Easy SQL Data Science Questions

Easy SQL questions jump into basic SQL query writing. These questions are most common in technical screens, but they could show up in the on-site interview for more junior-level data science roles.

Easy questions in SQL interviews test your ability to perform basic aggregations like using COUNT or HAVING, if you can use basic joins or date-time manipulations.

10. You have a flights table that contains information about all the flights an airline has booked. Select all entries from the flights table.

Hint: The * serves as a wildcard that will capture all columns in a table.

11. Given two tables, users and neighborhood, write a query to find all neighborhoods that have zero users.

In this Google SQL interview question, we are given two tables. The users table includes a user_id, name, neighborhood_id, and a created_at timestamp. The second table includes a user_id, neighborhoods_name, and a city_id.

One hint: Whenever a SQL question asks you to find values with 0, immediately think of the concept of a LEFT JOIN. Remember that an INNER JOIN finds values that are in both tables, while a LEFT JOIN keeps only the values in the left table.

One additional step in this question would be to use the WHERE to select only NULL values. Here is a simple solution:

SELECT n.name   
FROM neighborhoods AS n 
LEFT JOIN users AS u
    ON n.id = u.neighborhood_id
WHERE u.id IS NULL

12. Write a query to return pairs of projects where the end date of one project matches the start date of another project.

Here is the schema for the projects table:

Columns Type
id INTEGER
title VARCHAR
start_date DATETIME
end_date DATETIME
budget FLOAT

With this question, our aim is to use values in a single table for comparisons. Specifically, we are looking to select pairs of projects in which one project starts on the same day another project ends.

This task requires us to perform an INNER JOIN. The order of the values must be precise to correctly perform this function. Pay particular attention of aliasing to make sure you keep your tables straight. For instance, you do not want to name one of your tables project_starts, but use that table to source the value for project_title_end, or vice-versa.

13. You are given a table that represents search results, and includes a search position and search rating. Write a query to measure the quality of the results for each search term.

More Context: The search results table includes:

  • Query (search term).
  • Result_ID
  • Position
  • Rating (1 to 5 where 5 is high relevance and 1 is low relevance)

For example, the query “cat” shows a picture of a cat, and includes a rating of 4 and a position of 1. However, the query “dog” shows a picture of a hotdog, with a rating of 2 and a position of 1. The ranking precision of “cat” is better than “dog.” How would you write a query to create a metric that could validate the position by their search precision?

Hint: Taking the average of the ratings seems like a good way to measure the quality of the search results.

We now have to find a way to weight the position in accordance to the rating to normalize the metric score. For example, in a table of two search results for one query:

  • If the first result is rated a 5 (position = 1, rating = 5) and the last result is rated a 1 (position = 2, rating = 1), this is good.
  • If the first result is rated a 5 (position = 1, rating = 5) and the last result is also rated a 5 (position = 2, rating = 5), this is even better
  • If the first result is rated a 1 (position = 1, rating = 1) and last result is rated a 5 (position = 2, rating = 5), this is bad

14. Write a SQL query to select the second-highest salary in the engineering department.

You are presented with an employees table (employee_id, first_name, last_name, salary and department_id) and a departments table (id, name). The department_id field in the employees table is associated with the “id” field in the departments table.

With this knowledge, department_id is a “foreign key” because it is a column that references the primary key of another table. Based on this common field, we can join both tables using INNER JOIN to associate the name of the department name to the employees that are a part of these departments.

SELECT salary
FROM employees
INNER JOIN departments
ON employees.department_id = departments.id

What else would you need to do to filter to just the employees in the engineering department, their order by salary, and selection of the top two salaries?

15. Given three tables representing customer transactions and customer attributes, write a query to get the average order value by gender.

You are provided with a transactions table (id, user_id, created_at, product_id, and quantity), a users table (id, name, and sex), a products table (id, name, and price). To answer the question, we would need to apply an INNER JOIN between the users and transactions tables to find the value.

Here is a quick solution:

SELECT
    u.sex
    , ROUND(AVG(quantity  *price), 2) AS aov
FROM users AS u
INNER JOIN transactions AS t
   ON u.id = t.user_id
INNER JOIN products AS p
    ON t.product_id = p.id
GROUP BY 1

16. Write a query to get the total amount spent on each item in the ‘purchases’ table by users that registered in 2022.

Let us say you have a database with two tables. The first table is composed of user information, including their registration date, and the second table is of purchases, which has the purchase history (if any) for all users.

 Here is a process you can use to solve this: 
  • Use INNER JOIN or JOIN to connect the users and purchases table by the user_id column.
  • Filter the results by using the WHERE clause.
  • Use GROUP BY to aggregate item, and apply SUM to calculate the amount spent.

17. You are given a table containing information about the rides of users for a ride-sharing company. Write a query to find the duration of each ride over two hours.

Note: Find the duration in minutes and sort results by duration in descending order.

rides table:

Columns Type
id INTEGER
passenger_user_id INTEGER
start_dt DATETIME
end_dt DATETIME
distance FLOAT
ride_region VARCHAR
is_completed INTEGER

Output:

Column Type
duration_minutes INTEGER

Here is a quick solution:

SELECT  TIMESTAMPDIFF(MINUTE,start_dt,end_dt) AS duration_minutes 
FROM rides 
WHERE TIMESTAMPDIFF(MINUTE,start_dt,end_dt) > 120 
ORDER BY duration_minutes DESC

Learn how to prepare for the SQL interview with our SQL course

Medium SQL Data Science Questions

Medium SQL coding questions jump into more complex joins, aggregations, window functions and more. Generally, these questions are more advanced reporting and metrics questions that require multi-step aggregations and creative problem-solving to answer.

18. Given a transactions table, write a query that finds the third purchase of every user.

We need an indicator of which purchase was the third value. Whenever we think of ranking our dataset, it is helpful to then immediately think of a specific window function we can use.

We need to apply the RANK function to the transactions table. The RANK function is a window function that assigns a rank to each row in the partition of the result set.

SELECT *, RANK() OVER (
    PARTITION BY user_id ORDER BY created_at ASC
        ) AS rank_value
FROM transactions
LIMIT 100

What else do we need to do, now that we have the transactions in ranked order?

19. Find the average number of downloads for free vs. paying customers, broken down by day.

More Context: You are given two tables, accounts (acount_id, paying_customer) and downloads (account_id, download_date, downloads).

We need to use data from both tables to solve the question, so the first thing we figure out is how to join them together. Since we should only consider accounts that had downloads, we can use an INNER JOIN (or just JOIN).

An INNER JOIN will discard accounts with no records present in the downloads table. To calculate an average, we can use the AVG function. The AVG function is an aggregate function, so we need to apply a GROUP BY function to group results by the columns date and paying customer.

20. Given a table of job postings, write a query to break down the number of users that have posted their jobs once versus the number that have posted at least one job multiple times.

If a user has five jobs but has only posted each job once, then they are part of the single_post user base. But if the user has five jobs and has posted a total of seven times, then at least one job must have multiple postings.

In general, if a user’s total number of postings exceeds that user’s total number of distinct jobs, the pigeonhole principle tells us at least one must have been posted multiple times.

We first write a subquery to get an organized version of the job_postings and name it user_job.

We want a count of total job postings per user and job. Since each job posting has a unique id, we write our subquery to count posting ids and distinct job ids per user.

We use COUNT DISTINCT on job_id to get a unique row for each job and COUNT on id as all id are already unique. We then GROUP BY user_id so we can compare the number of distinct jobs per user, denoted as num_jobs, with the number of total posts per user denoted as n_posts.

21. Given a table product purchases, write a query to get the number of customers that were upsold additional products.

Note: If the customer purchased two things on the same day, we do not count that as an upsell, as they were purchased within a similar timeframe.

Questions like this are common in Amazon SQL interviews.

Hint: An upsell is determined by multiple days by the same user. Therefore, we have to group by both the date field and the user_id to get each transaction broken out by day and user.

SELECT 
    user_id
    , DATE(created_at) AS date
FROM transactions
GROUP BY 1,2

Now we just have to filter for the users that purchased on multiple dates. How can we do this?

Evaluating what is given to us, we have an initial query that shows the purchases on each date. So effectively we need to now count the number of distinct dates that a user purchases on.

22. Given a table of students and their SAT scores, write a query to return the two students with the closest test scores.

Note: If there are multiple students with the same minimum score difference, select the student name combination that is higher in the alphabet.

Given that the problem statement is referencing one table with only two columns, we have to self-reference different creations of the same table. It is helpful to think about this problem in the form of two different tables with the same values.

There are two parts to this question:

  • Comparing each combination of students and their SAT scores.
  • Determining which two students’ scores are then the closest.

For the first part of the comparison, we have two of the same tables s1 and s2. Since we want to compare each student against each other, we can do a variation of the CROSS JOIN by joining each user against each other.

INNER JOIN scores AS s2
    ON s1.student != s2.student

23. Write a SQL query to create a metric to recommend pages for each user based on recommendations from pages their friends have already liked.

More Context: You want to build a naïve recommender, and you are given two tables. One is called friends (user_id and friend_id) and another called page_likes (with user_id and page_id). Note: it shouldn’t recommend pages that the user already likes.

Let’s solve this Facebook SQL interview question by visualizing what kind of output we want from the query. Given that we have to create a metric by which to evaluate recommending a user to a page, we know we want something with a user_id and a page_id along with some sort of recommendation score.

Let’s try to think of an easy way to represent the scores of each user_id and page_id combo. One naïve method would be to create a score by summing up the total likes by friends on each page that the user has not currently liked. Then the max value on our metric will be the most recommendable page.

The first thing we have to do is then to write a query to associate users to their friends’ liked pages. We can do that easily with an initial join between the two tables.

WITH t1 AS (
    SELECT 
        f.user_id 
        , f.friend_id
        , pl.page_id 
    FROM friends AS f
    INNER JOIN page_likes AS pl
        ON f.friend_id = pl.user_id
)

What comes next to reaching the desired output?

Hard SQL Data Science Questions

Hard SQL questions in data science interviews can include complex queries that involve sub-queries, window functions and advanced joins. However, one of the most common types of hard SQL questions is the SQL case study.

24. Write a query to show the number of users, number of transactions placed and total order amount per month in the year 2020.

Assume that we are only interested in the monthly reports for a single year (January to December). You are given three tables: transactions, products and users. But do you need the data from all three tables?

Since there is no information in the users table that explicitly relates to the outcome of our query, we can use just the transactions and products table to generate our number of customers and their total purchase amounts per month.

How will we get the number of customers per month, excluding repeat customers?

Hint: We have decided to COUNT the DISTINCT users in our transactions table to generate our number of monthly customers. How can we take advantage of the relationship between our transactions and products tables to calculate the total purchase amount per month of any given customer?

25. Calculate the first-touch attribution for each user_id that converted.

The schema below is for a retail online shopping company consisting of two tables, attribution and user_sessions. Here are some details of the two tables:

  • The attribution table logs a session visit for each row.
  • If conversion is true, then the user converted to a purchase on that session.
  • The channel column represents which advertising platform the user was attributed to for that specific session.
  • Lastly the user_sessions table maps session visits back to one user, from a single visit all the way up to serval on the same day.

How do we solve this one? First-touch attribution is defined as the channel to which the converted user was associated with when they first discovered the website. It is helpful to sketch out the attribution model for converting users:

  • 1st Session: User sees Facebook ad -> Clicks to order -> Leaves
  • 2nd Session: User sees Google ad -> Leaves
  • 3rd Session: User types in website -> Clicks to order -> Purchases

How do we figure out the beginning path of the Facebook ad and connect it to the end purchasing user?

We need to do two actions: 1) subset all of the users that converted to customers, and 2) figure out their first session visit to attribute the actual channel. We can do that by creating a subquery that only gets the distinct users that have actually converted.

26. Write a query to find each user with a subscription date range that overlaps with any other user.

You are provided with a table of product subscriptions with subscription start and end dates for each user.

Take a look at each of the conditions first and see how they could be triggered. Given two date ranges, what determines if the subscriptions would overlap?

Let’s set an example with two date ranges: A and B.

Let Condition A>B demonstrate that DateRange A is completely after DateRange B. _ |—- DateRange A ——| |—Date Range B —–| _

When would Condition A>B be true?

27. Write a query to create a new table, named flights, that displays unique pairs of two locations.

Note: Duplicate pairs from the flights table, such as Dallas to Seattle and Seattle to Dallas, should have one entry in the flight routes output table.

flights table

Column Type
id INTEGER
source_location VARCHAR
destination_location VARCHAR

How could you solve this using the LEAST and GREATEST functions?

Strategies for Answering Data Science SQL Questions

image

Let’s review the common strategies when tackling SQL interview questions.

1. Repeat the problem statement

When presented with a SQL question, listen carefully to the problem description and repeat back what you think the crux of the problem is. The interviewer can then help verify if your understanding is correct.

2. Understand the edge cases

If time permits, write out a base case and an edge case to show that you understand the problem. For example: if the interviewer asks you to pull the average number of events per user per day, write out an example scenario where you’re verifying this metric.

Do duplicate events matter? Are we looking at distinct users? These are questions we need to clarify.

3. Try working backwards if the problem is tricky

Sketching out what the output of the SQL question will look like is a great strategy towards solving the problem. Usually, if we know what the end output table is supposed to look like, we can work backwards from there on what functions need to be applied before.

For example, if the output looks like this:

date average events per user
2021-12-01 3.5
2021-12-02 4.0

We know that the table before this aggregation would have to look something like this:

date event user_id
2021-12-01 click 1
2021-12-01 view 1
……

From here, we can figure out what functions I should use to get to my desired output!

4. Pattern match to different functions

As you practice more and more SQL exercises, what you will find is that many SQL problems follow similar patterns. There are techniques we can use in SQL, like utilizing HAVING on aggregations, self-joins and cross-joins, and applying window functions, and these functions will appear repeatedly the more problems we practice on. For example, writing a query to get the second highest salary or writing a query to isolate every fifth purchase by a user utilizes the same RANK function in SQL.

Understanding the commonalities between questions will help you understand the first step to solving SQL questions faster, because you can re-use similar code and stitch together techniques on top of each other.

5. Start writing SQL

Finally, it’s important to just start writing SQL. It is better to start writing an imperfect solution than to try and perfectly understand the problem or trying to perfect the coding solution on the first try.

Verbalize your assumptions and what you are doing as you write SQL, and your interviewer can then be put on the same page as you, even helping you to navigate your response. These processes are most often collaborative in a business environment, so you can show your ability to brainstorm and communicate during the interview.