SQL interviews for data analysts are very often merely screening interviews, but without a strong grasp of the language, they can be difficult to pass.
What do you need to study in order to prepare?
SQL questions for data analysts tend to fall into three buckets:
If you’re not prepared, it’s very easy to get overwhelmed. To help, we’ve highlighted everything you need to know about SQL case interviews, including how they’re conducted and different types of questions that can be asked, along with 15 sample SQL questions to help you study.
Reporting and metrics questions are the most common type of SQL questions that data analysts face. These mirror the day-to-day work of an analyst, assessing a candidate’s ability to pull business and reporting metrics that typically end up in dashboards.
In general, the metric that you need to pull will be clearly defined, and you’ll just have to write a clean SQL query to pull it.
Beyond metrics and reporting, a variety of concepts can be covered in SQL interviews for data analysts. Depending on the role, you may be asked to write SQL queries related to:
SQL technical screens are a part of nearly every analyst interview. In these screens, candidates are asked to answer real-world problems using SQL.
Most commonly, candidates are provided a dataset and asked to write a SQL query to return the desired data.
How do companies test proficiency? There are three main types of SQL interviews:
Regardless of the interview format, the types of SQL questions that get asked can range from basic definitions to advanced query writing and analysis-based problem-solving.
The three most common types of SQL questions for data analytics positions are:
1) Basic SQL questions
There are two types: definitions and basic SQL code writing.
Definitions-based SQL questions ask you to explain technical concepts, compare functions, and define how a concept is used. These questions are typically asked early in the interview process to quickly assess your proficiency.
On the other hand, a basic SQL coding question for analysts might ask you to write a simple query like getting the COUNT of a table or using the HAVING clause.
2) Reporting and metrics questions
This is the most common type of SQL question that data analysts face, and they mirror the work that analysts do on the job.
Typically, you will be provided with the desired output, and then have to write the query to return the value. Metrics SQL questions require more advanced coding skills. These problems typically require complex joins, sub-queries, or window functions to solve. ****
3) Analytics SQL questions
Analytics questions are usually asked later in the interview process, and they’re sometimes referred to as “analytics case studies.” Most commonly, a real-world problem is posed. You then have to define the metrics to solve the question and write a query to find the output.
As we’ve mentioned before, data analyst technical interviews generally start with some beginner SQL questions. This includes definitions of common features and functions in SQL, as well as basic SQL queries. For example, you might be asked to use the COUNT or AVG function. Simple joins may also be asked. Here are some easy SQL practice questions:
You’re given two tables, a users table with demographic information and the neighborhood they live in, and a neighborhoods table.
Write a query that returns all of the neighborhoods that have 0 users.
Users table:
columns | type |
---|---|
id | int |
name | varchar |
neighborhood_id | int |
created_at | datetime |
Neighborhoods table:
columns | type |
---|---|
id | int |
name | varchar |
city_id | int |
Output:
columns | type |
---|---|
neighborhood_name | varchar |
Hint: Our predicament is to find all the neighborhoods without users. In a sense, we need all the neighborhoods that do not have a singular user living in them. This means we have to introduce a concept of existence of a column in one table, but not in the other.
Let’s say we have our dataset looking like this:
neighborhoods.name | users.id |
---|---|
castro | 123 |
castro | 124 |
cole valley | null |
castro heights | 534 |
castro heights | 564 |
How do we remove the NULL?
What SQL query would we use to get our data so that we can display the neighborhood of Cole Valley as not having any users?
Whenever the question asks about finding values with 0 something (users, employees, posts, etc..), immediately think of the concept of LEFT JOIN! An inner join finds any values that are in both tables, a left join keeps only the values in the left table.
Our predicament is to find all the neighborhoods without users. To do this, we have to do a left join from the neighborhoods table to the users table.
If we then add in a where condition of WHERE users.id IS NULL, then we will get every single neighborhood without a singular user.
SELECT n.name
FROM neighborhoods AS n
LEFT JOIN users AS u
ON n.id = u.neighborhood_id
WHERE u.id IS NULL
An alternative method is to use the IN clause. This method involves finding all the neighborhoods that the users live in, and then using that as a filter to subset all of the existing neighborhoods in the neighborhoods table.
SELECT
name
FROM neighborhoods
WHERE id NOT IN (
SELECT DISTINCT neighborhood_id
FROM users
)
This method might be slightly more inefficient depending on the SQL interpreter used. In this secondary method, we have to find the DISTINCT number of neighborhood IDs, which would result in scanning the table once before doing a second scan on the neighborhoods table.
The first method, however, does a join but then only needs to do one scan through the combined tables to filter out the NULL values.
Overall, if you’re using an advanced SQL interpreter, it will find the most efficient method out of the existing ones no matter what.
Q1. Define the most common aggregate functions in SQL.
The big three in SQL are: COUNT, SUM, and AVG. Have definitions ready for common terms like these.
Q2. When would you use the GROUP BY function?
If we were to try to formulate a query selecting a number of rows with a WHERE clause and then an aggregate value alongside them, we would find that the query would return an error. That’s because SQL doesn’t know how to display all of the results of our query, which includes a single aggregate value but also a list of values, in a single table.
The GROUP BY clause enables us to compute aggregate values alongside discrete data by collapsing the distinct rows of our database into summary rows that share a particular characteristic.
Q3. Why would you use the WITH clause?
When multiple subqueries are needed, SQL code becomes less and less readable. In general, it is best practice to isolate subqueries with the WITH clause. A WITH clause sets up an isolated temporary table that your query can reference.
Q4. Given the tables users and rides, write a query to report the distance traveled by each user in descending order.
Hint: We need to accomplish two things. One is to figure out the total distance traveled for each user_id. The second is to then get the name of the user and order by the distance traveled.
Users table:
column | type |
---|---|
id | integer |
name | string |
Rides table:
column | type |
---|---|
id | integer |
passenger_user_id | integer |
distance | float |
Output:
column | type |
---|---|
name | string |
distance_traveled | float |
Intermediate SQL questions require candidates to perform more complex SQL functions. These questions might ask you to perform joins, sub-queries, self-joins, and window functions. Here are some sample medium SQL questions for data analysts:
Given a table of students and their SAT test scores, write a query to return the two students with the closest test scores with the score difference.
If there are multiple students with the same minimum score difference, select the student name combination that is higher in the alphabet.
Scores table:
column | type |
---|---|
id | integer |
student | varchar |
score | integer |
Example input:
id | student | score |
---|---|---|
1 | Jack | 1700 |
2 | Alice | 2010 |
3 | Miles | 2200 |
4 | Scott | 2100 |
Example output:
one_student | other_student | score_diff |
---|---|---|
Alice | Scott | 90 |
Hint: It’s helpful to think about this problem in the form of two different tables with the same values.
There are two parts to this question, the first part is comparing each combination of students and their SAT scores.
The second part is figuring out which two students’ scores are the closest.
Let’s work on the first part of comparison.
We have two of the same tables and we’ll label one s1 and one s2. Since we want to compare each student against each other student, we can do a variation of the CROSS JOIN by joining each user against each other.
INNER JOIN scores AS s2
ON s1.student != s2.student
Notice that we can compare each user while avoiding matching the student to their own test score. Now what would the resulting table look like if we ran this query?
SELECT
s1.student AS one_student
, s2.student AS other_student
, s1.score
, s2.score
FROM scores AS s1
INNER JOIN scores AS s2
ON s1.id != s2.id
Q1. We are given a table of product purchases. Each row in the table represents an individual user product purchase.
Write a query to get the number of customers that were upsold by purchasing additional products.
Transactions table:
column | type |
---|---|
id | integer |
user_id | integer |
created_at | datetime |
product_id | integer |
quantity | integer |
Note that if the customer purchased two things on the same day, it does not count as an upsell as they were purchased within a similar timeframe.
Q2. We have a table with an id and name field. The table holds over 100 million rows and we want to sample a random row in the table without throttling the database.
Write a query to randomly sample a row from this table.
Big table:
column | type |
---|---|
id | int |
name | varchar |
**Hint: **We know that the RAND() function actually returns a floating-point between 0 and 1. So, if we were to run this function:
SELECT RAND()
We would get a random decimal point to some Nth degree of precision. RAND() essentially allows us to seed a random value. How can we use this to select a random row quickly?
Q3. Given a table of job postings, write a query to retrieve the number of users that have posted each job only once and the number of users that have posted at least one job multiple times.
Each user has at least one job posting. Thus the sum of single_post and multiple_posts should equal the total number of distinct user_id’s.
Job postings table:
column | type |
---|---|
id | integer |
job_id | integer |
user_id | integer |
date_posted | datetime |
Hint: We want the value of two different metrics, the number of users that have posted their jobs once and the number of users that have posted at least one job multiple times. What does that mean exactly?
Q4. You’re given a table that represents search results from searches on Facebook. The query
column is the search term, position
column represents each position the search result came in, and the rating
column represents the human rating of the result from 1 to 5, where 5 is high relevance and 1 is low relevance.
Write a query to get the percentage of search queries where all of the ratings for the query results are less than a rating of 3. Please round your answer to two decimal points.
Search results table:
column | type |
---|---|
query | varchar |
result_id | integer |
position | integer |
rating | integer |
Hint: Here’s a neat trick. If a question includes the phrasing of “where ALL of the values are less/greater etc..” we should not be using a WHERE clause for our filtering.
Why is that? Because we have to do the filtering after a GROUP BY when looking for the cases where a certain condition must be applied to a distinct value. In this case, the distinct value is a query in which there are multiple search results per query.
Advanced SQL questions tend to fall into two categories: logic-based questions similar to what you see on Leetcode, and complex SQL query writing. Logic-based questions tend to look at data structures and algorithms, while advanced queries tend to assess skills like debugging code, using indices to tune SQL queries, and the ability to use advanced SQL causes.
Given the two tables, write a SQL query that creates a cumulative distribution of number of comments per user. Assume bin buckets class intervals of one.
Users table:
columns | type |
---|---|
id | integer |
name | string |
created_at | datetime |
neighborhood_id | integer |
sex | string |
Comments table:
columns | type |
---|---|
user_id | integer |
body | string |
created_at | datetime |
Output:
columns | type |
---|---|
frequency | integer |
cum_total | float |
Hint: What is a cumulative distribution exactly? If we were to imagine our output and figure out what we wanted to display on a cumulative distribution graph, what would the dataset look like?
A cumulative distribution will display the total value count of everything that has at max some frequency value.
So, for example, from our comments dataset, we want to know how many users made less than one comment, less than two comments, less than three etc…
frequency | cumulative |
---|---|
0 | 10 |
1 | 25 |
2 | 27 |
We can start out with the query to count the frequency of each user by joining users
to comments
and then grouping by the user id to get the number of comments per user.
WITH hist AS (
SELECT users.id, COUNT(comments.user_id) AS frequency
FROM users
LEFT JOIN comments
ON users.id = comments.user_id
GROUP BY 1
)
SELECT * FROM hist
Now that we have our histogram, how do we get a cumulative distribution? Specifically, we want to see our frequency table go from:
frequency | count |
---|---|
0 | 10 |
1 | 25 |
2 | 27 |
to:
frequency | cumulative |
---|---|
0 | 10 |
1 | 25 |
2 | 27 |
Hint: Did you make sure that users that did not comment get factored into the cumulative distribution graph?
Let’s see if we can find a pattern and logical grouping that gets us what we want. The constraints given to us are that we will probably have to self-join since we can compute the cumulative total from the data in the existing histogram table.
If we can model out that computation, we’ll find that the cumulative is taken from the sum all of the frequency counts lower than the specified frequency index. In which we can then run our self join on a condition where we set the left f1 table frequency index as greater than the right table frequency index.
WITH hist AS (
SELECT users.id, COUNT(comments.user_id) AS frequency
FROM users
LEFT JOIN comments
ON users.id = comments.user_id
GROUP BY 1
),
freq AS (
SELECT frequency, COUNT(*) AS num_users
FROM hist
GROUP BY 1
)
SELECT *
FROM freq AS f1
LEFT JOIN freq AS f2
ON f1.frequency >= f2.frequency
Now we just have to sum up the num_users column while grouping by the f1.frequency index.
WITH hist AS (
SELECT users.id, COUNT(c.user_id) AS frequency
FROM users
LEFT JOIN comments as c
ON users.id = c.user_id
GROUP BY 1
),
freq AS (
SELECT frequency, COUNT(*) AS num_users
FROM hist
GROUP BY 1
)
SELECT f1.frequency, SUM(f2.num_users) AS cum_total
FROM freq AS f1
LEFT JOIN freq AS f2
ON f1.frequency >= f2.frequency
GROUP BY 1
Try this question on Interview Query.
Q1. The schema above is for a retail online shopping company consisting of two tables, attribution and user_sessions.
First touch attribution is defined as the channel to which the converted user was associated with when they first discovered the website.
Calculate the first touch attribution for each user_id that converted.
Attribution table:
column | type |
---|---|
session_id | integer |
channel | string |
conversion | boolean |
User_sessions table:
column | type |
---|---|
session_id | integer |
created_at | datetime |
user_id | integer |
Output:
user_id | channel |
---|---|
123 | |
145 | |
153 | |
172 | organic |
173 |
**Hint: **How do we figure out the beginning path of the Facebook ad and connect it to the end purchasing user?
Q2. We are given a table of bank transactions with three columns: user_id, a deposit or withdrawal value (determined if the value is positive or negative), and created_at time for each transaction.
Write a query to get the total three day rolling average for deposits by day.
Bank_transactions table:
column | type |
---|---|
user_id | int |
created_at | datetime |
transaction_value | float |
Output:
column | type |
---|---|
dt | datetime |
rolling_three_day | float |
Hint: Usually if the problem states to solve for a moving/rolling average, we’re given the dataset in the form of a table with two columns, the date and the value.
This problem, however, is taken one step further with a table of just transactions with values conditioned to filtering for only deposits, and remove records representing withdrawals, denoted by a negative value (e.g. -10).
Q3. Given a table of product subscriptions with a subscription start date and end date for each user, write a query that returns true or false whether or not each user has a subscription date range that overlaps with any other user.
Subscriptions table:
column | type |
---|---|
user_id | int |
start_date | date |
end_date | date |
Example:
user_id | start_date | end_date |
---|---|---|
1 | 2019-01-01 | 2019-01-31 |
2 | 2019-01-15 | 2019-01-17 |
3 | 2019-01-29 | 2019-02-04 |
4 | 2019-02-05 | 2019-02-10 |
Output:
user_id | overlap |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 0 |
**Hint: **Let’s take a look at each of the conditions first and see how they could be triggered. Given two date ranges, what determines if the subscriptions would overlap?
Q4. We are given two tables, a table of notification deliveries and a table of users with created and purchase conversion dates. If the user hasn’t made a purchase, then the conversion_date
column is NULL.
Write a query to get the distribution of total push notifications before a user converts.
Notification deliveries table:
column | type |
---|---|
notification | varchar |
user_id | int |
created_at | datetime |
Users table:
column | type |
---|---|
id | int |
created_at | datetime |
conversion_date | datetime |
Output:
column | type |
---|---|
total_pushes | int |
frequency | int |
Hint: If we’re looking for the distribution of total push notifications before a user converts, what should the end result look like? What conditions for the JOIN would we need to use to return that result?
SQL is one of the most critical skills tested in data analyst interviews, but you can’t just simply brush up on SQL and expect to pass. You’ll also want to focus on Python and algorithms, product metrics, A/B testing and machine learning.
Need some help? See our guide Top Data Analyst Interview Questions for the most common questions asked in these subject areas. Or see the SQL lessons in our data science course.