Interview Query

15 SQL Questions for Data Analysts

Explore easy, medium and hard SQL questions for data analysts. Plus, see what gets asked and how you can study.

SQL Interviews for Data Analysts: Overview

SQL interviews for data analysts are very often merely screening interviews, but without a strong grasp of the language, they can be difficult to pass.

What do you need to study in order to prepare?

SQL questions for data analysts tend to fall into three buckets:

  • Definitions-based questions e.g. What are the differences between WHERE and HAVING?
  • Reporting and metrics - e.g. Write a query to return X metric.
  • Analytics cases - e.g. Here’s a business problem. What metrics would you use to solve it? Write a query to find this metric.

If you’re not prepared, it’s very easy to get overwhelmed. To help, we’ve highlighted everything you need to know about SQL case interviews, including how they’re conducted and different types of questions that can be asked, along with 15 sample SQL questions to help you study.

What Kind of SQL Questions are Asked in Data Analyst Interviews?

a/b testingalgorithmsanalyticsmachine learningprobabilityproduct metricspythonsqlstatistics
Data Analyst
High confidence

Reporting and metrics questions are the most common type of SQL questions that data analysts face. These mirror the day-to-day work of an analyst, assessing a candidate’s ability to pull business and reporting metrics that typically end up in dashboards.

In general, the metric that you need to pull will be clearly defined, and you’ll just have to write a clean SQL query to pull it.

Beyond metrics and reporting, a variety of concepts can be covered in SQL interviews for data analysts. Depending on the role, you may be asked to write SQL queries related to:

  • Business and product cases
  • Statistics and probability
  • A/B testing

image

How Is SQL Tested in Data Analyst Interviews?

image

SQL technical screens are a part of nearly every analyst interview. In these screens, candidates are asked to answer real-world problems using SQL.

Most commonly, candidates are provided a dataset and asked to write a SQL query to return the desired data.

How do companies test proficiency? There are three main types of SQL interviews:

  1. Whiteboarding - SQL whiteboard tests are a common part of interviews. In a whiteboard test, you’re required to write SQL queries by hand, which allows companies to assess your understanding of SQL concepts and problem-solving ability.
  2. Coding tests - Many companies ask you to write code and run queries in live interviews. With live coding screens, you can check for syntax errors while you work and it provides companies a way to see your coding efficiency.
  3. SQL case studies - In case interviews, you’re given a real-world problem and asked to use your SQL skills to solve the problem. These are typically open-ended questions that leave room for analysis and creative problem-solving.

Prep for your data analyst interview with our guide: Top Data Analyst Interview Questions for 2021.

Types of SQL Questions for Data Analysts

Regardless of the interview format, the types of SQL questions that get asked can range from basic definitions to advanced query writing and analysis-based problem-solving.

The three most common types of SQL questions for data analytics positions are:

1) Basic SQL questions

There are two types: definitions and basic SQL code writing.

Definitions-based SQL questions ask you to explain technical concepts, compare functions, and define how a concept is used. These questions are typically asked early in the interview process to quickly assess your proficiency.

On the other hand, a basic SQL coding question for analysts might ask you to write a simple query like getting the COUNT of a table or using the HAVING clause.

2) Reporting and metrics questions

This is the most common type of SQL question that data analysts face, and they mirror the work that analysts do on the job.

Typically, you will be provided with the desired output, and then have to write the query to return the value. Metrics SQL questions require more advanced coding skills. These problems typically require complex joins, sub-queries, or window functions to solve. ****

3) Analytics SQL questions

Analytics questions are usually asked later in the interview process, and they’re sometimes referred to as “analytics case studies.” Most commonly, a real-world problem is posed. You then have to define the metrics to solve the question and write a query to find the output.

Easy SQL Questions for Data Analysts

data analytics - topic sql data analyst

As we’ve mentioned before, data analyst technical interviews generally start with some beginner SQL questions. This includes definitions of common features and functions in SQL, as well as basic SQL queries. For example, you might be asked to use the COUNT or AVG function. Simple joins may also be asked. Here are some easy SQL practice questions:

Question:

You’re given two tables, a users table with demographic information and the neighborhood they live in, and a neighborhoods table.

Write a query that returns all of the neighborhoods that have 0 users.

Users table:

columns type
id int
name varchar
neighborhood_id int
created_at datetime

Neighborhoods table:

columns type
id int
name varchar
city_id int

Output:

columns type
neighborhood_name varchar

Hint: Our predicament is to find all the neighborhoods without users. In a sense, we need all the neighborhoods that do not have a singular user living in them. This means we have to introduce a concept of existence of a column in one table, but not in the other.

Solution:

Let’s say we have our dataset looking like this:

neighborhoods.name users.id
castro 123
castro 124
cole valley null
castro heights 534
castro heights 564

How do we remove the NULL?

What SQL query would we use to get our data so that we can display the neighborhood of Cole Valley as not having any users?

Whenever the question asks about finding values with 0 something (users, employees, posts, etc..), immediately think of the concept of LEFT JOIN! An inner join finds any values that are in both tables, a left join keeps only the values in the left table.

Our predicament is to find all the neighborhoods without users. To do this, we have to do a left join from the neighborhoods table to the users table.

If we then add in a where condition of WHERE users.id IS NULL, then we will get every single neighborhood without a singular user.

SELECT n.name   
FROM neighborhoods AS n 
LEFT JOIN users AS u
    ON n.id = u.neighborhood_id
WHERE u.id IS NULL

An alternative method is to use the IN clause. This method involves finding all the neighborhoods that the users live in, and then using that as a filter to subset all of the existing neighborhoods in the neighborhoods table.

SELECT
  name
FROM neighborhoods
WHERE id NOT IN (
    SELECT DISTINCT neighborhood_id
    FROM users
)

This method might be slightly more inefficient depending on the SQL interpreter used. In this secondary method, we have to find the DISTINCT number of neighborhood IDs, which would result in scanning the table once before doing a second scan on the neighborhoods table.

The first method, however, does a join but then only needs to do one scan through the combined tables to filter out the NULL values.

Overall, if you’re using an advanced SQL interpreter, it will find the most efficient method out of the existing ones no matter what.

Try this question on Interview Query.

Q1. Define the most common aggregate functions in SQL.

The big three in SQL are: COUNT, SUM, and AVG. Have definitions ready for common terms like these.

Q2. When would you use the GROUP BY function?

If we were to try to formulate a query selecting a number of rows with a WHERE clause and then an aggregate value alongside them, we would find that the query would return an error. That’s because SQL doesn’t know how to display all of the results of our query, which includes a single aggregate value but also a list of values, in a single table.

The GROUP BY clause enables us to compute aggregate values alongside discrete data by collapsing the distinct rows of our database into summary rows that share a particular characteristic.

Q3. Why would you use the WITH clause?

When multiple subqueries are needed, SQL code becomes less and less readable. In general, it is best practice to isolate subqueries with the WITH clause. A WITH clause sets up an isolated temporary table that your query can reference.

Q4. Given the tables users and rides, write a query to report the distance traveled by each user in descending order.

Hint: We need to accomplish two things. One is to figure out the total distance traveled for each user_id. The second is to then get the name of the user and order by the distance traveled.

Users table:

column type
id integer
name string

Rides table:

column type
id integer
passenger_user_id integer
distance float

Output:

column type
name string
distance_traveled float

Medium SQL Questions for Data Analysts

sql data on dashboard

Intermediate SQL questions require candidates to perform more complex SQL functions. These questions might ask you to perform joins, sub-queries, self-joins, and window functions. Here are some sample medium SQL questions for data analysts:

Question:

Given a table of students and their SAT test scores, write a query to return the two students with the closest test scores with the score difference.

If there are multiple students with the same minimum score difference, select the student name combination that is higher in the alphabet.

Scores table:

column type
id integer
student varchar
score integer

Example input:

id student score
1 Jack 1700
2 Alice 2010
3 Miles 2200
4 Scott 2100

Example output:

one_student other_student score_diff
Alice Scott 90

Hint: It’s helpful to think about this problem in the form of two different tables with the same values.

Solution:

There are two parts to this question, the first part is comparing each combination of students and their SAT scores.

The second part is figuring out which two students’ scores are the closest.

Let’s work on the first part of comparison.

We have two of the same tables and we’ll label one s1 and one s2. Since we want to compare each student against each other student, we can do a variation of the CROSS JOIN by joining each user against each other.

INNER JOIN scores AS s2
    ON s1.student != s2.student

Notice that we can compare each user while avoiding matching the student to their own test score. Now what would the resulting table look like if we ran this query?

SELECT 
    s1.student AS one_student
    , s2.student AS other_student
    , s1.score
    , s2.score
FROM scores AS s1
INNER JOIN scores AS s2
    ON s1.id != s2.id

Try this question on Interview Query.

Q1. We are given a table of product purchases. Each row in the table represents an individual user product purchase.

Write a query to get the number of customers that were upsold by purchasing additional products.

Transactions table:

column type
id integer
user_id integer
created_at datetime
product_id integer
quantity integer

Note that if the customer purchased two things on the same day, it does not count as an upsell as they were purchased within a similar timeframe.

Q2. We have a table with an id and name field. The table holds over 100 million rows and we want to sample a random row in the table without throttling the database.

Write a query to randomly sample a row from this table.

Big table:

column type
id int
name varchar

**Hint: **We know that the RAND() function actually returns a floating-point between 0 and 1. So, if we were to run this function:

SELECT RAND()

We would get a random decimal point to some Nth degree of precision. RAND() essentially allows us to seed a random value. How can we use this to select a random row quickly?

Q3. Given a table of job postings, write a query to retrieve the number of users that have posted each job only once and the number of users that have posted at least one job multiple times.

Each user has at least one job posting. Thus the sum of single_post and multiple_posts should equal the total number of distinct user_id’s.

Job postings table:

column type
id integer
job_id integer
user_id integer
date_posted datetime

Hint: We want the value of two different metrics, the number of users that have posted their jobs once and the number of users that have posted at least one job multiple times. What does that mean exactly?

Q4. You’re given a table that represents search results from searches on Facebook. The query column is the search term, position column represents each position the search result came in, and the rating column represents the human rating of the result from 1 to 5, where 5 is high relevance and 1 is low relevance.

Write a query to get the percentage of search queries where all of the ratings for the query results are less than a rating of 3. Please round your answer to two decimal points.

Search results table:

column type
query varchar
result_id integer
position integer
rating integer

Hint: Here’s a neat trick. If a question includes the phrasing of “where ALL of the values are less/greater etc..” we should not be using a WHERE clause for our filtering.

Why is that? Because we have to do the filtering after a GROUP BY when looking for the cases where a certain condition must be applied to a distinct value. In this case, the distinct value is a query in which there are multiple search results per query.

Hard SQL Questions for Data Analysts

data on graph - topic data analyst sql questions

Advanced SQL questions tend to fall into two categories: logic-based questions similar to what you see on Leetcode, and complex SQL query writing. Logic-based questions tend to look at data structures and algorithms, while advanced queries tend to assess skills like debugging code, using indices to tune SQL queries, and the ability to use advanced SQL causes.

Question:

Given the two tables, write a SQL query that creates a cumulative distribution of number of comments per user. Assume bin buckets class intervals of one.

Users table:

columns type
id integer
name string
created_at datetime
neighborhood_id integer
sex string

Comments table:

columns type
user_id integer
body string
created_at datetime

Output:

columns type
frequency integer
cum_total float

Hint: What is a cumulative distribution exactly? If we were to imagine our output and figure out what we wanted to display on a cumulative distribution graph, what would the dataset look like?

Solution:

A cumulative distribution will display the total value count of everything that has at max some frequency value.

So, for example, from our comments dataset, we want to know how many users made less than one comment, less than two comments, less than three etc…

frequency cumulative
0 10
1 25
2 27

We can start out with the query to count the frequency of each user by joining users to comments and then grouping by the user id to get the number of comments per user.

WITH hist AS (
    SELECT users.id, COUNT(comments.user_id) AS frequency
    FROM users
    LEFT JOIN comments
        ON users.id = comments.user_id
    GROUP BY 1
)

SELECT * FROM hist

Now that we have our histogram, how do we get a cumulative distribution? Specifically, we want to see our frequency table go from:

frequency count
0 10
1 25
2 27

to:

frequency cumulative
0 10
1 25
2 27

Hint: Did you make sure that users that did not comment get factored into the cumulative distribution graph?

Let’s see if we can find a pattern and logical grouping that gets us what we want. The constraints given to us are that we will probably have to self-join since we can compute the cumulative total from the data in the existing histogram table.

If we can model out that computation, we’ll find that the cumulative is taken from the sum all of the frequency counts lower than the specified frequency index. In which we can then run our self join on a condition where we set the left f1 table frequency index as greater than the right table frequency index.

WITH hist AS (
    SELECT users.id, COUNT(comments.user_id) AS frequency
    FROM users
    LEFT JOIN comments
        ON users.id = comments.user_id
    GROUP BY 1
),

freq AS (
    SELECT frequency, COUNT(*) AS num_users
    FROM hist
    GROUP BY 1
)

SELECT * 
FROM freq AS f1
LEFT JOIN freq AS f2
    ON f1.frequency >= f2.frequency

Now we just have to sum up the num_users column while grouping by the f1.frequency index.

WITH hist AS (
    SELECT users.id, COUNT(c.user_id) AS frequency
    FROM users
    LEFT JOIN comments as c
        ON users.id = c.user_id
    GROUP BY 1
),

freq AS (
    SELECT frequency, COUNT(*) AS num_users
    FROM hist
    GROUP BY 1
)

SELECT f1.frequency, SUM(f2.num_users) AS cum_total
FROM freq AS f1
LEFT JOIN freq AS f2
    ON f1.frequency >= f2.frequency
GROUP BY 1

Try this question on Interview Query.

Q1. The schema above is for a retail online shopping company consisting of two tables, attribution and user_sessions.

  • The attribution table logs a session visit for each row.
  • If conversion is true, then the user converted to buying on that session.
  • The channel column represents which advertising platform the user was attributed to for that specific session.
  • The user_sessions table maps many to one session visits back to one user.

First touch attribution is defined as the channel to which the converted user was associated with when they first discovered the website.

Calculate the first touch attribution for each user_id that converted.

Attribution table:

column type
session_id integer
channel string
conversion boolean

User_sessions table:

column type
session_id integer
created_at datetime
user_id integer

Output:

user_id channel
123 facebook
145 google
153 facebook
172 organic
173 email

**Hint: **How do we figure out the beginning path of the Facebook ad and connect it to the end purchasing user?

Q2. We are given a table of bank transactions with three columns: user_id, a deposit or withdrawal value (determined if the value is positive or negative), and created_at time for each transaction.

Write a query to get the total three day rolling average for deposits by day.

Bank_transactions table:

column type
user_id int
created_at datetime
transaction_value float

Output:

column type
dt datetime
rolling_three_day float

Hint: Usually if the problem states to solve for a moving/rolling average, we’re given the dataset in the form of a table with two columns, the date and the value.

This problem, however, is taken one step further with a table of just transactions with values conditioned to filtering for only deposits, and remove records representing withdrawals, denoted by a negative value (e.g. -10).

Q3. Given a table of product subscriptions with a subscription start date and end date for each user, write a query that returns true or false whether or not each user has a subscription date range that overlaps with any other user.

Subscriptions table:

column type
user_id int
start_date date
end_date date

Example:

user_id start_date end_date
1 2019-01-01 2019-01-31
2 2019-01-15 2019-01-17
3 2019-01-29 2019-02-04
4 2019-02-05 2019-02-10

Output:

user_id overlap
1 1
2 1
3 1
4 0

**Hint: **Let’s take a look at each of the conditions first and see how they could be triggered. Given two date ranges, what determines if the subscriptions would overlap?

Q4. We are given two tables, a table of notification deliveries and a table of users with created and purchase conversion dates. If the user hasn’t made a purchase, then the conversion_date column is NULL.

Write a query to get the distribution of total push notifications before a user converts.

Notification deliveries table:

column type
notification varchar
user_id int
created_at datetime

Users table:

column type
id int
created_at datetime
conversion_date datetime

Output:

column type
total_pushes int
frequency int

Hint: If we’re looking for the distribution of total push notifications before a user converts, what should the end result look like? What conditions for the JOIN would we need to use to return that result?

More SQL Resources for Data Analysts

SQL is one of the most critical skills tested in data analyst interviews, but you can’t just simply brush up on SQL and expect to pass. You’ll also want to focus on Python and algorithms, product metrics, A/B testing and machine learning.

Need some help? See our guide Top Data Analyst Interview Questions for the most common questions asked in these subject areas. Or see the SQL lessons in our data science course.