Top SQL Case Study Questions

Top SQL Case Study Questions

What is a SQL Case Study?

The majority of SQL interview questions are straightforward. You may be asked for definitions, or to write a clearly defined SQL query.

But SQL case study questions are an entirely different beast.

These questions usually start with a hypothetical business or product issue, e.g. unsubscribe rates are falling. Then, you have to define what metrics could be used to investigate the problem, and then write the query to produce those metrics.

One of the best ways to prepare for SQL case study interviews is to walk through solutions step-by-step. This will show you how to think about metrics in hypotheticals, as well as how to walk interviewers through your logic.

We’ve done that here, with two breakdowns of SQL case questions with clear solutions.

Example SQL Case Question: Unsubscribe Rates

Many SQL case study questions will ask you to investigate correlation. In this example SQL case question, we’re looking into this issue: Unsubscribe rates have increased after a new notification system has been introduced.

Question:

Twitter wants to roll out more push notifications to users because they think users are missing out on good content. Twitter decides to do this in an A/B test.

Say that after more notifications are released, there is a sudden increase in the total number of unsubscribes.

We’re given two tables: events where actions are ‘login’, ‘nologin’, and ‘unsubscribe’ and another table called variants where user’s are bucketed into a control and a variant A/B test.

Given these tables, write a query to display a graph to understand how unsubscribes are affecting login rates over time.

Note: Let’s say that all users are automatically put into the A/B test.

events table

Column Type
user_id INTEGER
created_at DATETIME
action STRING

variants table

Column Type
user_id INTEGER
experiment STRING
variant STRING

Step 1: Start Each SQL Case Study by Making Assumptions

This question asks us to compare multiple variables at play here. Specifically, we’re looking at:

  • There is a new notification system.
  • We’re interested in the effect the new notifications are having on unsubscribes.

We’re not sure how unsubscribes are affecting login rates, but we can plot a graph that would help us visualize how the login rates change before and after an unsubscribe from a user.

We can also see how the login rates compare for unsubscribes for each bucket of the A/B test. Given that we want to measure two different changes, we have to eventually do a GROUP BY of two different variables:

  1. Date
  2. Bucket variant

Step 2: Develop a Hypothesis for the SQL Case Question

In order to visualize this, we’ll need to plot two lines on a 2D graph.

  • The x-axis represents days until unsubscribing with a range of -30 to 0 to 30, in which -30 is thirty days before unsubscribing and 30 is 30 days after unsubscribing.
  • The y-axis represents the average login rate for each day. We’ll be plotting two lines for each of the A/B test variants, control and test.

Now that we have what we’re going to graph, it’s a matter of writing a SQL query to get the dataset for the graph.

We can make sure our dataset looks something like this:

variant days_since_unsub login_rate
control -30 90%
test -30 91%

Each column represents a different axis or line for our graph.

Step 3: SQL Coding + Analysis

We know that we have to get every user that has unsubscribed, so we’ll first INNER JOIN the abtest table to the events table, where there exists an unsubscribe event. Now we’ve isolated all users that have ever unsubscribed.

Additionally, we have to then get every event in which the user has logged in, and divide it by the total number of users that are eligible within the timeframe.

SELECT 
    variant
    , DATEDIFF(e1.created_at, e2.created_at) AS days_since_unsub
    , ROUND(COUNT(DISTINCT CASE WHEN e2.action = 'login' 
THEN 1 ELSE 0 END)/COUNT(DISTINCT abtest.user_id) * 100, 2)AS login_rate
FROM variants as abtest
INNER JOIN events AS e1
    ON abtest.user_id = e1.user_id
        AND e1.action = 'unsubscribe'
INNER JOIN events AS e2
    ON abtest.user_id = e2.user_id
        AND e2.action IN ('login', 'nologin')
        AND DATEDIFF(e1.created_at, e2.created_at) 
            BETWEEN -30 AND 30
GROUP BY 1,2

Example SQL Case Question: LinkedIn Job Titles

Many SQL case questions require creativity to solve. You’re given a hypothesis, but then have to determine how to prove or disprove it with specific metrics. The key here is walking the interviewer through your thought process. This example SQL case question from LinkedIn explores user career paths.

Question:

We’re given a table of user experiences representing each person’s past work experiences and timelines.

Specifically, let’s say we’re interested in analyzing the career paths of data scientists. The titles we care about are bucketed into data scientist, senior data scientist, and data science manager.

We’re interested in determining if a data scientist who switches jobs more often ends up getting promoted to a manager role faster than a data scientist that stays at one job for longer.

Write a query to prove or disprove this hypothesis.

user_experiences table

Column Type
id INTEGER
user_id INTEGER
title STRING
company STRING
start_date DATETIME
end_date DATETIME
is_current_role BOOLEAN

Step 1: Make Assumptions about the SQL Case Question

The hypothesis is that data scientists that end up switching jobs more often get promoted faster.

Therefore, in analyzing this dataset, we can prove this hypothesis by separating the data scientists into specific segments based on how often they shift in their careers.

For example, if we look at the number of job switches for data scientists that have been in their field for five years, we could prove the hypothesis if the number of data science managers increased along with the number of career jumps.

Here’s what that might look like:

  • Never switched jobs: 10% are managers
  • Switched jobs once: 20% are managers
  • Switched jobs twice: 30% are managers
  • Switched jobs three times: 40% are managers

We could look at this over different buckets of time as well to see if the correlation stays consistent after 10 or 15 years in a data science career.

This analysis proves to be correct except for the fact that it doesn’t count the intention of the data scientist. What happens if the data scientist didn’t ever want to become a manager?

Step 2: Come up with a Hypothesis for the SQL Case Question

There’s one flaw in the assumption there. It doesn’t account for the intention of the data scientist. It doesn’t answer the question: What happens if the data scientist didn’t ever want to become a manager?

One way to solve this is to do the analysis backwards.

We can subset all of the existing data science managers and see how often they ended up switching jobs before they got to their first manager position.

Then divide the number of job switches by the amount of time it took for them to achieve the manager position themselves. This way, we can end up with a result that looks like this:

  • Job switches: 1 - Average months to promotion: 50
  • Job switches: 2 - Average months to promotion: 46
  • Job switches: 3 - Average months to promotion: 44

But there is a fault with this analysis as well. What about all those data scientists that have switched jobs / not switched jobs but haven’t become managers yet? They could be one month away from being a manager and be subsetted out of our analysis!

We have to then make some assumptions about the distribution of existing data science managers.

Are the years of experience before they became managers normally distributed? If not, then our results might be a bit biased from our hindsight analysis.

Step 3: Write the SQL Case Query

We first make a CTE called manager_promo with all the user_ids that have been promoted to data science managers.

Next, we count the number of job switches before getting promoted as num_jobs_switched.

Then, we calculate the number of months before promotion to the data science manager position as month_to_promo.

Finally, we order by the number of jobs switched.

WITH manager_promo as (
    SELECT user_id
        , MIN(start_date) as promotion_date
    FROM playground.user_experiences
    WHERE title='data science manager'
    GROUP BY user_id
)

SELECT num_jobs_switched
    , AVG(
        TIMESTAMPDIFF(MONTH, career_started_date, promotion_date)
    ) as month_to_promo
FROM (
    SELECT u.user_id
        , mp.promotion_date
        , COUNT(DISTINCT u.id) as num_jobs_switched
        , MIN(start_date) as career_started_date
    FROM playground.user_experiences u
    INNER JOIN manager_promo mp 
        on u.user_id=mp.user_id
    WHERE u.start_date<mp.promotion_date
    GROUP BY u.user_id, mp.promotion_date
) tt
GROUP BY num_jobs_switched
ORDER BY 2 DESC

Step 4: Perform Analysis and Make Conclusions

Hint: Talk about any conclusions you could draw from your data, but also be prepared to talk about trade-offs and potential flaws.

With the query result, we can draw conclusions about the months it took each distinct user to be promoted to data science manager.

Be warned this solution is not perfect. The edge cases where users never become promoted to data science managers are not considered.

Finally, many adjustments, like creating buckets for different ranges of months (0-20 months to promotion, 20-40 months to promotion, etc.), can present a more digestible, high-level analysis on whether frequent job changes affect promotion opportunities to the data science manager position.

Each bucket would correspond to the average time it took the users in that bucket to be promoted to a data science manager position.

Learn more about SQL questions

This course is designed to help you learn everything you need to know about working with data, from basic concepts to more advanced techniques.

More SQL Resources to Ace Your Interview

If you have an interview coming up, review Interview Query’s data science course, which includes modules in SQL.

SQL interviews are demanding, and the more you practice all types of SQL interview questions and not just case questions, the more confident and efficient you’ll become in answering them.