Interview Query

Top Data Analyst Interview Questions for 2021

Introduction

Over ten years ago, a data analyst interview was very simple. All you needed to know to say was two things:

  1. “Yes, I know Excel!”

  2. “I’m a great communicator.”

Nowadays, the role of a data analyst has changed. Not only have salaries shot up, but data analysts are more in-demand than ever before due to their insight and analytical skillset.

Most data analyst jobs at tech companies require a strong technical skillset combined with good judgement. In this guide, we’ll break down the interview process and most common interview questions you’ll see on the interview.

Data Analyst Interview Guide

Technical interviews for data analyst roles are typically multi-stage interviews. They start with initial screens designed to weed out candidates, and quickly progress to more technically demanding screens.

Here’s a typical breakdown of the data analyst interview process:

1) Initial Phone Screen

Initial screens are generally calls with recruiters. These screens assess your experience, your interests/specializations, and salary expectations. In some cases, you may be asked basic SQL questions or a simple scenario based case question.

Sample question: Tell us about a challenging project you have worked on. What were some of the obstacles you had to overcome?

2) Technical Interview

The technical screen assesses your technical skills. In many cases, SQL is a primary focus. SQL questions range from basic definitions, to writing intermediate-to-advanced queries. Depending on the job function, you may be asked technical questions about Python, statistics and probability, algorithms and A/B testing.

Sample question: Given two tables, write a query to find the number of users that accepted Friend requests in the last month.

3) Take-Home Challenge

Take-homes are longer tests that may take several hours to complete. These challenges are designed to evaluate your ability to handle data, perform analysis and present your results effectively. Typically, these tests will ask you to perform and investigation on a dataset and present your findings.

Sample question: Prepare a summary of sales and website data for the Vice President of Marketing. Include an overview of website traffic and sales, as well as areas for improvement.

4) On-site Interview

Data analyst on-site interviews typically consist of 3-5 hour-long interviews. They typically cover traditional technical SQL and statistics questions, as well as data analytics case studies and behavioral questions.

Sample question: Describe an analytics project you worked on. What were some challenges you faced?

Different Types of Data Analyst Interview Questions

Interview Query regularly analyzes the contents of data analyst interviews. By tagging common keywords and mapping them back to question topics for over 10K+ tech companies, we’ve found that SQL questions are asked most frequently.

In fact, in interviews for data analyst roles, SQL and data manipulations questions are asked 85% of the time.

algorithmsmachine learningprobabilityproduct metricspythonsqlstatistics
Data Analyst
High confidence

Here are the types of technical interview questions data analysts get asked most frequently:

Additionally for more traditional data analyst roles, expect interview questions around:

Let’s first dive into how to approach and answer behavioral interview questions

Behavioral Interview Questions for Data Analysts

Behavioral questions in data analyst interviews ask about specific situations you’ve been in, in which you had to apply specific skills or knowledge.

For many data analysts, behavioral questions can be fairly tough.

One tip: Always try to relate the question back to your experience and strengths.

Q1. Describe a time when you spotted an inconsistency. How did you respond?

Successful data analysts can help businesses identify anomalies and respond quickly.

For data sense questions, think about a time that you were able to spot an inconsistency in data quality, and how you eventually addressed it.

Q2. Talk about a time where you had to make a decision in a lot of uncertainty.

Q3. How would you convey insights and the methods you use to a non-technical audience?

Q4. How do you set goals and achieve them? Give us an example.

Q5. Describe a time when you solved a conflict at work.

Q6. Give an example of a situation when you have shown effectiveness, empathy, humbleness, and adaptability.

Q7. Give me an example of a time when you failed on a project.

Q8. Talk about an occasion when you used logic to solve a problem.

Q9. What do you do if you disagree with your manager?

SQL Interview Questions for Data Analysts

sql code on screen - data analyst interview questions

Data analysts use SQL to query data to solve complex business problems or find answers for other employees. In general, SQL data analyst questions focus on analytics and reporting:

  • Basic SQL Questions - These include the basics, e.g. definitions, as well as beginner SQL queries.

  • Analytics Questions – Analytics based questions, you might have to understand what kind of report or graph to build first, and then write a query to generate that report. So it’s an extra step on top of a regular SQL question.

  • Reporting Questions – SQL reporting questions replicate the work many data or business analysts do on a day-to-day basis, e.g. writing queries.

Reporting interview questions focus on writing a query to generate an already known output. Such as producing a report or a metric given some example table.

For analytics based questions, you might have to understand what kind of report or graph to build first, and then write a query to generate that report. So it’s an extra step on top of a regular SQL question.

See more SQL questions and tips for answering in our guide: 15 SQL Questions for Data Analysts.

Basic SQL Interview Questions

Q1: What are the different ways of handling NULL when querying a data set?

To handle such a situation, we can use three different operations:

IS NULL − This operator returns true, if the column value is NULL. IS NOT NULL − This operator returns true, if the column value is not NULL. <=> − This operator compares values, which (unlike the = operator) is true even for two NULL values.

Q2: What’s the difference between UNION and UNION ALL? (Asked by Facebook)

UNION and UNION ALL are SQL operators used to concatenate 2 or more result sets. This allows us to write multiple SELECT statements, retrieve the desired results, then combine them together into a final, unified set.

The main difference between UNION and UNION ALL is that:

  • UNION: only keeps unique records

  • UNION ALL: keeps all records, including duplicates

Q3: What is the difference between a SQL view and table? (Asked by Kaiser Permanente)

A table is structured with columns and rows. A view is a virtual table extracted from a database by writing a query.

Q4: What’s the difference between an INNER and OUTER JOIN?

The difference between an inner and outer join is that inner joins result in the intersection of two tables, whereas outer joins result in the union of two tables.

Reporting SQL Questions

Q5. We have a table with an id and name field. The table holds over 100 million rows and we want to sample a random row in the table without throttling the database. Write a query to randomly sample a row from this table.

big_table

column type
id int
name varchar

In most SQL databases there exists a RAND() function in which normally we can call:

SELECT * FROM big_table
ORDER BY RAND()

The function will randomly sort the rows in the table. This function works fine and is fast if you only have let’s say around 1,000 rows. It might take a few seconds to run at 10K. And then at 100K maybe you have to go to the bathroom or cook a meal before it finishes.

What happens at 100 million rows?

Someone in DevOps is probably screaming at you.

Random sampling is important in SQL with scale. We don’t want to use the pre-built function because it wasn’t meant for performance. But maybe we can re-purpose it for our own use case.

We know that the RAND() function actually returns a floating-point between 0 and 1. So if we were to instead call:

SELECT RAND()

We would get a random decimal point to some Nth degree of precision. RAND() essentially allows us to seed a random value. How can we use this to select a random row quickly?

Let’s try to grab a random number using RAND() from our table that can be mapped to an id. Given we have 100 million rows, we probably want a random number from 1 to 100 million. We can do this by multiplying our random seed from RAND() by the max number of rows in our table.

SELECT CEIL(RAND() * (
    SELECT MAX(id) FROM big_table)
)

We use the CEIL function to round the random value to an integer. Now we have to join back to our existing table to get the value.

What happens if we have missing or skipped id values though? We can solve for this by running the join on all the ids which are greater or equal than our random value and selects only the direct neighbor if a direct match is not possible.

As soon as one row is found, we stop (LIMIT 1). And we read the rows according to the index (ORDER BY id ASC). Now our performance is optimal.

SELECT r1.id, r1.name
FROM big_table AS r1 
INNER JOIN (
    SELECT CEIL(RAND() * (
        SELECT MAX(id)
        FROM big_table)
    ) AS id
) AS r2
    ON r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1

Q5. Given a table of job postings, write a query to breakdown the number of users that have posted their jobs once versus the number of users that have posted at least one job multiple times.

Hint: We want the value of two different metrics, the number of users that have posted their jobs once and the number of users that have posted at least one job multiple times. What does that mean exactly?

Q6. Let’s say we have a table representing a company payroll schema.

Due to an ETL error, the employees table instead of updating the salaries every year when doing compensation adjustments, did an insert instead. The head of HR still needs the current salary of each employee.

Write a query to get the current salary for each employee.

Assume no duplicate combination of first and last names. (I.E. No two John Smiths)

The first step we need to do would be to remove duplicates and retain the current salary for each user.

Given we know there aren’t any duplicate first and last name combinations, we can remove duplicates from the employees table by running a GROUP BY on two fields, the first and last name. This allows us to then get a unique combinational value between the two fields.

Analytics SQL Questions

Q7. Given a table of search results, write a query to compute a metric to measure the quality of the search results for each query.

search_results table

column type
query varchar
result_id integer
position integer
rating integer

You want to be able to compute a metric that measures the precision of the ranking system based on position. For example, if the results for dog and cat are….

query result_id position rating notes
dog 1000 1 2 picture of hotdog
dog 998 2 4 dog walking
dog 342 3 1 zebra
cat 123 1 4 picture of cat
cat 435 2 2 cat memes
cat 545 3 1 pizza shops

…we would rank ‘cat’ as having a better search result ranking precision than ‘dog’ based on the correct sorting by rating.

Write a query to create a metric that can validate and rank the queries by their search result precision. Round the metric (avg_rating column) to 2 decimal places.

Q8. Given the two tables, write a SQL query that creates a cumulative distribution of number of comments per user. Assume bin buckets class intervals of one.

Hint: What is a cumulative distribution exactly? If we were to imagine our output and figure out what we wanted to display on a cumulative distribution graph, what would the dataset look like?

Q9. We are given a table of bank transactions with three columns: user_id, a deposit or withdrawal value (determined if the value is positive or negative), and created_at time for each transaction.

Write a query to get the total three day rolling average for deposits by day.

Usually if the problem states to solve for a moving/rolling average, we’re given the dataset in the form of a table with two columns, the date and the value.

This problem, however, is taken one step further with a table of just transactions with values conditioned to filtering for only deposits, and remove records representing withdrawals, denoted by a negative value (e.g. -10).

Data Analytics Case Study

Data analytics case study questions combine a rotating mix of product intuition, business estimation, and data analytics.

Case questions come up in interviews when the job responsibilities lean to more of a heavy analytics space with an emphasis on solving problems and producing insights for management.

Many times data analysts will transition into a heavy analytics role when they’re required to take on more scope around the product and provide insights that upper level management can understand and interpret.

So data analytics case study questions will focus on a particular problem and you will be judged on how you break down the question, analyze the problem, and communicate your insights.

Here’s an example data analytics case study question:

Q1. You work at Stack Overflow on the community team that monitors the health of the platform.

Community members can create a post to ask a question, and other users can reply with answers or comments to that question. The community can express their support for the post by upvoting or downvoting.

post_analytics table:

column type description
id int Primary key of posts table
user_id int ID of the user who created the post
created_at datetime Timestamp of the post
title string Title of the post
body string Text content of the post
comment_count int Total number of the comments on a post
view_count int Total number of the views on a post
answer_count int Total number of answers on a post
upvotes int Total number of upvotes on the post

1) Given a table of Stack Overflow posts data, suggest three metrics to monitor the health of the community.

2) Write the queries for these metrics in SQL.

This is a classic data analytics case study. A question like this is designed to assess your data intuition, product sense, and ability to isolate key metrics.

Remember: There isn’t one correct answer, but usually, the conversation should head in a similar direction.

For example, this question asks about community health. Broadly, there are several metrics you’ll want to consider: Growth rate, engagement and user retention would provide insights into the community’s health.

The challenge with this question is determine how to measure those metrics with the data provided.

Q2. Describe an analytics experiment that you designed. How were you able to measure success?

Case questions sometimes take the form of behavioral questions. Data analysts get tasked with experimenting with data to test new features or campaigns. Many behavioral questions will ask about experiments, but also tap into how you approach measuring your results.

With questions like these, be sure to describe the objective of the experiment, even if it was a simple A/B test. Don’t be afraid to get technical and explain the metrics you used and the process you used to quantify the results.

Q3. You’re given a table that represents search results from searches on Facebook. The query column is the search term, position column represents each position the search result came in, and the rating column represents the human rating from 1 to 5 where 5 is high relevance and 1 is low relevance.

Each row in the search_events table represents a single search with the has_clicked column representing if a user clicked on a result or not. We have a hypothesis that the CTR is dependent on the search result rating.

Write a query to return data to support or disprove this hypothesis.

Start by making assumptions and thinking out loud. With this question, focus on coming up with a metric to support the hypothesis. If the question is unclear or if you think you need more information, be sure to ask.

Answer. The hypothesis is that CTR is dependent on search result rating. Therefore, we want to focus on the CTR metric

Python Coding Questions

Python coding questions for data analysts are usually pretty simple and not as difficult as the ones seen on Leetcode. Mainly most interviewers just want to test basic knowledge of Python to the point that they know you can write scripts or some basic functions to move data between SQL and Excel or onto a dashboard.

Most data analysts never write production code, such as their code is never under scrutiny because it’s not holding a website up or performing some critical business function.

Therefore, most coding questions for data analyst interviews are generally on the easier side and mostly test basic functions that are required for data manipulation. Pandas questions may also be asked in this round of the interview.

Here’s a example Python coding question:

Q1. Write a function that can take a string and return a list of bigrams. (Asked by Indeed)

sentence = "Have free hours and love children?"

output = [
 ('have', 'free'),
 ('free', 'hours'),
 ('hours', 'and'),
 ('and', 'love'),
 ('love', 'children')
]

Bigrams are two words that are placed next to each other. To actually parse them out of a string, we need to first split the input string.

We would use the Python function .split() to create a list with each individual word as an input. Create another empty list that will eventually be filled with tuples.

Then, once we’ve identified each individual word, we need to loop through k-1 times (if k is the amount of words in a sentence) and append the current word and subsequent word to make a tuple. This tuple gets added to a list that we eventually return. Remember to use the Python function .lower() to turn all the words into lowercase!

def find_bigrams(sentence):
  input_list = sentence.split()
  bigram_list = []

  # Now we have to loop through each word
  for i in range(len(input_list)-1):
    #strip the whitespace and lower the word to ensure consistency
    bigram_list.append((input_list[i].strip().lower(), input_list[i+1].strip().lower()))
  return bigram_list

Q2. Write a function that takes in a list of dictionaries with a key and list of integers and returns a dictionary with the standard deviation of each list.

Hint: need to use the equation for standard deviation to answer this question. Using the equation, allows us to take the sum of the square of the data value minus the mean, over the total number of data points, all in a square root.

Q3. Given a list of timestamps in sequential order, return a list of lists grouped by week (7 days) using the first timestamp as the starting point.

This question sounds like it should be a SQL question doesn’t it? Weekly aggregation implies a form of GROUP BY in a regular SQL or pandas question. In either case, aggregation on a dataset of this form by week would be pretty trivial.

Q4. Explain negative indexing. What purpose does it serve?

Negative indexing is a function in Python that allows users to index arrays or lists from the last element. For example, the value -1 returns the last element, while -2 returns the second-to-last element. It is used to display data from the end of a list, or to reverse a number or string.

Example of negative indexing:

a = "Python Data Analyst Questions"
print (a[-1])
>> s

Need some help? Read the Python coding question guide on Interview Query.

Statistics and Probability Interview Questions

Statistics and probability questions for data analysts will come up usually on an onsite round as a test of basic fundamentals.

Statistics questions are more likely than probability questions to show up as statistics are the fundamental building blocks for many

Q1. Given uniform distributions X and Y and the mean 0 and standard deviation 1 for both, what’s the probability of 2X > Y? (Asked by Snapchat)

Given that X and Y both have a mean of 0 and a standard deviation of 1, what does that indicate for the distributions of X and Y?

Let’s look at this question a little closer.

We’re given two normal distributions. The values can either be positive or negative but each value is equally likely to occur. Since we know the mean is 0 and the standard deviation is 1, we understand that the distributions are also symmetrical across the Y-axis.

In this scenario, we are equally likely to randomly sample a value that is greater than 0 or less than 0 from the distribution.

Now, let’s take examples of random values that we could get from each scenario. There are about six different scenarios here.

  • X > Y: Both positive

  • X > Y: Both negative

  • X < Y: Both positive

  • X < Y: Both negative

  • X > Y: X is positive Y is negative

  • X < Y: X is negative Y is positive

We can simulate a random sampling by equating that all six are equally likely to occur. If we play out each scenario and plug the variables into 2X > Y, then we see about half of the time the statement is true, or 50%.

Why is this the case? Generally if we go back to the fact that both distributions are symmetrical across the Y-axis, we can intuitively understand that if both X and Y are random variables across the same distribution, we will see 2X as being on average double positive or double negative the value that Y is.

Q2. What is an unbiased estimator and can you provide an example for a layman to understand?

To answer this question, start by thinking about how a biased estimator looks. Then, think about how an unbiased estimator differs. Ultimately, an estimator is unbiased if its expected value equals the true value of a parameter, meaning that the estimates are in line with the average.

Q3. Let’s say we have a sample size of N. The margin of error for our sample size is 3. How many more samples would we need to decrease the margin of error to 0.3?

Hint: In order to decrease our margin of error, we’ll probably have to increase our sample size. But by how much?

Q4. What’s the Difference Between Correlation and Covariance?

Covariance measures the linear relationship of variables, while correlation measures the strength and direction of the relationship. Therefore, correlation is a function of a covariance. For example, a correlation between two variables does not mean that the change in variable X caused the change in variable Y’s value.

Q5. You are about to get on a plane to Seattle. You call 3 random friends in Seattle and ask each if it’s raining. Each has a 23 chance of telling you the truth and a 13 chance of messing with you by lying. All 3 friends tell you that “yes” it is raining.

What is the probability that it’s actually raining in Seattle?

Hint: There are several ways to answer this question. Given that a frequentist approach operates on the set of known principles and variables given in the original problem, you can logically deduce that P(Raining)= 1-P(Not Raining).

Since all three friends have given you the same answer as to whether or not it’s raining, what can you determine about the relationship between P(Not Raining) and the probability that each of your friends is lying?

A/B Testing and Experimentation

A/B testing and experimentation questions for data analysts tend to explore the candidate’s ability to properly conduct A/B tests. You should have strong knowledge of: p-values, confidence intervals, and assessing validity of the experiment.

Q1. Let’s say that your company is running a standard control and variant A/B test on a feature to increase conversion rates on the landing page. The PM checks the results and finds a .04 p-value.

How would you assess the validity of the result?

In this particular question, you’ll need to clarify the context of how the A/B test was set up and measured.

If we have an A/B test to analyze, there are two main ways in which we can look for invalidity. We could likely re-phrase the question to: How do you set up and measure an A/B test correctly?

Let’s start out by answering the first part of figuring out the validity of the set up of the A/B test:

1. How were the user groups separated?

Can we determine that the control and variant groups were sampled accordingly to the test conditions?

If we’re testing changes to a landing page to increase conversion, can we compare the two different users in the groups to see different metrics in which the distributions should look the same?

For example, if the groups were randomly bucketed, does the distribution of traffic from different attribution channels still look similar or is the variant A traffic channel coming primarily from Facebook ads and the variant B from email? If testing group B has more traffic coming from email then that could be a biased test.

2. Were the variants equal in all other aspects?

The outside world often has a much larger effect on metrics than product changes do. Users can behave very differently depending on the day of week, the time of year, the weather (especially in the case of a travel company like Airbnb), or whether they learned about the website through an online ad or found the site organically.

If the variants A’s landing page has a picture of the Eifel Tower and the submit button on the top of the page, and variant B’s landing page has a large picture of an ugly man and the submit button on the bottom of the page, then we could get conflicting results based on the change to multiple features.

Measurement

Looking at the actual measurement of the p-value, we understand that industry standard is .05, which means that 19 out of 20 times that we perform that test, we’re going to be correct that there is a difference between the populations.

However, we have to note a couple of things about the test in the measurement process.

What was the sample size of the test?

Additionally, how long did it take before the product manager measured the p-value? Lastly, how did the product manager measure the p-value and did they do so by continually monitoring the test?

If the product manager ran a T-test with a small sample size, they could very well easily get a p-value under 0.05. Many times, the source of confusion in AB testing is how much time you need to make a conclusion about the results of an experiment.

The problem with using the p-value as a stopping criterion is that the statistical test that gives you a p-value assumes that you designed the experiment with a sample and effect size in mind. If we continuously monitor the development of a test and the resulting p-value, we are very likely to see an effect, even if there is none. The opposite error is also common when you stop an experiment too early, before an effect becomes visible.

The number one most important reason is that we are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect.

How long should we recommend an experiment to run for then? To prevent a false negative (a Type II error), the best practice is to determine the minimum effect size that we care about and compute, based on the sample size (the number of new samples that come every day) and the certainty you want, how long to run the experiment for, before starting the experiment.

Q2. How can you effectively design an A/B test? Are there times when A/B testing shouldn’t be used?

Split testing fails when you have unclear goals. That’s why it’s imperative to start backwards with that goal. Is it to increase conversions? Are you trying to increase engagement and time spent on page? Once you have that goal, you can start experimenting with variables.

Q3. How much traffic would you need to drive to a page for the result of an A/B test to be statistically significant?

Statistical significance - or having 95% confidence in the results - requires the right volume of data. That’s why most A/B tests run for 2-8 weeks. Comparing metrics like conversions is fairly easy to calculate. In fact, most A/B tools have built-in calculators.

Q4. Let’s say you work at Uber. A PM comes to you considering a new feature where instead of a direct ETA estimate like 5 minutes, it would instead display a range of something like 3-7 minutes.

How would you conduct this experiment and how would you know if your results were significant?

Hint: A question like this asks you to think hypothetically about A/B testing. But the format is the same: Walk the interviewer through setting up the test and how you arrive at a statistically relevant result.

Excel and Data Visualization Interview Questions

Excel is still a widely used tool by data analysts, and in interviews, Excel questions typically focus on advanced features. These questions might ask for definitions, or you may be required to perform some Excel tasks.

Data analysts should also have strong knowledge of data visualization. These questions typically focus on design and presenting data, and may be more behavioral in nature. Be prepared to talk about how you make data accessible on dashboards.

Q1. Explain the Excel VLOOKUP function? What are the limitations of VLOOKUP?

This function allows users to find data from one column, and return a corresponding value from another.

For example, if you were analyzing a spreadsheet of customer data, you might use VLOOKUP to find a customer name and the corresponding phone number.

One limitation of VLOOKUP is that it only looks to the right of the column you are analyzing. For example, you couldn’t return a value from column A, if you used column B as the lookup column.

Another limitation is that VLOOKUP only returns the first value; if the spreadsheet contains duplicate records, you wouldn’t see any duplicates.

Q2. What questions would you ask before making a dashboard, chart or visualization?

If you’re asked a question like this, be sure to relate any experience you have. But ultimately, you should be thinking about the types of data that will be used and the relationships you’re trying to show with the data. Some key questions to ask are:

  • What business decisions will be made with this dashboard?
  • Will it be primarily retrospective or real-time?
  • Are the end-users technically savvy?

Q3. What is conditional formatting in Excel? When is a good time to use conditional formatting?

Conditional formatting allows users to change the appearance of a cell based on specified conditions.

Using conditional formatting, you can quickly highlight cells or ranges of cells, based on your conditions. Data analysts use conditional formatting to visualize data, to identify patterns or trends, or to detect potential issues.

Q4. What makes a data visualization effective?

You’ll hear a variation of this question if you’re interviewing for an analyst role that specializing in visualization.

You can talk about design and how you present information to the end-user. For example, you might say that a good visualization only presents necessary data, and that it’s clearly designed with cues to make the data more appealing. Additionally, you should consider scalability and time to deploy.

Q5. What are your favorite data visualization tools?

Data analysts will get asked what tools they have experience with. Choose a few that you’re most comfortable with and explain the features that you like.

Q6. What are some challenges you’ve experienced working with large volumes of data?

One tip: Think of questions like this in terms of Big Data’s 5 Vs: volume, velocity, variety, veracity and value.

More Resources for Data Analyst Interviews