Over ten years ago, a data analyst interview was very simple. All you needed to know to say was two things:
“Yes, I know Excel!”
“I’m a great communicator.”
Nowadays, the role of a data analyst has changed. Not only have salaries shot up, but data analysts are more in-demand than ever before due to their insight and analytical skillset.
Most data analyst jobs at tech companies require a strong technical skillset combined with good judgement. In this guide, we’ll break down the interview process and most common interview questions you’ll see on the interview.
Technical interviews for data analyst roles are typically multi-stage interviews. They start with initial screens designed to weed out candidates, and quickly progress to more technically demanding screens.
Here’s a typical breakdown of the data analyst interview process:
Initial screens are generally calls with recruiters. These screens assess your experience, your interests/specializations, and salary expectations. In some cases, you may be asked basic SQL questions or a simple scenario based case question.
Sample question: Tell us about a challenging project you have worked on. What were some of the obstacles you had to overcome?
The technical screen assesses your technical skills. In many cases, SQL is a primary focus. SQL questions range from basic definitions, to writing intermediate-to-advanced queries. Depending on the job function, you may be asked technical questions about Python, statistics and probability, algorithms and A/B testing.
Sample question: Given two tables, write a query to find the number of users that accepted Friend requests in the last month.
Take-homes are longer tests that may take several hours to complete. These challenges are designed to evaluate your ability to handle data, perform analysis and present your results effectively. Typically, these tests will ask you to perform and investigation on a dataset and present your findings.
Sample question: Prepare a summary of sales and website data for the Vice President of Marketing. Include an overview of website traffic and sales, as well as areas for improvement.
Data analyst on-site interviews typically consist of 3-5 hour-long interviews. They typically cover traditional technical SQL and statistics questions, as well as data analytics case studies and behavioral questions.
Sample question: Describe an analytics project you worked on. What were some challenges you faced?
Interview Query regularly analyzes the contents of data analyst interviews. By tagging common keywords and mapping them back to question topics for over 10K+ tech companies, we’ve found that SQL questions are asked most frequently.
In fact, in interviews for data analyst roles, SQL and data manipulations questions are asked 85% of the time.
Here are the types of technical interview questions data analysts get asked most frequently:
Additionally for more traditional data analyst roles, expect interview questions around:
Let’s first dive into how to approach and answer behavioral interview questions
Behavioral questions in data analyst interviews ask about specific situations you’ve been in, in which you had to apply specific skills or knowledge.
For many data analysts, behavioral questions can be fairly tough.
One tip: Always try to relate the question back to your experience and strengths.
Successful data analysts can help businesses identify anomalies and respond quickly.
For data sense questions, think about a time that you were able to spot an inconsistency in data quality, and how you eventually addressed it.
Data analysts use SQL to query data to solve complex business problems or find answers for other employees. In general, SQL data analyst questions focus on analytics and reporting:
Basic SQL Questions - These include the basics, e.g. definitions, as well as beginner SQL queries.
Analytics Questions – Analytics based questions, you might have to understand what kind of report or graph to build first, and then write a query to generate that report. So it’s an extra step on top of a regular SQL question.
Reporting Questions – SQL reporting questions replicate the work many data or business analysts do on a day-to-day basis, e.g. writing queries.
Reporting interview questions focus on writing a query to generate an already known output. Such as producing a report or a metric given some example table.
For analytics based questions, you might have to understand what kind of report or graph to build first, and then write a query to generate that report. So it’s an extra step on top of a regular SQL question.
See more SQL questions and tips for answering in our guide: 15 SQL Questions for Data Analysts.
To handle such a situation, we can use three different operations:
IS NULL − This operator returns true, if the column value is NULL. IS NOT NULL − This operator returns true, if the column value is not NULL. <=> − This operator compares values, which (unlike the = operator) is true even for two NULL values.
UNION and UNION ALL are SQL operators used to concatenate 2 or more result sets. This allows us to write multiple SELECT statements, retrieve the desired results, then combine them together into a final, unified set.
The main difference between UNION and UNION ALL is that:
UNION: only keeps unique records
UNION ALL: keeps all records, including duplicates
A table is structured with columns and rows. A view is a virtual table extracted from a database by writing a query.
The difference between an inner and outer join is that inner joins result in the intersection of two tables, whereas outer joins result in the union of two tables.
In most SQL databases there exists a RAND() function in which normally we can call:
SELECT * FROM big_table ORDER BY RAND()
The function will randomly sort the rows in the table. This function works fine and is fast if you only have let’s say around 1,000 rows. It might take a few seconds to run at 10K. And then at 100K maybe you have to go to the bathroom or cook a meal before it finishes.
What happens at 100 million rows?
Someone in DevOps is probably screaming at you.
Random sampling is important in SQL with scale. We don’t want to use the pre-built function because it wasn’t meant for performance. But maybe we can re-purpose it for our own use case.
We know that the RAND() function actually returns a floating-point between 0 and 1. So if we were to instead call:
We would get a random decimal point to some Nth degree of precision. RAND() essentially allows us to seed a random value. How can we use this to select a random row quickly?
Let’s try to grab a random number using RAND() from our table that can be mapped to an id. Given we have 100 million rows, we probably want a random number from 1 to 100 million. We can do this by multiplying our random seed from RAND() by the max number of rows in our table.
SELECT CEIL(RAND() * ( SELECT MAX(id) FROM big_table) )
We use the CEIL function to round the random value to an integer. Now we have to join back to our existing table to get the value.
What happens if we have missing or skipped id values though? We can solve for this by running the join on all the ids which are greater or equal than our random value and selects only the direct neighbor if a direct match is not possible.
As soon as one row is found, we stop (LIMIT 1). And we read the rows according to the index (ORDER BY id ASC). Now our performance is optimal.
SELECT r1.id, r1.name FROM big_table AS r1 INNER JOIN ( SELECT CEIL(RAND() * ( SELECT MAX(id) FROM big_table) ) AS id ) AS r2 ON r1.id >= r2.id ORDER BY r1.id ASC LIMIT 1
Hint: We want the value of two different metrics, the number of users that have posted their jobs once and the number of users that have posted at least one job multiple times. What does that mean exactly?
Assume no duplicate combination of first and last names. (I.E. No two John Smiths)
The first step we need to do would be to remove duplicates and retain the current salary for each user.
Given we know there aren’t any duplicate first and last name combinations, we can remove duplicates from the employees table by running a GROUP BY on two fields, the first and last name. This allows us to then get a unique combinational value between the two fields.
You want to be able to compute a metric that measures the precision of the ranking system based on position. For example, if the results for dog and cat are….
|dog||1000||1||2||picture of hotdog|
|cat||123||1||4||picture of cat|
…we would rank ‘cat’ as having a better search result ranking precision than ‘dog’ based on the correct sorting by rating.
Write a query to create a metric that can validate and rank the queries by their search result precision. Round the metric (avg_rating column) to 2 decimal places.
Hint: What is a cumulative distribution exactly? If we were to imagine our output and figure out what we wanted to display on a cumulative distribution graph, what would the dataset look like?
Usually if the problem states to solve for a moving/rolling average, we’re given the dataset in the form of a table with two columns, the date and the value.
This problem, however, is taken one step further with a table of just transactions with values conditioned to filtering for only deposits, and remove records representing withdrawals, denoted by a negative value (e.g. -10).
Data analytics case study questions combine a rotating mix of product intuition, business estimation, and data analytics.
Case questions come up in interviews when the job responsibilities lean to more of a heavy analytics space with an emphasis on solving problems and producing insights for management.
Many times data analysts will transition into a heavy analytics role when they’re required to take on more scope around the product and provide insights that upper level management can understand and interpret.
So data analytics case study questions will focus on a particular problem and you will be judged on how you break down the question, analyze the problem, and communicate your insights.
Here’s an example data analytics case study question:
|id||int||Primary key of posts table|
|user_id||int||ID of the user who created the post|
|created_at||datetime||Timestamp of the post|
|title||string||Title of the post|
|body||string||Text content of the post|
|comment_count||int||Total number of the comments on a post|
|view_count||int||Total number of the views on a post|
|answer_count||int||Total number of answers on a post|
|upvotes||int||Total number of upvotes on the post|
This is a classic data analytics case study. A question like this is designed to assess your data intuition, product sense, and ability to isolate key metrics.
Remember: There isn’t one correct answer, but usually, the conversation should head in a similar direction.
For example, this question asks about community health. Broadly, there are several metrics you’ll want to consider: Growth rate, engagement and user retention would provide insights into the community’s health.
The challenge with this question is determine how to measure those metrics with the data provided.
Case questions sometimes take the form of behavioral questions. Data analysts get tasked with experimenting with data to test new features or campaigns. Many behavioral questions will ask about experiments, but also tap into how you approach measuring your results.
With questions like these, be sure to describe the objective of the experiment, even if it was a simple A/B test. Don’t be afraid to get technical and explain the metrics you used and the process you used to quantify the results.
Start by making assumptions and thinking out loud. With this question, focus on coming up with a metric to support the hypothesis. If the question is unclear or if you think you need more information, be sure to ask.
Answer. The hypothesis is that CTR is dependent on search result rating. Therefore, we want to focus on the CTR metric
Python coding questions for data analysts are usually pretty simple and not as difficult as the ones seen on Leetcode. Mainly most interviewers just want to test basic knowledge of Python to the point that they know you can write scripts or some basic functions to move data between SQL and Excel or onto a dashboard.
Most data analysts never write production code, such as their code is never under scrutiny because it’s not holding a website up or performing some critical business function.
Therefore, most coding questions for data analyst interviews are generally on the easier side and mostly test basic functions that are required for data manipulation. Pandas questions may also be asked in this round of the interview.
Here’s a example Python coding question:
sentence = "Have free hours and love children?" output = [ ('have', 'free'), ('free', 'hours'), ('hours', 'and'), ('and', 'love'), ('love', 'children') ]
Bigrams are two words that are placed next to each other. To actually parse them out of a string, we need to first split the input string.
We would use the Python function .split() to create a list with each individual word as an input. Create another empty list that will eventually be filled with tuples.
Then, once we’ve identified each individual word, we need to loop through k-1 times (if k is the amount of words in a sentence) and append the current word and subsequent word to make a tuple. This tuple gets added to a list that we eventually return. Remember to use the Python function .lower() to turn all the words into lowercase!
def find_bigrams(sentence): input_list = sentence.split() bigram_list =  # Now we have to loop through each word for i in range(len(input_list)-1): #strip the whitespace and lower the word to ensure consistency bigram_list.append((input_list[i].strip().lower(), input_list[i+1].strip().lower())) return bigram_list
Hint: need to use the equation for standard deviation to answer this question. Using the equation, allows us to take the sum of the square of the data value minus the mean, over the total number of data points, all in a square root.
This question sounds like it should be a SQL question doesn’t it? Weekly aggregation implies a form of GROUP BY in a regular SQL or pandas question. In either case, aggregation on a dataset of this form by week would be pretty trivial.
Negative indexing is a function in Python that allows users to index arrays or lists from the last element. For example, the value -1 returns the last element, while -2 returns the second-to-last element. It is used to display data from the end of a list, or to reverse a number or string.
Example of negative indexing:
a = "Python Data Analyst Questions" print (a[-1]) >> s
Need some help? Read the Python coding question guide on Interview Query.
Statistics and probability questions for data analysts will come up usually on an onsite round as a test of basic fundamentals.
Statistics questions are more likely than probability questions to show up as statistics are the fundamental building blocks for many
Given that X and Y both have a mean of 0 and a standard deviation of 1, what does that indicate for the distributions of X and Y?
Let’s look at this question a little closer.
We’re given two normal distributions. The values can either be positive or negative but each value is equally likely to occur. Since we know the mean is 0 and the standard deviation is 1, we understand that the distributions are also symmetrical across the Y-axis.
In this scenario, we are equally likely to randomly sample a value that is greater than 0 or less than 0 from the distribution.
Now, let’s take examples of random values that we could get from each scenario. There are about six different scenarios here.
X > Y: Both positive
X > Y: Both negative
X < Y: Both positive
X < Y: Both negative
X > Y: X is positive Y is negative
X < Y: X is negative Y is positive
We can simulate a random sampling by equating that all six are equally likely to occur. If we play out each scenario and plug the variables into 2X > Y, then we see about half of the time the statement is true, or 50%.
Why is this the case? Generally if we go back to the fact that both distributions are symmetrical across the Y-axis, we can intuitively understand that if both X and Y are random variables across the same distribution, we will see 2X as being on average double positive or double negative the value that Y is.
To answer this question, start by thinking about how a biased estimator looks. Then, think about how an unbiased estimator differs. Ultimately, an estimator is unbiased if its expected value equals the true value of a parameter, meaning that the estimates are in line with the average.
Hint: In order to decrease our margin of error, we’ll probably have to increase our sample size. But by how much?
Covariance measures the linear relationship of variables, while correlation measures the strength and direction of the relationship. Therefore, correlation is a function of a covariance. For example, a correlation between two variables does not mean that the change in variable X caused the change in variable Y’s value.
Hint: There are several ways to answer this question. Given that a frequentist approach operates on the set of known principles and variables given in the original problem, you can logically deduce that P(Raining)= 1-P(Not Raining).
Since all three friends have given you the same answer as to whether or not it’s raining, what can you determine about the relationship between P(Not Raining) and the probability that each of your friends is lying?
A/B testing and experimentation questions for data analysts tend to explore the candidate’s ability to properly conduct A/B tests. You should have strong knowledge of: p-values, confidence intervals, and assessing validity of the experiment.
In this particular question, you’ll need to clarify the context of how the A/B test was set up and measured.
If we have an A/B test to analyze, there are two main ways in which we can look for invalidity. We could likely re-phrase the question to: How do you set up and measure an A/B test correctly?
Let’s start out by answering the first part of figuring out the validity of the set up of the A/B test:
1. How were the user groups separated?
Can we determine that the control and variant groups were sampled accordingly to the test conditions?
If we’re testing changes to a landing page to increase conversion, can we compare the two different users in the groups to see different metrics in which the distributions should look the same?
For example, if the groups were randomly bucketed, does the distribution of traffic from different attribution channels still look similar or is the variant A traffic channel coming primarily from Facebook ads and the variant B from email? If testing group B has more traffic coming from email then that could be a biased test.
2. Were the variants equal in all other aspects?
The outside world often has a much larger effect on metrics than product changes do. Users can behave very differently depending on the day of week, the time of year, the weather (especially in the case of a travel company like Airbnb), or whether they learned about the website through an online ad or found the site organically.
If the variants A’s landing page has a picture of the Eifel Tower and the submit button on the top of the page, and variant B’s landing page has a large picture of an ugly man and the submit button on the bottom of the page, then we could get conflicting results based on the change to multiple features.
Looking at the actual measurement of the p-value, we understand that industry standard is .05, which means that 19 out of 20 times that we perform that test, we’re going to be correct that there is a difference between the populations.
However, we have to note a couple of things about the test in the measurement process.
What was the sample size of the test?
Additionally, how long did it take before the product manager measured the p-value? Lastly, how did the product manager measure the p-value and did they do so by continually monitoring the test?
If the product manager ran a T-test with a small sample size, they could very well easily get a p-value under 0.05. Many times, the source of confusion in AB testing is how much time you need to make a conclusion about the results of an experiment.
The problem with using the p-value as a stopping criterion is that the statistical test that gives you a p-value assumes that you designed the experiment with a sample and effect size in mind. If we continuously monitor the development of a test and the resulting p-value, we are very likely to see an effect, even if there is none. The opposite error is also common when you stop an experiment too early, before an effect becomes visible.
The number one most important reason is that we are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect.
How long should we recommend an experiment to run for then? To prevent a false negative (a Type II error), the best practice is to determine the minimum effect size that we care about and compute, based on the sample size (the number of new samples that come every day) and the certainty you want, how long to run the experiment for, before starting the experiment.
Split testing fails when you have unclear goals. That’s why it’s imperative to start backwards with that goal. Is it to increase conversions? Are you trying to increase engagement and time spent on page? Once you have that goal, you can start experimenting with variables.
Statistical significance - or having 95% confidence in the results - requires the right volume of data. That’s why most A/B tests run for 2-8 weeks. Comparing metrics like conversions is fairly easy to calculate. In fact, most A/B tools have built-in calculators.
Hint: A question like this asks you to think hypothetically about A/B testing. But the format is the same: Walk the interviewer through setting up the test and how you arrive at a statistically relevant result.
Excel is still a widely used tool by data analysts, and in interviews, Excel questions typically focus on advanced features. These questions might ask for definitions, or you may be required to perform some Excel tasks.
Data analysts should also have strong knowledge of data visualization. These questions typically focus on design and presenting data, and may be more behavioral in nature. Be prepared to talk about how you make data accessible on dashboards.
This function allows users to find data from one column, and return a corresponding value from another.
For example, if you were analyzing a spreadsheet of customer data, you might use VLOOKUP to find a customer name and the corresponding phone number.
One limitation of VLOOKUP is that it only looks to the right of the column you are analyzing. For example, you couldn’t return a value from column A, if you used column B as the lookup column.
Another limitation is that VLOOKUP only returns the first value; if the spreadsheet contains duplicate records, you wouldn’t see any duplicates.
If you’re asked a question like this, be sure to relate any experience you have. But ultimately, you should be thinking about the types of data that will be used and the relationships you’re trying to show with the data. Some key questions to ask are:
Conditional formatting allows users to change the appearance of a cell based on specified conditions.
Using conditional formatting, you can quickly highlight cells or ranges of cells, based on your conditions. Data analysts use conditional formatting to visualize data, to identify patterns or trends, or to detect potential issues.
You’ll hear a variation of this question if you’re interviewing for an analyst role that specializing in visualization.
You can talk about design and how you present information to the end-user. For example, you might say that a good visualization only presents necessary data, and that it’s clearly designed with cues to make the data more appealing. Additionally, you should consider scalability and time to deploy.
Data analysts will get asked what tools they have experience with. Choose a few that you’re most comfortable with and explain the features that you like.
One tip: Think of questions like this in terms of Big Data’s 5 Vs: volume, velocity, variety, veracity and value.