### Introduction

Preparing for a data analyst interview is a tricky task. More and more, data analysts are asked to take on the responsibilities of data scientists. As a result, **these interviews require technical skills matched with comprehension of data analysis** methods and tools.

You can expect everything from the basics like “What is data cleaning?” to advanced SQL data analysis questions. Ultimately, it depends on the role and the company. To help you prep, we’ve compiled some common **data analyst questions with answers, hints and tips**.

### What Types of Questions Get Asked in Data Analyst Interviews?

Interview Query regularly analyzes data analyst interviews, and we try to understand exactly how the data analyst’s job description has changed over the last few years. We’ve found that **SQL questions are asked most frequently**. In fact, in interviews for data analyst roles, SQL or data processing questions are **asked 85% of the time**. Here are the types of questions data analysts get asked most frequently:

- SQL / data processing
- Product metrics and business case studies
- Python, algorithms, and coding questions
- Statistics and probability
- A/B testing and experimentation
- Machine learning and modeling

### SQL Data Analysis Interview Questions

SQL is a programming language used to access and manipulate databases. Data analysts use SQL to query data to solve complex business problems or find answers for other employees. In general, SQL data analysis questions focus on analytics and reporting:

**Analytics Questions –**SQL analytics questions test your problem-solving process.**Reporting Questions –**SQL reporting questions replicate the work many data or business analysts do on a day-to-day basis, e.g. writing queries.

**Q1: What are the different ways of handling NULL when querying a data set?**

To handle such a situation, we can use three different operations:

**IS NULL**− This operator returns true, if the column value is NULL.**IS NOT NULL**− This operator returns true, if the column value is not NULL.**<=>**− This operator compares values, which (unlike the = operator) is true even for two NULL values.

**Q2: What's the difference between UNION and UNION ALL? (Asked by Facebook)**

`UNION`

and `UNION ALL`

are SQL operators used to concatenate 2 or more result sets. This allows us to write multiple `SELECT`

statements, retrieve the desired results, then combine them together into a final, unified set.

The main difference between `UNION`

and `UNION ALL`

is that:

`UNION`

: only keeps unique records

`UNION ALL`

: keeps all records, including duplicates

Source: The Data School

**Q3: What is the difference between a SQL view and table? (Asked by Kaiser Permanente)**

A table is structured with columns and rows. A view is a virtual table extracted from a database by writing a query.

**Q4: What's the difference between an INNER and OUTER JOIN?**

The difference between an inner and outer join is that inner joins result in the intersection of two tables, whereas outer joins result in the union of two tables.

**Q5: Write a query that returns all of the neighborhoods that have 0 users. (Asked by Nextdoor / Facebook)**

**Example: **

```
'users' table
+-----------------+----------+
| columns | type |
+-----------------+----------+
| id | int |
| name | varchar |
| neighborhood_id | int |
| created_at | datetime |
+-----------------+----------+
'neighborhoods' table
+---------+---------+
| columns | type |
+---------+---------+
| id | int |
| name | varchar |
| city_id | int |
+---------+---------+
```

Hint: Our predicament is to find all the neighborhoods without users. In a sense we need all the neighborhoods that do not have a singular user living in them.

This means we have to introduce the **concept of existence of a column in one table, but not in the other.**

*Check out our Facebook Data Analyst Interview guide for more insights into Facebook's process. *

**Q6. If you have a table with billions of rows, how would you extract a random row from the table without performance issues?**

Quick Tip:Before jumping into the question we should remember to clarify a few details with the interviewer. It will show you think holistically about problems. Rushing too fast into a solution is a red flag for many interviewers.

Given the problem statement, let's clarify a few questions: **What is the potential impact of downtime?**

In this scenario, when we are talking about a table with a billion rows, this could be a table that is powering an entire company in which downtime would affect thousands of dollars in sales or could just be an offline analytics events table that would cause impact to internal employees.

It's crucial to then mention to the interviewer and assess, **what is the potential effects of downtime for seconds, minutes, and hours.** Figuring out the impact is pertinent to determining our strategy going forward.

*To explore more data analyst questions, check out our Google Data Analyst Interview guide. *

**Q7: Given a table of job postings, write a query to breakdown the number of users that have posted their jobs once versus the number of users that have posted at least one job multiple times. (Asked by LinkedIn)**

**Example: **

```
'job_postings' table
+-------------+----------+
| column | type |
+-------------+----------+
| id | integer |
| job_id | integer |
| user_id | integer |
| date_posted | datetime |
+-------------+----------+
```

Hint: This question is kind of complicated so let's break it down into multiple steps. First let's visualize what the output would look like.

We want the value of two different metrics, the **number of users that have posted their jobs once** versus the **number of users that have posted at least one job multiple times. **What does that mean exactly?

Well if a user has 5 jobs but only posted them once, then they are part of the first statement. But if they have a 5 jobs and posted a total of 7 times, that means that they had to at least posted one job multiple times.

Or watch me solve it in this video below!

**Q8. Given a users table, write a query to get the cumulative number of new users added by day, with the total reset every month. (Asked by Twitch)**

**Example:**

```
'users' table
+------------+----------+
| columns | type |
+------------+----------+
| id | int |
| name | varchar |
| created_at | datetime |
+------------+----------+
```

This question first seems like it could be solved by just running a `COUNT(*)`

and grouping by date. Or maybe it's just a regular cumulative distribution function?

But we have to notice that we are actually **grouping by a specific interval of month and date**. And that when the next month comes around, we want to the **reset the count of the number of users**.

The practical benefit for a query like this is that we can get a retention graph that compares the cumulative number of users from one month to another. If we have a goal to acquire 10% more users each month, how do we know if we're on track for this goal on February 15th without having the same number to compare it to for January 15th?

Therefore how can we make sure that the total amount of users in January 31st rolls over back to 0 on February 1st?

Let's first just solve the issue of getting the total count of users. We know that we'll need to know the number of users that sign up each day. This can be written pretty easily.

```
WITH daily_total AS (
SELECT
DATE(created_at) AS dt
, COUNT(*) AS cnt
FROM users
GROUP BY 1
)
```

Read more about how to solve SQL interview questions in our data science course.

### Business, Product & Data Analytics Questions

In analyst interviews, data analytics case study questions combine a rotating mix of product intuition, business estimation, and data analytics. These come up often in interviews when the job responsibilities lean to more of a heavy analytics space with an emphasis on solving problems first.

Many times data analyst will transition into a heavy analytics role when they're required to take on more scope around the product instead of analyzing data all day.

**Q1: Let's say you work for a social media company that has just done a launch in a new city. Looking at weekly metrics, you see a slow decrease in the average number of comments per user from January to March in this city.**

**What are some reasons on why the average number of comments per user would be decreasing?**

Let's model out an example scenario to help us see the data.

**Jan:** 10000 users, 30000 comments, 3 comments/user**Feb:** 20000 users, 50000 comments, 2.5 comments/user**Mar:** 30000 users, 60000 comments, 2 comments/user

We're given information that total user count is increasing linearly which means that the **decreasing comments/user** is not an effect of a declining user base creating a loss of network effects on the platform.

What else can we hypothesize then?

**Q2: Let's say we're given a dataset of page views where each row represents one page view. How would you differentiate between scrapers and real people? (Asked by LinkedIn)**

Quick Tip: Questions like have no exact right answer. Modeling-based theoretical questions are more meant to assess whether you can make realistic assumptions and come up with a solution under these assumptions.

We're given a dataset of page views with likely scrapers and real users visiting the site. Because the intent of a scraper is to extract data out of the LinkedIn network, a scraper will almost surely have a lot of page views, and the **duration of these views will likely be rather short** since a robotic scraper can process information much faster than a human.

A real user, on the other hand, tends to visit the page fewer times and spend more time in each visit. The link traversal between users would also be more nuanced. We'd expect **users to traverse the pages more through links on the site** rather than a scraper making requests to different URLs.

Under these assumptions, what solution can we come to?

**Q3. How do you estimate the average number of people in a family when given a dataset of housing data in a neighborhood? (Asked by Opendoor)**

**Q4: Let’s say you’re working on Facebook Groups. A product manager decides to add threading to comments on group posts. We see comments per user increase by 10% but posts go down 2%. Why would that be?**

**Additionally, what metrics would prove your hypotheses?**

Hint: Threading restructures the flow of comments so that instead of responding to the post, users can now respond to individual comments beneath the post. What effect might this have on a push notification ecosystem?

**Q5. How would you measure the success of Facebook Groups?**

Start with this question: What is the point of Facebook Groups? We could say that Facebook Groups provide a way for users to connect with other users through a shared interest or real life/offline relationship.

What types of general engagement metrics help us measure the success of this goal?

**Q6: A company is netting a loss of revenue for the past 12 months. You have the following transaction data: location, date, and total spent for the past 12 months. How would you approach this question to stem their revenue loss? (Asked by Walmart)**

### Data Analyst Python Coding Questions

Python is a versatile programming language, used by data analysts. Python questions in analyst interviews typically focus on how you can leverage Python to analyze data. As such, these types of interview questions revolve around data parsing, data manipulation, text strings and using Python packages like NumPy and Pandas.

**Q1. Write a function that can take a string and return a list of bigrams. (Asked by Indeed)**

```
sentence = """
Have free hours and love children?
Drive kids to school, soccer practice
and other activities.
"""
output = [('have', 'free'),
('free', 'hours'),
('hours', 'and'),
('and', 'love'),
('love', 'children?'),
('children?', 'drive'),
('drive', 'kids'),
('kids', 'to'),
('to', 'school,'),
('school,', 'soccer'),
('soccer', 'practice'),
('practice', 'and'),
('and', 'other'),
('other', 'activities.')]
```

To separate the sentence into bigrams, the first thing we need to do is split the sentence into individual words.

We would need to loop through each word of the sentence and append bigrams to the list.

**Q2:** **Write a function that takes in a list of dictionaries with a key and list of integers and returns a dictionary with the standard deviation of each list. (Asked by Snapchat)**

```
Example:
input = [
{
'key': 'list1',
'values': [4,5,2,3,4,5,2,3],
},
{
'key': 'list2',
'values': [1,1,34,12,40,3,9,7],
}
]
output -> {'list1': 1.12, 'list2': 14.19}
```

Remember the equation for standard deviation. To be able to fulfill this function, we need to use the following equation, where we take the sum of the square of the data value minus the mean, over the total number of data points, all in a square root.

Does the function inside the square root look familiar?

Standard deviation is the square root of the variance. Can we separate the equation above into different, simpler equations, so that the function isn’t a huge equation?

**Q3: Given a list of timestamps in sequential order, return a list of lists grouped by week (7 days) using the first timestamp as the starting point. (Asked by Postmates)**

```
ts = [
'2019-01-01',
'2019-01-02',
'2019-01-08',
'2019-02-01',
'2019-02-02',
'2019-02-05',
]
output = [
['2019-01-01', '2019-01-02'],
['2019-01-08'],
['2019-02-01', '2019-02-02'],
['2019-02-05'],
]
```

This question sounds like it should be a SQL question doesn't it? Weekly aggregation implies a form of GROUP BY in a regular SQL or pandas question. In either case, aggregation on a dataset of this form by week would be pretty trivial.

But since it's a scripting question, it's trying to pry out if the candidate deal with unstructured data. Data scientists deal with a lot of unstructured data.

In this function we have to do a few things.

1. Loop through all of the `datetime`

2. Set a beginning timestamp as our reference point.

3. Check if the next time in the array is more than 7 days ahead.

a. If so, set the new timestamp as the reference point.

b. If not, continue to loop through and append the last value.

```
from datetime import datetime
nts = [] #convert to datetime for testing
for t in ts:
nts.append(datetime.strptime(t, '%Y-%m-%d'))
def group_week(ts, delim='-'):
'''
Groups an ordered list of timestamps as strings by week.
The first day of the first week is defined by the earliest timestamp.
Dates should be ordered by year, month and day
Parameters:
ts: str, list of timestamps
delim: str, delimeter that separates the date
'''
out, week, week_ind = [], [], 0
for i,t in enumerate(ts):
if i == 0:
week.append(t)
start_date = datetime.strptime(t, f'%Y{delim}%m{delim}%d')
continue
t_date = datetime.strptime(t, f'%Y{delim}%m{delim}%d')
n = (t_date - start_date).days // 7
if n == week_ind:
week.append(t)
elif n > week_ind:
week_ind = n
out.append(week)
week = []
week.append(t)
```

**Q4. Explain negative indexing. What purpose does it serve? **

Negative indexing is a function in Python that allows users to index arrays or lists from the last element. For example, the value -1 returns the last element, while -2 returns the second-to-last element. It is used to display data from the end of a list, or to reverse a number or string.

**Example: **

```
a = "Python Data Analyst Questions"
print (a[-1])
>> s
```

### Probability and Statistics Interview Questions

Statistics and probability questions are common interview fundamentals intended to test a data analyst's understanding of key concepts. For example, every data analyst should know what a P-value is, what confidence intervals are, and how to read and analyze an A/B test.

**Q1. Given uniform distributions X and Y and the mean 0 and standard deviation 1 for both, what’s the probability of 2X > Y? (Asked by Snapchat) **

Given that X and Y both have a mean of 0 and a standard deviation of 1, what does that indicate for the distributions of X and Y?

**Q2. What is an unbiased estimator and can you provide an example for a layman to understand?**

To answer this question, start by thinking about how a biased estimator looks. Then, think about how an unbiased estimator differs. Ultimately, an estimator is unbiased if its expected value equals the true value of a parameter, meaning that the estimates are in line with the average.

**Q3. Let's say we have a sample size of N. The margin of error for our sample size is 3. How many more samples would we need to decrease the margin of error to 0.3?**

In order to decrease our margin of error, we'll probably have to increase our sample size. But by how much?

**Q4. What's the Difference Between Correlation and Covariance? **

Covariance measures the linear relationship of variables, while correlation measures the strength and direction of the relationship. Therefore, correlation is a function of a covariance. For example, a correlation between two variables does not mean that the change in variable X caused the change in variable Y's value.

**Q5. You are about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining.**

**What is the probability that it's actually raining in Seattle?**

Hint: Given that afrequentistapproach operates on the set of known principles and variables given in the original problem, you can logically deduce thatP(Raining)= 1-P(Not Raining). Since all three friends have given you the same answer as to whether or not it’s raining, what can you determine about the relationship between P(Not Raining) and the probability that each of your friends is lying?

A **Bayesian approach** utilizes new data related to the problem in order to describe the probability of a given event. What kind of new information could you assume/take into account in order to analyze the probability of certain events?

Think about what happens if you move from **examining the probability that your friends are lying** to the **probability that your friends tell you it’s raining**.

Using the Bayesian approach, the probability of rain in Seattle is a great factor to consider in relation to your friends’ respective answers. Since the probability of rain in Seattle isn’t specifically noted in the original prompt, what can you assume by default to be the value of the probability of rain in Seattle?

How is this then factored into the Bayesian Theorem?

**Q6. Amazon has a warehouse system where items on the website are located at different distribution centers across a city. Let's say in one example city, the probability that a specific item X at location A is 0.6, and at location B the probability is 0.8.**

**Given you're a customer in this example city and the items are only found on the website if they exist in the distribution centers, what is the probability that the item X would be found on Amazon's website?**

Hint: We know that item X will be on the website if the item is atLocation A or Location B.Thus, the probability we are aiming to find is the probability that item X is at Location A or B. However, there is also the chance that the item X may be at both Location A and Location B.

### A/B Testing and Experimentation

**Q1. Let's say that your company is running a standard control and variant AB test on a feature to increase conversion rates on the landing page. The PM checks the results and finds a .04 p-value.**

**How would you assess the validity of the result?**

Hint: What details is the interviewer leaving out of the question? Are there more assumptions that we can make about the context of how the AB test was set up and measured that will lead us to discovering invalidity?

**Q2. Let's say you work at Uber. A PM comes to you considering a new feature where instead of a direct ETA estimate like 5 minutes, would instead display a range of something like 3-7 minutes.**

**How would you conduct this experiment and how would you know if your results were significant?**

**Q3. How can you effectively design an A/B test? Are there times when A/B testing shouldn't be used?**

Split testing fails when you have unclear goals. That's why it's imperative to start backwards with that goal. Is it to increase conversions? Are you trying to increase engagement and time spent on page? Once you have that goal, you can start experimenting with variables.

Another potential challenge is when tests are overly broad or vague. It's difficult to A/B test two different landing page experiences with many different variables. An overly broad test can result in unclear results; you won't be able to answer why a particular variation was the winner.

**Q4. How much traffic would you need to drive to a page for the result of an A/B test to be statistically significant? **

Statistical significance - or having 95% confidence in the results - requires the right volume of data. That's why most A/B tests run for 2-8 weeks. Comparing metrics like conversions is fairly easy to calculate. In fact, most A/B tools have built-in calculators. Or you can use a free online statistical significance calculator.

### Data Analyst Excel Questions for Interviews

Excel is a tool commonly used by data analysts. Therefore, you should have a working knowledge of basic functions like SUMIF, IF statements, and COUNT. Additionally, you should know how to create graphs and pivot tables. In practice, because companies value collaboration, you may be required to use Google Sheets, which is essentially the same thing as Excel except shared more easily between different users.

**Q1. Explain the VLOOKUP function? What are the limitations of VLOOKUP? **

VLOOKUP is a common function, which means "Vertical Lookup." Essentially, this function allows users to find data from one column, and return a corresponding value from another. For example, if you were analyzing a spreadsheet of customer data. You might use VLOOKUP to find a customer name and the corresponding phone number.

One big limitation is that VLOOKUP only looks to the right of the column you are analyzing. For example, you couldn't return a value from column A, if you used column B as the lookup column. Another limitation is that VLOOKUP only returns the first value; if the spreadsheet contains duplicate records, you wouldn't see any duplicates.

**Q2. What is conditional formatting? When is a good time to use conditional formatting? **

Conditional formatting allows users to change the appearance of a cell based on specified conditions. Using conditional formatting, you can quickly highlight cells or ranges of cells, based on your conditions. Data analysts use conditional formatting to visualize data, to identify patterns or trends, or to detect potential issues.

**Q3. What questions would you ask before making a dashboard, chart or visualization? **

If you're asked a question like this, be sure to relate any experience you have. But ultimately, you should be thinking about the types of data that will be used and the relationships you're trying to show with the data. Some key questions to ask are:

- What business decisions will be made with this dashboard?
- Will it be primarily retrospective or real-time?
- Are the end-users technically savvy?

### Tableau Interview Questions for Data Analysts

Tableau is a data visualization tool that’s widely used across all industries. Typically, companies use Tableau to create dashboards, which are tools that display live and real-time metrics for businesses to look at. For example, a marketing team might create a dashboard that displays daily conversions, daily marketing spend, and daily website traffic.

**Q1. What's the difference between dimensions and measures? Provide examples of each. **

Dimensions in Tableau are qualitative values, e.g. names, dates and locational data. You can use them to categorize or segment the data. Measures, on the other hand, are numerical data, e.g. traffic volume, sales volume, etc. Measures can be aggregated.

**Q2. What is aggregation in Tableau? What aggregate functions are available? **

Aggregation in Tableau means aggregating multiple row values into a single value. There are many different types of aggregate functions available. Some of the most common are:

**SUM:**Adds up all the values.**AVG:**Calculates the average for the values selected.**MEDIAN:**Calculates the median.**COUNT:**Returns the number of values in a set of data.

**Q3. What is data blending?**

Data blending is used to combine data from multiple sources. Blending allows users to display data from a primary source, along with data from a secondary source. The common types of ways to combine data include:

- Blends
- Joins
- Relationships

### Miscellaneous Questions in Data Analyst Interviews

Here are a few more practice interview questions in a variety of categories. These may be asked along with a specific skill, or to test your knowledge of key data analyst job skills.

**Q1. How will you compare the performance of two search engines?**

**Q2. Describe any data cleansing techniques that you've applied.**

**Q3. Name the best Google product that you've used. What would you change about it, and how would you improve it?**

**Q4. Describe the different ways you measure customer check out rates.**

**Q5. What is the use of regularization? Tell us the difference between L1 and L2 regularization.**

**Q6. What is a random forest? What do you know about Naïve Bayes classifiers?**

**Q7. How would you determine if a given coin is biased?**

**Q8. What do ACF and PACF mean?**

### Behavioral Interview Questions for Data Analysts

These types of questions ask about specific situations you've been in, in which you had to apply specific skills or knowledge. For many data analysts, behavioral questions can be fairly tough. One tip: Always try to relate the question back to your experience and strengths.

**Q1. Talk about a time where you had to make a decision in a lot of uncertainty.**

**Q2. How do you set goals and achieve them? Give us an example.**

**Q3. Describe a time when you solved a conflict at work.**

**Q4. Give an example of a situation when you have shown effectiveness, empathy, humbleness, and adaptability.**

**Q5. Give us an example of when you failed.**

**Q6. Talk about an occasion when you used logic to solve a problem.**

**Q7. What do you do if you disagree with your senior?**

## Data Analyst Take-Home Assignment

A take-home assignment or challenge is a common part of data analyst interviews. These take-home assignments **are completed to advance to the onsite round.** This process is generally testing your overarching understanding of data analysis and communication in a written format.

Here's how the **take-home process usually goes:**

**Analyzing Data**

Each analysis starts with a problem or a task. The level of difficulty of these tasks can differ greatly. An example of a simple task would be if you were asked to write a query to provide a statistic, like yesterday’s sales in dollars. An example of a more difficult task is when the answer isn’t clear and you’re asked to explore the data, e.g. if you were asked to figure out why last month’s sales performed much worse than other months.

**Explore and Query**

Once you receive a problem, you’ll usually write a query or a number of queries to explore and gather the information that you need to solve the problem. This means that **you’ll probably need to know SQL or Python (or both) to gather the information you need.**

Continuing with the previous example, if you were asked to figure out why last month’s sales performed much worse than other months, you might query the average customer review rating last month to see if there was a problem with the product, or you might query last month’s marketing spending compared to other months to see if there was a significant cut in marketing spending.

**Gather Insights****The next step is to gather your insights**. Sometimes, gathering your insights means copying and pasting your insights into an Excel sheet. Other times, it means saving the queries that you used to find the information that you need for the next step.

**Visualize Insights**

Once you gather your insights, **you may be required to visualize your findings**. Sometimes, it’ll be as simple as making a bar graph in Excel. Other times, it means creating an extensive dashboard to be used by C-suite executives. The skills required in this step depends both on the company and the project. It includes but is not limited to: PowerPoint, Excel, Tableau, Matplotlib, etc.

**Communicate Your Findings**

Lastly, **you’ll be required to communicate your results**, whether it be through a slide deck with several static graphs or a dashboard with several KPI metrics. Similar to the STAR method for answering behavioral questions, you would walk through the problem, the task, the approach you took, and the end results(s).

### Thanks for Reading!

Check out all of our learning resources to practice for your data analyst interview. Try our Data Science Course to learn key concepts. Or test your knowledge with our Data Analyst interview guides for Google and Facebook.