Statistics and A/B testing interview questions appear in approximately 60% of data science interviews. These questions are designed to assess two fundamental data science skills:

- Your foundational knowledge of statistics and experiment design
- Your approach to using A/B testing data to inform product and business decisions

A/B testing and statistics questions are usually asked in tandem in interviews, as statistics are heavily utilized in experiment design. However, there are four main types of statistics and A/B questions that are asked in interviews:

**Basic A/B testing interview questions:**These questions are easy, definition-based A/B testing questions that determine whether you understand the fundamentals of and statistical concepts behind A/B testing.**A/B testing case studies:**These questions provide you with a hypothetical A/B testing scenario, and based on the case, you’ll need to design a functional A/B test or analyze the results of a hypothetical test.**Basic statistics interview questions:**These questions test your conceptual knowledge of statistics and your ability to communicate statistical information to a layperson.**Statistics case study questions:**These questions ask you to make computations (e.g., mean and variance in a non-normal distribution) and can cover a wide range of statistics and probability concepts.

In data science interviews, easy A/B testing interview questions ask for definitions, gauge your understanding of experiment design, and determine how you approach setting up an A/B test. Some of the most common basic A/B testing interview questions include:

This type of question assesses your foundational knowledge of A/B test design. You don’t want to begin without first understanding the problem the test aims to solve. Some questions you might ask include:

- What does the sample population look like?
- Have we taken steps to ensure our control and test groups are truly randomized?
- Is this a multivariate A/B test? If so, how does that affect the significance of our results?

The most important consideration is that running multiple t-tests exponentially increases the probability of false positives (also called Type I errors).

“Exponentially” here is not a placeholder for “a lot.” If each test has false-positive probability ψ, the probability of never getting a false positive in n many tests is (1 − ψ)n, which clearly tends to zero as n→∞.

There are two main approaches to consider in this situation:

- Use a correction method such as the Bonferroni correction
- Use an F-test instead

In general, you should start with understanding what you want to measure. From there, you can begin to design and implement a test. There are four key factors to consider:

**Setting metrics:**A good metric is simple, directly related to the goal at hand, and quantifiable. Every experiment should have one key metric that determines whether the experiment was a success or not.**Constructing thresholds:**Determine by what degree your key metric must change in order for the experiment to be considered successful.**Sample size and experiment length:**How large of a group are you going to test on and for how long?**Randomization and assignment:**Who gets which version of the test and when? You need at least one control group and one variant group. As the number of variants increases, the number of groups needed increases too.

It’s helpful to first explain the difference between a user-tied test and a user-untied test. A user-tied test is a statistical test in which the experiment buckets users into two groups on the user level. Therefore, a user-untied test is one in which they are not bucketed on the user level.

For example, in a user-untied test on a search engine, traffic is split at the search level instead of the user level given that a search engine generally does not need you to sign in to use the product. However, the search engine still needs to A/B test different algorithms to measure which ones are better.

One potential con of a user-tied test is that bias can be a problem in user-untied experiments because users aren’t bucketed and can potentially see both treatments. What are other pros and cons you can think of?

Typically, the significance level of an experiment is 0.05 and the power is 0.8, but these values may shift depending on how much change needs to be detected to implement the design change. The amount of change needed can be related to external factors such as the time needed to implement the change once the decision has been made.

A p-value of <0.05 strongly indicates that your hypothesis is correct and the results aren’t random.

**Hint:** Is the interviewer leaving out important details? Are there more assumptions you can make about the context of how the A/B test was set up and measured that will lead you to discovering invalidity?

Looking at the actual measurement of the p-value, you already know that the industry standard is .05, which means that 19 out of 20 times that you perform that test, you’re going to be correct that there is a difference between the populations. However, you have to note a couple of considerations about the test in the measurement process:

- The sample size of the test
- How long it took before the product manager measured the p-value
- How the product manager measured the p-value and whether they did so by continually monitoring the test

What would you do next to assess the test’s validity?

There are numerous scenarios in which bucket testing won’t reach statistical significance or the results end up unclear. Here are some reasons you might avoid A/B testing:

**Not enough data:**A statistically significant sample size is key for an effective A/B test. If a landing page isn’t receiving enough traffic, you likely won’t have a large enough sample size for an effective test.**Your metrics aren’t clearly defined:**An A/B test is only as effective as its metrics. If you haven’t clearly defined what you’re measuring or your hypothesis can’t be quantified, your A/B test results will be unclear.**Testing too many variables:**Trying to test too many variables in a single test can lead to unclear results.

When you’re testing more than one variant, the probability that you reached significance on a variant by chance is high. You can understand this by calculating the probability of one significant result by taking the inverse of the p-value that you are measuring.

Therefore, if you want to know the probability that you are getting a significant result by chance, you can take the inverse of that. For example: P(one significant result) = 1 − P(number of significant results) = 1 − (1 − 0.05) = 0.05

There is a 5% probability of getting a significant result just by chance alone. This makes intuitive sense given how significance works. Now, what happens when you test 20 results and are getting one variant back that is significant? What’s the likelihood it occurred by chance?

P(one significant result) = 1 − (1 − 0.05)^20 = 0.64 That is now a 64% chance that you got an incorrect significant result. This result is otherwise known as a false positive.

Experiment length is a function of sample size since you’ll need enough time to run the experiment on X users per day until you reach your total sample size. However, time introduces variance into an A/B test; there may be factors present one week that aren’t present in another, like holidays or weekdays vs. weekends.

The rule of thumb is to run the experiment for about two weeks, provided you can reach your required sample size in that time. Most split tests run for 2-8 weeks. Ultimately, the length of the test depends on many factors, such as traffic volume and the variables that are being tested.

If you’re looking for an alternative to A/B testing, there are two common tests that are used to make UI design decisions:

**A/B/N tests:**This type of test compares several different versions at once (the N stands for “number,” e.g., the number of variations being tested) and is best for testing major UI design choices.**Multivariate:**This type of test compares multiple variables at once, e.g., all the possible combinations that can be used. Multivariate testing saves time, as you won’t have to run numerous A/B tests. This type of test is best when considering several UI design changes.

It is important to ensure there is a normal distribution of users, with a variety of attributes, to guarantee the results of the A/B test are valid; randomizing insufficiently may result in confounding variables further down the line.

It also matters when A/B tests are given to users. For instance, is every new user given an A/B test? How will that affect assessment of existing users? Conversely, if A/B tests are assigned to all users, and some of those users signed up for the website this week, and others have been around for much longer, is the ratio of new users to existing users representative of the larger population of the site?

Finally, it is also important to ensure that the control and variant groups are of equal size so that they can be easily (and accurately) compared at the end of the test.

In general, there are many different metrics you might consider in an A/B test. But some of the most common are:

- Impression count
- Conversion rate
- Click-through rate (CTR)
- Button hover time
- Time spent on page
- Bounce rate

The variable you should use is based on your hypothesis and what you’re testing. If you’re testing a button variation, button hover time or CTR are probably the best choices. But if you’re testing messaging choices on a long-form landing page, time spent on page and bounce rate would likely be the best metrics to consider.

In general, A/B testing works best at informing UI design changes, as well as with promotional and messaging choices. You might consider an A/B test for:

- UI design decisions
- Testing promotions, coupons, or incentives
- Testing messaging variations (e.g., different headlines or calls-to-action)
- Funnel optimizations

You will want to approach A/B testing case study questions with a clear understanding of the process of A/B test experiment design. Therefore, you want to consider all potential factors when designing an A/B test. The more you can demonstrate a thorough understanding of the scope of the problem, the more attractive you’ll appear as a candidate. Some sample A/B testing case study questions include:

**More context:** The experience measured the impact financial rewards have on users’ response rates. The treatment group with a $10 reward has a 30% response rate, while the control group with no reward has a 50% response rate. How could you improve this experimental design?

See a mock interview solution for this question:

In this case, the question is prompting multiple changes within the experiment design itself. Due to the testing of two independent variables, the test will result in an interaction effect of two different variables, creating four variants:

- Red button - top of the page
- Red button - bottom of the page
- Blue button - top of the page
- Blue button -bottom of the page

The test is now multivariate. Notably, more test variants increase the variance of the results. One way to decrease these effects is to increase the length of time for the test to reach significance in order to reduce variance from day-to-day effects.

**Follow-up question:** Is an A/B test on pricing a good business decision?

A/B testing pricing generally has more downsides than upsides. One major downside is that if two users go to a pricing page and one of them sees a product for $50/month and the other sees one for $25/month, you’re risking an incentive for users to opt out of the A/B test into another bucket, creating real statistical anomalies.

However, an even larger issue on A/B testing pricing is on understanding success. When testing a discount rate on a subscription product, you want to know two things:

- Does the customer convert at a higher rate for the discount?
- Is the total lifetime value of the customer higher as well?

Running a recurring revenue A/B test means a pricing test must run for at least 2 months, one month for all of the users to opt-in and a second month to examine the churn rate for all of those initial users.

What else do you need to consider?

**Follow-up question:** Will the metric actually go up by ~5%, more, or less? You can also assume there is no novelty effect.

This question starts with numerous information gaps. Clarifying questions will help you answer more confidently. Some questions to consider asking are:

- How long did the test run for?
- What are the confidence intervals for the test?
- Is there an impact analysis or assumptions of heterogeneity on the users?
- Is the A/B test sample population a representative sample of the whole?
- Are there any interactions with other experiments?

**More context:** One variant in the test has a sample size of 50,000 users, whereas the other has 200,000 users. Would the test result in bias toward the smaller group?

In this case, the interviewer is trying to assess how you would approach the problem given an unbalanced group. Since you are not given context to the situation, you have to either state assumptions or ask clarifying questions.

How long has the A/B test been running? Have they been running for the same duration of time? If the data was collected during different time periods, then bias certainly exists from one group collecting data from a different time period than the other.

Specifically, you want to create control and test groups to test the close friends feature on Instagram stories.

Let’s say you want to run the experiment with a per-user assignment. Half of the users in this experiment would get the close friends feature and half would not. To further define what it means to get the close friends feature, you would say that the users would be able to create close friend stories and see close friends’ stories on Instagram.

However, this in itself presents the problem of the per-user assignment:

**Test group user creates story -**what does the control group friend see?**Control group watches stories -**what happens to all of their friends’ stories in the close friends variant?**Test group user does not create a story -**how do you analyze this effect on the friends that were not in this variant?

Per-user assignment fails many times in this scenario because you encounter network effects where you cannot properly hold one variable constant and test for the effect.

Understanding whether your data abides by or violates a normal distribution is an important first step in your subsequent data analysis. This understanding will change which statistical tests you use if you need to immediately look for statistical significance.

For example, you cannot run a t-test if your distribution is non-normal, since this test uses mean/average as a way to find differences between groups. If you find out that the distribution of the data is not normal, there are several steps you can take to help fix this problem:

- Perform a Mann-Whitney U-Test
- Utilize bootstrapping
- Gather more data

With this question, you should start with some follow-up clarifying questions like:

- How were the controlled and variant groups sampled?
- Was it a random sample, and was it done by following the agreed-upon test conditions?
- Is the time frame of the data for both groups the same, and did you follow the agreed-upon duration to run the analysis?
- What does the distribution of data for the two groups look like?
- Is the data normally distributed for both controlled and variant groups?

Once you’ve obtained clarity, you could calculate the p-value or find the rejection region to determine if you should reject or fail to reject a null hypothesis.

et’s say you work at an eCommerce startup. You run an A/B test on the company’s checkout product page with the hypothesis that surfacing free shipping will increase conversions.

The experiment group specifies whether the product qualifies for free shipping or not. In the control group, there is no specification of free shipping.

Let’s say we run the experiment for two weeks and get these results.

Control:

- 3056 page visits
- 1413 conversions
- 46% conversion

Experiment:

- 2947 samples
- 1533 conversions
- 52% conversion

How would you evaluate the results and if the test was successful or not?

Basic statistics interview questions weed out candidates who don’t have a firm technical grasp of statistics in addition to candidates who struggle to communicate their findings to others.

You might be asked to describe the difference between Type I (false positive) and Type II (false negative) errors, as well as how to go about detecting them, or to describe what a result with a significance level of 0.05 actually means. In general, if you’re familiar with the statistical concepts relevant to your job as a data scientist, this part of the interview should be pretty straightforward. Some conceptual statistics interview questions include:

The null hypothesis suggests that there is no statistically significant relationship between two sets of data. For example, if you were comparing weight loss between a keto and a low-carb diet, the null hypothesis would be that there is no statistical difference in the mean weight loss amount between the two diets.

What an interviewer is looking for here is whether you can answer this question in a way that conveys both your understanding of statistics and your ability to explain this concept to a non-technical worker.

Use an example to illustrate your answer; you could say:

Say we’re conducting an A/B test of an ad campaign. In this type of test, you have two hypotheses. The null hypothesis states that the ad campaign will not have a measurable increase in daily active users. The test hypothesis states that the ad campaign will have a measurable increase in daily active users.

Then, use data to run a statistical test to find out which hypothesis is true. The p-value can help determine this by providing the probability of whether you would observe the current data if the null hypothesis were true. Note, this is just a statement about probability given an assumption, the p-value is not a measure of “how likely” the null hypothesis is to be right, nor does it measure “how likely” the observations in the data are due to random chance, which are the most common misinterpretations of what the p-value is. The only thing the p-value can say is how likely you are to get the data you got if the null hypothesis were true. The difference may seem very abstract and impractical, but using incorrect explanations contributes to the overvaluing of p-values in non-technical circles.

Therefore, a low p-value indicates that it would be extremely unlikely that the data would result in this way if the null hypothesis were true.

Primacy effect involves users being resistant to change, whereas novelty effect involves users becoming temporarily excited by new things. For example, Instagram launches a new feature called close friend stories. Users could be drawn to this new feature because it’s new, but after using it for a while they might lose interest and use it less.

A Type I error is a false positive, or the rejection of a true null hypothesis. A Type II error is a false negative, or the failure to reject a true null hypothesis.

**Hint:** What values can covariance take? What about correlation? The difference between covariance and correlation is that covariance can take on any numeric value, whereas correlation can only take on values between −1 and 1.

A holdback experiment rolls out a feature to a high proportion of users but holds back the remaining percent of users to monitor their behavior over a longer period of time. This allows analysts to quantify the “lift” of a change over a longer amount of time.

**Hint:** What would a biased estimator look like? How does an unbiased estimator differ?

The key here is explaining this concept in layman’s terms. An unbiased estimator is an accurate statistic that is used to approximate a population parameter. A statistic in this case would be a piece of data that defines a population in some way. For example, the mean and median of a population.

To further simplify, the approximation of a population parameter means using this estimator, or statistic, to effectively measure the same statistic on the entire population. If you use the mean as an example, then the unbiased estimator is the sample mean being equivalent to the population mean. If you overestimate or underestimate the mean value, then the mean of the difference between the sample mean and the population mean is the bias.

Now go into the layman example by painting a picture for the interviewer that you want to figure out how many people will vote for the Democratic or Republican presidential candidate. You can’t survey the entire eligible voting population of over 100 million, so you take a sample of around 10,000 people over the phone. You find out that 50% are going to vote for the Democratic presidential candidate. Is this an unbiased estimator?

The central limit theorem says that if there is a very large sample of independent variables, they will eventually become normally distributed.

Most people probably recognize this graph as the ordinary bell curve distribution, in which the relevant features of a given group are evenly distributed around the mean of the curve.

You need time series models because of a fundamental assumption of most standard regression models: no autocorrelation. In layman’s terms, this means that a data point’s values in the present are not influenced by the values of data points in the past.

The AR component stands for autoregression: the situation in which the value of a current data point is determined in part by the value of a previous data point. The MA component stands for moving average: the situation in which the mean of the random noise of a random process changes over time.

ARIMA models are models that contain both an autoregressive component and a moving average component. That is, current data points are assumed to be the result of the last p data points and last q errors, along with some random noise.

Boolean variables are variables with values of 0 or 1. Examples of these variables include gender, whether someone is employed or not, or whether something is gray or white.

The sign of the coefficient is important. If you have a positive sign on the coefficient, then that means, all else equal, the variable has a higher likelihood of positively influencing the outcome variable. Conversely, a negative sign implies an inverse relationship between the variable and the outcome you are interested in. Further, if the value for the boolean variable is 1, the effect is counted toward the outcome, and if the value is 0, the effect is diminished with respect to the outcome variable.

Maximum likelihood estimation (MLE) and maximum a posterior (MAP) are methods used to estimate a variable with respect to observed data. They are both good for estimating a single variable as opposed to a distribution.

If you get into the math between the two methods, one distinct difference between the two is the inclusion of the prior assumption of the model parameter in MAP. Otherwise, they are essentially identical. With this added parameter, this means that the likelihood is weighted with some weight on the prior. Thus, the two methods can be used to estimate parameters but have slightly different approaches in achieving them!

In statistics interviews for data science positions, you’ll likely be asked computation questions or provided case studies, which will test your knowledge of statistics and probability. Most will have a right or wrong answer.

When answering this type of question, make sure that you present not only a solution to the problem at hand but also walk the interviewer through the thought process you used when arriving at your answer. Cite any relevant statistical concepts that you’re employing in your solution and make your response as intelligible as possible for the layperson. Here are some sample statistics case study interview questions:

**Hint:** Given that X and Y both have a mean of 0 and a standard deviation of 1, what does that indicate for the distributions of X and Y? What are some of the scenarios you can imagine when you randomly sample from each distribution? Detail each of the possibilities where X > Y and X < Y, as well as possible values of X and Y in each.

The linear combination of two independent normal random variables is a normal random variable itself. How does this change how you solve for the mean of 2X − Y?

**Hint:** The variance of aX − bY depends on the constants a and b, Var(X)/Var(Y), and Cov(X,Y). If you know that between independent random variables the covariance equals 0, how can you use this information for computing the variance?

Recall the equation:

**Hint:** To decrease the margin of error, you will probably have to increase the sample size. But by how much?

**Hint:** What do the two sequences, HHT and HTT, have in common? Since the question asks you to flip a fair coin until one of the two sequences comes up, it may be useful to consider the problem given a larger sample space. What happens to the probability of each sequence appearing as you flip the coin four times? Five times? N times?

**Hint:** How many scenarios are there in which none of the zebras collide? There are two scenarios in which the zebras do not collide: if they all move clockwise or if they all move counterclockwise. How do you calculate the probability that an individual zebra chooses to move clockwise or counterclockwise? How can you use this individual probability to calculate the probability that all zebras choose to move in the same direction?

**Hint:** What’s the probability of never drawing a pair?

Having the vocabulary to describe a distribution is an important skill when it comes to communicating statistical ideas. There are four important concepts, with supporting vocabulary, that you can use to structure your answer to a question like this:

- Center (mean, median, mode)
- Spread (standard deviation, interquartile range, range)
- Shape (skewness, kurtosis, uni- or bimodal)
- Outliers (do they exist?)

In terms of the distribution of time spent per day on Facebook, you can imagine there may be two groups of people on Facebook: a) people who scroll quickly through their feed and don’t spend too much time on Facebook, and b) people who spend a large amount of their social media time on Facebook. Therefore, how could you describe the distribution using the four terms listed above?

If you assume that your new customer acquisition is uniform throughout each month and that customer churn goes down by 20% month over month, what’s the expected churn for March 1st out of all customers that bought in January?

**Hint:** What if the question is actually 100 cards and you select 3 cards without replacement, does the answer change?

A uniform distribution is a straight line over the range of values from 0 to *d*, where any value between 0 to *d* is equally likely to be randomly sampled.

So, practically, if you have *N* samples and you have to estimate what d is with zero context of statistics and based on intuition, what value would you choose?

For example, if the *N* sample is 5 and the values are: (1,4,6,2,3), what value would you guess as *d*? Is it the max value of 6?

**More context:** Let’s say you are managing products for an ecommerce store. You think products from category 9 have a lower average price than those in all other categories.

Here’s a short solution using SQL:

```
WITH CTE_1 as -- CTE_1 to calculate mean, n, variance
(
SELECT * FROM
(SELECT AVG(price) AS mean_1, count(*) AS n_1, VAR_SAMP(price) AS var_1 FROM products WHERE category_id = 9 ) x
JOIN
(SELECT AVG(price) as mean_2, count(*) AS n_2, VAR_SAMP(price) AS var_2 FROM products WHERE category_id <> 9 ) y
ON 1 = 1
),
CTE_2 -- and power(var,2)/n
AS
(
SELECT *,POWER(var_1,2)/n_1 as X_1, POWER(var_2,2)/n_2 AS X_2 FROM CTE_1
)
SELECT -- Calculate T_value and D_o_F (Welch formula)
CAST((mean_1-mean_2)/sqrt((var_1/n_1)+(var_2/n_2)) AS DECIMAL(10,5)) AS t_value,
FLOOR((POWER((X_1+X_2),2))/((POWER(X_1,2))/(n_1-1) + (POWER(X_2,2))/(n_2-1))) as d_o_f
from CTE_2
```

**More context:** Capital approval rates have gone down. Let’s say last week it was 85% and the approval rate went down to 82% this week, which is a statistically significant reduction. The first analysis shows that all approval rates stayed flat or increased over time when looking at the individual products:

- Product 1: 84% to 85% week over week
- Product 2: 77% to 77% week over week
- Product 3: 81% to 82% week over week
- Product 4: 88% to 88% week over week

What could be the cause of the decrease?

This would be an example of Simpson’s Paradox, which is a phenomenon in statistics and probability. Simpson’s Paradox occurs when a trend shows in several groups but either disappears or is reversed when combining the data.

In this specific example, each product’s approval rates are binomially distributed with count ni and probability pi. Then the mean for each distribution would be ni x pi.

To prepare for a data science statistics interview, review our probability interview questions, which include questions on fundamental statistics and probability concepts. The A/B testing and statistics module in our data science course will also help you prepare for computation and statistics interviews.