Probability questions are often asked in both data science and analytics interviews at FAANG companies and other big tech firms.
Testing your probability knowledge provides companies with a good idea of your analytical reasoning skills and intelligence, and they often take the form of case studies. You will be given a scenario, and then asked to compute the probability for that given scenario. Besides case studies, conceptual questions are common too.
To recap, the most common types of data science probability questions are:
Ultimately, to answer probability interview questions, you need a strong basis in probability theory, and therefore it is recommended to brush up on it during your interview prep.
Prepare for the probability interview with the Interview Query probability course
Conceptual questions test your knowledge of probability theory. These are short, quiz-like questions that ask about types of distributions, definitions of concepts like Central Limit Theorem, or use cases for concepts like Bayes’ Theorem.
To answer this type of probability question successfully, your answer must be accessible to a layperson.
Probability distributions represent random variables and associated probabilities of different outcomes. In essence, a distribution maps the probability of various outcomes. For example, a distribution of test grades might look similar to a normal distribution, AKA bell curve, with the highest number of students receiving Cs and Bs, and a smaller percentage of students failing or receiving a perfect score. In this way the center of the distribution would be the highest, while outcomes at either end of the scale falling lower and lower.
A Bernoulli distribution models the event of conducting one trial of an experiment with only two possible outcomes, like a coin flip (Heads/Tails) or which team will win the Super Bowl in a given year (49ers/Chargers). Binomial distribution models the event of conducting n number of trials that have two possible outcomes, like tossing a coin 100 times (Heads/Tails) or asking 50 people if they have visited Hong Kong (Yes/No).
A probability distribution is not normal if most of its observations do not cluster around the mean, forming the bell curve. An example of a non-normal probability distribution is a uniform distribution, in which all values are equally likely to occur within a given range. A random number generator set to produce only the numbers 1-5 would create such a not normal distribution, as each value would be equally represented in your distribution after several hundred iterations.
In probability theory and statistics, Bayes’ Theorem refers to the probability of an event based on conditions that exist. Essentially, the theorem allows us to update our beliefs about a random event based on what we know about the event.
For example, if the risk of customer churn increases the longer a user has been inactive, Bayes’ Theorem allows us to more accurately assess the churn risk for users, because we can condition the probability of churn to how long the user has been inactive.
Covariance can take on any numeric value, while correlation can only take on values between -1 (strong inverse correlation) and 1 (strong direct correlation).
Note: A zero value for correlation means there is there is no relationship between the two variables.
Therefore, the relationship between two variables can have a covariance that seems high, but only a middling correlation value.
The Law of Large Numbers says that a sample mean is an unbiased estimator for the population mean and that the error of that mean decreases as the sample size grows the average of your sample is predictive of the average of the entire population, becoming more accurate with a larger sample, while the Central Limit Theorem states that as a sample size n becomes larger, its distribution can be approximated by the normal distribution (it will appear more like a normal Bell Curve).
Probability mass functions describe discrete distributions. Using probability mass functions, we can determine the probability of an event to be equal to a target value. In other words, we are sure that an event will always equal x.
Density mass functions describe continuous probability distributions. Using density mass functions we can determine the probability of an event within a range around the target value, which can be found by calculating the area under the interval curve.
In probability, confidence intervals refer to a range of values that you expect your estimate to fall between, if you were to rerun a test. Confidence intervals are a range that are equal to the mean of your estimate plus or minus the variation.
For example, if a presidential popularity poll had a confidence interval of 93%, encompassing a 50%-55% approval , it would be expected that, if you re-polled your sample 100 more times, 93 times the estimate would fall between the upper and lower values of your interval. Those other seven events would fall outside, which is to say either below the 50% or above 55%. More polling would allow you to get closer to the true population average, and narrow the interval.
An unbiased estimator is an accurate statistic that is used to approximate a population parameter. An example would be taking a sample of 10,000 voters in a political poll to estimate the total voting population. There is no such thing as a perfectly unbiased estimator, because this would require you to accurately survey the entire population for your sample, impossible amongst what is often millions of eligible respondents.
The main object of study in probability is events. An event is simply an outcome of some experiment, such as flipping a coin. Questions on events typically focus on games of chance and ask you to determine the probability of an event occurring.
These probability interview questions deal with independent and dependent events:
Independent events do not affect the outcome of another event, while dependent events do affect the other’s outcomes.
For example, if you were asked to toss a coin 100 times, a coin flip would be an independent event, because the probability of each successive flip would remain 50-50. Getting a heads on the first flip does not influence your chances of either a heads or tails on the second. Drawing a card from a deck (without replacement) would be a dependent event, because with each draw the deck gets smaller, affecting the outcome of each successive draw.
First, start by listing the possibilities. There are four possible outcomes for this family:
We can rule out No. 4 since we know at least one of the children is already a boy. Out of the three remaining, only 1 is right. Therefore, the probability is ⅓.
First, separate it into two instances. One would be if you grabbed the fair coin, and the second would be for the biased coin. Then, solve the probability of flipping the same side for both. Then, you would account for the probabilities of flipping heads both times, and the probability of flipping tails both times. Finally, you would need to account for the probability of selecting the fair or biased coin.
More context: Each of three zebras is sitting on the corner of an equal length triangle. Each zebra randomly picks a direction and only runs along the side, never deviating from the triangle’s outline.
To answer this question, think about the scenarios in which the zebras will not collide:
Then start calculating how the zebras could interact outside of those scenarios that would result in collisions.
Break this down into two different cases for the different color or different shape:
Case 1: Pulling a card of a different color
Assuming the first card is pulled of a certain color, let’s say red, we now have 26 choices for a card of the other color (black) available from a total of 51 remaining cards. Therefore the probability of pulling a card of different color is equivalent to 26⁄51.
Case 2: Pulling a card of a different suit
Since in a normal deck of playing cards all suits are composed of only two different colors (diamonds and hearts are red, clubs and spades are black), a different suit doesn’t necessarily mean a different color than the first card we pick. The “or color” part is then redundant.
We removed one card from the 52 card deck, so we will not pull it again. That means if you started with 4 suits of 13 cards each, you would have 13*3 cards left that would be of a different suit from the first in the deck, and 51 remaining cards in the deck.
Therefore, you have a 39⁄51 chance of pulling a card of a different suit or color.
Hint: First, we consider how many ways we can possibly roll a seven. Let D1 and D2 be the result of the first and second role, respectively. Therefore, there are six ways you can roll a 7.
More Context: Let’s say that you are drawing N cards (without replacement) from a standard 52 card poker deck. Each card is unique and part of 4 different suits and 13 different ranks.
See a full solution for this hard probability problem on Interview Query.
Combinatorics refers to the study of counting things. Many probability interview questions ask questions related to combinations, in which the order doesn’t matter, and permutations, in which the order does matter. Combinatorics questions might be “how many ways can you arrange n books?” or “How many ways can you choose a committee of K people from a pool of N?”
The key difference is if the order matters or not. In combinations, you would select r elements from a set n (without replacement). In combinations, the order does not matter, and are frequently used for group data. An example would be picking 3 items off a restaurant menu for your entrees, it hardly makes a difference which item you order first or last, all will come out of the kitchen at the same time.
With permutations, you would pick r elements from a set of n (without replacement), but the order does matter. An example would be selecting your 1st, 2nd or 3rd choices for kickball teammates or choosing Top 3 favorite movies in order, in both cases your top choice comes first.
Discrete variables are countable, while continuous variables are measurable. A discrete variable would be the number of faberge eggs created, there are only so many in the world, and no more are being produced. Other examples of a discrete variable could be the number of students in a class, or the amount of money in your wallet. A continuous variable, on the other hand, would be something like age, because you could continue to count it forever, e.g. I am 33 years old, 9 months, 2 days, 5 hours, 4 seconds…. on and on.The continuous values are infinitely divisible.
You can turn a continuous variable into a discrete variable, by making it countable. For example, you could count a toddler’s age in months.
We can start by thinking about possible outcomes:
With this information, we can then calculate how many flips you would expect to get two tails in a row:
x = (1⁄2)(x+1) + (1⁄4)(x+2) + (1⁄4)2
x = 1⁄2 [ (1+x) + 1⁄2(2+x) + 1 ]
x = 1⁄2 [ 1 + x + 1 + x/2 + 1 ]
x / 4 = 3⁄2
x = 6
More context: Let’s say we’re trying to determine fake reviews on our products. Based on past data, 98% of reviews are legitimate and 2% are fake. If a review is fake, there is a 95% chance that the machine learning algorithm identifies it as fake. If a review is legitimate, there is a 90% chance that the machine learning algorithm identifies it as legitimate.
See a full solution for this Bayes’ Rule probability question on Interview Query.
Here’s a hint:
We can generalize to two scenarios of drawing an ace on the second card:
If we model the probability of the first scenario we can multiply the two probabilities of each occurrence to get the actual probability.
Now the second scenario:
Hint: Imagine this as a sample space problem ignoring all other distracting details. If you have to draw three different numbered cards without replacement, and they are all unique, then we are assuming that there will be, effectively, a lowest card, a middle card and a high card.
Let’s make it easy and assume we drew the numbers 1,2, and 3. In our scenario, if we drew (1,2,3), then that would be the winning scenario. But what’s the full range of outcomes we could draw? Let’s map out all of the possibilities.
More context: Netflix has hired people to rate movies. Out of all of the raters, 80% of the raters carefully rate movies and rate 60% of the movies as good and 40% as bad. The other 20% are lazy raters and rate 100% of the movies as good. For this question, assume all raters rate the same amount of movies.
Probability distributions are statistical functions that describe the likelihood of obtaining possible values that a random variable can take. In other words, the values of the variable will change based on the underlying probability distribution. This is one of the most common types of data science interview questions.
Hint: In this scenario, it is much easier to solve by figuring out the combination of never rolling a 3. We can further break it down by understanding the probability of rolling a 3 on one die is ⅙. To not roll a three, we do 1- ⅙ = 5⁄6 . With two dice, we can multiply the probabilities together, and onward for our N dice given to us.
With this question, the solution requires a multinomial distribution, with n=12 and k=3.
Note: In this problem, the players are indistinguishable from each other.
An additional question: Given that there are N sessions and they convert with probability q, what is the expected number of converted sessions?
A few questions to ask yourself include “Does the outcome of one of the users really affect the outcome of the other user?” and “What are the rules for adding together many different experiment’s expectations?”
Hint: What are some of the scenarios we can imagine when we randomly sample from each distribution? Write out each of the possibilities where X > Y and X < Y, as well as possible values of X and Y in each.
A question like this tests your knowledge of the concepts of uniform and normal distributions.
There’s a simple answer to this. To simulate draws from a uniform distribution, you would plug the values into the normal cumulative distribution function (CDF) for the same random variable.
This is known as the Universality of the Uniform or Probability Integral Transform.
The means would need to be more than two standard deviations apart.
Scenario: There’s a swimmer who is 5.8-feet tall who doesn’t know how to swim. The swimmer wants to swim in a lake with an average depth of 5.6 feet with a standard deviation of 1 foot. Assume when the lake exceeds the swimmer’s height, the swimmer dies.
See a full solution to this hard probability problem on Interview Query.
This question requires some memorization. At first glance we can infer that it’s a binomial distribution problem given that we have to guess the number of heads out of a number of trials. Therefore it is similar to a binomial distribution with an n number of trials and probability of success of p on each trial.
Take our probability course to review concepts like distributions and combinatorics. Sign up for Interview Query to access a variety of data science interview resources, including: