The Facebook data science interview consists of multiple technical and business case questions, heavily focused on applying technical knowledge to business case scenarios. Facebook data scientists are expected to work cross-functionally and explore, analyze, and aggregate large data sets to provide actionable information.
Typically, interviews at Facebook vary by role and team, but commonly data science interview questions in SQL, business case, analytics and product metrics are asked.
We’ve gathered this data from parsing thousands of interview experiences sourced from members.
Facebook interview questions generally fall into four main categories:
The technical screen will generally consist of one product question and one data analysis question. Be sure to prepare for both in order to move on to the onsite.
Note: This process is the same for both those seeking full-time jobs at Facebook, as well as Facebook data science internships.
Case study questions in Facebook data science interviews focus heavily on product metrics and business cases.
The question states the drop is from 3% a month ago to 2.5% today. The first thing we have to do is clarify the context around the problem before jumping to conclusions about metrics. Is today a weekday and one month from today a weekend, so users are posting less? Is there a special event or seasonality? Is this an ongoing downward trend or a one-time occurrence spike downwards?
The second part is understanding the metric itself. What drove the decrease: was it the number of users that increased or the number of posts that decreased? The interviewer will likely ask you to jump into one or both of the metrics to discuss what could have caused the decrease.
Threading restructures the flow of comments so that, instead of responding to the post, users can now respond to individual comments beneath the post. What effect might this have on a push notification ecosystem?
We can start by breaking down some structure of what the interviewer is looking for. Whenever we’re given these open-ended product questions, it makes sense to think about structuring the questions with well-defined objectives, so we’re not switching between different answers.
SQL questions are the most frequently asked in Facebook data science interviews. See additional practice questions in our guide Facebook SQL Interview Questions.
users
table
Columns | Type |
---|---|
id |
INTEGER |
name |
VARCHAR |
created_at |
DATETIME |
neighborhood_id |
INTEGER |
mail |
VARCHAR |
comments
table
Columns | Type |
---|---|
user_id |
INTEGER |
body |
VARCHAR |
created_at |
DATETIME |
Since a histogram is just a display of frequencies of each user, all we really need to do is get the total count of user comments in the month of January 2020 for each user, and then group by that count.
action
represents either (‘post_enter’, ‘post_submit’, ‘post_canceled’) for when a user starts a post (enter), ends up canceling it (cancel), or ends up posting it (submit).events
table
Column | Type |
---|---|
id |
INTEGER |
user_id |
INTEGER |
created_at |
DATETIME |
action |
VARCHAR |
url |
VARCHAR |
platform |
VARCHAR |
Write a query to get the post success rate for each day in the month of January 2020.
Let’s see if we can clearly define the metrics we want to calculate before just jumping into the problem. We want post success rate for each day over the past week.
To get that metric, we can assume post success rate can be defined as:
(total posts created) / (total posts entered)
Additionally, since the success rate must be broken down by day, we must make sure that a post that is entered must be completed on the same day.
Now that we have these requirements, it’s time to calculate our metrics. We know we have to GROUP BY the date to get each day’s posting success rate. We also have to break down how we can compute our two metrics of total posts entered and total posts actually created.
friends
with a user_id and friend_id columns representing each user’s friends, and another table called page_likes
with a user_id and a page_id representing the page each user liked.friends
table
Column | Type |
---|---|
user_id |
INTEGER |
friend_id |
INTEGER |
page_likes
table
Column | Type |
---|---|
user_id |
INTEGER |
page_id |
INTEGER |
Write an SQL query to create a metric to recommend pages for each user based on recommendations from their friends liked pages.
We can start by visualizing what kind of output we want from the query. Given that we have to create a metric for each user to recommend pages, we know we want something with a user_id and a page_id along with some sort of recommendation score.
How can we easily represent the scores of each user_id and page_id combo? One naive method would be to create a score by summing up the total likes by friends on each page that the user hasn’t currently liked. The max value on our metric would be the most recommendable page.
The first thing we have to do then is to write a query to associate users to their friends liked pages. We can do that easily with an initial join between the two tables.
Statistics and probability questions assess your understanding of mathematical concepts and how they’re used in data science interviews at Facebook.
Having the vocabulary to describe a distribution is an important skill as a data scientist when it comes to communicating ideas to your peers. There are four important concepts with supporting vocabulary that you can use to structure your answer to a question like this. These are:
In terms of the distribution of time spent per day on Facebook (FB), one can imagine there may be two groups of people on Facebook:
Based on this, what kind of claims could we make about the distribution of time spent on FB?
Suppose we have 100 raters, each rating one ad independently. What’s the expected number of good ads?
Keep in mind that in order for the rater to rate an ad, the rater must first be selected. So the event that the rater is selected happens first, then the rating happens. How would you represent this fact arithmetically using the basic properties of probability?
Hint: If we only have one rater, we don’t need to test that rater’s personality more than once.
What is the probability that none of the zebras collide?
There are two scenarios in which the zebras do not collide: if they all move clockwise or if they all move counterclockwise.
How do we calculate the probability that an individual zebra chooses to move clockwise or counterclockwise? How can we use this individual probability to calculate the probability that all zebras choose to move in the same direction?
Facebook machine learning interview questions focus on algorithms, systems design, applied modeling and recommendation engines.
Since we are interested in whether or not someone will be an active user in 6 months or not, we can test this assumption by first looking at the existing data. One way to do so is to put users into buckets determined by friend size six months ago and then look at their activity over the next six months.
If we set a metric to define “active user”, such as if they logged in X number of times, posted once, etc., we can then just compute the averages on these metrics across the buckets to determine if more friends are equivalent to higher engagement metrics.
How would you optimize the ratio of public versus private content? How would you build a model, what features would you use, and what metrics would you track?
In Facebook data scientist interviews, system design questions require you to walk the interviewer through building recommendation engines, models, etc.
Let’s think about a simple use case to start out with. Say that we type in the word “hello” for the beginning of a movie.
If we typed in h-e-l-l-o, then a suitable suggestion might be a movie like “Hello Sunshine” or a Spanish movie named “Hola”.
Let’s now move on to an MVP within the scope. We can begin to think of the solution in the form of a prefix table.
How a prefix table works is that your prefix, your input string, outputs your output string, one at a time to start with. For an mvp, we could input a string and output a suggestion string with added fuzzy matching and context matching.
But now, how do we recommend a certain movie?
In Facebook data scientist interviews, Python questions are asked, and typically, they require you to perform algorithmic coding.
Write a function to generate an output which lists the pairs of friends with their corresponding timestamps of the friendship beginning and then the timestamp of the friendship ending.
Note that you are only looking for friendships that have an end date. Because of this, every friendship that will be in our final output is contained within the friends_removed list. So if you start by iterating through the friends_removed dictionary, you will already have the id pair and the end date of each listing in our final output–you just need to find the corresponding start date for each end date.
The friends_added and friends_removed dictionaries are already sorted by date. Because of this, you can be sure that as long as you iterate from the top through both, you will find the correct pairings of dates since each end date can only have one corresponding start date appearing before it in time.
Given a dictionary consisting of many roots and a sentence, stem all the words in the sentence with the root forming it. If a word has many roots can form it, replace it with the root with the shortest length.
Example:
Input:
roots = ["cat", "bat", "rat"]
sentence = "the cattle was rattled by the battery"
Output:
"the cat was rat by the bat"
At first, it simply looks like we can just loop through each word and check if the root exists in the word and, if so, replace the word with the root. But since we are technically stemming the words we have to make sure that the roots are equivalent to the word at its prefix rather than existing anywhere within the word.
We’re given a dictionary of roots with a sentence string. Given we have to check each word, try creating a function that takes a word and returns the existing word if it doesn’t match a root or returns the root itself.
name | age | favorite_color | grade |
---|---|---|---|
Tim Voss | 19 | red | 91 |
Nicole Johnson | 20 | yellow | 95 |
Elsa Williams | 21 | green | 82 |
John James | 20 | blue | 75 |
Catherine Jones | 23 | green | 93 |
Write a function to select only the rows where the student’s favorite color is green or red, and their grade is above 90.
This question requires us to filter a dataframe by two conditions: first, the grade of the student, and second, their favorite color.
Let’s start with filtering by grade since it’s a bit simpler than filtering by strings. We can filter columns in pandas by setting our dataframe equal to itself with the filter in place. In this case:
students_df = students_df[students_df["grade"] > 90]
If we were to look at our dataframe after passing that line of code, we’d see that every student with a lower grade than 90 no longer appears in our data frame.
See more Facebook data scientist questions from Interview Query:
Facebook pays some of the highest salaries for data scientists at FAANG companies.
Average Base Salary
Average Total Compensation
Read interview experiences and salary posts in preparation for your next interview.