Facebook Data Scientist Interview Guide

Facebook Data Scientist Interview GuideFacebook Data Scientist Interview Guide

Introduction

The Facebook data science interview consists of multiple technical and business case questions, heavily focused on applying technical knowledge to business case scenarios. Facebook data scientists are expected to work cross-functionally and explore, analyze, and aggregate large data sets to provide actionable information.

Facebook Data Scientist Interview Process

Typically, interviews at Facebook vary by role and team, but commonly data science interview questions in SQL, business case, analytics and product metrics are asked.

We’ve gathered this data from parsing thousands of interview experiences sourced from members.

Facebook Data Scientist Interview Questions

Facebook interview questions generally fall into four main categories:

The technical screen will generally consist of one product question and one data analysis question. Be sure to prepare for both in order to move on to the onsite.

Note: This process is the same for both those seeking full-time jobs at Facebook, as well as Facebook data science internships.

Facebook Case Study Interview Questions

Case study questions in Facebook data science interviews focus heavily on product metrics and business cases.

1. Facebook composer, the posting tool, drops from 3% posts per user last month to 2.5% posts per user today. How would you investigate what happened?

The question states the drop is from 3% a month ago to 2.5% today. The first thing we have to do is clarify the context around the problem before jumping to conclusions about metrics. Is today a weekday and one month from today a weekend, so users are posting less? Is there a special event or seasonality? Is this an ongoing downward trend or a one-time occurrence spike downwards?

The second part is understanding the metric itself. What drove the decrease: was it the number of users that increased or the number of posts that decreased? The interviewer will likely ask you to jump into one or both of the metrics to discuss what could have caused the decrease.‌

2. A Facebook Groups product manager decides to add threading to comments on group posts. Comments per user increase by 10%, but posts go down 2%. Why would that be? What metrics would prove your hypotheses?

Threading restructures the flow of comments so that, instead of responding to the post, users can now respond to individual comments beneath the post. What effect might this have on a push notification ecosystem?

3. Facebook is rolling out a new feature called “Mentions” which is an app specifically for celebrities on Facebook to connect with their fans. How would you measure the health of the Mentions app?

We can start by breaking down some structure of what the interviewer is looking for. Whenever we’re given these open-ended product questions, it makes sense to think about structuring the questions with well-defined objectives, so we’re not switching between different answers.

  1. Did you begin by stating what the goals of the feature are before jumping into defining metrics? What is the point of the Mentions feature?
  2. Are your answers structured, or do you tend to talk about random points?
  3. Are the metrics definitions specific, or are they generalized in an example like “I would find out if people used Mentions frequently”?

4. How can Facebook figure out when users falsify their attended schools?

5. If 70% of Facebook users on iOS use Instagram, but only 35% of Facebook users on Android use Instagram, how would you investigate the discrepancy?

6. Facebook Newsfeed engagement is down by 10%. How would you find out why?

Facebook SQL Interview Questions

SQL questions are the most frequently asked in Facebook data science interviews. See additional practice questions in our guide Facebook SQL Interview Questions.

1. Write a SQL query to create a histogram of the number of comments per user in the month of January 2020. Assume bin buckets class intervals of one.

users table

Columns Type
id INTEGER
name VARCHAR
created_at DATETIME
neighborhood_id INTEGER
mail VARCHAR

comments table

Columns Type
user_id INTEGER
body VARCHAR
created_at DATETIME

Since a histogram is just a display of frequencies of each user, all we really need to do is get the total count of user comments in the month of January 2020 for each user, and then group by that count.

2. In the table below, column action represents either (‘post_enter’, ‘post_submit’, ‘post_canceled’) for when a user starts a post (enter), ends up canceling it (cancel), or ends up posting it (submit).

events table

Column Type
id INTEGER
user_id INTEGER
created_at DATETIME
action VARCHAR
url VARCHAR
platform VARCHAR

Write a query to get the post success rate for each day in the month of January 2020.

Let’s see if we can clearly define the metrics we want to calculate before just jumping into the problem. We want post success rate for each day over the past week.

To get that metric, we can assume post success rate can be defined as:

(total posts created) / (total posts entered)

Additionally, since the success rate must be broken down by day, we must make sure that a post that is entered must be completed on the same day.

Now that we have these requirements, it’s time to calculate our metrics. We know we have to GROUP BY the date to get each day’s posting success rate. We also have to break down how we can compute our two metrics of total posts entered and total posts actually created.

3. We want to build a naive recommender, and we’re given two tables, one table called friends with a user_id and friend_id columns representing each user’s friends, and another table called page_likes with a user_id and a page_id representing the page each user liked.

friends table

Column Type
user_id INTEGER
friend_id INTEGER

page_likes table

Column Type
user_id INTEGER
page_id INTEGER

Write an SQL query to create a metric to recommend pages for each user based on recommendations from their friends liked pages.

We can start by visualizing what kind of output we want from the query. Given that we have to create a metric for each user to recommend pages, we know we want something with a user_id and a page_id along with some sort of recommendation score.

How can we easily represent the scores of each user_id and page_id combo? One naive method would be to create a score by summing up the total likes by friends on each page that the user hasn’t currently liked. The max value on our metric would be the most recommendable page.

The first thing we have to do then is to write a query to associate users to their friends liked pages. We can do that easily with an initial join between the two tables.

Statistics and Probability Interview Questions

Statistics and probability questions assess your understanding of mathematical concepts and how they’re used in data science interviews at Facebook.

1. What do you think the distribution of time spent per day on Facebook looks like? What metrics would you use to describe that distribution?

Having the vocabulary to describe a distribution is an important skill as a data scientist when it comes to communicating ideas to your peers. There are four important concepts with supporting vocabulary that you can use to structure your answer to a question like this. These are:

  1. Center (mean, median, mode)
  2. Spread (standard deviation, interquartile range, range)
  3. Shape (skewness, kurtosis, uni or bimodal)
  4. Outliers (Do they exist?)

In terms of the distribution of time spent per day on Facebook (FB), one can imagine there may be two groups of people on Facebook:

  1. People who scroll quickly through their feed and don’t spend too much time on FB.
  2. People who spend a large amount of their social media time on FB.

Based on this, what kind of claims could we make about the distribution of time spent on FB?

2. We use people to rate ads. There are two types of raters, random and independent, from our point of view:

  • 80% of raters are careful, and they rate an ad as good (60% chance) or bad (40% chance).
  • 20% of raters are lazy, and they rate every ad as good (100% chance).

Suppose we have 100 raters, each rating one ad independently. What’s the expected number of good ads?
Keep in mind that in order for the rater to rate an ad, the rater must first be selected. So the event that the rater is selected happens first, then the rating happens. How would you represent this fact arithmetically using the basic properties of probability?

Hint: If we only have one rater, we don’t need to test that rater’s personality more than once.

3. Three zebras are chilling in the desert when a lion suddenly attacks. Each zebra is sitting on a corner of an equally spaced triangle. Each zebra randomly picks a direction and only runs along the outline of the triangle to either edge of the triangle.

What is the probability that none of the zebras collide?
There are two scenarios in which the zebras do not collide: if they all move clockwise or if they all move counterclockwise.

How do we calculate the probability that an individual zebra chooses to move clockwise or counterclockwise? How can we use this individual probability to calculate the probability that all zebras choose to move in the same direction?

Facebook Machine Learning Interview Questions

Facebook machine learning interview questions focus on algorithms, systems design, applied modeling and recommendation engines.

1. How would you test whether having more friends now increases the probability that a Facebook member is still an active user after 6 months?

Since we are interested in whether or not someone will be an active user in 6 months or not, we can test this assumption by first looking at the existing data. One way to do so is to put users into buckets determined by friend size six months ago and then look at their activity over the next six months.

If we set a metric to define “active user”, such as if they logged in X number of times, posted once, etc., we can then just compute the averages on these metrics across the buckets to determine if more friends are equivalent to higher engagement metrics.

2. We’re given different posts such as your friends baby pictures, Buzzfeed Tasty videos, and birthday posts and have to decide how to rank them.

How would you optimize the ratio of public versus private content? How would you build a model, what features would you use, and what metrics would you track?

3. You’ve been asked to generate a machine learning model that can map the nicknames of people using Facebook. How do you go about designing this model?

4. A product manager has asked you to develop a method to match users to their siblings on Facebook. How would you evaluate a method or algorithm to match users with their siblings? What metrics might you use?

Facebook System Design Interview Questions

In Facebook data scientist interviews, system design questions require you to walk the interviewer through building recommendation engines, models, etc.

1. How would you build the recommendation algorithm for type-ahead search for Netflix?

Let’s think about a simple use case to start out with. Say that we type in the word “hello” for the beginning of a movie.

If we typed in h-e-l-l-o, then a suitable suggestion might be a movie like “Hello Sunshine” or a Spanish movie named “Hola”.

Let’s now move on to an MVP within the scope. We can begin to think of the solution in the form of a prefix table.

How a prefix table works is that your prefix, your input string, outputs your output string, one at a time to start with. For an mvp, we could input a string and output a suggestion string with added fuzzy matching and context matching.

But now, how do we recommend a certain movie?

Facebook Coding Interview Questions

In Facebook data scientist interviews, Python questions are asked, and typically, they require you to perform algorithmic coding.

1. There are two lists of dictionaries representing friendship beginnings and endings: friends_added and friends_removed. Each dictionary contains the user_ids and created_at time of the friendship beginning/ending.

Write a function to generate an output which lists the pairs of friends with their corresponding timestamps of the friendship beginning and then the timestamp of the friendship ending.

Note that you are only looking for friendships that have an end date. Because of this, every friendship that will be in our final output is contained within the friends_removed list. So if you start by iterating through the friends_removed dictionary, you will already have the id pair and the end date of each listing in our final output–you just need to find the corresponding start date for each end date.

The friends_added and friends_removed dictionaries are already sorted by date. Because of this, you can be sure that as long as you iterate from the top through both, you will find the correct pairings of dates since each end date can only have one corresponding start date appearing before it in time.

2. In data science, there exists the concept of stemming, which is the heuristic of chopping off the end of a word to clean and bucket it into an easier feature set.

Given a dictionary consisting of many roots and a sentence, stem all the words in the sentence with the root forming it. If a word has many roots can form it, replace it with the root with the shortest length.

Example:

Input:

roots = ["cat", "bat", "rat"]
sentence = "the cattle was rattled by the battery"

Output:

"the cat was rat by the bat"

At first, it simply looks like we can just loop through each word and check if the root exists in the word and, if so, replace the word with the root. But since we are technically stemming the words we have to make sure that the roots are equivalent to the word at its prefix rather than existing anywhere within the word.

We’re given a dictionary of roots with a sentence string. Given we have to check each word, try creating a function that takes a word and returns the existing word if it doesn’t match a root or returns the root itself.

3. You’re given a dataframe of students:

name age favorite_color grade
Tim Voss 19 red 91
Nicole Johnson 20 yellow 95
Elsa Williams 21 green 82
John James 20 blue 75
Catherine Jones 23 green 93

Write a function to select only the rows where the student’s favorite color is green or red, and their grade is above 90.

This question requires us to filter a dataframe by two conditions: first, the grade of the student, and second, their favorite color.

Let’s start with filtering by grade since it’s a bit simpler than filtering by strings. We can filter columns in pandas by setting our dataframe equal to itself with the filter in place. In this case:

students_df = students_df[students_df["grade"] > 90]

If we were to look at our dataframe after passing that line of code, we’d see that every student with a lower grade than 90 no longer appears in our data frame.‌‌

See more Facebook data scientist questions from Interview Query:

Question
Topics
Difficulty
Ask Chance
Python
Hard
Very High
Pandas
Medium
High
Python
Easy
High

This feature requires a user account

Sign up to get your personalized learning path.

feature

Access 600+ data science interview questions

feature

1600+ top companies interview guide

feature

Unlimited code runs and submissions


View all Meta Data Scientist questions

Facebook Data Scientist Salary

Facebook pays some of the highest salaries for data scientists at FAANG companies.

$160,099

Average Base Salary

$228,971

Average Total Compensation

Min: $115K
Max: $205K
Base Salary
Median: $160K
Mean (Average): $160K
Data points: 1,470
Min: $31K
Max: $417K
Total Compensation
Median: $220K
Mean (Average): $229K
Data points: 308

View the full Data Scientist at Meta salary guide

Facebook Data Scientist Jobs

👉 Reach 100K+ data scientists and engineers on the #1 data science job board.
Submit a Job
Data Scientist Creator Marketing
Data Scientist Auction Expert
Data Scientist Instagram Growth
Data Scientist Product Analytics
Data Scientist
Data Scientist Product Analytics
Data Scientist Small Business Group
Data Scientist Global Business Product Marketing
Data Scientist Instagram Growth
Marketing Data Scientist Reality Labs