The Facebook interview consists of multiple technical and business case questions, heavily focused on applying technical knowledge to business case scenarios. Facebook data scientists are expected to work cross functionally and explore, analyze, and aggregate large data sets to provide actionable information.
Facebook Interview Questions
Facebook interview questions generally fall into four main categories:
- Product and business sense
- Technical data analysis (SQL, pandas)
- Statistics and probability
- Modeling knowledge and applying data
The technical screen will generally consist of one product question and one data analysis question. Be sure to prepare for both in order to move on to the onsite.
Case Study Interview Questions
Q1: Facebook composer, the posting tool, drops from 3% posts per user last month to 2.5% posts per user today. How would you investigate what happened?
The question states the drop is from 3% a month ago to 2.5% today. The first thing we have to do is clarify the context around the problem before jumping to conclusions about metrics. Is today a weekday and one month from today a weekend so users are posting less? Is there a special event or seasonality? Is this an ongoing downward trend or a one-time occurrence spike downwards?
The second part is understanding the metric itself. What drove the decrease: was it the number of users that increased or the number of posts that decreased? The interviewer will likely ask you to jump into one or both of the metrics to discuss what could have caused the decrease.
Q2: A Facebook Groups product manager decides to add threading to comments on group posts. Comments per user increase by 10% but posts go down 2%. Why would that be? What metrics would prove your hypotheses?
Threading restructures the flow of comments so that, instead of responding to the post, users can now respond to individual comments beneath the post.
What effect might this have on a push notification ecosystem?
Q3: Facebook is rolling out a new feature called "Mentions" which is an app specifically for celebrities on Facebook to connect with their fans. How would you measure the health of the Mentions app?
We can start by breaking down some structure on what the interviewer is looking for. Whenever we're given these open-ended product questions, it makes sense to think about structuring the questions with well-defined objectives so we're not switching between different answers.
1. Did you begin by stating what the goals of the feature are before jumping into defining metrics? What is the point of the Mentions feature?
2. Are your answers structured or do you tend to talk about random points?
3. Are the metrics definitions specific or are they generalized in an example like “I would find out if people used Mentions frequently”?
Q4: How can Facebook figure out when users falsify their attended schools?
Q5: If 70% of Facebook users on iOS use Instagram, but only 35% of Facebook users on Android use Instagram, how would you investigate the discrepancy?
Q6: Facebook Newsfeed engagement is down by 10%. How would you find out why?
SQL Interview Questions
Q1: Write a SQL query to create a histogram of number of comments per user in the month of January 2020. Assume bin buckets class intervals of one.
users table | columns | type | |-----------------|----------| | id | integer | | name | string | | created_at | datetime | | neighborhood_id | integer | | mail | string |
comments table | columns | type | |------------|----------| | user_id | integer | | body | text | | created_at | datetime |
Since a histogram is just a display of frequencies of each user, all we really need to do is get the total count of user comments in the month of January 2020 for each user, and then group by that count.
Q2: In the table below, column `action` represents either ('post_enter', 'post_submit', 'post_canceled') for when a user starts a post (enter), ends up canceling it (cancel), or ends up posting it (submit).
events table | column | type | |------------|----------| | id | integer | | user_id | integer | | created_at | datetime | | action | string | | url | string | | platform | string |
Write a query to get the post success rate for each day in the month of January 2020.
Let's see if we can clearly define the metrics we want to calculate before just jumping into the problem. We want post success rate for each day over the past week.
To get that metric, we can assume post success rate can be defined as:
(total posts created) / (total posts entered)
Additionally, since the success rate must be broken down by day, we must make sure that a post that is entered must be completed on the same day.
Now that we have these requirements, it's time to calculate our metrics. We know we have to GROUP BY the date to get each day's posting success rate. We also have to break down how we can compute our two metrics of total posts entered and total posts actually created.
Q3: We want to build a naive recommender and we're given two tables, one table called `friends` with a user_id and friend_id columns representing each user's friends, and another table called `page_likes` with a user_id and a page_id representing the page each user liked.
`friends` table | column | type | |-----------|---------| | user_id | integer | | friend_id | integer |
`page_likes` table | column | type | |---------|---------| | user_id | integer | | page_id | integer |
Write an SQL query to create a metric to recommend pages for each user based on recommendations from their friends liked pages.
We can start by visualizing what kind of output we want from the query. Given that we have to create a metric for each user to recommend pages, we know we want something with a user_id and a page_id along with some sort of recommendation score.
How can we easily represent the scores of each user_id and page_id combo? One naive method would be to create a score by summing up the total likes by friends on each page that the user hasn't currently liked. The max value on our metric would be the most recommendable page.
The first thing we have to do then is to write a query to associate users to their friends liked pages. We can do that easily with an initial join between the two tables.
Statistics and Probability Interview Questions
Q1: What do you think the distribution of time spent per day on Facebook looks like? What metrics would you use to describe that distribution?
Having the vocabulary to describe a distribution is an important skill as a data scientist when it comes to communicating ideas to your peers. There are four important concepts, with supporting vocabulary, that you can use to structure your answer to a question like this. These are:
- Center (mean, median, mode)
- Spread (standard deviation, inter quartile range, range)
- Shape (skewness, kurtosis, uni or bimodal)
- Outliers (Do they exist?)
In terms of the distribution of time spent per day on Facebook (FB), one can imagine there may be two groups of people on Facebook:
- People who scroll quickly through their feed and don’t spend too much time on FB.
- People who spend a large amount of their social media time on FB.
Based on this, what kind of claims could we make about the distribution of time spent on FB?
Q2: We use people to rate ads. There are two types of raters, random and independent from our point of view:
- 80% of raters are careful and they rate an ad as good (60% chance) or bad (40% chance).
- 20% of raters are lazy and they rate every ad as good (100% chance).
Suppose we have 100 raters each rating one ad independently. What's the expected number of good ads?
Keep in mind that in order for the rater to rate an ad, the rater must first be selected. So the event that the rater is selected happens first, then the rating happens. How would you represent this fact arthmetically using basic properties of probability?
Hint: If we only have one rater, we don't need to test that rater's personality more than once.
Q3: Three zebras are chilling in the desert when a lion suddenly attacks. Each zebra is sitting on a corner of an equally spaced triangle. Each zebra randomly picks a direction and only runs along the outline of the triangle to either edge of the triangle.
What is the probability that none of the zebras collide?
There are two scenarios in which the zebras do not collide: if they all move clockwise or if they all move counterclockwise.
How do we calculate the probability that an individual zebra chooses to move clockwise or counterclockwise? How can we use this individual probability to calculate the probability that all zebras choose to move in the same direction?
Machine Learning Interview Questions
Q1: How would you test whether having more friends now increases the probability that a Facebook member is still an active user after 6 months?
Since we are interested in whether or not someone will be an active user in 6 months or not, we can test this assumption by first looking at the existing data. One way to do so is to put users into buckets determined by friend size six months ago and then look at their activity over the next six months.
If we set a metric to define "active user", such as if they logged in X number of times, posted once, etc., we can then just compute the averages on these metrics across the buckets to determine if more friends is equivalent to higher engagement metrics.
Q2: We're given different posts such as your friends baby pictures, Buzzfeed Tasty videos, and birthday posts and have to decide how to rank them.
How would you optimize the ratio of public versus private content? How would you build a model, what features would you use, and what metrics would you track?
Q3: You've been asked to generate a machine learning model that can map the nicknames of people using Facebook. How do you go about designing this model?
Q4: A product manager has asked you to develop a method to match users to their siblings on Facebook. How would you evaluate a method or algorithm to match users with their siblings? What metrics might you use?
See our Machine Learning course for more in-depth explanations.
Facebook System Design Interview Questions
Q1: How would you build the recommendation algorithm for type-ahead search for Netflix?
Let's think about a simple use case to start out with. Say that we type in the word "hello" for the beginning of a movie.
If we typed in h-e-l-l-o, then a suitable suggestion might be a movie like "Hello Sunshine" or a Spanish movie named "Hola".
Let's now move on to an MVP within the scope. We can begin to think of the solution in the form of a prefix table.
How a prefix table works is that your prefix, your input string, outputs your output string, one at a time to start with. For an mvp, we could input a string and output a suggestion string with added fuzzy matching and context matching.
But now how do we recommend a certain movie?
Coding Interview Questions
Q1: There are two lists of dictionaries representing friendship beginnings and endings: friends_added and friends_removed. Each dictionary contains the user_ids and created_at time of the friendship beginning/ending.
Write a function to generate an output which lists the pairs of friends with their corresponding timestamps of the friendship beginning and then the timestamp of the friendship ending.
Note that you are only looking for friendships that have an end date. Because of this, every friendship that will be in our final output is contained within the friends_removed list. So if you start by iterating through the friends_removed dictionary, you will already have the id pair and the end date of each listing in our final output–you just need to find the corresponding start date for each end date.
The friends_added and friends_removed dictionaries are already sorted by date. Because of this, you can be sure that as long as you iterate from the top through both, you will find the correct pairings of dates since each end date can only have one corresponding start date appearing before it in time.
Q2: In data science, there exists the concept of stemming, which is the heuristic of chopping off the end of a word to clean and bucket it into an easier feature set.
Given a dictionary consisting of many roots and a sentence, stem all the words in the sentence with the root forming it. If a word has many roots can form it, replace it with the root with the shortest length.
Input: roots = ["cat", "bat", "rat"] sentence = "the cattle was rattled by the battery" Output: "the cat was rat by the bat"
At first it simply looks like we can just loop through each word and check if the root exists in the word and if so, replace the word with the root. But since we are technically stemming the words we have to make sure that the roots are equivalent to the word at it's prefix rather than existing anywhere within the word.
We're given a dictionary of roots with a sentence string. Given we have to check each word, try creating a function that takes a word and returns the existing word if it doesn't match a root, or return the root itself.
Q3: You're given a dataframe of students:
| name | age | favorite_color | grade | |-----------------|-----|----------------|-------| | Tim Voss | 19 | red | 91 | | Nicole Johnson | 20 | yellow | 95 | | Elsa Williams | 21 | green | 82 | | John James | 20 | blue | 75 | | Catherine Jones | 23 | green | 93 |
Write a function to select only the rows where the student's favorite color is green or red and their grade is above 90.
This question requires us to filter a dataframe by two conditions: first, the grade of the student, and second, their favorite color.
Let's start with filtering by grade since it's a bit simpler than filtering by strings. We can filter columns in pandas by setting our dataframe equal to itself with the filter in place. In this case:
students_df = students_df[students_df["grade"] > 90]
If we were to look at our dataframe after passing that line of code, we'd see that every student with a lower grade than 90 no longer appears in our data frame.
Looking for more data science questions? Check out our LinkedIn Data Science interview guide or our Python interview questions resource.