In many of our crowd-sourced interview guides, Facebook data scientists, analysts, and engineers are all asked SQL interview questions, more than the average technical interview. That’s why it’s extremely important to prepare for the SQL section of the Facebook technical screen, as it’s sure to come up.
Facebook candidates can be expected to demonstrate a solid foundation of SQL knowledge – if not expertise – in several different categories. With such broad constraints for the technical screen, it may be hard to determine exactly what to focus on for the SQL portion. To help you, we’ve prepared some sample questions to prep for Facebook SQL interviews.
It’s helpful to prep when you have an idea of the SQL questions that get asked at Facebook. Essentially, these questions fall into two categories:
Definition-based questions - Definition questions are used to quickly assess your SQL knowledge. These will be simple questions like, “What is a JOIN?” or more theoretical questions like “Why would you use an index in SQL?” You’ll need a solid grasp of key terms and concepts to nail this portion of the Facebook interview.
Writing queries - Facebook interviews ask a lot of SQL questions. And the majority tend to fall into this category. In Facebook SQL interviews, you’ll likely be asked to write a query using two or more datasets. These are generally based on Facebook (or Instagram) cases.
To answer this question, it really depends on the role. In some positions, like product analysts - for example - you’ll use SQL everyday, and will likely be asked a wide range of SQL query questions. Other roles, like data engineers, will use SQL for data processing and database management, which might ask more theoretical questions.
Here’s a look at SQL by role:
Here are some sample Facebook SQL questions that are definitions, as well as theoretical. As you study, focus on those definitions (especially being able to explain them in layman’s terms), as well as how they might apply theoretically.
A foreign key is a field or multiple fields in one table that can refer back to the primary key in another. The table with the primary key is referred to as the parent or referenced table, while the one with the foreign key is the child table. Foreign keys are used to provide the referential integrity between tables.
No, NULL is not the same as a zero or blank space. Instead, NULL is used in the absence of any value. In other words, the value is unavailable, unknown, unassigned or not appropriate. Zeroes and blank spaces are treated differently, and they can be compared to other zeroes or blank spaces, respectively. NULL cannot be compared to another NULL.
SQL injection, also known as SQLI, is a type of vulnerability that uses malicious SQL code to give attacks access to the backend database. Attackers can then manipulate the database and gain access to private information like user lists and customer details.
One of the first steps would be input validation, to identify the illegitimate user inputs.
Both allow you to find multiple values from the table. But a key difference is the type of data you are selecting with these operators. For example, BETWEEN selects a range of data between two values. Alternatively, IN allows you select multiple values.
In SQL, a cursor is a temporary memory or workstation. They store database tables, and in SQL there are two main types:
A trigger is a stored procedure that automatically runs when a certain event occurs in the database server. For example, DML triggers run when a user tries to modify data. Triggers can be used to:
In the technical screen, you’ll be asked a range of query-based SQL questions. These will ask you to write queries based on provided datasets. The best prep strategy is to practice and write lots of sample queries aligned to real case studies. Here are questions that will help you develop that skill:
users
table
Columns | Type |
---|---|
id |
INTEGER |
name |
STRING |
created_at |
DATETIME |
neighborhood_id |
INTEGER |
mail |
STRING |
comments
table
Columns | Type |
---|---|
user_id |
INTEGER |
body |
VARCHAR |
created_at |
DATETIME |
Hint: Since a histogram is just a display of frequencies of each user, all we really need to do is get the total count of user comments in the month of January 2020 for each user, and then group by that count.
action
represents either (‘post_enter’, ‘post_submit’, ‘post_canceled’) for when a user starts a post (enter), ends up canceling it (cancel), or ends up posting it (submit). Write a query to get the post success rate for each day in the month of January 2020.events
table
Columns | Type |
---|---|
id |
INTEGER |
user_id |
INTEGER |
created_at |
DATETIME |
action |
STRING |
url |
STRING |
platform |
STRING |
Hint: Let’s see if we can clearly define the metrics we want to calculate before just jumping into the problem. We want post success rate for each day over the past week. How would we get that metric?
We can assume post success rate can be defined as: (total posts created) / (total posts entered)
Additionally, since the success rate must be broken down by day, we must make sure that a post that is entered must be completed on the same day.
Now that we have these requirements, it’s time to calculate our metrics. We know we have to GROUP BY the date to get each day’s posting success rate. We also have to break down how we can compute our two metrics of total posts entered and total posts actually created.
More Context: You want to build a naïve recommender, and you are given two tables. One is called friends (user_id and friend_id) and another called page_likes (with user_id and page_id). Note: it shouldn’t recommend pages that the user already likes.
Let’s solve this Facebook SQL interview question by visualizing what kind of output we want from the query. Given that we have to create a metric by which to evaluate recommending a user to a page, we know we want something with a user_id and a page_id along with some sort of recommendation score.
Let’s try to think of an easy way to represent the scores of each user_id and page_id combo. One naïve method would be to create a score by summing up the total likes by friends on each page that the user has not currently liked. Then the max value on our metric will be the most recommendable page.
The first thing we have to do is then to write a query to associate users to their friends’ liked pages. We can do that easily with an initial join between the two tables.
WITH t1 AS (
SELECT
f.user_id
, f.friend_id
, pl.page_id
FROM friends AS f
INNER JOIN page_likes AS pl
ON f.friend_id = pl.user_id
)
What comes next to reaching the desired output?
transactions
table
Columns | Type |
---|---|
id |
integer |
user_id |
INTEGER |
created_at |
DATETIME |
product_id |
INTEGER |
quantity |
INTEGER |
users
table
Columns | Type |
---|---|
id |
INTEGER |
name |
VARCHAR |
sex |
VARCHAR |
products
table
Columns | Type |
---|---|
id |
INTEGER |
name |
STRING |
price |
FLOAT |
Men
Women
Would the difference in AOV (average order value) be significant?
Note: Remember to round your answer to two decimal places.
Part 1:
Write a query to compute a metric to measure the quality of the search results for each query.
This is an unusual SQL problem, given it asks to define a metric and then write a query to compute it. Generally, this should be pretty simple. Can we rank by the metric and figure out which query has the best overall results?
For example, if the search query for ‘tiger’ has 5s for each result, then that would be a perfect result.
The way to compute that metric would be to simply take the average of the rating for all of the results. In which the query can very easily be:
sql
SELECT query, ROUND(AVG(rating), 2) AS avg_rating
FROM search_results
GROUP BY 1
Part 2: You want to be able to compute a metric that measures the precision of the above ranking system based on position. For example, if the results for dog and cat are:
query | result_id | position | rating | notes |
---|---|---|---|---|
dog | 1000 | 1 | 2 | picture of hotdog |
dog | 998 | 2 | 4 | dog walking |
dog | 342 | 3 | 1 | zebra |
cat | 123 | 1 | 4 | picture of cat |
cat | 435 | 2 | 2 | cat memes |
cat | 545 | 3 | 1 | pizza shops |
We would rank ‘cat’ as having a better search result ranking precision than ‘dog’ based on the correct sorting by rating.
Write a query to create a metric that can validate and rank the queries by their search result precision. Round the metric (avg_rating column) to 2 decimal places.
The precision metric is a little more difficult now that we have to account for a second factor, which is position. We now have to find a way to weight the position in accordance to the rating to normalize the metric score.
This type of problem set can get very complicated if we wanted to dive deeper into it. However, the question is clearly more marked towards being practical in figuring out the metric and developing an easy SQL query, rather than developing a search ranking precision scale that optimizes for something like CTR.
To solve the problem, it’s helpful to look at the example to construct an approach towards a metric. For example, if the first result is rated at 5 and the last result is rated at a 1, that’s good. Even better, however, is if the first result is rated 5 and the last result is also rated 5. Bad is if the first result is 1 and the last result is 5.
However, if we use the approach from Part 1, we’ll get the same metric score no matter which way the values are ranked by position. So how do we factor position into the ranking?
employees
table
Columns | Types |
---|---|
id |
int |
first_name |
VARCHAR |
last_name |
VARCHAR |
salary |
INTEGER |
department_id |
INTEGER |
departments
table
Columns | Types |
---|---|
id |
INTEGER |
name |
VARCHAR |
Let’s approach this problem by looking at the output of what the response would look like.
We know that we need to calculate the total number employees that are making over $100K by each department. This means that we’re going to have to run a GROUP BY
on the department name since we want a new row for each department.
We also need a formula to represent how we can differentiate employees that make over $100K and those that make less. We can calculate that by formulating:
(Number of people making over $100K) / (Total number of people in that department)
Now in terms of implementation, if we first do a JOIN between the employees table and the departments table, then we can get all of the data points we need together. Then all that is left is:
users
table
Columns | Types |
---|---|
id | INTEGER |
name | VARCHAR |
neighborhood_id | INTEGER |
created_at | DATETIME |
neighborhoods
table
Columns | Type |
---|---|
id |
INTEGER |
name |
VARCHAR |
city_id |
INTEGER |
Write a query that returns all of the neighborhoods that have 0 users.
Whenever the question asks about finding values with 0 something (users, employees, posts, etc..) immediately think of the concept of LEFT JOIN! An inner join finds any values that are in both tables, a left join keeps only the values in the left table.
Our predicament is to find all the neighborhoods without users. To do this, we must do a left join from the neighborhoods table to the users table, which will give us an output like this:
neighborhoods.name |
users.id |
---|---|
castro | 123 |
castro | 124 |
cole valley | null |
castro heights | 534 |
castro heights | 564 |
If we then add in a where condition of WHERE users.id IS NULL, then we will get every single neighborhood without a singular user.
In nearly all Facebook roles, including Facebook data science internships, SQL interview questions are an essential part of the interview. Prep for these questions with a solid study of the basics. You need a strong foundation of SQL expertise to ace this part of the technical screen. But you’ll also want plenty of experience writing advanced queries as well, especially case-based SQL questions.