LinkedIn’s data science team leverages billions of data points to empower member engagement, business growth, and monetization efforts. With over 500 million members around the world and a mix of B2B and B2C programs, the data science team has a huge impact in determining product direction. The data science job at LinkedIn is generally focused on the business side rather than engineering, and the data science role functions more like a product analyst and analytics job at many other companies.

The LinkedIn Interview

The LinkedIn interview process is relatively straight-forward. Recruiters at LinkedIn like to dogfood their own product. So they will likely send you a message or InMail through LinkedIn to schedule a 30-minute phone screen, during which they'll get to understand your interests in the company and see if the role is a good fit.

Technical Screen

The initial technical screen consists of two separate phone interviews, each lasting between 30 to 45 minutes long.

One interview is more technical focused and specializes in testing concepts on SQL and data processing, while the other will run through a product and business case study. Depending on how your interview is structured, either interview could be the first one of the two. However, you are not guaranteed both interviews if you do poorly on one of them. Both interviewers are also going to be employees on the LinkedIn data science team, leaving ample time at the end to ask questions.

Product and Business Case Study

Q1: We're working on a new feature for LinkedIn chat. We want to implement a green dot to show an “active user,” but given engineering constraints, we can't A/B test it before release.

How would you analyze the effectiveness of this new feature?

Hint: While it may be tempting to correlate increased usage of LinkedIn chat as a sign that the green dot is "effective," you may need a more solid metric that relates this increased usage back to the profit-generating aspects of LinkedIn's business model.
Green Dot — Interview Query product metrics problem
Let's say we're working on a new feature for LinkedIn chat. We want to implement a green dot to show an “active user” but given engineering constraints, we can't AB
See the video solution here.


Q2: We're given a dataset of page views where each row represents one page view. How would you differentiate between scrapers and real people?

There is no exact right answer for problems like this. Modeling-based theoretical questions are more meant to assess whether you can make realistic assumptions and come up with a solution under these assumptions. Likely it will go down the path the interviewer explores as you make assumptions and draw conclusions.

We're given a dataset of page views with likely scrapers and real users visiting the site. Because the intent of a scraper is to extract data out of the LinkedIn network, a scraper will almost surely have a lot of page views, and the duration of these views will likely be rather short since a robotic scraper can process information much faster than a human (it just needs to download the fetched page and do some simple processing, say extract URLs that lead to other pages on LinkedIn).

A real user, on the other hand, tends to visit the page fewer times and spend more time in each visit. The link traversal between users would also be more nuanced. We'd expect users to traverse the pages more through links on the site rather than a scraper making requests to different URLs.

Scrapers or Users — Interview Query business case problem
Let's say we're given a dataset of page views where each row represents one page view.How would you differentiate between scrapers and real people?
See the full solution here.

SQL Interview Questions

Q1: We're interested in analyzing the career paths of data scientists given a table of user experiences representing each person's past work experiences and timelines. The titles we care about are bucketed into data scientist, senior data scientist, and data science manager.

user_experiences table

| column          | type     |
|-----------------|----------|
| id              | integer  |
| user_id         | integer  |
| title           | string   |
| company         | string   |
| start_date      | datetime |
| end_date        | datetime |
| is_current_role | boolean  |

Determine if a data scientist who switches jobs more often ends up getting promoted to a manager role faster than a data scientist that stays at one job for longer by writing a query to prove or disprove this hypothesis.

The hypothesis is that data scientists that end up switching jobs more often get promoted faster.

Therefore, in analyzing this dataset, we can prove this hypothesis by separating the data scientists into specific segments on how often they jump in their careers.

For example, if we looked at the number of job switches for data scientists that have been in their field for 5 years, we could prove the hypothesis if the number of data science managers increased as the number of career jumps.

  • Never switched jobs: 10% are managers
  • Switched jobs once: 20% are managers
  • Switched jobs twice: 30% are managers
  • Switched jobs three times: 40% are managers

We could look at this over different buckets of time as well to see if the correlation stays consistent after 10 years and 15 years in a data science career.

Career Jumping — Interview Query sql problem
user_experiences table columntype idinteger user_idinteger titlestring companystring start_datedatetime end_datedatetime is_current_roleboolean We're
See the full question and solution here.

Q2: Given a table of job postings, write a query to breakdown the number of users that have posted their jobs once versus the number of users that have posted at least one job multiple times.

`job_postings` table

| column      | type     |
|-------------|----------|
| id          | integer  |
| job_id      | integer  |
| user_id     | integer  |
| date_posted | datetime |

First, let's visualize what the output would look like.

We want the value of two different metrics, the number of users that have posted their jobs once versus the number of users that have posted at least one job multiple times. What does that mean exactly?

Well, if a user has 5 jobs but only posted them once, then they are part of the first statement. But if they have 5 jobs and posted a total of 7 times, that means that they had to at least posted one job multiple times.

Repeat Job Postings — Interview Query sql problem
`job_postings` table columntype idinteger job_idinteger user_idinteger date_posteddatetime Given a table of job postings, write a query
See the full question and solution here.

Q3: Given a table of transactions and products, write a function to get the month-over-month change in revenue for the year 2019. Make sure to round month_over_month to 2 decimal places.

`transactions` table

| column     | type     |
|------------|----------|
| id         | integer  |
| user_id    | integer  |
| created_at | datetime |
| product_id | integer  |
| quantity   | integer  |
`products` table

| column | type    |
|--------|---------|
| id     | integer |
| name   | string  |
| price  | float   |

Whenever there is a question on month-over-month, week-over-week or year-over-year change, note that it can generally be done in two different ways.

One is using the LAG function that is available in certain SQL services. Another is to do a sneaky join.

For both, we'll have to first sum the transactions and group by the month and the year. Grouping by the year is generally redundant in this case because we are only looking for the year of 2019.

WITH monthly_transactions AS (
    SELECT 
        MONTH(created_at) AS month,
        YEAR(created_at) AS year,
        SUM(price * quantity) AS revenue
    FROM transactions AS t
    INNER JOIN products AS p
            ON t.product_id = p.id
    WHERE YEAR(created_at) = 2019
    GROUP BY 1,2
    ORDER BY 1
)

SELECT * FROM monthly_transactions
Month Over Month — Interview Query sql problem
`transactions` table columntype idinteger user_idinteger created_atdatetime product_idinteger quantityinteger `products` table columntype idinteger namestring pricefloat Given
See the full solution here.

Q4: Write a query to return the percentage of users that satisfy either of the following conditions:

  1. posted a job that is more than 180 days old.
  2. posted a job that has the same job_id as a previous job posting(s), that altogether, are more than 180 days old.
`job_postings` table

| column      | type     |
|-------------|----------|
| id          | integer  |
| job_id      | integer  |
| user_id     | integer  |
| date_posted | datetime |
180 Day Job Postings — Interview Query sql problem
`job_postings` table columntype idinteger job_idinteger user_idinteger date_posteddatetime Write a query to return the percentage of users
See the full question here.

Q5: We're given two tables, a table of notification deliveries and a table of users with created and purchase conversion dates. If the user hasn't purchased, then the `conversion_date` column is NULL.

`notification_deliveries` table

| column       | type     |
|--------------|----------|
| notification | varchar  |
| user_id      | int      |
| created_at   | datetime |
`users` table

| column          | type     |
|-----------------|----------|
| id              | int      |
| created_at      | datetime |
| conversion_date | datetime |

Write a query to get the distribution of total push notifications before a user converts.

If we're looking for the distribution of total push notifications before a user converts, we can evaluate that we want our end result to look something like this:

| total_pushes | frequency |
|--------------|-----------|
| 0            | 100       |
| 1            | 250       |
| 2            | 300       |
| ...          | ...       |

In order to get there, we have to follow a couple of logical conditions for the JOIN between users and notification_deliveries

  1. We have to join on the user_id field in both tables.
  2. We have to exclude all users that have not converted.
  3. We have to set the conversion_date value as greater than the created_at value in the delivery table in order to get all notifications sent to the user.

We know this has to be a LEFT JOIN additionally in order to get the users that converted off of zero push notifications as well.

We can get the count per user, and then group by that count to get the overall distribution.

Notification Deliveries — Interview Query sql problem
`notification_deliveries` table columntype notificationvarchar user_idint created_atdatetime `users` table columntype idint created_atdatetime conversion_datedatetime We're
See the full solution here.

Statistics and Probability Interview Questions

Q1: A deck of 500 cards, numbered from 1 to 500, is shuffled randomly. If you are asked to pick three cards, one at a time, what's the probability of each subsequent card being larger than the previous drawn card?

Hint: Let's say the question is actually 100 cards and you select 3 cards without replacement. Does the answer change?

Imagine this as a sample space problem ignoring all other distracting details. If you have to draw three different numbered cards without replacement, and they are all unique, then we are assuming that there will be effectively a lowest card, a middle card, and a high card.

Let's make it easy and assume we drew the numbers 1, 2, and 3. In our scenario, if we drew (1, 2, 3), then that would be the winning scenario. But what's the full range of outcomes we could draw?

500 Cards — Interview Query probability problem
Imagine a deck of 500 cards numbered from 1 to 500. If all the cards are shuffled randomly and you are asked to pick three cards, one at a time, what's the probability of each subsequent
See the full solution here.

Machine Learning Interview Questions

Q1: You're working on a job recommendation engine. You have access to all user LinkedIn profiles, a list of jobs each user applied to, and answers to questions that the user filled in about their job search.

Using this information, how would you build a job recommendation feed?

For this problem, we have to understand what our dataset consists of before being able to build a model for recommendations. More importantly, we need to understand what a recommendation feed might look like for the user.

For example, what we're expecting is that the user could go to a tab or open up a mobile app and then view a list of recommended jobs sorted by highest recommended at the top.

We can either use an unsupervised or supervised model. For an unsupervised model, we could use a nearest neighbors or collaborative filtering algorithm off of features from users and jobs. But if we want more accuracy, we would likely go with a supervised classification algorithm.

If we use a supervised model, we need to understand our training dataset in the form of features and an output metric label (whether the user applied or not).

The expected result is that for each user, we will have user feature data in the form of their user profiles and user activity data by extracting information from questions that they have answered. Additionally we'll have all of the jobs that the user applied to. What we're missing is the data on jobs that the user did not apply to.

Job Recommendation — Interview Query machine learning problem
Let's say that you're working on a job recommendation engine. You have access to all user Linkedin profiles, a list of jobs each user applied to, and answers to questions that