LinkedIn Data Scientist Interview Guide

Written by IQ Team

IQ Team

Published February 25, 2022

Estimated reading time: 16 minutes

Back to Linkedin

Table of contents

Overview

LinkedIn Data Scientist Interview Process

LinkedIn Data Scientist Interview Questions

LinkedIn Data Scientist Salary

LinkedIn Data Scientist Jobs

Overview

The data science job at LinkedIn is generally focused on the business side rather than engineering, and the data science role functions more like a product analyst and analytics job than it does at many other companies.

LinkedIn’s data science team leverages billions of data points to empower member engagement, business growth, and monetization efforts. With over 500 million members worldwide and a mix of B2B and B2C programs, the data science team has a huge impact in determining product direction.

LinkedIn Data Scientist Interview Process

The LinkedIn data science interview process is relatively straight-forward. Recruiters at LinkedIn like to dogfood their own product. So they will likely send you a message or InMail through LinkedIn to schedule a 30-minute phone screen, during which they’ll ask behavioral interview questions and see if the role is a good fit.

Technical Screen

The initial technical screen consists of two separate phone interviews, each lasting between 30 to 45 minutes long.

One interview is more technical-focused and specializes in testing concepts on SQL and data processing, while the other will run through a product and business case study. Depending on how your interview is structured, either interview could be the first one of the two. However, you are not guaranteed both interviews if you do poorly on one of them. Both interviewers are also going to be employees on the LinkedIn data science team, leaving ample time in the end to ask questions.

We’ve gathered this data from parsing thousands of interview experiences sourced from members.

LinkedIn Data Scientist Interview Questions

Typically, interviews at LinkedIn vary by role and team, but they follow fairly standardized data science interview questions across these topics.

LinkedIn Product and Business Case Study Questions

1. We’re working on a new feature for a LinkedIn chat. We want to implement a green dot to show an “active user,” but given engineering constraints, we can’t A/B test it before release.

How would you analyze the effectiveness of this new feature?

Hint: While it may be tempting to correlate increased usage of LinkedIn chat as a sign that the green dot is “effective,” you may need a more solid metric that relates this increased usage to the profit-generating aspects of LinkedIn’s business model.

2. We’re given a dataset of page views where each row represents one page view. How would you differentiate between scrapers and real people?

There is no exact right answer for problems like this. Modeling-based theoretical questions are more meant to assess whether you can make realistic assumptions and come up with a solution under these assumptions. Likely it will go down the path the interviewer explores as you make assumptions and draw conclusions.

We’re given a dataset of page views with likely scrapers and real users visiting the site. Because a scraper intends to extract data out of the LinkedIn network, a scraper will almost surely have a lot of page views, and the duration of these views will likely be rather short since a robotic scraper can process information much faster than a human (it just needs to download the fetched page and do some simple processing, say extract URLs that lead to other pages on LinkedIn).

The link traversal between users would also be more nuanced. We’d expect users to traverse the pages more through links on the site rather than a scraper making requests to different URLs. A real user, on the other hand, tends to visit the page fewer times and spend more time on each visit.

LinkedIn SQL Interview Questions

1. We’re interested in analyzing the career paths of data scientists, given a table of user experiences representing each person’s past work experiences and timelines. The titles we care about are bucketed into data scientist, senior data scientist, and data science manager.

user_experiences table

Column	Type
id	integer
user_id	integer
title	string
company	string
start_date	datetime
end_date	datetime
is_current_role	boolean

Determine if a data scientist who switches jobs more often gets promoted to a manager role faster than a data scientist who stays at one job for longer by writing a query to prove or disprove this hypothesis.

The hypothesis is that data scientists who switch jobs more often get promoted faster.

Therefore, in analyzing this dataset, we can prove this hypothesis by separating the data scientists into specific segments on how often they jump in their careers.

For example, if we looked at the number of job switches for data scientists that have been in their field for 5 years, we could prove the hypothesis that the number of data science managers increased as the number of careers jumped.

Never switched jobs: 10% are managers Switched jobs once: 20% are managers Switched jobs twice: 30% are managers Switched jobs three times: 40% are managers We could look at this over different buckets of time to see if the correlation stays consistent after 10 years and 15 years in a data science career.

2. Given a table of job postings, write a query to break down the number of users who have posted their jobs once versus the number of users who have posted at least one job multiple times.

job_postings table

Column	Type
id	integer
job_id	integer
user_id	integer
date_posted	datetime

First, let’s visualize what the output would look like.

We want the value of two different metrics, the number of users who have posted their jobs once versus the number of users who have posted at least one job multiple times. What does that mean exactly?

If a user has 5 jobs but only posts them once, they are part of the first statement. But if they have 5 jobs and posted 7 times, they had to post one job at least multiple times.

3. Given a table of transactions and products, write a function to get the month-over-month change in revenue for the year 2019. Make sure to round month_over_month to 2 decimal places.

transactions table

Column	Type
id	integer
user_id	integer
created_at	datetime
product_id	integer
quantity	integer

products table

Column	Type
id	integer
name	string
price	float

Whenever there is a question on month-over-month, week-over-week, or year-over-year change, note that it can generally be done in two different ways.

One is using the LAG function that is available in certain SQL services. Another is to do a sneaky join.

For both, we’ll have to first sum the transactions and group by the month and the year. Grouping by the year is generally redundant in this case because we are only looking for the year 2019.

WITH monthly_transactions AS (
    SELECT 
        MONTH(created_at) AS month,
        YEAR(created_at) AS year,
        SUM(price * quantity) AS revenue
    FROM transactions AS t
    INNER JOIN products AS p
            ON t.product_id = p.id
    WHERE YEAR(created_at) = 2019
    GROUP BY 1,2
    ORDER BY 1
)

SELECT * FROM monthly_transactions

4. Write a query to return the percentage of users that satisfy either of the following conditions.

posted a job that is more than 180 days old. posted a job that has the same job_id as a previous job posting(s), which is more than 180 days old. job_postings table

Column	Type
id	integer
job_id	integer
user_id	integer
date_posted	datetime

5. We’re given two tables, a table of notification deliveries, and a table of users with created and purchase conversion dates. The `conversion_date` column is NULL if the user hasn’t purchased.

notification_deliveries table

Column	Type
notification	varchar
user_id	int
created_at	datetime

users table

Column	Type
id	int
created_at	datetime
conversion_date	datetime

Write a query to get the distribution of total push notifications before a user converts.

If we’re looking for the distribution of total push notifications before a user converts, we can evaluate that we want our result to look something like this:

total_pushes	frequency
0	100
1	250
2	300
…	…

To get there, we have to follow a couple of logical conditions for the JOIN between users and notification_deliveries

We have to join the user_id field in both tables. We have to exclude all users that have not converted. To get all notifications sent to the user, we have to set the conversion_date value as greater than the created_at value in the delivery table. We know this has to be a LEFT JOIN additionally to get the users that converted off of zero push notifications.

We can get the count per user and then group by that count to get the overall distribution.

LinkedIn Statistics and Probability Interview Questions

1. A deck of 500 cards are shuffled randomly, numbered from 1 to 500. If you are asked to pick three cards, one at a time, what’s the probability of each subsequent card being larger than the previously drawn card?

Hint: Let’s say the question is 100 cards, and you select 3 cards without replacement. Does the answer change? Imagine this as a sample space problem ignoring all other distracting details. If you have to draw three different numbered cards without replacement, and they are all unique, then we assume that there will be effectively the lowest, middle, and high cards.

Let’s make it easy and assume we drew the numbers 1, 2, and 3. In our scenario, if we drew (1, 2, 3), that would be the winning scenario. But what’s the full range of outcomes we could draw?

LinkedIn Machine Learning Interview Questions

1. You’re working on a job recommendation engine. You have access to all user LinkedIn profiles, a list of jobs each user applied to, and answers to questions that the user filled in about their job search.

Using this information, how would you build a job recommendation feed?

For this problem, we have to understand what our dataset consists of before being able to build a model for recommendations. More importantly, we need to understand what a recommendation feed might look like for the user.

For example, we’re expecting that the user could go to a tab or open up a mobile app and then view a list of recommended jobs sorted by highest recommended at the top.

We can either use an unsupervised or supervised model. For an unsupervised model, we could use the nearest neighbors or a collaborative filtering algorithm off of features from users and jobs. But if we want more accuracy, we would likely go with a supervised classification algorithm.

If we use a supervised model, we need to understand our training dataset in the form of features and a metric output label (whether the user applied or not).

The expected result is that for each user, we will have user feature data in the form of their user-profiles and user activity data by extracting information from questions that they have answered. Additionally, we’ll have all of the jobs that the user applied. What we’re missing is the data on jobs that the user did not apply to.

More LinkedIn Data Scientist Interview Questions below:

Question

Topics

Difficulty

Ask Chance

Career Jumping

SQL

Analytics

Hard

Very High

Green Dot

Product Metrics