Google Data Scientist Interview Questions + Guide in 2024

Written by IQ Team

IQ Team

Published March 11, 2024

Estimated reading time: 23 minutes

Back to Google

Table of contents

Introduction

The Data Science Role at Google

The Google Interview Process

What Gets Asked in Google Data Science Interviews?

Google Behavioral Interview Questions

Google Machine Learning Interview Questions

Google Statistics and Probability Questions

Google Product and Business Case Questions

Google Coding Interview Questions

Google Data Scientist Salary

Google Data Scientist Jobs

Introduction

Based on 2019 statistics, Google processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide. To Google, this presents endless opportunity to help its customer grow and scale, and to data scientists, this presents a treasure trove of information for analysis and interpretation.

Google data science interview questions focus primarily on statistics, algorithms, machine learning, and probability. Google’s interview process is designed to assess your ability to perform analysis on large datasets and generate data-driven insights.

The Data Science Role at Google

Data scientists at Google work across a wide facet of teams, products, and features, from enhancing advertising efficacy to network infrastructure optimization.

The Google data science role is primarily an analytics role that is focused on metrics and experimentation. This is distinctly different from the machine learning and product analyst roles at Google, which focus more on the engineering or product sides, respectively. The data science role at Google used to be called a quantitative analyst before switching to data science to attract more talent.

Required Skills

Google typically hires individuals with two to three years of industry experience in analytics or related fields. Google does have programs for internships and university graduates in data science, and specifically has more advanced roles for new PhD graduates.

Other relevant qualifications include:

Masters or PhD in Statistics, Computer Science, Bioinformatics, Computational Biology, Engineering, Physics, Applied Mathematics, Economics, Operations Research or related quantitative discipline, or equivalent practical experience.
Advanced experience in statistical software (e.g. MATLAB, Panda, Colab, S-Plus, SAS, etc.), programming languages like Python, R, C++, and/or Java, and advanced experience in database language (e.g SQL) and management systems.
Experience with big data and cloud platforms to deploy large-scale data science solutions.
Experience in data science with a focus on business analytics, designing and building statistical modelling, visualization, machine learning, digital attribution, forecasting, optimization, and predictive analytics
Knowledge of Advanced Statistical Concepts and applied experience with machine learning on large datasets.
Demonstrated problem-framing, problem-solving, project management, and people management skills.

Our data shows that Google tests heavily on Statistics & A/B Testing and Machine Learning during their data scientist interview process.

The Google Interview Process

Data scientist interview process at Google is standardized and is similar to many other tech companies. The process includes:

1. Initial Screen

The initial screen is a 30-minute phone interview with a recruiter. During this call, you’ll discuss the job and what it’s like to work at Google. The recruiter will also want to learn about your skills, professional experiences, career goals, and, most importantly, if you’re the right fit for Google’s culture.

2. Technical Screen

Google’s data scientist technical screen is video-based (Google Hangouts). You’ll meet with a data scientist and focus on experimental design, statistics, and a probabilistic coding question.

Technical screens also include discussions about your past research and work experience. Be prepared for questions about business issues you’ve faced and your approach to solving them.

3. Onsite Interview

The onsite interview for Google data scientist positions interview includes 5 one-on-one rounds with a data scientist. These rounds cover computational statistics, probability, product interpretation, metrics and experimentation, modeling, and behavioral questions.

Each interview lasts for approximately 45 minutes, and there’s a lunch break in between.

Quick Tips for Google Data Science Interviews

You should plan to brush up on any technical skills and try as many practice interview questions and mock interviews as possible. A few tips for acing your Google interview include:

Know Your Google Products: Google questions are standardized and rely heavily on situational scenarios with their products. Study Google’s large breadth of products and understand how you would personally improve or test them.
Be Data Driven: Google’s data science interviews assess how well you can provide business-driving insights with data science. Brush up on your knowledge of statistics and probability, given these questions can be some of the hardest to solve.
Embody the Spirit: Google at its core has an collaborative, employee-focused culture that values innovation. Practice responding to behavioral questions with answers that touch on Google’s core values.

What Gets Asked in Google Data Science Interviews?

Google data science interview questions include both behavioral and technical problems, covering a range of topics. Most Google behavioral questions, for instance, are centered around how well the candidate fits in with Google’s work culture. On the other hand, Google’s technical data science questions span multiple areas, including statistics, machine learning, coding and product sense.

Onsite interviews for Google data science positions are demanding. The panel interview typically consists of five 45-minute interviews with various teams, and you’ll be assessed on your data science knowledge in a range of areas. The most common Google data science interview question topics include:

Behavioral Questions - Behavioral questions are designed to assess your “Googlyness,” e.g. how well you work with others, how you can navigate workplace ambiguity, and how well you can work under pressure.
Statistics & Probability - A strong emphasis is placed on statistics and probability questions in Google interviews. These questions assess your ability to explain complex statistical terms and perform statistical coding.
Product Sense - As a data scientist at Google, you’ll be tasked with using your technical knowledge to solve product and business problems. These questions assess your ability to generate insights that can be used to improve Google products.
Coding - Google data scientists use coding every day to mine datasets and generate insights. As such, you’ll likely be asked SQL, data analysis and Python coding questions.
Machine Learning - Google’s data science teams use machine learning principles to build and improve algorithms. Google’s machine learning questions will assess your machine learning knowledge as it applies to algorithms.

Google Behavioral Interview Questions

Behavioral interview questions usually occur during the recruiter screen and throughout the onsite Google interview. These types of questions are designed to assess your ability to think on your feet, whether you are the right culture fit for Google, and your ability to communicate ideas.

In particular, your answers should touch on what Google looks for in data science candidates:

First is the general cognitive ability, which screens based on how candidates can learn and adapt to new situations.
The second is role-related knowledge which is based on background, skillsets, and experience that are specific and relevant to the roles.
The third is the leadership attribute. Google’s core culture is about building a team of high performers individuals who are great team players and can one day step into leadership roles.
The fourth and last attribute is the Googlyness, to ensure candidate succeed in their roles. Google assesses on “comfort with ambiguity,” “bias to action,” and a “collaborative nature.”

1. Why Google?

Provide concrete examples of what interests you in a job at Google. You might talk about your love for Google’s data science culture or how the company encourages employees to continuously learn and expand their skills.

2. Describe a past data science project you worked on.

Hint: Tell the interviewer why the project was successful. Provide any metrics and positive change you were able to bring about.

3. How do you prioritize tasks when working on many different projects?

4. What career goals do you have? How do you plan to achieve them?

5. Do you have a favorite Google product? What do you love about it?

Again, researching Google’s products before the interview is an absolute must. Be sure you can talk confidently about the majority of the company’s product offerings; but also have 2-3 products that you know in-depth.

6. Describe a time when a project you were working wasn’t successful. What did you learn?

Hint: Questions like these can be intimidating. Don’t be afraid to be honest. But also explain how you apply what you have learned as you approach a new project.

Google Machine Learning Interview Questions

The types of machine learning questions asked in Google data science interviews range from basic definition-based questions about regression models or feature selection, to advanced algorithm questions.

Looking for machine learning resources? Check out Interview Query’s Modeling & Machine Learning and Machine Learning Systems Design courses.

1. What is the difference between K-mean and EM?

2. Let’s say you have a categorical variable with thousands of distinct values, how would you encode it?

Hint: Does this depend on whether the problem is asking about a regression or a classification model?

Say it’s a regression model. One way we could tackle this problem would be to cluster features based on the response variable by working backwards.

3. What is the function of p-values in high dimensional linear regression?

4. How would you build the recommendation algorithm for type-ahead search for a media company like Netflix?

Hint: We can begin to think of the solution in the form of a prefix table. How a prefix table works is that your prefix, that is your input string, outputs your output string, one at a time to start with. For an MVP, we could input a string and output a suggestion string with added fuzzy matching and context matching.

5. Given two strings A and B, return whether or not A can be shifted some number of times to get B.

Example:

A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True

A = 'abc'
B = 'acb'
can_shift(A, B) == False

Hint: This problem is relatively simple if we figure out the underlying algorithm that allows us to easily check for string shifts between strings A and B.

Google Statistics and Probability Questions

Statistics and probability are a core focus in Google onsite panel interviews. To best prepare, make sure you have a strong grasp of statistical concepts and know how to perform statistical coding in python interview questions.

1. Let’s say we have a sample size of N. The margin of error for our sample size is 3. How many more samples would we need to decrease the margin of error to 0.3?

Hint: In order to decrease our margin of error, we’ll probably have to increase our sample size. But by how much?

2. What is the assumption of error in linear regression?

3. Let’s say we use people to rate ads. There are two types of raters. Random and independent from our point of view.

80% of raters are careful and they rate an ad as good (60% chance) or bad (40% chance).
20% of raters are lazy and they rate every ad as good (100% chance).

1. Suppose we have 100 raters each rating one ad independently. What’s the expected number of good ads?

2. Now suppose we have 1 rater rating 100 ads. What’s the expected number of good ads?

3. Suppose we have 1 ad, rated as bad. What’s the probability the rater was lazy?

Hint: Keep in mind that in order for the rater to rate an ad, the rater must first be selected. So the event that the rater is selected happens first, then the rating happens. How would you represent this fact arithmetically using basic properties of probability?

4. What are the assumptions of error in linear regression?

There are several assumptions of linear regression. These assumptions are baked into the dataset and how the model is built. Otherwise if these assumptions are violated, we become privy to the phrase “garbage in, garbage out.”

5. Explain how a probability distribution could be not normal and give an example scenario.

Hint: Think about things that generally have a normal distribution. Are there other things that we might want to measure that might not be similar to those things? Normal distributions generally measure things like size, mass, content, but what about measures like time, random-number generators, or likelihood?

6.You flip a fair coin 576 times. Without using a calculator, calculate the probability of flipping at least 312 heads.

Hint: What sort of probability distribution should we use to model experiments with only two outcomes?

7. What is the difference between parametric and non-parametric testing?

8.Let’s say you have a function that outputs a random integer between a minimum value, N, and maximum value, M.

Now let’s say we take the output from the random integer function and place it into another random function as the max value with the same min value N.

What would the distribution of the samples look like?
What would be the expected value?

9. You have a deck and you take one card at random and guess what the card is. What is the probability you guess right?

Google Product and Business Case Questions

Google onsite interviews typically include business and product case study questions. To prepare for case interviews, practice product or business metrics questions, and be prepared to propose solutions, analyze the success of a feature, and measure results.

1. How would you detect inappropriate content on YouTube?

2. You are a data scientist at YouTube focused on creators. A PM comes to you worried that amateur video creators could do well before but now it seems like only “superstars” do well.

What data points and metrics would you look at to decide if this is true or not?

Hint: With questions like these, try to rephrase it as a hypothesis. What hypothesis could you draw from the information provided?

3. How would you investigate a 10% drop in usage on Google Docs?

Hint: The first step in product case questions is to clarify the question. With this example, you would want some clarity on the type of drop (e.g., time on page, storage, etc.), as well as the timeframe for the usage drop.

4. Let’s say we’re given a dataset of page views where each row represents one page view.

How would you differentiate between scrapers and real people?

Hint: Modeling-based theoretical questions are meant to assess whether you can make realistic assumptions and come up with a solution under these assumptions.

5. How do you test if a new feature has increased engagement in Google’s ecosystem?

6. If the outcome of an experiment results in one group clicking 5% than the other, is that a good result?

Hint: Always ask for clarity. With a question like this, we’d need more information to answer.

7. Let’s say that your company is running a standard control and variant A/B test on a feature to increase conversion rates on the landing page. The PM checks the results and finds a .04 p-value.

How would you assess the validity of the result?

Hint: What is the interviewer leaving out, and how might we rephrase the question for clarity? We could likely re-phrase the question to: How do you set up and measure an AB test correctly?

9. Given three random variables independent and identically distributed from a uniform distribution of 0 to 4, what is the probability that the median is greater than 3?

If we break down this question, we’ll find that another way to phrase it is to ask what the probability is that at least two of the variables are larger than 3. For example, if look at the combination of events that satisfy the condition, the events can actually be divided into two exclusive events.

Event A: All three random variables are larger than 3.
Event B: One random variable is smaller than 3 and two are larger than 3.

Given these two events satisfy the condition of the median > 3, we can now calculate the probability of both of the events occurring. The question can now be rephrased as P(Median > 3) = P(A) + P(B).

Let’s calculate the probability of the event A. The probability that a random variable > 3 but less than 4 is equal to ¹⁄₄. So the probability of event A is:

P(A) = (¹⁄₄) * (¹⁄₄) * (¹⁄₄) = ¹⁄₆₄

The probability of event B is that two values must be greater than 3, but one random variable is smaller than 3. We can calculate this the same way as the calculating the probability of A. The probability of a value being greater than 3 is ¹⁄₄ and the probability of a value being less than 3 is ³⁄₄. Given this has to occur three times we multiply the condition three times.

P(B) = 3 * ((³⁄₄) * (¹⁄₄) * (¹⁄₄)) = ⁹⁄₆₄

Therefore the total probability is P(A)+P(B) = ¹⁄₆₄ + ⁹⁄₆₄ = ¹⁰⁄₆₄

Check out the Interview Query Statistics course for more practice with statistical concepts and coding.

Google Coding Interview Questions

At Google, data scientists work with vast datasets, and are tasked with using coding to generation insights and solutions. Typically, statistical coding (with a tool like Python), SQL queries and algorithmic coding are all covered in Google interviews for data science positions.

1. Write a function to generate N samples from a normal distribution and plot the histogram.

Hint: This is a relatively simple problem because we have to set up our distribution and then generate N samples from it which are then plotted. In this question, we make use of the SciPy library which is a library made for scientific computing.

2. You’re given two dataframes. One contains information about addresses and the other contains relationships between various cities and states. Write a function to create a single dataframe with complete addresses in the format of street, city, state, zip code.

Tip: Follow the link to find the relevant data on Interview Query for this question.

3.You are given the layout of a rectangular building with rooms forming a grid. Each room has four doors to the room to the north, east, south, and west where exactly one door is unlocked and the other three doors are locked. In each time step, you can move to an adjacent room via an unlocked door.

Your task is to determine the minimum number of time steps required to get from the northwest corner to the southeast corner of the building.

Note: If the path doesn’t lead you to exit return -1 .

The input is given as:

a non-empty 2d-array of letters 'N', 'E', 'S', 'W' named ‘building’
‘building[0][0]’ represents the open door at the northwest corner.
The rows of this array are associated with the north-south direction.
The columns are associated with the east-west direction.

Example:

E	E	S	W
N	W	S	N
S	E	E	S
E	N	W	W

Expected Output: 6

4. Given a percentile threshold and N samples, write a function to simulate a truncated normal distribution.

Input:

threshold = 0.75
n = 6
truncated_dist(n, percentile_threshold)

Output:

# with mean of 2 and std deviation of 1
output = [2, 1.1, 2.2, 3, 1.5, 1.3]

5. Let’s say you’re given a list of standardized test scores from high schoolers from grades 9 to 12.

Given the dataset, write code in Pandas to return the cumulative percentage of students that received scores within the buckets of <50, <75, <90, <100.

6. The schema below is for a retail online shopping company consisting of two tables, attribution and user_sessions.

The attribution table logs a session visit for each row. If conversion is true, then the user converted to buying on that session. The channel column represents which advertising platform the user was attributed to for that specific session. Lastly the user_sessions table maps many to one session visits back to one user. First touch attribution is defined as the channel to which the converted user was associated with when they first discovered the website.

Calculate the first touch attribution for each user_id that converted.

attribution table: