Google Data Engineer Interview Guide

Google Data Engineer Interview GuideGoogle Data Engineer Interview Guide

Overview

Data engineers at Google play an integral role in building the data systems and reporting tools the company uses to innovate and improve. People in these roles are responsible for large volumes of product, marketing, and operations data, and as such, it’s a highly demanding job.

To achieve success, candidates for Google data engineer roles must possess knowledge of the latest data trends and working knowledge of common data engineering and ETL tools. They should know data structures inside and out and have a strong working knowledge of algorithms. Algorithm questions show up in virtually all Google data engineering interviews and SQL, Python, and database design problems.

Google Data Engineer Interview Process

Typically, Google interviews vary by role and team, but commonly data engineer interviews at Google focus on a fairly standardized mix of questions:

We analyzed 10,000+ interview questions to determine the most frequently asked subjects in Google data engineer interviews.

Google Data Engineer Interview Questions

Data engineer interviews at Google are conducted in three phases: an initial coding test, a technical phone screen, and an on-site technical interview. Throughout the process, candidates are asked a variety of data engineer interview questions and are tested on a wide range of concepts and skills, including:

  • Algorithms and Data Structures
  • SQL
  • Python
  • Database Design
  • ETL

SQL is the most frequently asked topic of all these, so you can expect SQL interview questions in all three phases. One unique point to Google: Data engineer interviews test your knowledge of algorithms, so be sure you study up there.

Algorithms and Data Structures Questions

According to our research, algorithm questions are frequently asked in Google data engineer interviews - far greater than the average for all data engineer interviews. Some basic knowledge of algorithms and data structures goes a long way to ace these questions.

In particular, data engineers should understand how these concepts can be applied to efficiently transfer large amounts of data.

1. You have an array of integers of length n spanning 0 to n with one missing. Write a function missing_number that returns the missing number in the array.

Example:

nums = [0,1,2,4,5]
missing_number(nums) -> 3

Note: Complexity of O(N) required.

Hint: Mathematical formulation or logical iteration are two ways to solve this problem while holding O(N) complexity.

2. Given a list of integers, find the index at which the sum of the left half of the list is equal to the right half. If there is no index where this condition is satisfied, return -1.

Hint: The first thing to think about is what number you aim to find. The number you want to find is the sum of the entire list divided by 2. First, how would we write up an equation to add all the values?

Example 1:

nums = [1, 7, 3, 5, 6]
findIndex(nums) -> 2
example 2:

nums = [1,3,5]
findIndex(nums) -> -1

3. Given two strings A and B, write a function can_shift to return whether or not A can be shifted some number of places to get B.

Example:

A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False

Hint: This problem is relatively simple if we figure out the underlying algorithm that allows us to check for string shifts between strings A and B easily.

First off, we have to set baseline conditions for string shifting. Strings A and B must both be the same length and consist of the same letters. We can check for the former by setting a conditional statement for if A’s length is equivalent to the length of B.

4. What is Big O notation?

Big O notation is used to describe the run time of an algorithm relative to its input. This concept is used to describe how efficient an algorithm is.

SQL Questions

Data engineer interviews at Google typically start with a short online SQL and Python coding screen. This test usually tests general query writing, but data engineering SQL questions will be featured prominently throughout the technical screen and on-site interview. Your prep strategy should include plenty of SQL questions.

1. We have a table with an id and name field. The table holds over 100 million rows, and we want to sample a random row in the table without throttling the database.

Write a query to sample a row from this table randomly.

big table

Column Type
id INTEGER
name VARCHAR

We know that the RAND() function returns a floating point between 0 and 1. So if we were to run this function:

SELECT RAND()

We would get a random decimal point to some Nth degree of precision. RAND() essentially allows us to seed a random value. How can we use this to select a random row quickly?

2. Write a query to identify the manager with the biggest team size.

Note: You may assume there is only one manager with the largest team size.

employees table

Column Type
id INTEGER
first_name STRING
last_name STRING
salary INTEGER
department_id INTEGER
manager_id INTEGER

managers table

Column Type
id INTEGER
name STRING
team STRING

Output:

Column Type
manager STRING
team_size INTEGER

3. Given three tables: user_dimension, account_dimension, and download_facts, find the average number of downloads for free vs. paying customers broken out by day.

Note: The account_dimension table maps users to multiple accounts where they could be paying customers or not. Also, round average_downloads to 2 decimal places. user_dimensions table:

Column Type
user_id INTEGER
account_id INTEGER

account_dimensions table

Column Type
account_id INTEGER
paying_customer BOOLEAN

download_facts table:

Column Type
date DATE
user_id INTEGER
downloads INTEGER

Hint: Let’s first break it down. What values in which tables can we join together to get the data we need?

4. You’re given a product table with name, SKU, and price. How would you write a query to show the highest-priced item?

Hint: A lot of SQL questions will be theoretical that test your ability to write simple queries like this.

Python Questions

Like SQL, your programming skills will be tested on the initial screen, but be prepared for data engineering python interview questions on the tech screen and on-site as well. These questions are designed to test your baseline knowledge of Python and provide interviewers with an idea of how you write Python code.

The key with this type of question is practice because the more you write Python functions, the more it becomes second nature.

1. When might you use the NumPy library vs. pandas?

NumPy is a useful library for processing arrays of numbers, while pandas is best for statistics and machine learning data processing, which is why it’s widely used in data science.

Hint: With a question like this, be prepared to talk about the pros and cons of a library, and situations in which you would use that particular library.

2. Given a list of integers, find all combinations that equal the value N.

integers = [2,3,5], target = 8,
output = [
[2,2,2,2],
[2,3,3],
[3,5]
]

Hint: You may notice in solving this problem that it breaks down into identical subproblems. For example, if given integers = [2,3,5] and target = 8 as in the prompt, we might recognize that if we first solve for the input: integers = [2, 3, 5] and target = 8 - Q2 = 6, we can just add 2 to the output to obtain our final answer. This is a key idea in using recursion.

3. Given a string, write a function recurring_char to find its first recurring character. Return None if there is no recurring character.

Note: Treat upper and lower case letters as distinct characters. You may assume the input string includes no spaces.

Example:

input = "interviewquery"
output = "i"

input = "interv"
output = "None"

We know we have to store a unique set of characters of the input string and loop through the string to check which ones occur twice.

Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then just check if the character exists in that saved set. If it does, return the character.

Database Design Questions

Data engineers must have a strong grasp of data modeling and database design. That’s a central part of the position, and these types of questions are commonly asked in interviews.

In general, database questions tend to be hypothetical. For example, you might be provided with an application and be asked to design the schema and conduct queries to return business metrics.

1. How would you create a schema to represent client click data on the web?

These types of questions are more architecture-based and are generally given to test experience within developing databases, setting up architectures, and in this case, representing client sidetracking in the form of clicks.

What exactly does click data on the web mean? Any form of button clicks, scrolls, or acts as an interaction with the client interface, in this case, desktop, would be somehow represented into a schema form for the end-user to query. This does not include client views, however.

A simple but effective design schema would be to first represent each action with a specific label. In this case, it is assigning each click event a name or label describing its specific action.

2. Let’s say you have a table with a billion rows.

How would you add a column inserting data from the source without affecting the user experience?

Hint: A question like this tests your problem-solving approach. The question is vague on purpose and should prompt some questions that you might have. Always get clarifying information when a question is overly vague.

3. Design a database to represent a Tinder-style dating app.

What does the schema look like, and what are some optimizations that you think we might need?

Let’s first approach this problem by understanding the scope of the dating app and what functionality we have to design around.

If we were to list the key Tinder app capabilities, it would be something like this:

  • Onboarding - User opens up, adds preferences, adds pictures, and starts swiping on users.
  • Matching - If the user matches with another user, we notify them and create a messaging system.
  • Messaging - Users can message each other only if both have been matched. Users can also leave conversations at any time.

Next, we should figure out if we have specific feature goals that we have to account for within engineering and system design.

4. You are given a table of measurement values from a sensor. Each measurement can happen multiple times each day. Write a query to output the sum of values for every odd measurement and the sum of values for every even measurement by date.

measurements table

Column Type
id INTEGER
value FLOAT
time DATETIME

5. Describe the process of logical design for a database.

Logical design is the process of arranging data in a series of logical relationships called entities and attributes. Entities are larger chunks of data, while attributes are smaller components that help to define the entity. In relational databases, entities usually map to a table.

ETL Questions

ETL - the process of extracting, transforming, and loading data - is a key responsibility for data engineers. Data engineers must design an ETL process that’s efficient and easy to manage. The types of questions you ask will be hypotheticals designed to understand your approach to developing efficient ETL processes.

1. Let’s say we have a table representing a company payroll schema.

Due to an ETL error, the employees table, instead of updating the salaries every year when doing compensation adjustments, did an insert instead. The head of HR still needs the current salary of each employee.

Write a query to get the current salary for each employee.

Assume no same combination of first and last names. (I.E. No two John Smiths)

employees table

Column Type
id INTEGER
first_name STRING
last_name STRING
salary INTEGER
department_id INTEGER

Output:

Column Types
first_name STRING
last_name STRING
salary INTEGER

2.We have a table called song_plays that tracks each user’s playing a song.

Let’s say we want to create an aggregate table called lifetime_plays that records each user’s song count by date.

Write a SQL query that could make this ETL each day.

song_plays table

Column Type
id INTEGER
created_at DATETIME
user_id INTEGER
song_id INTEGER

Hint: We use the INSERT INTO keywords to add rows into the lifetime_plays table for this problem. Setting this query to run daily becomes a daily extract, transform, and load (ETL) process.

One thing to note: Google’s data engineering interviews aren’t just technically demanding; they’re often business-focused. So before your interview, familiarize yourself with Google’s products and try to learn the types of tools and processes they use. This can help you understand the types of data you might be presented with.

Overall, make algorithms and SQL a study focus in your interview prep and mock interviews. Querying and algorithm questions are asked quite frequently, and you’ll be in the best position for these interviews if you’re confident in those two skills.

See more Google Data Engineer interview questions below:

Question
Topics
Difficulty
Ask Chance
Python
R
Algorithms
Easy
Very High
SQL
Easy
Medium
ML System Design
Hard
Medium

This feature requires a user account

Sign up to get your personalized learning path.

feature

Access 600+ data science interview questions

feature

1600+ top companies interview guide

feature

Unlimited code runs and submissions


View all Google Data Engineer questions

One thing to note: Google’s data engineering interviews aren’t just technically demanding; they’re often business-focused. So before your interview, familiarize yourself with Google’s products and try to learn the types of tools and processes they use. This can help you understand the types of data you might be presented with.

Overall, make algorithms and SQL a study focus in your interview prep and mock interviews. Querying and algorithm questions are asked quite frequently, and you’ll be in the best position for these interviews if you’re confident in those two skills.

Google Data Engineer Salary

$148,185

Average Base Salary

$194,015

Average Total Compensation

Min: $101K
Max: $190K
Base Salary
Median: $144K
Mean (Average): $148K
Data points: 87
Min: $3K
Max: $369K
Total Compensation
Median: $195K
Mean (Average): $194K
Data points: 63

View the full Data Engineer at Google salary guide

Google Data Engineer Jobs

👉 Reach 100K+ data scientists and engineers on the #1 data science job board.
Submit a Job
Senior Data Scientist Manager
Software Engineering Manager Ii Google Cloud Compute Infrastructure
Senior Software Engineer Android
Lead Group Product Manager Google Home
Product Manager Google Product Studio
Staff Data Scientist Research
Senior Software Engineer Full Stack Google Cloud Data Management
Senior Product Manager Gmail
Lead Group Outbound Product Manager Google Cloud
Data Scientist Ii Product