Data engineers at Google play an integral role in building the data systems and reporting tools the company uses to innovate and improve. People in these roles are responsible for large volumes of product, marketing, and operations data, and as such, it’s a highly demanding job.
To achieve success, candidates for Google data engineer roles must possess knowledge of the latest data trends and working knowledge of common data engineering and ETL tools. They should know data structures inside and out and have a strong working knowledge of algorithms. Algorithm questions show up in virtually all Google data engineering interviews and SQL, Python, and database design problems.
Typically, Google interviews vary by role and team, but commonly data engineer interviews at Google focus on a fairly standardized mix of questions:
We analyzed 10,000+ interview questions to determine the most frequently asked subjects in Google data engineer interviews.
Data engineer interviews at Google are conducted in three phases: an initial coding test, a technical phone screen, and an on-site technical interview. Throughout the process, candidates are asked a variety of data engineer interview questions and are tested on a wide range of concepts and skills, including:
SQL is the most frequently asked topic of all these, so you can expect SQL interview questions in all three phases. One unique point to Google: Data engineer interviews test your knowledge of algorithms, so be sure you study up there.
According to our research, algorithm questions are frequently asked in Google data engineer interviews - far greater than the average for all data engineer interviews. Some basic knowledge of algorithms and data structures goes a long way to ace these questions.
In particular, data engineers should understand how these concepts can be applied to efficiently transfer large amounts of data.
Example:
nums = [0,1,2,4,5]
missing_number(nums) -> 3
Note: Complexity of O(N) required.
Hint: Mathematical formulation or logical iteration are two ways to solve this problem while holding O(N) complexity.
Hint: The first thing to think about is what number you aim to find. The number you want to find is the sum of the entire list divided by 2. First, how would we write up an equation to add all the values?
Example 1:
nums = [1, 7, 3, 5, 6]
findIndex(nums) -> 2
example 2:
nums = [1,3,5]
findIndex(nums) -> -1
Example:
A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False
Hint: This problem is relatively simple if we figure out the underlying algorithm that allows us to check for string shifts between strings A and B easily.
First off, we have to set baseline conditions for string shifting. Strings A and B must both be the same length and consist of the same letters. We can check for the former by setting a conditional statement for if A’s length is equivalent to the length of B.
Big O notation is used to describe the run time of an algorithm relative to its input. This concept is used to describe how efficient an algorithm is.
Data engineer interviews at Google typically start with a short online SQL and Python coding screen. This test usually tests general query writing, but data engineering SQL questions will be featured prominently throughout the technical screen and on-site interview. Your prep strategy should include plenty of SQL questions.
Write a query to sample a row from this table randomly.
big
table
Column | Type |
---|---|
id |
INTEGER |
name |
VARCHAR |
We know that the RAND() function returns a floating point between 0 and 1. So if we were to run this function:
SELECT RAND()
We would get a random decimal point to some Nth degree of precision. RAND() essentially allows us to seed a random value. How can we use this to select a random row quickly?
Note: You may assume there is only one manager with the largest team size.
employees
table
Column | Type |
---|---|
id |
INTEGER |
first_name |
STRING |
last_name |
STRING |
salary |
INTEGER |
department_id |
INTEGER |
manager_id |
INTEGER |
managers
table
Column | Type |
---|---|
id |
INTEGER |
name |
STRING |
team |
STRING |
Output:
Column | Type |
---|---|
manager |
STRING |
team_size |
INTEGER |
Note: The account_dimension table maps users to multiple accounts where they could be paying customers or not. Also, round average_downloads to 2 decimal places. user_dimensions table:
Column | Type |
---|---|
user_id |
INTEGER |
account_id |
INTEGER |
account_dimensions
table
Column | Type |
---|---|
account_id |
INTEGER |
paying_customer |
BOOLEAN |
download_facts table:
Column | Type |
---|---|
date |
DATE |
user_id |
INTEGER |
downloads |
INTEGER |
Hint: Let’s first break it down. What values in which tables can we join together to get the data we need?
Hint: A lot of SQL questions will be theoretical that test your ability to write simple queries like this.
Like SQL, your programming skills will be tested on the initial screen, but be prepared for data engineering python interview questions on the tech screen and on-site as well. These questions are designed to test your baseline knowledge of Python and provide interviewers with an idea of how you write Python code.
The key with this type of question is practice because the more you write Python functions, the more it becomes second nature.
NumPy is a useful library for processing arrays of numbers, while pandas is best for statistics and machine learning data processing, which is why it’s widely used in data science.
Hint: With a question like this, be prepared to talk about the pros and cons of a library, and situations in which you would use that particular library.
integers = [2,3,5], target = 8,
output = [
[2,2,2,2],
[2,3,3],
[3,5]
]
Hint: You may notice in solving this problem that it breaks down into identical subproblems. For example, if given integers = [2,3,5] and target = 8 as in the prompt, we might recognize that if we first solve for the input: integers = [2, 3, 5] and target = 8 - Q2 = 6, we can just add 2 to the output to obtain our final answer. This is a key idea in using recursion.
Note: Treat upper and lower case letters as distinct characters. You may assume the input string includes no spaces.
Example:
input = "interviewquery"
output = "i"
input = "interv"
output = "None"
We know we have to store a unique set of characters of the input string and loop through the string to check which ones occur twice.
Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then just check if the character exists in that saved set. If it does, return the character.
Data engineers must have a strong grasp of data modeling and database design. That’s a central part of the position, and these types of questions are commonly asked in interviews.
In general, database questions tend to be hypothetical. For example, you might be provided with an application and be asked to design the schema and conduct queries to return business metrics.
These types of questions are more architecture-based and are generally given to test experience within developing databases, setting up architectures, and in this case, representing client sidetracking in the form of clicks.
What exactly does click data on the web mean? Any form of button clicks, scrolls, or acts as an interaction with the client interface, in this case, desktop, would be somehow represented into a schema form for the end-user to query. This does not include client views, however.
A simple but effective design schema would be to first represent each action with a specific label. In this case, it is assigning each click event a name or label describing its specific action.
How would you add a column inserting data from the source without affecting the user experience?
Hint: A question like this tests your problem-solving approach. The question is vague on purpose and should prompt some questions that you might have. Always get clarifying information when a question is overly vague.
What does the schema look like, and what are some optimizations that you think we might need?
Let’s first approach this problem by understanding the scope of the dating app and what functionality we have to design around.
If we were to list the key Tinder app capabilities, it would be something like this:
Next, we should figure out if we have specific feature goals that we have to account for within engineering and system design.
measurements
table
Column | Type |
---|---|
id |
INTEGER |
value |
FLOAT |
time |
DATETIME |
Logical design is the process of arranging data in a series of logical relationships called entities and attributes. Entities are larger chunks of data, while attributes are smaller components that help to define the entity. In relational databases, entities usually map to a table.
ETL - the process of extracting, transforming, and loading data - is a key responsibility for data engineers. Data engineers must design an ETL process that’s efficient and easy to manage. The types of questions you ask will be hypotheticals designed to understand your approach to developing efficient ETL processes.
Due to an ETL error, the employees table, instead of updating the salaries every year when doing compensation adjustments, did an insert instead. The head of HR still needs the current salary of each employee.
Write a query to get the current salary for each employee.
Assume no same combination of first and last names. (I.E. No two John Smiths)
employees table
Column | Type |
---|---|
id |
INTEGER |
first_name |
STRING |
last_name |
STRING |
salary |
INTEGER |
department_id |
INTEGER |
Output:
Column | Types |
---|---|
first_name |
STRING |
last_name |
STRING |
salary |
INTEGER |
song_plays
that tracks each user’s playing a song.Let’s say we want to create an aggregate table called lifetime_plays
that records each user’s song count by date.
Write a SQL query that could make this ETL each day.
song_plays table
Column | Type |
---|---|
id |
INTEGER |
created_at |
DATETIME |
user_id |
INTEGER |
song_id |
INTEGER |
Hint: We use the INSERT INTO keywords to add rows into the lifetime_plays table for this problem. Setting this query to run daily becomes a daily extract, transform, and load (ETL) process.
One thing to note: Google’s data engineering interviews aren’t just technically demanding; they’re often business-focused. So before your interview, familiarize yourself with Google’s products and try to learn the types of tools and processes they use. This can help you understand the types of data you might be presented with.
Overall, make algorithms and SQL a study focus in your interview prep and mock interviews. Querying and algorithm questions are asked quite frequently, and you’ll be in the best position for these interviews if you’re confident in those two skills.
Sign up to get your personalized learning path.
Access 600+ data science interview questions
1600+ top companies interview guide
Unlimited code runs and submissions
One thing to note: Google’s data engineering interviews aren’t just technically demanding; they’re often business-focused. So before your interview, familiarize yourself with Google’s products and try to learn the types of tools and processes they use. This can help you understand the types of data you might be presented with.
Overall, make algorithms and SQL a study focus in your interview prep and mock interviews. Querying and algorithm questions are asked quite frequently, and you’ll be in the best position for these interviews if you’re confident in those two skills.
Average Base Salary
Average Total Compensation