Data engineers play an integral role at Google, building the data systems and reporting tools the company uses to innovate and improve. People in these roles are responsible for large volumes of product, marketing, and operations data, and as such, it’s a highly demanding job.
To achieve success, candidates for these roles must possess knowledge of the latest data trends, as well as working knowledge of common data engineering and ETL tools. They should know data structures inside and out, and have strong working knowledge of algorithms. In fact, algorithm questions show up in virtually all Google data engineering interviews, in addition to SQL, Python, and database design problems.
What Types of Questions Can You Expect in Google Interviews?
Data engineer interviews at Google are conducted in three phases: an initial coding test, a technical phone screen, and an on-site technical interview. Throughout the process, candidates are tested on a wide range of concepts and skills, including:
- Algorithms and Data Structures
- Database Design
Of all of these, SQL is the most frequently asked topic, so you can expect SQL questions in all three phases. One point that’s unique to Google: Data engineer interviews test your knowledge of algorithms, so be sure you study up there.
Getting ready for your Google interview? Sign up with Interview Query to get a free practice problem every week!
Algorithms and Data Structures Questions
According to our research, algorithm questions get asked frequently for Google data engineer positions - far greater than the average. To ace these questions, some basic knowledge of algorithms and data structures goes a long way.
In particular, data engineers should understand how these concepts can be applied to transfer large amounts of data efficiently.
Q1. You have an array of integers of length n spanning 0 to n with one missing. Write a function missing_number that returns the missing number in the array.
nums = [0,1,2,4,5] missing_number(nums) -> 3
Note: Complexity of O(N) required.
Hint: There are two ways to solve this problem while holding O(N) complexity: through mathematical formulation or logical iteration.
Q2. Given a list of integers, find the index at which the sum of the left half of the list is equal to the right half. If there is no index where this condition is satisfied, return -1.
Hint: The first thing to think about is what number you are aiming to find. The number that you want to find is the sum of the entire list divided by 2. First, how would we write up an equation to add up all the values?
nums = [1, 7, 3, 5, 6] findIndex(nums) -> 2
nums = [1,3,5] findIndex(nums) -> -1
Q3. Given two strings A and B, write a function can_shift to return whether or not A can be shifted some number of places to get B.
A = 'abcde' B = 'cdeab' can_shift(A, B) == True A = 'abc' B = 'acb' can_shift(A, B) == False
Hint: This problem is relatively simple if we figure out the underlying algorithm that allows us to easily check for string shifts between strings A and B.
First off, we have to set baseline conditions for string shifting. Strings A and B must both be the same length and consist of the same letters. We can check for the former by setting a conditional statement for if the length of A is equivalent to the length of B.
Q5. What is Big O notation?
Big O notation is used to describe the run time of an algorithm, relative to its input. This concept is used to describe how efficient an algorithm is.
Data engineer interviews at Google typically start with a short online SQL and Python coding screen. This test usually tests general query writing, but SQL questions will be featured prominently throughout the technical screen and on-site interview. Your prep strategy should include plenty of SQL questions.
Q1. We have a table with an id and name field. The table holds over 100 million rows and we want to sample a random row in the table without throttling the database.
Write a query to randomly sample a row from this table.
| column | type | | ------ | --------------- | | id | int | | name | varchar |
We know that the RAND() function actually returns a floating-point between 0 and 1. So if we were to run this function:
We would get a random decimal point to some Nth degree of precision. RAND() essentially allows us to seed a random value. How can we use this to select a random row quickly?
Q2. Write a query to identify the manager with the biggest team size. Note: You may assume there is only one manager with the largest team size.
| Column | Type | | -------------- | --------------- | | id | integer | | first_name | string | | last_name | string | | salary | integer | | department_id | integer | | manager_id | integer |
| Column | Type | | ------ | -------------- | | id | integer | | name | string | | team | string |
| column | type | | ---------- | ------ | | manager | string | | team_size | int |
Q3. Given three tables: user_dimension, account_dimension, and download_facts, find the average number of downloads for free vs paying customers broken out by day.
Note: The account_dimension table maps users to multiple accounts where they could be a paying customer or not. Also, round average_downloads to 2 decimal places.
| column | type | | ----------- | ----------- | | user_id | int | | account_id | int |
| column | type | | ---------------- | --------------- | | account_id | int | | paying_customer | boolean |
| column | type | | --------- | ----------- | | date | date | | user_id | int | | downloads | int |
Hint: Let's first break it down. What values in which tables can we join together to get the data that we need?
Q3. You’re given a product table with name, SKU, and price. How would you write a query to show the highest priced item?
Hint: A lot of SQL questions will be theoreticals that test your ability to write simple queries like this.
Like SQL, your programming skills will be tested in the initial screen, but be prepared for Python questions in the tech screen and on-site as well. These questions are designed to test your baseline knowledge of Python and provide interviewers with an idea of how you write Python code.
The key with this type of question is practice, because the more you write Python functions, the more it becomes second nature.
Q1. When might you use the NumPy library vs. pandas?
NumPy is a useful library for processing arrays of numbers, while pandas is best for statistics and machine learning data processing, which is why it’s widely used in data science.
Hint: With a question like this, be prepared to talk about the pros and cons of a library, and situations in which you would use that particular library.
Q2. Given a list of integers, find all combinations that equal the value N.
integers = [2,3,5], target = 8, output = [ [2,2,2,2], [2,3,3], [3,5] ]
Hint: You may notice in solving this problem that it breaks down into identical subproblems. For example, if given integers = [2,3,5] and target = 8 as in the prompt, we might recognize that if we first solve for the input: integers = [2, 3, 5] and target = 8 - 2 = 6, we can just add 2 to the output to obtain our final answer. This is a key idea in using recursion.
Q3. Given a string, write a function recurring_char to find its first recurring character. Return None if there is no recurring character.
Note: Treat upper and lower case letters as distinct characters. You may assume the input string includes no spaces.
input = "interviewquery" output = "i" input = "interv" output = "None"
We know we have to store a unique set of characters of the input string and loop through the string to check which ones occur twice.
Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then just check if the character exists in that saved set. If it does, return the character.
Database Design Questions
Data engineers must have a strong grasp of data modeling and database design. That’s a central part of the position, and these types of questions are commonly asked in interviews.
In general, database questions tend to be hypotheticals. For example, you might be provided with an application and be asked to design the schema and conduct queries to return business metrics.
Q1. How would you create a schema to represent client click data on the web?
These types of questions are more architecture based and are generally given to test experience within developing databases, setting up architectures, and in this case, representing client side tracking in the form of clicks.
What exactly does click data on the web mean? Any form of button clicks, scrolls, or action at all as an interaction with the client interface, in this case desktop, would be somehow represented into a schema form for the end-user to query. This does not include client views, however.
A simple but effective design schema would be to first represent each action with a specific label. In this case, assigning each click event a name or label describing its specific action.
Q2. Let's say you have a table with a billion rows.
How would you add a column inserting data from the original source without affecting the user experience?
Hint: A question like this tests your problem-solving approach. The question is vague on purpose, and should prompt some questions that you might have. Always get clarifying information when a question is overly vague.
Q3. Design a database to represent a Tinder style dating app.
What does the schema look like and what are some optimizations that you think we might need?
Let’s first approach this problem by understanding the scope of the dating app and what functionality we have to design around.
If we were to list the key Tinder app capabilities, it would be something like:
- Onboarding - User opens up, adds preferences, adds pictures, and starts swiping on users.
- Matching - If the user matches with another user, we notify them and create a messaging system.
- Messaging - Users can message between each other only if both have been matched. Users can also leave conversations at any time.
Next, we should figure out if we have specific feature goals that we have to account for within engineering and system design.
Q4. You are given a table of measurement values from a sensor. Each measurement can happen multiple times each day. Write a query to output the sum of values for every odd measurement and the sum of values for every even measurement by date.
| column | type | | ----------- | --------------- | | id | integer | | value | float | | time | datetime |
Q5. Describe the process of logical design for a database.
Logical design is the process of arranging data in a series of logical relationships called entities and attributes. Entities are larger chunks of data, while attributes are smaller components that help to define the entity. In relational databases, entities usually map to a table.
ETL - the process of extracting, transforming and loading data - is a key responsibility for data engineers. Data engineers must design an ETL process that’s efficient and easy to manage. The types of questions you get asked will be hypotheticals designed to understand your approach to developing efficient ETL processes.
Q1. Let’s say we have a table representing a company payroll schema.
Due to an ETL error, the employees table, instead of updating the salaries every year when doing compensation adjustments, did an insert instead. The head of HR still needs the current salary of each employee.
Write a query to get the current salary for each employee.
Assume no duplicate combination of first and last names. (I.E. No two John Smiths)
| column | type | | -------------- | --------------- | | id | integer | | first_name | string | | last_name | string | | salary | integer | | department_id | integer |
| column | types | | ----------- | ------- | | first_name | string | | last_name | string | | salary | integer |
Q2. We have a table called `song_plays` that tracks each time a user plays a song.
Let's say we want to create an aggregate table called `lifetime_plays` that records the song count by date for each user.
Write a SQL query that could make this ETL each day.
| column | type | | ----------- | --------------- | | id | integer | | created_at | datetime | | user_id | integer | | song_id | integer |
Hint: For this problem, we use the INSERT INTO keywords to add rows into the lifetime_plays table. If we set this query to run daily, it becomes a daily extract, transform, and load (ETL) process.
One thing to note: Google’s data engineering interviews aren’t just technically demanding, they’re often business focused. So before your interview, familiarize yourself with Google’s products and try to learn the types of tools and processes they use. This can help you understand the types of data you might be presented with.
Overall, make algorithms and SQL a study focus in your interview prep and mock interviews. Querying and algorithm questions are asked quite frequently, and you’ll be in the best position for these interviews if you’re confident in those two skills.
Thanks for Reading!
Continue your data engineering interview prep with these helpful resources from Interview Query: Top Data Engineering Interview Questions and the SQL and data modeling sections of our Data Science Course.