The Amazon Data Engineer Guide

The Amazon Data Engineer Guide

Overview: Data Engineering at Amazon

Amazon data engineers play an integral role in the company’s data science operations. They are responsible for wrangling massive datasets, developing scalable engineering solutions and building data solutions that drive real impact at the company.

In other words, data engineers at Amazon are responsible for driving improvements in data systems for the benefit of the business, customers and data science teams.

The data engineering interview process is rigorous and technically demanding, with questions covering SQL, Python, algorithms and database design. In addition to the technical rounds, Amazon data engineers must pass a human resources’ screening and a behavioral interview, which focus on the Amazon Leadership Principles (discussion on these farther below).

Ultimately, Amazon data engineer interviews include three phases, a recruiter screen, a technical screen and an on-site, which itself includes two to four technical or behavioral rounds.

Amazon Data Engineer Teams

Data engineers at Amazon are responsible for designing, developing and maintaining the core data structures, data models and data pipelines at Amazon. Amazon data engineers work across a variety of verticals, including:

  • Advertising - The Amazon Advertising data engineering team ingests, transforms, and enriches terabytes of data per day. The team leads the development of data systems, tools and processes to analyze and leverage Amazon Advertising data.
  • Amazon Alexa - Data engineers on the Alexa team architect, develop and maintain a centralized data system and a single source of truth for the Web Info organization. These teams are responsible for integrating multiple data sources to create BI reports and visualizations.
  • Amazon Devices - Data engineers on the Devices team are strategic partners to the product managers and engineers on the team. Amazon Devices engineers provide expertise on data storage, feature instrumentation and data privacy, as well as creating the data infrastructure and pipelines to drive Amazon’s machine learning projects.
  • Amazon Web Services - The AWS Data Science team uses AWS tools to unify data preparation, machine learning and model deployment. The engineering team is responsible for scaling the abilities and resources for customers by delivering advanced functionality for data visualization, feature engineering, model interpretability and low-latency deployment.
  • Operations Technology - Data engineers on the Operations Technology team tackle some of the most complex challenges in large-scale computing. Most of the work they do involves storing and providing access to data in efficient ways. They deal with very diverse and high-volume data, millions of records per day typically.
  • Retail - Data Engineers on the Retail team play a significant role in building Amazon’s large-scale, high-volume, high-performance data integration and delivery services. These data solutions are used for periodic reporting and drive business-decision making.

Amazon Data Engineer Roles and Responsibilities

Data engineers at Amazon tackle complex, large-scale data engineering challenges, with many different streams of data to incorporate. Although the responsibilities vary by vertical, data engineers at Amazon are responsible for:

  • Building different types of data warehousing layers based on specific use cases.
  • Building scalable data infrastructure and understanding distributed systems concepts from a data storage and compute perspective.
  • Utilizing expertise in SQL and having a strong understanding of ETL (Extract-Transform-Load) and data modeling.
  • Ensuring the accuracy and availability of data to customers and understanding how technical decisions can impact the business’s analytics and reporting.
  • Proficiency in at least one scripting/programming language to handle large-volume data processing.
  • Designing and implementing analytical data infrastructure.
  • Interfacing with other technology teams to extract, transform and load data from a wide variety of data sources.
  • Collaborating with various tech teams to implement advanced analytics algorithms.

The Amazon Data Engineer Interview

Amazon data engineer interviews are typically broken into three stages: An initial recruiter screen, a technical screen, and an onsite round. In the technical and onsite rounds, candidates will be asked questions focusing on core data engineering skills like SQL, data modeling, database design, and data warehousing.

Here is a more in-depth look at the Amazon data engineer interview process:

Stage 1: Recruiter Screen

Amazon data engineering interviews start with a phone screen. A recruiter will call you to assess your technical skills, ask about your experience and determine if you are a right fit for the role. At this stage, the recruiter wants to understand how proficient you are in programming, typically in Python and SQL. They also want to understand how your experience aligns with Amazon’s culture.

Tip: Be prepared to talk about your work experience, and map your skills and experiences to Amazon’s core values.

Stage 2: Technical Screen This interview is typically a 45-minute screen with an Amazon data engineer. This stage of the interview process assesses your technical skills and knowledge. Topics covered in the tech screen include:

  • Data warehousing
  • ETL tools
  • Data structures
  • SQL and Python
  • Data modeling

Commonly, candidates will face a range of SQL questions, covering basic to intermediate SQL concepts like joins, subqueries, window functions and case statements. You may also face a simple data engineering case study question. In this stage, you will be assessed on how efficiently you can write code, as well as your comfort with programming languages, tech solutions and concepts used in the job.

Tip: Although speed is assessed, thoroughness and problem solving are more important to the interviewer. Ask clarifying questions before you jump into a coding solution. Also, think out loud to help the interviewer understand your process. In the real world you won’t be as constrained by time as in the interview setting.

Stage 3: Onsite Interview

The Amazon onsite includes a variety of challenging technical data engineer interview questions, including scenario-based case studies, coding tests and a “Bar Raiser” behavioral interview.

This stage is usually split into three areas:

  • Technical Interviews
  • Bar Raiser Interview
  • HR Interview

The technical interview phase typically includes two to four rounds that seek to assess your problem-solving and coding ability. You may face a data engineering case study, as well as SQL, Python and database design rounds. In coding and case study interview rounds you will be asked to propose a solution to a case study question or write-out / whiteboard Python or SQL code.

Technical interviews may also ask you to investigate a pipeline model, design a data model or determine what is causing an ETL error.

Bar Raiser interviews are unique to Amazon. In this stage, an Amazon employee from a different department will ask you questions and determine if you are the right fit for Amazon. These interviews focus on Amazon’s 16 Leadership Principles (be sure to read through these and commit them to your vocabulary). Prepare answers to data engineer behavioral questions that map your skills and experiences to these 16 principles.

Finally, the HR round focuses on your previous work experiences and projects you have worked on in the past. Keep the focus on yourself and your data engineering skills in this stage.

Hiring requirements for Amazon data engineering jobs The qualifications and required skills vary by department and job role. But in general, most data engineering jobs at Amazon require:

  • A natural curiosity and ability to collaborate.
  • Degrees in math, computer science, engineering or a related field (or comparable experience in the industry).
  • Proficiency with data modeling, ETL development and data warehousing.
  • Comfort with tools like Oracle, Redshift and PostgreSQL.
  • Proficiency in programming languages like Python or Java.
  • Familiarity with AWS.
  • Strong SQL skills, including performance tuning.

Amazon Data Engineer Interview Questions

In data engineering interviews at Amazon, the most frequently tested subjects include SQL (asked in 99% of interviews), Python, behavioral questions and database design/data modeling. Use these practice data engineer interview questions to prepare for the Amazon interview.

Behavioral Questions

Remember to incorporate Amazon Leadership Principles into your answers to behavioral questions. Behavioral questions are asked in the recruiter screen, the Bar Raiser Interview and HR Interviews.

1. How would you describe your communication style?

Culture fit is assessed in Amazon behavioral interviews. This question helps the interviewer understand how you communicate, gather information and collaborate with a team. With a question like this, you might incorporate any of the three principles of Disagree and Commit, Earn Trust or Learn and Be Curious.

2. Describe a data engineering project you worked on that was challenging. What was challenging about it?

Using a framework can help you answer a question like this. Describe the situation and task you were faced with, describe the actions and finish with the results you achieved.

For example, you might say that:

“In my previous job, I was asked to build an ETL pipeline for streaming data that would gather customer transaction data to be used by the sales team. I started by gathering stakeholder input to learn about the specific needs of the sales team, and then researched options. A challenge arose during the testing phase, as there was significant pipeline lag. I had to review all of the code, and optimize it.This turned out to be a great learning experience, as I learned commonly implemented code inefficiencies that I was creating and can now avoid using.”

3. Tell me about a time you had to describe a complex technical subject to a non-technical stakeholder.

A question like this assesses your communication and collaboration skills. You might say:

“I was asked to design a marketing analytics database. However, the marketing department didn’t have extensive analytics or database knowledge. I created a short presentation, helping the team visualize the database schema and held a Q&A session after. The presentation made the schema easy to understand and helped the team better query the data.”

4. Why Amazon?

Expect a variation of this question. Some options for answering include:

  • Describing your excitement for the ecommerce space.
  • Aligning your passion with their company culture.
  • Mentioning referrals who have told you good things about working at Amazon.

5. You are assigned to work on a new engineering project. How do you get started?

Start with the initial stages like gathering stakeholder input, understanding data requirements and creating process or logical data models.

SQL Questions

SQL questions for data engineers typically include basic definitions, scenario-based questions and SQL query writing tests.

1. How do you handle duplicate data in SQL?

Start with clarifying questions. Specifically, you should ask:

  • What type of data are we working with?
  • What types of values are most likely duplicated?

This should arm you with enough information to answer this question confidently. For example, you might suggest using keys like DISTINCT, UNIQUE, or GROUP BY to de-duplicate the data.

2. Write a query to find the current salary data for each employee.

For this question we have a table representing a company payroll schema. Due to an ETL error the employees table, instead of updating the salaries every year when doing compensation adjustments, did an insert instead. The head of HR still needs the current salary of each employee.

Hint. The first step we need to do would be to remove duplicates and retain the current salary for each user.

Given we know there are no duplicate first and last name combinations, we can remove duplicates from the employees table by running a GROUP BY on two fields, the first and last name. This allows us to then get a unique combinational value between the two fields.

3. Write a query that returns all the neighborhoods with zero users.

Hint. Whenever the question asks about finding values with zero of something (users, employees, posts, etc..) immediately think of the concept of LEFT JOIN. An inner join finds any values that are in both tables, and a left join keeps only the values in the left table.

Our predicament is to find all the neighborhoods without users. To do this, we must do a left join between the neighborhoods table and the users table.

4. Write a query to select the top 3 departments with at least 10 employees. Rank them by the percentage of employees making over $100,000.

The first step is to determine what the question is asking. With this question, we can break this question down into separate conditions:

  • Top 3 departments.
  • Percent of employees making over $100,000 in salary.
  • Department must have at least 10 employees.

What comes next to fully solve the above question?

Python Questions

Data engineer Python questions can range from definitions of data structures to writing Python code.

1. Given a string, write a function recurring_char to find its first recurring character. Return “None” if there is no recurring character.

We know we have to store a unique set of characters of the input string and loop through the string to check which ones occur twice.

Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character and then just check if the character exists in that saved set. If it does, then return the character.

def recurring_char(input):
    seen = set()
    for char in input:
        if char in seen:
            return char
        seen.add(char)

    return(None)

2. What types of data types are available in Python?

Python includes a variety of built-in data types, including:

  • Lists
  • Tuples
  • Dictionaries
  • Sets

There are also user-defined data types in Python. Examples include queues, trees and linked lists.

3. Given a list of timestamps in sequential order, return a list of lists grouped by week using the first timestamp as the starting point.

This is a scripting question that asks you to process unstructured data. Specifically, we are being asked to do a few different tasks:

  1. Loop through all of the datetimes.
  2. Set a beginning timestamp as our reference point.
  3. Check if the next time in the array is more than 7 days ahead. a. If it is more than 7 days, set the new timestamp as the reference point. b. If it is not more than 7 days, continue to loop through and append the last value.

4. You have an array of integers of length n spanning 0 to n with one number missing. Write a function missing_number that returns the missing number in the array.

Hint. There are two ways we can solve this problem. One way is through logical iteration and another way is through mathematical formulation. We can look at both methods as they both hold O(N) complexity.

The first method would be through general iteration through the array. We can pass in the array and create a set which will hold each value in the input array. Then we create a loop that will span the range from 0 to n, and look to see if each number is in the set we just created. If it is not, we return the missing number.

Database Design Questions

Database design questions asked in Amazon interviews typically provide you with a case and ask you to create a schema for that case.

1. How would you create a schema to represent client click data on the web?

Hint. First, we want to clarify what click data means. You could safely assume that it represents button clicks, scrolls, closing pop-ups, etc. One solution: You could assign each action a label that describes the action.

For example, here we can say that the product is Dropbox and that we want to track each folder click on the UI of an individual person’s Dropbox folder. We can label the clicking on a folder as an action name called folder_click. When the user clicks on the side panel to login or logout and we need to specify the action, we can call it login_click and logout_click.

2. Say you have a table with a billion rows. How would you add a column inserting data from the source, without impacting user experience?

This question is vague, and you would probably want to get some clarity before answering. You can see a full mock interview solution to this question on YouTube:

Algorithm Questions

Algorithm questions are asked in Amazon data engineer interviews. However, the focus is primarily on basic algorithmic knowledge, data structures, and easy Python coding tests.

1. Given a list of integers, find the index at which the sum of the left half of the list is equal to the right half. If there is no index where this condition is satisfied, return -1.

Hint. Start by thinking about what number you are trying to find. It is the sum of the entire list divided by 2. How could you create a function to add up all the values?

2. What are the assumptions of linear regression?

Hint. This is most similar to the types of algorithm questions you might face. You might start with noting that there is a linear relationship between the features and the response variable, which is the value you want to predict.

3. How would you approach multicollinearity in multiple linear regression?

Multiple linear regression uses more than one independent variable to predict the dependent variable. One assumption we can make with this technique is that the independent variables are also independent from one another, or that the values do not affect one another.

More Amazon Data Engineer Interview Questions

Question
Topics
Difficulty
Frequency
SQL
Database Design
Data Modeling
Data Pipelines
Data Engineering
Medium
Database Design
Medium
Ready to go premium?
Get access to hundreds of in-depth solutionsGet access to hundreds of in-depth solutions
30+ hours of data science course material30+ hours of data science course material
Unlimited code runs and test casesUnlimited code runs and test cases

Amazon Data Engineer Salary

$123,457

Average Base Salary

$165,415

Average Total Compensation

Min: $75K
Max: $160K
Min: $13K
Max: $296K

View the full Data Engineer at Amazon salary guide

Amazon Data Engineer Jobs

👉 Reach 100K+ data scientists and engineers on the #1 data science job board.
Data Center Controls Engineer
Physical Security Engineer, Data Center Design Engineering
Data Engineer - Flink
Data Engineer - Flink
Data Center Controls Engineer, AWS AMER Controls
Data Center Chief Engineer
Controls Engineer, AWS Data Center Controls Team (Level 5)
Data Center Controls Engineer
Controls Deployment Engineer, AWS Data Centers Controls Team
Controls Deployment Engineer, AWS Data Centers Controls Team