In data engineer interviews, Python is the most frequently asked topic, behind only SQL. Although you may get asked a variety of programming questions, Python is the most important language to know. In fact, it’s a required skill for nearly 70% of data engineer jobs.
Python is widely used in data science, machine learning and AI. Therefore, if you’re prepping for a data engineer interview, you should have a strong grasp of its fundamentals and practical uses, including writing Python functions, Python theory and Python definitions.
Ace your data engineering interview. Start with our Data Engineering Interview Questions guide.
Python Questions for Data Engineer Interviews
Data engineers should be prepared for a wide range of Python coding questions. A few of the most common topics include distribution-based questions, data munging with pandas, and data manipulation. Yet, for data engineer positions, Python questions tend to focus on three core categories:
1) Data structures and manipulation - You’ll likely be asked questions about using Python lists, data types and basic Python operations like searching and other data manipulation techniques. Typically these are definitions-based questions.
2) Python definitions - This type of question includes definitions of sequences, especially search, merge and sort functions, as well as creating new data by combining existing data. Algorithms-based Python questions might be asked too.
3) Python practice programs - Data engineering interviews all include coding exercises. This technical portion will ask you to solve problems based on an existing data set, e.g. practical application of your Python programming skills.
Data Structures & Data Manipulation Questions
You can expect theory-based questions - for example, a theoretical exploration about data architecture - as well as simple coding exercises related to data structures and manipulation. Here are some sample questions:
Q1. Write a function find_bigrams to take a string and return a list of all bigrams.
sentence = """ Have free hours and love children? Drive kids to school, soccer practice and other activities. """ def find_bigrams(sentence) -> [('have', 'free'), ('free', 'hours'), ('hours', 'and'), ('and', 'love'), ('love', 'children?'), ('children?', 'drive'), ('drive', 'kids'), ('kids', 'to'), ('to', 'school,'), ('school,', 'soccer'), ('soccer', 'practice'), ('practice', 'and'), ('and', 'other'), ('other', 'activities.')]
Hint: To separate the sentence into bigrams, the first thing we need to do is split the sentence into individual words.
Q2. Given a list of timestamps in sequential order, return a list of lists grouped by week (7 days) using the first timestamp as the starting point.
ts = [ '2019-01-01', '2019-01-02', '2019-01-08', '2019-02-01', '2019-02-02', '2019-02-05', ] def weekly_aggregation(ts) -> [ ['2019-01-01', '2019-01-02'], ['2019-01-08'], ['2019-02-01', '2019-02-02'], ['2019-02-05'], ]
Q3. Write a function to locate the left insertion point for a specified value in a sorted order.
Here's an answer for a Python data manipulation question:
import bisect def index(a, x): i = bisect.bisect_left(a, x) return i a = [1,2,4,5] print(index(a, 6)) print(index(a, 3))
Q4. Write a function to create a queue and display all the members and size of the queue.
Here's the solution:
import queue q = queue.Queue() for x in range(4): q.put(x) print("Members of the queue:") y=z=q.qsize() for n in list(q.queue): print(n, end=" ") print("\nSize of the queue:") print(q.qsize()) Sample Output: Members of the queue: 0 1 2 3 Size of the queue: 4
Q5. What are some primitive data structures in Python? What are some user-defined data structures?
The built-in data types in Python include lists, tuples, dictionaries and sets. These data types are already defined and supported by Python, and they act as containers for grouping data by type.
User-defined data types share commonalities with the primitive types, and they are based on these concepts. But ultimately, they allow users to create their own data structures, including queues, trees and linked lists.
Hint: With questions like these, be prepared to talk about the advantages of a particular data structure and when it might be best for a project.
Python Definitions Questions
Definition-based questions are commonly asked in technical screens. They’re used to quickly assess your comfort level with Python, and your knowledge of foundational-to-advanced concepts. These will include simple conceptual definitions, as well as opinion-based questions (like what’s the advantage of NumPy arrays vs. Python lists).
Q1. Which Python libraries are most efficient for data processing?
This is a foundational question, but it quickly assesses your familiarity with data processing. Be sure to include NumPy and pandas and list the advantages of both.
NumPy is the best solution for arrays of data, while pandas is the most efficient solution for processing statistics and machine learning data.
Hint: Be prepared for situational questions as well. The interviewer might give you a situation, and ask which Python library you might use to process the data.
Q2. What is data smoothing? How do you do it?
This is an approach that’s used to eliminate outliers from data sets. Data smoothing helps to reduce noise, and make patterns more recognizable. ‘Roughing out the edges’ helps to improve machine learning, as well.
Algorithms are used in Python to reduce noise and smooth data sets. A few data smoothing algorithms include Savitzky-Golay filter and Triangular Moving Average.
Q3. When to use Python vs Java?
There are a lot of similarities between these two languages. They’re both object-oriented languages, and have large libraries that extend their uses. But Python has an edge in data science. That’s in part due to its simplicity and user friendliness. Java is the better language for developing applications.
Q4. What is NumPy used for? What are its benefits?
NumPy is an open-source library that’s used to analyze data, and includes support for Python’s multi-dimensional arrays and matrices. NumPy is used for a variety of mathematical and statistical operations.
Q5. When would you use NumPy Arrays over Python lists?
Python lists are a basic building block, and they’re a useful data container for a variety of functions. For example, with lists, vectorized operations aren’t possible, including element-wise multiplication.
They also require that Python store the type information of every element since they support objects of different types. This means a type dispatching code must be executed each time an operation on an element is done. Also, each iteration would have to undergo type checks and require Python API bookkeeping resulting in very few operations being carried by C loops.
Data Engineer Interview: Coding Exercises
These questions will help you practice for the coding exercise portion of the interview. Typically, you’ll be given some information - like a data set - and asked to write Python code to solve the problem. These types of questions can test beginner Python skills, all the way up to advanced sequences and functions in Python.
Q1. Given a string, write a function recurring_char to find its first recurring character. Return None if there is no recurring character.
Note: Treat upper and lower case letters as distinct characters. You may assume the input string includes no spaces.
input = "interviewquery" output = "i" input = "interv" output = "None"
Hint: We know we have to store a unique set of characters of the input string and loop through the string to check which ones occur twice.
Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then just check if the character exists in that saved set. If it does, return the character.
Q2. Given a dictionary consisting of many roots and a sentence, write a function replace_words to stem all the words in the sentence with the root forming it. If a word has many roots that can form it, replace it with the root with the shortest length.
roots = ["cat", "bat", "rat"] sentence = "the cattle was rattled by the battery"
"the cat was rat by the bat"
Hint: At first, it simply looks like we can just loop through each word and check if the root exists in the word and if so, replace the word with the root. But since we are technically stemming the words, we have to make sure that the roots are equivalent to the word at its prefix, rather than existing anywhere within the word.
Q3. Given a list of integers, find all combinations that equal the value N.
integers = [2,3,5], target = 8, output = [ [2,2,2,2], [2,3,3], [3,5] ]
Hint: You may notice, in solving this problem, that it breaks down into identical subproblems.
For example, if given integers = [2,3,5] and target = 8 as in the prompt, we might recognize that if we first solve for the input: integers = [2, 3, 5] and target = 8 - 2 = 6, we can just add 2 to the output to obtain our final answer. This is a key idea in using recursion to solve this problem.
Q4. Given a percentile_threshold, sample size N, and mean and standard deviation m and sd of the normal distribution, write a function truncated_dist to simulate a normal distribution truncated at percentile_threshold.
m = 2 sd = 1 n = 6 percentile_threshold = 0.75
def truncated_dist(m,sd,n, percentile_threshold): -> [2, 1.1, 2.2, 3, 1.5, 1.3] # All values in the output sample are in the lower 75% = percentile_threshold of the distribution.
Hint: First, we need to calculate where to truncate our distribution. We want a sample where all values are below the percentile_threshold.
Say we have a point Z and want to calculate the percentage of our normal distribution that resides on or below Z. In order to do this, we would simply plug Z into the CDF of our distribution.
Conclusion: Final Tips for Data Engineer Programming Interviews
Programming questions are fairly straightforward, especially beginner Python coding exercises, which require you to write code efficiently. Although these technical data engineering interviews tend to ask a lot about the hard skills, don’t forget to focus on your soft skills. Here are some quick tips to help you nail a Python coding interview:
- Coding Exercises - Practice programs will help your hard skills in coding become second nature. Make coding challenges a core piece of your interview preparation strategy.
- Consider Soft Skills - Make sure you think about the soft skills that will be essential for a data engineer job. You’ll want to focus on communication (be able to express complex subjects clearly and in layman’s terms), time management (solve problems efficiently), and ability to take direction (problem-solving approach).
- Know Your Data Structures - Study up on the most commonly used structures and algorithms. Be comfortable expressing their uses, key benefits, and situations in which these structures are most useful.
Thanks for Reading!
Continue your data engineer interview prep with these resources from Interview Query: Top Data Engineering Interview Questions, Python section of our Data Science Course and Top Books for Data Engineers.