Data science has now transformed into a multi-disciplinary skillset that requires a combination of statistics, modeling, and coding. Along with the growth in data science, there has also been a rise in data science technical interviews with an emphasis in Python coding questions. This process has transformed from interviewers asking random coding questions to now focusing more of their questions around specific Python concepts.
Why is Python the main language of choice for data science interviews?
Python has reigned as the dominant language in data science over the past few years, taking over former strongholds such as R, Julia, Spark, and Scala by its wide breadth of data science libraries supported by a strong and growing data science community.
One of the main reasons why Python is now the preferred language of choice is because Python has libraries that can extend its use to the full stack of data science. While each data science language has it's own specialty, such as R for data analysis and modeling within academia, Spark and Scala for big data ETLs and production; Python has grown their own ecosystem of libraries to a point where they all fit nicely together. At the end of the day, it's much easier to program and perform full stack data science without having to switch languages. This means running exploratory data analysis, creating graphs and visualization, building the model, and implementing the deployment all in one language.
What's the difference between Python interview questions and software engineering interview questions?
Given this need for Python skills, what kind of questions would be expected on the data science interview? Python requirements for data scientists in interviews are very different from software engineers and developers. Data scientists should obviously be comfortable with basic Python syntax (lists, dictionaries, data types) and the popular data analysis libraries like Pandas and Numpy. But where do we draw the line between a software engineering type interview question on data structures and algorithms and Python questions?
The main difference between these two is that Python based interview questions are meant to assess your scripting skills. This means how well you can write code that can effectively either analyzes, transform, or manipulates data in some way that will most of the time, not run in a production environment.
For examples, in software engineering and much of machine learning engineering and infrastructure, many engineers work on building systems, maintaining web applications, and scaling software to millions of users. These tasks require careful engineering to build products that minimize downtime and bugs. On the other side, there exists analytics and data science that caters primarily to the internal parts of the organization. This involves importing data to analyze from the website, creating ETLs, and writing scripts that run at a certain cadence.
If we use Facebook as an example, a software engineer would build the web application for Facebook to render friends, profiles, and a newsfeed for the end user to share and connect with friends. A data scientist might be tasked with writing a script that could pull in the number of stories a user visited on the newsfeed and analyze it each day and output it into a dashboard.
Given this task doesn't affect the end user experience, engineering is many times not the primary directive for a data scientist as their script would not cause the website to crash if it had bugs or couldn't scale. But the level to which data scientists have to understand data structures and algorithms vary depending on their responsibilities at the organization. Many times, data scientists are tasked with writing production code and function as machine learning engineers
Python Data Science Interview Questions and Concepts
So what kinds of questions are determined to actually be Python data science questions? We know it's in-between something as simple as what is a dictionary in Python and difficult data structure, algorithms, or object oriented programming concepts.
There are five main concepts tested in Python data science interview questions.
- Statistics and distribution based questions
- Probability simulation
- String parsing and data manipulation
- Numpy functions and matrices
- Pandas data munging
Python Statistics Questions
Python statistics questions are based on implementing statistical analyses and testing how well you know statistical concepts and can translate them into code. Many times, these questions take the form of random sampling from a distribution, generating histograms, computing different statistical metrics such as standard deviation, mean, or median, and etc..
These kinds of questions should be tackled by first understanding statistics at a core level.
Example Python Statistics Question:
Asked By Google:
Write a function to generate X samples from a normal distribution and plot the histogram.
Python Probability Questions
Most Python questions that involve probability are testing your knowledge of the probability concept. These questions are really similar to the Python statistics questions except they are focused on simulating concepts like Binomial or Bayes theorem.
Since most general probability questions are focused around calculating chances based on a certain condition, almost all of these probability questions can be proven by writing Python to simulate the case problem. For example, if we take this example data science probability problem from Microsoft:
Amy and Brad take turns in rolling a fair six-sided die. Whoever rolls a "6" first wins the game. Amy starts by rolling first. What's the probability that Amy wins?
Given this scenario, we can write a Python function that can simulate this scenario thousands of times to see how many times Amy wins first. Solving this problem then requires understanding how to create two separate people and simulate the scenario of one person rolling first each time.
String Parsing and Data Manipulation Python Questions
String parsing questions in Python are probably one of the most common. These types of questions focus on how well you can manipulate text data which always needs to be thoroughly cleaned and transformed into a dataset.
Examples of these types of questions that are common at startups or companies that work with a lot of text that needs to be analyzed on a regular basis. This means most social media companies like Twitter or LinkedIn, job companies like Indeed or Ziprecruiter, etc...
Example String Parsing:
- Given a log file, parse each line and return a dictionary with the corresponding values.
- Write a function that can take a string and return a list of bigrams.
Data manipulation questions cover more techniques that would be transforming data outside of Numpy or Pandas. This is common when designing ETLs for data engineers when transforming data between raw json and database reads.
Many times these types of problems will require grouping, sorting, or filtering data using lists, dictionaries, and other Python data structure types. These types of questions test your general knowledge of Python data munging outside of actual Pandas formatting.
Python Numpy and Matrices Problems
Many data science problems deal with working with the Numpy library and matrices. These types of problems are not as common as the others but still show up. This involves working with the Numpy library to run matrix multiplication, calculating the Jacobian determinant, and transforming matrices in some way or form.
Example Python Matrix Problem:
- Given a 4x4 Numpy matrix, reverse the matrix.
- Add two Numpy matrices together.
Pandas Data Munging
Lastly, questions with pandas are starting to show up more and more in data science interviews. While Pandas can be used in many different forms in data science, including analytics types of questions similar to SQL problems, these kinds of Pandas questions revolve more about cleaning data.
This mean problems like one-hot encoding variables, using the Pandas apply function to group different variables, and text cleaning different columns.
Suppose you have a dataframe with the following values
Write code using Python Pandas to return the rows where the students favorite color is green or yellow and their grade is above 90.
Python Data Science Interview Strategies
Practice. The foremost easiest way to get better at Python data science interview questions is to do more practice problems. The more questions you practice and understand, the more strategies you'll figure out in faster time as you start to pattern match and group similar problems together.
Clarify Upfront. What packages or libraries are you allowed to use? Do you have to build an algorithm from scratch? What's the most optimal runtime that they're looking for? Ask questions to understand the scope of the problem first to get a sense of where to start. The worst thing you could do is not clarify their expectations from the get go!
Solve a simple problem first. This allows you get an early win and build on the larger scope of the problem. Additionally if you have a solution but you know it's not the most efficient, write it out first anyway to get something on paper and then work backwards to try to find the most optimal one.
Think out loud and communicate. Talk about what you're doing and why. This helps with both your thought process and their understanding of what you're doing. That way you can make sure both you and the interviewer are both on the same page.
Admit if you don't know. If you don't know different Python methods, types, and other concepts, it looks bad to the interviewer. Rather, just mention that you forgot and make an assumption so that the interviewer understands where you're coming from. If you're wrong, they will most likely correct you.
Slow down. Don't jump in headfirst and expect to do well. Take your time to think about the problem and solve like how you would when you're practicing. Remember that you most likely will have plenty of time to solve the problem.