# Top 25 Data Science Coding Interview Questions for 2024

## Overview

Data science connects user intentions to business interests. The purpose of data science is to extract and analyze user data to help companies make informed decisions about strategy and product changes.

However, as a data scientist, you’ll also be reconstructing the often-broken bridge between technical and non-technical stakeholders with data visualizations and effective communication.

Most data science roles, except a few, require you to be highly proficient in coding to facilitate data manipulation, designing statistical forecast models, and performing automation.

To aid in that matter and reinforce your preparedness for the upcoming data science interview, we’ve compiled a list of data science coding interview questions in this article that you’ll find challenging and useful.

## Basic Data Science Coding Interview Questions

We’ve considered foundational coding problems, such as databases and querying, as basic data science coding interview questions. The difficulty of the questions is at the competitive levels that most well-known data science companies expect you to perform on:

Write a function named `grades_colors` to select only the rows where the student’s favorite color is green or red and their grade is above 90.

### 1. Write a function named `grades_colors` to select only the rows where the student’s favorite color is green or red and their grade is above 90.

`students_df` table

Tim Voss 19 red 91
Nicole Johnson 20 yellow 95
Elsa Williams 21 green 82
John James 20 blue 75
Catherine Jones 23 green 93

Example:

Input:

``````import pandas as pd

students = {"name" : ["Tim Voss", "Nicole Johnson", "Elsa Williams", "John James", "Catherine Jones"], "age" : [19, 20, 21, 20, 23], "favorite_color" : ["red", "yellow", "green", "blue", "green"], "grade" : [91, 95, 82, 75, 93]}

students_df = pd.DataFrame(students)

``````

Output:

``````def grades_colors(students_df) ->

``````
Tim Voss 19 red 91
Catherine Jones 23 green 93

### 2. Write a query to find the `id` of suitable wines for this customer.

Let’s say you run a wine house. You have detailed information about the chemical composition of wines in a `wines` table.

One day, a customer comes asking specifically for a wine that has

• Greater or equal to 13% alcohol content
• Ash content less than 2.4
• Color intensity less than 3

Note: All percentages are reported with two numbers before the decimal point; for example, 13.55% is represented as `13.55` instead of `0.1355`.

Example:

Input:

`wines` table

Column Type
id INTEGER
alcohol FLOAT
malic_acid FLOAT
ash FLOAT
alcalinity_of_ash FLOAT
magnesium INTEGER
total_phenols FLOAT
flavanoids FLOAT
nonflavanoid_phenols FLOAT
proanthocyanins FLOAT
color_intensity FLOAT
hue FLOAT
od280_or_od315_of_diluted_wines FLOAT
proline INTEGER

Output:

Column Type
id INTEGER

### 3. Write a query that returns all neighborhoods that have 0 users.

We’re given two tables, a `users` table with demographic information and the neighborhood they live in and a `neighborhoods` table.

Example:

Input:

`users` table

Columns Type
id INTEGER
name VARCHAR
neighborhood_id INTEGER
created_at DATETIME

`neighborhoods` table

Columns Type
id INTEGER
name VARCHAR
city_id INTEGER

Output:

Columns Type
name VARCHAR

### 4. Write an SQL query to select the 2nd highest salary in the engineering department.

Note: If more than one person shares the highest salary, the query should select the next highest salary.

Example:

Input:

`employees` table

Column Type
id INTEGER
first_name VARCHAR
last_name VARCHAR
salary INTEGER
department_id INTEGER

`departments` table

Column Type
id INTEGER
name VARCHAR

Output:

Column Type
salary INTEGER

### 5. Given a table of bank transactions with columns `id`, `transaction_value`, and `created_at` representing the date and time for each transaction, write a query to get the last transaction for each day.

The output should include the ID of the transaction, datetime of the transaction, and the transaction amount. Order the transactions by datetime.

Example:

Input:

`bank_transactions` table

Column Type
id INTEGER
created_at DATETIME
transaction_value FLOAT

Output:

Column Type
created_at DATETIME
transaction_value FLOAT
id INTEGER

## Python Data Science Coding Interview Questions

A fundamental requirement to succeed as a data scientist involves demonstrating a problem-solving approach and applying algorithm and coding skills to resolve real-world analytical challenges.

To ascertain your coding abilities, data science interviewers typically use Python interview questions. Here are some of them:

### 6. Given two sorted lists, write a function to merge them into one sorted list.

Bonus: What’s the time complexity?

Example:

Input:

``````list1 = [1,2,5]
list2 = [2,4,6]

``````

Output:

``````def merge_list(list1,list2) -> [1,2,2,4,5,6]
``````

### 8. You are given two rectangles `a` and `b`, each defined by four ordered pairs denoting their corners on the `x`, `y` plane. Write a function `rectangle_overlap` to determine whether or not they overlap. Return `True` if so, and `False` otherwise.

Note: If the two rectangles border one another or share a corner like two diagonally adjacent positions on a chessboard, they are said to overlap.

Note: The lists of ordered pairs are in no particular order. The first entry in list `a` could be the top left corner, while the first in list `b` is the bottom right.

Example:

Input:

``````a = [(-3,5), (-3,2),(0,5),(0,2)]
b = [(-1,4), (3,4), (3,1), (-1,1)]

``````

Output:

``````def rectangle_overlap(a, b) -> True

``````

Point `(0,2)` is fully contained in rectangle `b`, and point `(-1,4)` is fully contained in rectangle `a`.

### 9. You have an array of integers, `nums` of length `n` spanning `0` to `n` with one missing. Write a function `missing_number` that returns the missing number in the array.

Note: The complexity of O(n) is required.

Example:

Input:

``````nums = [0,1,2,4,5]
missing_number(nums) -> 3
``````

### 10. The probability that it will rain tomorrow is dependent on whether or not it is raining today and whether or not it rained yesterday. Given that it is raining today and that it rained yesterday, write a function `rain_days` to calculate the probability that it will rain on the nth day after today.

Given that it is raining today and rained yesterday, write a function `rain_days` to calculate the probability that it will rain on the nth day after today.

Example:

Input:

``````n=5

``````

Output:

``````def rain_days(n) -> 0.39968
``````

## Data Structures Interview Questions

In addition to algorithms and coding, data structure fundamentals—especially trees, lists, and maps—also contribute to successful data science projects. We have a plethora of data structure interview questions in our database; some of which are:

### 11. Given two strings `A` and `B`, write a function `can_shift` to return whether or not `A` can be shifted some number of places to get `B`.

Example:

Input:

``````A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True

A = 'abc'
B = 'acb'
can_shift(A, B) == False
``````

### 12. Build a random forest model from scratch with the following conditions:

• The model takes as input a dataframe `data` and an array `new_point` with a length equal to the number of fields in the `data`.
• All values of both `data` and `new_point` are `0` or `1`, i.e., all fields are dummy variables and there are only two classes.
• Rather than randomly deciding what subspace of the data each tree in the forest will use, like usual, make your forest out of decision trees that go through every permutation of the value columns of the data frame. Split the data according to the value seen in `new_point` for that column.
• Return the majority vote on the class of `new_point`.
• You may use `pandas` and `NumPy` but NOT `scikit-learn`.

Bonus: The `permutations` in the `itertools` package can help you easily get all of any iterable object.

Example:

Input:

``````new_point = [0,1,0,1]
print(data)
...
Var1  Var2  Var3  Var4  Target
0    1.0   1.0   1.0   0.0       1
1    0.0   0.0   0.0   0.0       0
2    1.0   0.0   1.0   0.0       0
3    0.0   1.0   1.0   1.0       1
4    1.0   0.0   1.0   0.0       0
..   ...   ...   ...   ...     ...
95   0.0   1.0   0.0   1.0       0
96   1.0   1.0   0.0   0.0       0
97   0.0   0.0   1.0   1.0       0
98   1.0   0.0   0.0   0.0       0
99   0.0   1.0   0.0   0.0       0

[100 rows x 5 columns]

``````

Output:

``````def random_forest(new_point, data) -> 0
``````

### 13. Write a function `find_intersecting` to find which lines, if any, intersect with any of the others in the given `x_range`.

Say you are given a list of tuples where the first element is the slope of a line and the second element is the y-intercept of a line.

Example:

Input:

``````tuple_list = [(2, 3), (-3, 5), (4, 6), (5, 7)]
x_range = (0, 1)

``````

Output:

``````def find_intersecting(tuple_list, x_range) ->  [(2,3), (-3,5)]

``````

### 14. Build a k-nearest neighbors classification model from scratch with the following conditions:

• Use Euclidian distance (a.k.a., the “2 norm”) as your closeness metric.
• Your function should be able to handle data frames of many arbitrary rows and columns.
• If there is a tie in the class of the k-nearest neighbors, rerun the search using k-1 neighbors instead.
• You may use `pandas` and `NumPy` but NOT `scikit-learn`.

Example:

Input:

``````k = 5
new_point = [0.5,-2,8]
print(data)
...
Var1      Var2      Var3  Target
0  -3.279536  3.362223  2.847892       2
1  -0.791565  1.742475  2.151587       2
2  -0.785992 -0.938681 -0.459770       0
3  -1.068190  1.461051  0.127130       3
4  -0.367568 -0.870240 -0.225734       0
..       ...       ...       ...     ...
95 -1.327175  1.971085 -0.690689       2
96 -3.203714  1.847649  0.778901       2
97 -0.587640  0.647458  2.094385       2
98  0.363644 -0.509795  2.514191       1
99 -0.673498  2.955285  2.102122       4

[100 rows x 4 columns]

``````

Output:

``````def kNN(k, new_point, data) -> 2
``````

### 15. Given a dictionary with keys of letters and values of a list of letters, write a function `closest_key` to find the key with the input value closest to the beginning of the list.

Example:

Input:

``````dictionary = {
'a' : ['b','c','e'],
'm' : ['c','e'],
}
input = 'c'
``````

Output:

``````closest_key(dictionary, input) -> 'm'
``````

c is at a distance of 1 from a and 0 from m. Hence, the closest key for c is m.

## NumPy Data Science Coding Interview Questions

NumPy is a fundamental Python library for scientific computing that provides high-performance multidimensional array objects and tools for working with these arrays. It is an upgrade to Python’s built-in lists for mathematical calculations on large datasets. We have an extensive list of NumPy Interview Questions, some of which are discussed here:

### 19. Given a list of integers, write a function `gcd` to find the greatest common denominator between them.

Example:

Input:

``````int_list = [8, 16, 24]

``````

Output:

``````def gcd(int_list) -> 8
``````

## Machine Learning Data Science Coding Interview Questions

Machine learning aids data scientists when they need to gather information faster and assists with trend analysis. While your involvement in building or “coding” ML models will be determined by the company and the type of role you hold, data scientists are generally not expected to approach machine learning interview questions from a strict development standpoint. However, you may be expected to answer algorithm coding questions, such as:

### 21. Write a function, `search_list` that returns a Boolean indicating if the `target` value is in the `linked_list` or not.

You receive the head of the linked list, which is a dictionary with the following keys:  `value` (contains the value of the node) and `next` (contains the next node in the list, or `None`).

If the linked list is empty, you’ll receive `None` since there is no head node for an empty list.

Example:

Input:

``````target = 2
linked_list = 3 -> 2 -> 5 -> 6 -> 8 -> None

``````

Output:

``````search_list(target, linked_list) -> True
``````

### 22. Given two strings `A` and `B`, write a function `can_shift` to return whether or not `A` can be shifted some number of places to get `B`.

Example:

Input:

``````A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True

A = 'abc'
B = 'acb'
can_shift(A, B) == False
``````

### 23. You’re given two words, `begin_word` and `end_word` which are elements of `word_list`.

Write a function `shortest_transformation` to find the length of the shortest transformation sequence from `begin_word` to `end_word` through the elements of `word_list`.

Note: Only one letter can be changed at a time, and each transformed word in the list must exist inside of `word_list`.

Note: In all test cases, a path does exist between `begin_word` and `end_word`

Example:

Input:

``````Input:
begin_word = "same",
end_word = "cost",
word_list = ["same","came","case","cast","lost","last","cost"]

``````

Output:

``````def shortest_transformation(begin_word, end_word, word_list) -> 5

``````

Since the transformation sequence would be:

``````'same' -> 'came' -> 'case' -> 'cast' -> 'cost'

``````

which is five elements long.

### 24. Given two strings, `string1` and `string2`, write a function `max_substring` to return the maximal substring shared by both strings.

Example:

Input:

``````string1 = 'mississippi'

string2 = 'mossyistheapple'
``````

Output:

``````def maximal_substring(string1, string2) ->  'mssispp'
``````

Note: If there are multiple max substrings with the same length, just return any one of them.

### 25. Given a sorted list of integers ints with no duplicates, write an efficient function nearest_entries that takes in integers N and k.

Additionally, it should do the following:

• Finds the element of the list that is closest to N.
• Then returns that element along with the k-next and k-previous elements of the list.

## Tips to Ace Data Science Coding Interview Questions

To excel in data science coding interviews, focus on a strong foundation in data structures and algorithms.

Practice coding regularly on our platform and utilize our AI Interviewer feature. Understand the trade-offs between different approaches and articulate your thought process clearly, especially in ML coding questions.

Emphasize code readability, efficiency, and test case considerations.

Additionally, delve deep into Python libraries like NumPy, pandas, and scikit-learn for efficient data manipulation and modeling. All the best!