Data science connects user intentions to business interests. The purpose of data science is to extract and analyze user data to help companies make informed decisions about strategy and product changes.

However, as a data scientist, you’ll also be reconstructing the often-broken bridge between technical and non-technical stakeholders with data visualizations and effective communication.

Most data science roles, except a few, require you to be highly proficient in coding to facilitate data manipulation, designing statistical forecast models, and performing automation.

To aid in that matter and reinforce your preparedness for the upcoming data science interview, we’ve compiled a list of data science coding interview questions in this article that you’ll find challenging and useful.

We’ve considered foundational coding problems, such as databases and querying, as basic data science coding interview questions. The difficulty of the questions is at the competitive levels that most well-known data science companies expect you to perform on:

Write a function named ** grades_colors** to select only the rows where the student’s favorite color is green or red and their grade is above 90.

`grades_colors`

to select only the rows where the student’s favorite color is green or red and their grade is above 90.** students_df** table

name | age | favorite_color | grade |
---|---|---|---|

Tim Voss | 19 | red | 91 |

Nicole Johnson | 20 | yellow | 95 |

Elsa Williams | 21 | green | 82 |

John James | 20 | blue | 75 |

Catherine Jones | 23 | green | 93 |

**Example:**

**Input:**

```
import pandas as pd
students = {"name" : ["Tim Voss", "Nicole Johnson", "Elsa Williams", "John James", "Catherine Jones"], "age" : [19, 20, 21, 20, 23], "favorite_color" : ["red", "yellow", "green", "blue", "green"], "grade" : [91, 95, 82, 75, 93]}
students_df = pd.DataFrame(students)
```

**Output:**

```
def grades_colors(students_df) ->
```

name | age | favorite_color | grade |
---|---|---|---|

Tim Voss | 19 | red | 91 |

Catherine Jones | 23 | green | 93 |

`id`

of suitable wines for this customer.Let’s say you run a wine house. You have detailed information about the chemical composition of wines in a ** wines** table.

One day, a customer comes asking specifically for a wine that has

- Greater or equal to 13% alcohol content
- Ash content less than 2.4
- Color intensity less than 3

**Note:** All percentages are reported with two numbers before the decimal point; for example, 13.55% is represented as ** 13.55** instead of

`0.1355`

**Example:**

**Input:**

** wines** table

Column | Type |
---|---|

id | INTEGER |

alcohol | FLOAT |

malic_acid | FLOAT |

ash | FLOAT |

alcalinity_of_ash | FLOAT |

magnesium | INTEGER |

total_phenols | FLOAT |

flavanoids | FLOAT |

nonflavanoid_phenols | FLOAT |

proanthocyanins | FLOAT |

color_intensity | FLOAT |

hue | FLOAT |

od280_or_od315_of_diluted_wines | FLOAT |

proline | INTEGER |

**Output:**

Column | Type |
---|---|

id | INTEGER |

We’re given two tables, a ** users** table with demographic information and the neighborhood they live in and a

`neighborhoods`

**Example:**

**Input:**

** users** table

Columns | Type |
---|---|

id | INTEGER |

name | VARCHAR |

neighborhood_id | INTEGER |

created_at | DATETIME |

** neighborhoods** table

Columns | Type |
---|---|

id | INTEGER |

name | VARCHAR |

city_id | INTEGER |

**Output:**

Columns | Type |
---|---|

name | VARCHAR |

**Note:** If more than one person shares the highest salary, the query should select the next highest salary.

**Example:**

**Input:**

** employees** table

Column | Type |
---|---|

id | INTEGER |

first_name | VARCHAR |

last_name | VARCHAR |

salary | INTEGER |

department_id | INTEGER |

** departments** table

Column | Type |
---|---|

id | INTEGER |

name | VARCHAR |

**Output:**

Column | Type |
---|---|

salary | INTEGER |

`id`

`transaction_value`

`created_at`

The output should include the ID of the transaction, datetime of the transaction, and the transaction amount. Order the transactions by datetime.

**Example:**

**Input:**

** bank_transactions** table

Column | Type |
---|---|

id | INTEGER |

created_at | DATETIME |

transaction_value | FLOAT |

**Output:**

Column | Type |
---|---|

created_at | DATETIME |

transaction_value | FLOAT |

id | INTEGER |

A fundamental requirement to succeed as a data scientist involves demonstrating a problem-solving approach and applying algorithm and coding skills to resolve real-world analytical challenges.

To ascertain your coding abilities, data science interviewers typically use Python interview questions. Here are some of them:

*Bonus: What’s the time complexity?*

**Example:**

**Input:**

```
list1 = [1,2,5]
list2 = [2,4,6]
```

**Output:**

```
def merge_list(list1,list2) -> [1,2,2,4,5,6]
```

`a`

`b`

,`x`

`y`

`rectangle_overlap`

`True`

`False`

**Note:** *If the two rectangles border one another or share a corner like two diagonally adjacent positions on a chessboard, they are said to overlap.*

**Note:** *The lists of ordered pairs are in no particular order. The first entry in list a could be the top left corner, while the first in list b is the bottom right.*

**Example:**

**Input:**

```
a = [(-3,5), (-3,2),(0,5),(0,2)]
b = [(-1,4), (3,4), (3,1), (-1,1)]
```

**Output:**

```
def rectangle_overlap(a, b) -> True
```

Point ** (0,2)** is fully contained in rectangle

`b`

,`(-1,4)`

`a`

`nums`

`n`

`0`

`n`

`missing_number`

*Note: The complexity of O(n) is required.*

**Example:**

**Input:**

```
nums = [0,1,2,4,5]
missing_number(nums) -> 3
```

`rain_days`

to calculate the probability that it will rain on the nth day after today.Given that it is raining today and rained yesterday, write a function ** rain_days** to calculate the probability that it will rain on the nth day after today.

**Example:**

**Input:**

```
n=5
```

**Output:**

```
def rain_days(n) -> 0.39968
```

In addition to algorithms and coding, data structure fundamentals—especially trees, lists, and maps—also contribute to successful data science projects. We have a plethora of data structure interview questions in our database; some of which are:

`A`

`B`

`can_shift`

`A`

`B`

**Example:**

**Input:**

```
A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False
```

- The model takes as input a dataframe
and an array`data`

with a length equal to the number of fields in the`new_point`

.`data`

- All values of both
and`data`

are`new_point`

or`0`

, i.e., all fields are dummy variables and there are only two classes.`1`

- Rather than randomly deciding what subspace of the data each tree in the forest will use, like usual, make your forest out of decision trees that go through every permutation of the value columns of the data frame. Split the data according to the value seen in
for that column.`new_point`

- Return the majority vote on the class of
.`new_point`

- You may use
and`pandas`

but`NumPy`

**NOT**.`scikit-learn`

**Bonus:** *The* `permutations`

*in the* `itertools`

*package can help you easily get all of any iterable object.*

**Example:**

**Input:**

```
new_point = [0,1,0,1]
print(data)
...
Var1 Var2 Var3 Var4 Target
0 1.0 1.0 1.0 0.0 1
1 0.0 0.0 0.0 0.0 0
2 1.0 0.0 1.0 0.0 0
3 0.0 1.0 1.0 1.0 1
4 1.0 0.0 1.0 0.0 0
.. ... ... ... ... ...
95 0.0 1.0 0.0 1.0 0
96 1.0 1.0 0.0 0.0 0
97 0.0 0.0 1.0 1.0 0
98 1.0 0.0 0.0 0.0 0
99 0.0 1.0 0.0 0.0 0
[100 rows x 5 columns]
```

**Output:**

```
def random_forest(new_point, data) -> 0
```

`find_intersecting`

`x_range`

Say you are given a list of tuples where the first element is the slope of a line and the second element is the y-intercept of a line.

**Example:**

**Input:**

```
tuple_list = [(2, 3), (-3, 5), (4, 6), (5, 7)]
x_range = (0, 1)
```

**Output:**

```
def find_intersecting(tuple_list, x_range) -> [(2,3), (-3,5)]
```

- Use Euclidian distance (a.k.a., the “2 norm”) as your closeness metric.
- Your function should be able to handle data frames of many arbitrary rows and columns.
- If there is a tie in the class of the k-nearest neighbors, rerun the search using k-1 neighbors instead.
- You may use
and`pandas`

but`NumPy`

*NOT*.`scikit-learn`

**Example:**

**Input:**

```
k = 5
new_point = [0.5,-2,8]
print(data)
...
Var1 Var2 Var3 Target
0 -3.279536 3.362223 2.847892 2
1 -0.791565 1.742475 2.151587 2
2 -0.785992 -0.938681 -0.459770 0
3 -1.068190 1.461051 0.127130 3
4 -0.367568 -0.870240 -0.225734 0
.. ... ... ... ...
95 -1.327175 1.971085 -0.690689 2
96 -3.203714 1.847649 0.778901 2
97 -0.587640 0.647458 2.094385 2
98 0.363644 -0.509795 2.514191 1
99 -0.673498 2.955285 2.102122 4
[100 rows x 4 columns]
```

**Output:**

```
def kNN(k, new_point, data) -> 2
```

`closest_key`

to find the key with the input value closest to the beginning of the list.**Example:**

**Input:**

```
dictionary = {
'a' : ['b','c','e'],
'm' : ['c','e'],
}
input = 'c'
```

**Output:**

```
closest_key(dictionary, input) -> 'm'
```

c is at a distance of 1 from a and 0 from m. Hence, the closest key for c is m.

NumPy is a fundamental Python library for scientific computing that provides high-performance multidimensional array objects and tools for working with these arrays. It is an upgrade to Python’s built-in lists for mathematical calculations on large datasets. We have an extensive list of NumPy Interview Questions, some of which are discussed here:

`gcd`

**Example:**

**Input:**

```
int_list = [8, 16, 24]
```

**Output:**

```
def gcd(int_list) -> 8
```

Machine learning aids data scientists when they need to gather information faster and assists with trend analysis. While your involvement in building or “coding” ML models will be determined by the company and the type of role you hold, data scientists are generally not expected to approach machine learning interview questions from a strict development standpoint. However, you may be expected to answer algorithm coding questions, such as:

`search_list`

that returns a Boolean indicating if the `target`

value is in the `linked_list`

or not.You receive the head of the linked list, which is a dictionary with the following keys: ** value** (contains the value of the node) and

`next`

`None`

If the linked list is empty, you’ll receive ** None** since there is no head node for an empty list.

**Example:**

**Input:**

```
target = 2
linked_list = 3 -> 2 -> 5 -> 6 -> 8 -> None
```

**Output:**

```
search_list(target, linked_list) -> True
```

`A`

`B`

`can_shift`

`A`

`B`

**Example:**

**Input:**

```
A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False
```

`begin_word`

and `end_word`

which are elements of `word_list`

.Write a function ** shortest_transformation** to find the length of the shortest transformation sequence from

`begin_word`

`end_word`

`word_list`

**Note:** *Only one letter can be changed at a time, and each transformed word in the list must exist inside of* .

`word_list`

**Note:** *In all test cases, a path does exist between* `begin_word`

*and* `end_word`

**Example:**

**Input:**

```
Input:
begin_word = "same",
end_word = "cost",
word_list = ["same","came","case","cast","lost","last","cost"]
```

**Output:**

```
def shortest_transformation(begin_word, end_word, word_list) -> 5
```

Since the transformation sequence would be:

```
'same' -> 'came' -> 'case' -> 'cast' -> 'cost'
```

which is **five** elements long.

`string1`

and `string2`

, write a function `max_substring`

to return the maximal substring shared by both strings.**Example:**

**Input:**

```
string1 = 'mississippi'
string2 = 'mossyistheapple'
```

**Output:**

```
def maximal_substring(string1, string2) -> 'mssispp'
```

**Note:** *If there are multiple max substrings with the same length, just return any one of them.*

Additionally, it should do the following:

- Finds the element of the list that is closest to N.
- Then returns that element along with the k-next and k-previous elements of the list.

To excel in data science coding interviews, focus on a strong foundation in data structures and algorithms.

Practice coding regularly on our platform and utilize our AI Interviewer feature. Understand the trade-offs between different approaches and articulate your thought process clearly, especially in ML coding questions.

Emphasize code readability, efficiency, and test case considerations.

Additionally, delve deep into Python libraries like NumPy, pandas, and scikit-learn for efficient data manipulation and modeling. All the best!