Python programming questions feature prominently in data science technical interviews. During a typical interview, you’ll likely be asked questions covering key Python coding concepts. Start your practice with these newly updated Python data science interview questions, covering statistics, probability, string parsing, NumPy/matrices, and Pandas.

*Build your Python coding competency with Interview Query’s **new Python course**, covering basic to advanced concepts!*

### Why is Python one of the most important data science programming languages?

Python has reigned as the dominant language in data science over the past few years, taking over former strongholds such as R, Julia, Spark, and Scala. That’s thanks in a large part to its wide breadth of data science libraries supported by a strong and growing data science community.

One of the main reasons why Python is now the preferred language of choice is because Python has libraries that can extend its use to the **full stack of data science**. While each data science language has its own specialty, such as R for data analysis and modeling within academia, Spark and Scala for big data ETLs and production; Python has grown their own ecosystem of libraries to a point where they all fit nicely together. At the end of the day, **it's much easier to program and perform full-stack data science without having to switch languages**. This means running exploratory data analysis, creating graphs and visualization, building the model, and implementing the deployment all in one language.

### Python Coding Interview Questions and Concepts

What kinds of questions are actually Python data science questions? We know it's in-between something as simple as *what is a dictionary in Python* and difficult data structure, algorithms, or object-oriented programming concepts.

There are five main concepts tested in Python data science interviews:

- Statistics and distribution-based questions
- Probability simulation
- String parsing and data manipulation
- NumPy functions and matrices
- Pandas data munging

Check out every one of our *regularly updated Python interview questions** **on Interview Query. *Or see our guide to Python machine learning interview questions.

### Basic Python Questions

Although there are plenty of advanced technical questions, be sure you can quickly and competently answer basic questions like “What data types are used in Python?” and “What is a Python dictionary?”. You don’t want to get caught stumbling on an answer for a basic Python syntax question.

**Q1. What built-in data types are used in Python?**

Python uses several built-in data types, including:

- Number (int, float and complex)
- String (str)
- Tuple (tuple)
- Range (range)
- List (list)
- Set (set)
- Dictionary (dict)

In Python, data types are used to classify or categorize data, and every value has a data type.

**Q2. How are data analysis libraries used in Python? What are some of the most common libraries?**

A key reason Python is such a popular data science programming language is because there is an extensive collection of data analysis libraries available. These libraries include functions, tools and methods for managing and analyzing data. There are Python libraries for performing a wide range of data science functions, including processing image and textual data, data mining and data visualization. The most widely used Python data analysis libraries include:

- Pandas
- NumPy
- SciPy
- TensorFlow
- SciKit
- Seaborn
- Matplotlib

**Q3. How is a negative index used in Python?**

Negative indexes are used in Python to assess and index lists and arrays from the end, counting backwards. For example, n-1 will show the last item in a list, while n-2 will show the second to last. Here’s an example of a negative index in Python:

```
b = "Python Coding Fun"
print(b[-1])
>> n
```

**Q4. What’s the difference between lists and tuples in Python?**

Lists and tuples are classes in Python that store one or more objects or values. Key differences include:

**Syntax –**Lists are enclosed in square brackets and tuples are enclosed in parentheses.**Mutable vs.****Immutable –**Lists are mutable, which means they can be modified after being created. Tuples are immutable, which means they cannot be modified.**Operations –**Lists have more functionalities available than tuples, including insert and pop operations and sorting.**Size –**Because tuples are immutable, they require less memory and are subsequently faster.

### Python Statistics Questions

Python statistics questions are based on implementing statistical analyses and testing how well you know statistical concepts and can translate them into code. Many times, these questions take the form of random sampling from a distribution, generating histograms, and computing different statistical metrics such as standard deviation, mean, or median.

**Q5. Write a function to generate N samples from a normal distribution and plot them on the histogram.**

This is a relatively simple problem. We have to set up our distribution and generate N samples from it, which are then plotted. In this question, we make use of the SciPy library which is a library made for scientific computing.

**Q6. Write a function that takes in a list of dictionaries with a key and list of integers and returns a dictionary with the standard deviation of each list.**

**Note that this should be done without using the NumPy built-in functions.**

**Example:**

```
| string1 = 'abc' | | | | |
|------------------------------------------|---|---|---|---|
| string2 = 'asbsc' | | | | |
| string3 = 'acedb' | | | | |
| isSubSequence(string1, string2) -> True | | | | |
| isSubSequence(string1, string3) -> False | | | | |
```

Hint: Remember the equation for standard deviation. To be able to fulfill this function, we need to use the following equation, where we take the sum of the square of the data value minus the mean, over the total number of data points, all in a square root.Does the function inside the square root look familiar?

**Q7. Given a list of stock prices in ascending order by datetime, write a function that outputs the max profit by buying and selling at a specific interval.**

**Example:**

```
stock_prices = [10,5,20,32,25,12]
get_max_profit(stock_prices) -> 27
```

**Making it harder, given a list of stock prices and date times in ascending order by datetime, write a function that outputs the profit and start and end dates to buy and sell for max profit.**

```
stock_prices = [10,5,20,32,25,12]
dts = [
'2019-01-01',
'2019-01-02',
'2019-01-03',
'2019-01-04',
'2019-01-05',
'2019-01-06',
]
get_profit_dates(stock_prices,dts) -> (27, '2019-01-02', '2019-01-04')
```

Hint: There are several ways to solve this problem. But a good place to start is by thinking about your goal: If we want to maximize profit, ideally we would want to buy at the lowest price and sell at the highest possible price.

### Python Probability Questions

Most Python questions that involve probability are testing your knowledge of the probability concept. These questions are really similar to the Python statistics questions except they are focused on simulating concepts like Binomial or Bayes theorem.

Since most general probability questions are focused around calculating chances based on a certain condition, almost all of these probability questions can be proven by writing Python to simulate the case problem.

**Q8. Amy and Brad take turns rolling a fair six-sided die. Whoever rolls a “6” first wins the game. Amy starts by rolling first.**

**What’s the probability that Amy wins?**

Given this scenario, we can write a Python function that can simulate this scenario thousands of times to see how many times Amy wins first. Solving this problem then requires understanding how to create two separate people and simulate the scenario of one person rolling first each time.

**Q9. Every night between 7pm and midnight, two computing jobs from two different sources are randomly started with each one lasting an hour. Unfortunately, when the jobs simultaneously run, they cause a failure in some of the company’s other nightly jobs, resulting in downtime for the company that costs $1,000.**

**The CEO, who has enough time today to hear one word, needs a single number representing the annual (365 days) cost of this problem. Write a function to simulate this problem and output an estimated cost.**

**Bonus: How would you solve this using probability?**

Hint: Let's assume that the start times of the two jobs are independent of one another, and that each job has the property that given two time intervals of equal duration in the 7pm-midnight period, the chance the job starts in the first interval is equal to that of the second interval.

Now how can we further solve this problem?

**Q10. Imagine a deck of 500 cards numbered from 1 to 500. If all the cards are shuffled randomly and you are asked to pick three cards, one at a time, what's the probability of each subsequent card being larger than the previous drawn card?**

One way to start this question: Make the sample smaller. Let's say the question is actually 100 cards and you select 3 cards without replacement. Does the answer change?

### String Parsing and Data Manipulation Python Questions

String parsing questions in Python are probably one of the most common. These types of questions focus on how well you can manipulate text data which always needs to be thoroughly cleaned and transformed into a dataset.

Examples of these types of questions that are common at startups or companies that work with a lot of text that needs to be analyzed on a regular basis. This means most social media companies like Twitter or LinkedIn, job companies like Indeed or Ziprecruiter, etc...

**Q11. Write a function that can take a string and return a list of bigrams.**

**Example:**

```
sentence = """
Have free hours and love children?
Drive kids to school, soccer practice
and other activities.
"""
output = [('have', 'free'),
('free', 'hours'),
('hours', 'and'),
('and', 'love'),
('love', 'children?'),
('children?', 'drive'),
('drive', 'kids'),
('kids', 'to'),
('to', 'school,'),
('school,', 'soccer'),
('soccer', 'practice'),
('practice', 'and'),
('and', 'other'),
('other', 'activities.')]
```

When separating a sentence into bigrams, the first thing we need to do is split the sentence into individual words. We would need to loop through each word of the sentence and append bigrams to the list. How many loops would we need, if the amount of words in a sentence was k?

**Q12. Given two strings A and B, return whether or not A can be shifted some number of times to get B.**

**Example:**

```
A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False
```

**Q13. Given two strings, string1 and string, determine if there exists a one to one character mapping between each character of string1 to string2.**

**Example 1:**

```
string1 = 'qwe'
string2 = 'asd'
string_map(string1, string2) == True
#q = a, w = s, and e = d
```

**Example 2:**

```
string1 = 'donut'
string2 = 'fatty'
string_map(string1, string2) == False
#t cannot map to two different values
```

In general, we know that both strings must be equal in length. If they aren't then there definitely is not a one to one mapping. Next, we'll look at the most efficient way to determine True or False given the conditions.

We know that if there exists one false condition between characters of string1 to string2, we can immediately determine the mapping as FALSE. However for the condition to be true, we have to continue checking each character in the string until we've exhausted all characters and checked for the mapping. Given this mindset, let's then try looping through both strings and creating a key value dictionary for the mapping of the characters on string1 to string2 at each index. If the character at an index does not equal the character in the dictionary, then return False.

**Q14. Given a string, return the first recurring character in it, or None if there is no recurring character.**

**Example:**

```
input = "interviewquery"
output = "i"
input = "interv"
output = None
```

### Python Data Manipulation Interview Questions

Data manipulation questions cover more techniques that would be transforming data outside of NumPy or Pandas. This is common when designing ETLs for data engineers when transforming data between raw json and database reads. Many times, these types of problems will require grouping, sorting, or filtering data using lists, dictionaries, and other Python data structure types. These types of questions test your general knowledge of Python data munging outside of actual Pandas formatting.

**Q15. Given a list of timestamps in sequential order, return a list of lists grouped by week (7 days) using the first timestamp as the starting point.**

**Example:**

```
ts = [
'2019-01-01',
'2019-01-02',
'2019-01-08',
'2019-02-01',
'2019-02-02',
'2019-02-05',
]
output = [
['2019-01-01', '2019-01-02'],
['2019-01-08'],
['2019-02-01', '2019-02-02'],
['2019-02-05'],
]
```

**Q16. In data science, there exists the concept of stemming, which is the heuristic of chopping off the end of a word to clean and bucket it into an easier feature set.**

**Given a dictionary consisting of many roots and a sentence, stem all the words in the sentence with the root forming it. If a word has many roots can form it, replace it with the root with the shortest length.**

**Example:**

```
Input:
roots = ["cat", "bat", "rat"]
sentence = "the cattle was rattled by the battery"
Output: "the cat was rat by the bat"
```

Hint: At first it simply looks like we can just loop through each word and check if the root exists in the word, and if so, replace the word with the root. But since we are technically stemming the words we have to make sure that the roots are equivalent to the word at it's prefix rather than existing anywhere within the word.

**Q17. There are two lists of dictionaries representing friendship beginnings and endings: friends_added and friends_removed. Each dictionary contains the user_ids and created_at time of the friendship beginning /ending . **

**Write a function to generate an output which lists the pairs of friends with their corresponding timestamps of the friendship beginning and then the timestamp of the friendship ending.**

**Note: There can be multiple instances over time when two people became friends and unfriended; only output lists when a corresponding friendship was removed.**

**Example Input:**

```
friends_added = [{'user_ids': [1, 2], 'created_at': '2020-01-01'},
{'user_ids': [3, 2], 'created_at': '2020-01-02'},
{'user_ids': [2, 1], 'created_at': '2020-02-02'},
{'user_ids': [4, 1], 'created_at': '2020-02-02'}]
friends_removed = [{'user_ids': [2, 1], 'created_at': '2020-01-03'},
{'user_ids': [2, 3], 'created_at': '2020-01-05'},
{'user_ids': [1, 2], 'created_at': '2020-02-05'}]
```

**Example Output: **

```
friendships = [{
'user_ids': [1, 2],
'start_date': '2020-01-01',
'end_date': '2020-01-03'
},
{
'user_ids': [1, 2],
'start_date': '2020-02-02',
'end_date': '2020-02-05'
},
{
'user_ids': [2, 3],
'start_date': '2020-01-02',
'end_date': '2020-01-05'
},
]
```

You are only looking for friendships that have an end date. Because of this, every friendship that will be in our final output is contained within the friends_removed list. So if you start by iterating through the friends_removed dictionary, you will already have the id pair and the end date of each listing in our final output. You just need to find the corresponding start date for each end date.

### Python NumPy and Matrices Problems

Many data science problems deal with working with the NumPy library and matrices. These types of problems are not as common as the others but still show up. This involves working with the NumPy library to run matrix multiplication, calculating the Jacobian determinant, and transforming matrices in some way or form.

**Q18. Given a 4x4 NumPy Matrix, reverse the matrix.**

**Q19. Add two NumPy matrices together.**

### Pandas Data Munging

Lastly, questions with pandas are starting to show up more and more in data science interviews. While Pandas can be used in many different forms in data science, including analytics types of questions similar to SQL problems, these kinds of Pandas questions are more closely aligned to cleaning data.

This means problems like one-hot encoding variables, using the Pandas apply function to group different variables, and text cleaning different columns.

**Q20. Let's say you're given a list of standardized test scores from high schoolers from grades 9 to 12.**

**Given the dataset, write code in Pandas to return the cumulative percentage of students that received scores within the buckets of <50, <75, <90, <100.**

**Example Input: **

```
user_id | grade |test score
--------+-------+-----------
1 | 10 | 85
2 | 10 | 60
3 | 11 | 90
4 | 10 | 30
5 | 11 | 99
Example Output
grade |test score |percentage
--------+------------+-----------
10 | <50 | 30%
10 | <75 | 65%
10 | <90 | 96%
10 | <100 | 99%
11 | <50 | 15%
11 | <75 | 50%
.. | .. | ..
```

**Q21. You’re given two dataframes. One contains information about addresses and the other contains relationships between various cities and states:**

address |

4860 Sunset Boulevard, San Francisco, 94105 |

3055 Paradise Lane, Salt Lake City, 84103 |

682 Main Street, Detroit, 48204 |

9001 Cascade Road, Kansas City, 64102 |

5853 Leon Street, Tampa, 33605 |

city |
state |

Salt Lake City |
Utah |

Kansas City |
Missouri |

Detroit |
Michigan |

Tampa |
Florida |

San Francisco |
California |

**Write a function to create a single dataframe with complete addresses in the format of street, city, state, zipcode.**

We need to find a way to merge our two dataframes, but one of them is a long, messy string. How can we modify this string to make merging our dataframes easier?

One option: We can use the string method .split with an expand=True argument and a ‘, ’ delimiter to split our address string into three separate columns. Now how will we perform our merge?

**Q22. You're given a dataframe of students:**

name |
age |
favorite_color |
grade |

Tim Voss |
19 |
red |
91 |

Nicole Johnson |
20 |
yellow |
95 |

Elsa Williams |
21 |
green |
82 |

John James |
20 |
blue |
75 |

Catherine Jones |
23 |
green |
93 |

**Write a function to select only the rows where the student's favorite color is green or red and their grade is above 90.**

Hint: We need to filter our dataframe by two conditions: grade and favorite color. How would we go about filtering our dataframe by grade?

**Q23. You're given a dataframe containing a list of user IDs and their full names (e.g. 'James Emerson'). Transform this dataframe into a dataframe that contains the user ids and only the first name of each user.**

### Python Data Science Interview Strategies

**Practice. **The foremost easiest way to get better at Python data science interview questions is to do more Python practice problems. The more questions you practice and understand, the more strategies you'll figure out in faster time as you start to pattern match and group similar problems together.

**Clarify Upfront.** What packages or libraries are you allowed to use? Do you have to build an algorithm from scratch? What's the most optimal runtime that they're looking for? Ask questions to understand the scope of the problem first to get a sense of where to start. The worst thing you could do is not clarify their expectations from the get go!

**Solve a simple problem first. **This allows you get an early win and build on the larger scope of the problem. Additionally, if you have a solution but you know it's not the most efficient, write it out first anyway to get something on paper and then work backwards to try to find the most optimal one.

**Think out loud and communicate.** Talk about what you're doing and why. This helps with both your thought process and their understanding of what you're doing. That way you can make sure both you and the interviewer are both on the same page.

**Admit if you don't know. **If you don't know different Python methods, types, and other concepts, it looks bad to the interviewer. Rather, just mention that you forgot and make an assumption so that the interviewer understands where you're coming from. If you're wrong, they will most likely correct you.

**Slow down.** Don't jump in headfirst and expect to do well. Take your time to think about the problem and solve like how you would when you're practicing. Remember that you most likely will have plenty of time to solve the problem.

**Resources**