Table of contents

Most machine learning interviews are tough to pass. Here’s 50+ questions to help you prepare for your next data science or machine learning interview.

Most machine learning questions assess two things:

Your past experience working with machine learning.

Your capacity to memorize concepts and apply them towards a solution.

Therefore, many of these questions will be designed to test your knowledge, e.g. definitions-based and theoretical questions, and also your ability to **apply ML theory towards a business goals.**

In this guide, we’ll focus on different types of machine learning questions including the different types of algorithms, applied modeling questions, and questions on machine learning system design.

Machine learning interview questions follow a couple of patterns. We’ve broken them down into six different types of questions.

These questions assess your working knowledge of algorithm fundamentals. Often, they’re posed as comparison questions.

Sample question:What is the difference between a parametric learning algorithm vs a non-parametric learning algorithm?

Machine learning case studies ask you to explain and walk the interviewer through building a model and the various different tradeoffs you can make.

Sample question:How would you build a model for a Product X?

Applied modeling questions take machine learning concepts and ask how they could be applied to fix a certain problem.

These problems are slightly different than case studies as they’re more specific towards understanding machine learning theory rather than a business case.

For example:You’re given a model with 90% accuracy, should you deploy it?

System design questions look at the design and architecture of recommendation systems, machine learning models, and concepts on scaling these systems

Sample question:1. How would you build Twitter-style social media feed to display relevant posts to users?

These questions are often posed like case studies, but they’re specific to recommendation and search engines. They’re very common in machine learning interviews.

Sample question:How would you build a recommendation engine to recommend news to users on Google?

These questions ask you to code machine learning algorithms from scratch without the use of helper packages. For example, you might be asked to re-create an algorithm from Scikit-learn or NumPy from the ground up.

Sample question:Given a list of tuples representing coordinates on a 2-D plane, write a function to compute the maximum gradient descent coefficient.

Now we’ll take a look at machine learning questions in each of these categories with hints and sample solutions.

Machine learning algorithms questions assess your **conceptual knowledge of machine learning**. Companies ask these questions mostly to machine learning and deep learning specialists that would be focusing on the specific building and training of a machine learning model.

Many times algorithms questions can be asked in different forms. But three of the most common ways are:

- Comparing differences algorithms
- Identifying similarities between algorithms
- Definitions of algorithm terms

**Why do they get asked?**

Algorithm interview questions test your foundational knowledge. For example, a common question like **bias/variance tradeoff** helps the interviewer know how deep your knowledge of the concept truly is, as well as your ability to communicate complex ideas.

With a question like this, you should define both, and then explain your reasoning for your solution.

Random forest regression is based on the ensemble machine learning technique of bagging. The two key concepts of random forests are:

- Random sampling of training observations when building trees.
- Random subsets of features for splitting nodes.

Compared to linear regression, random forest can also handle missing values and cardinality well, while avoiding a sizable impact by outliers. Random forest will also tend to perform better with categorical predictors.

Linear regression on the other hand is the standard regression technique in which relationships are modeled **using a linear predictor function**, the most common example of y = Ax + B. Linear regression models are often fitted using the least-squares approach.

There are also four main assumptions in linear regression:

- A normal distribution of error terms
- Independence in the predictors
- The mean residuals must equal zero with constant variance
- No correlation between the features

Bias is the amount our predictions are systematically off from the target. Bias is the measure of how “inflexible” the model is.

Variance is the measure of how much the prediction would vary if the model was trained on a different dataset, drawn from the same population. Can be also thought of as the “flexibility” of the model.

Regularization is the act of modifying our objective function by adding a penalty term, to reduce overfitting.

Gradient descent is a method of minimizing the cost function. The form of the cost function will depend on the type of supervised model.

When optimizing our cost function, we compute the gradient to find the direction of steepest ascent. To find the minimum, we need to continuously update our Beta, proportional to the steps of the steepest gradient.

Interpreting linear regression coefficients is much simpler than logistic regression. The regression coefficient signifies how much the mean of the dependent variable changes, given a one-unit shift in that variable, holding all variables constant.

Maximum likelihood estimation is where we find the distribution that is most likely to have generated the data. To do this, we have to estimate the parameter theta, that maximizes the the likelihood function evaluated at x.

LDA is a predictive modeling algorithm for multi-class classification. LDA will compute the directions that will represent the axes that maximize the separation between classes.

Recall: What proportion of actual positives was identified correctly?

Precision: What proportion of positive identifications was actually correct?

The intuition is that we’re taking the harmonic mean between precision/recall. In a scenario where classes are imbalance, we’re likely to have either precision extremely high or recall extremely low, or vice-versa. As a result, this will be reflected in our F1 score, since the lower of the two metrics should drag the F1 score down.

Rather than use contextual words, we calculate a co-occurrence matrix of all words. GloVe will also take local contexts into account, per a fixed window size, then calculate the covariance matrix. Then, we predict the co-occurence ratio between the words in the neural network.

GloVe will learn this matrix and train word vectors that predict co-occurrence ratios. Loss is weighted by word frequency.

You can reduce overfitting by training the network on more examples or reduce overfitting by changing the complexity of the network.

The benefit of very deep neural networks is that their performance continues to improve as they are fed larger and larger datasets. A model with a near-infinite number of examples will eventually plateau in terms of what the capacity of the network is capable of learning.

Use named entity recognition techniques or use specific packages to measure cosine similarity and overlap.

Mean Square Error (MSE) is defined as Mean or Average of the square of the difference between actual and estimated values.

We would use MSE when looking at the accuracy of a regression model.

Adding an additional feature does not necessarily improve the performance of GBM or Logistic regression because adding new features without a multiplicative increase in the number of observations will lead to a phenomenon where by we have a complex dataset (a dataset with many features) and a small amount of observations.

Model parameter optimization is a process of finding the best values that a model’s parameters take. Model parameters can be tuned by using the Grid Search algorithm or Random Search.

Both techniques are used for dimensionality reduction. PCA is unsupervised while LDA is supervised.

In supervised learning, input data is provided to the model along with the output. In unsupervised learning, only input data is provided to the model.

The goal of supervised learning is to train the model so that it can predict the output when it is given new data.

Support Vector Machine is a linear model for classification and regression problems. The idea of is that the algorithm creates a line or a hyperplane which separates the data into different classes.

The machine learning case study requires a candidate to evaluate and explain a particular part of the model building process. A common case study problem would be for a candidate to explain how they would build a model for a product that exists at the company.

For the machine learning lifecycle, we have around six different steps that we should touch on from beginning to end:

- Data Exploration & Pre-Processing
- Feature Selection & Engineering
- Model Selection
- Cross Validation
- Evaluation Metrics
- Testing and Roll Out

Need some help? Check out our machine learning case study in our interview course.

Many times, this can be scoped down into a specific portion of the model building process. For instance, taking the example above, we could instead reword the problem to:

- How would you evaluate the predictions of an Uber ETA model?
- What features would you use to predict the Uber ETA for ride requests?

The main point of these case questions is to determine your knowledge of the full modeling lifecycle and how you would apply it to a business scenario.

We want to approach the case study with an understanding of what the machine learning and modeling lifecycle should look like from beginning to end, as well as creating a structured format to make sure we’re delivering a solution that explains our thought process thoroughly.

What algorithm would you use to build this model? What are the tradeoffs between different classifiers?

You can see a full mock interview with a solution for this question on YouTube.

The bank wants to implement a text messaging service in addition that will text customers when the model detects a fraudulent transaction in order for the customer to approve or deny the transaction with a text response.

How would we build this model?

Need a hint? We know that since we’re working with fraud, there has to be a case where there either is a fraudulent transaction or there isn’t.

We should summarize our findings by building out a binary classifier on an imbalanced dataset.

A few considerations we have to make are:

**How accurate is our data?**Is all of the data labeled carefully? How much fraud are we not detecting if customers don’t even know they’re being defrauded?**What model works well on an imbalance dataset?**Generally tree models come to mind.**How much do we care about interpretability?**Building a highly accurate model for our dataset may not be the best method if we don’t learn anything from it. In the case that our customers are being comprised without us even knowing, then we run into the issue of building a model that we can’t learn from and feature engineer for in the future.**What are the costs of misclassification?**If we look at precision versus recall, we can understand which metrics we care given the business problem at hand.

This depends on whether the problem is a regression or a classification model.

If it’s a **regression model**, one way would be to cluster them based on the response by working backwards. You could sort them by the response variable, and then **split the categorical variables into buckets based on the grouping of the response variable**. This could be done by using a shallow decision tree to reduce the number of categories.

Another way given a regression model would be to **target encode** them. Replace each category in a variable with the mean response given that category. Now you have one continuous feature instead of a bunch of categories.

For a binary classification, you can target encode the column by finding the **conditional probability of the response variable being a one**, given that the categorical column takes a particular value. Then replace the categorical column with this numerical value. For example if you have a categorical column of city in predicting loan defaults, and the probability of a person who lives in San Francisco defaults is 0.4, you would then replace “San Francisco” with 0.4.

Applied modeling questions take machine learning concepts and ask how they could be applied to fix a certain problem. These questions are a little more nuanced, require more experience, but are great litmus tests of modeling and machine learning knowledge.

These types of questions are similar to case studies in that they are mostly ambiguous, require more contextual knowledge and information gathering from the interviewer, and are used to really test your understanding in a certain area of machine learning.

Let’s pretend that we have three people: Alice, Bob, and Candace that have all applied for a loan. Simplifying the financial lending loan model, let’s assume the only features are:

- Total number of credit cards
- Dollar amount of current debt
- Credit age

Let’s say Alice, Bob, and Candace all have the same number of credit cards and credit age but not the same dollar amount of current debt.

- Alice: 10 credit cards, 5 years of credit age,
**$20K**of debt - Bob: 10 credit cards, 5 years of credit age,
**$15K**of debt - Candace: 10 credit cards, 5 years of credit age,
**$10K**of debt

Alice and Bob get rejected for a loan but **Candace gets approved**. We would assume that given this scenario, we can logically point to the fact that Candace’s 10K of debt has swung the model to approve her for a loan.

**How did we reason this out?** If the sample size analyzed was instead thousands of people who had the same number of credit cards and credit age with varying levels of debt, we could figure out the model’s average loan acceptance rate for each numerical amount of current debt.

Then we could plot these on a graph to **model out the y-value, average loan acceptance, versus the x-value, dollar amount of current debt**.

**How do we deal with the missing data to construct our model?**

This is a pretty classic modeling interview question. Data cleanliness is a well-known issue within most datasets when building models. Real-life data is messy, missing, and almost always needs to be wrangled with.

The key to answering this interview question is to probe and ask questions to learn more about the specific context. For example, we should clarify if there are any other features missing data in the listings.

If we’re only missing data within the square footage data column, we can b**uild models of different sizes of training data.**

Now, what’s the second method?

Collecting data can be costly. This question assesses the candidate’s skill in being able to practically figure out how a candidate might approach a problem with evaluating a model.

Specifically, what other kinds of information should we look into when we’re given a dataset and build a model with a *“pretty good”* accuracy rate.

If this is the first version of a model, **how would we ever know if we should put any effort into iteration of the model?** And exactly how can we evaluate the cost of extra effort into the model?

There are a couple of factors to look into.

**1.** **Look at the feature set size to training data size ratio**. If we have an extremely high number of features compared to data points, then the model will be prone to overfitting and inaccuracy.

**2. Create an existing model off a portion of the data**, the training set, and measure performance of the model on the validation sets, otherwise known as using a holdout set. We hold back some subset of the data from the training of the model, and then use this holdout set to check the model performance to get a baseline level.

See a solution for this machine learning interview question on YouTube.

When AUC=0.5, then the classifier is not able to distinguish between positive and negative classifications. Meaning either the classifier is predicting a random class or constant class for all the data points.

Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple logistic regression model are highly correlated or associated. Multicollinearity does not reduce the predictive power or reliability of the model as a whole as it only affects calculations regarding individual predictors.

One way is to create groups for the output class.

- Delays less than 1 and 2 hours
- Between 2 to 10 hours
- Over 10+ hours.

That way the outliers are skewed out into a specific classification problem instead of regression.

Another way would be to just filter them out from the analysis.

Machine learning system design interview questions ask you about design and architecture of machine learning applications. Essentially, these questions test your ability to **solve the problem of deploying a machine learning model** that meet the specific business requirements.

To **answer machine learning system design questions**, you should follow a framework:

- Setting the problem statement.
- Architecting the high-level infrastructure.
- Explaining how data moves from one part to the next.
- Understand how to measure performance of the machine learning models.
- Deal with common problems around scale, reliability, and deployment.

See a solution for this question on Interview Query.

See a solution for this question on Interview Query.

You should start to answer this question buy outlining metrics and design recommendations.

**Offline Metrics**

- Precision (the fraction of relevant instances among the retrieved instances)
- Recall (the fraction of the total amount of relevant instances that were actually retrieved)
- Ranking Loss
- Logloss

**Online Metrics**

- Use A/B testing to compare:
- Click Through Rates (CTR)
- Watch time
- Conversation rates

**Training**

- User behavior is generally unpredictable and videos can become viral during the day. Ideally, we want to train many times during the day to capture temporal changes.

**Inference**

- For every user to visit the homepage, the system will have to recommend 100 videos for them. The latency needs to be under 200ms, ideally sub 100ms.
- For online recommendations, it’s important to find the balance between exploration vs. exploitation. If the model over-exploits historical data, new videos might not get exposed to users. We want to balance between relevance and fresh new content.

Recommendation and search engines are questions that are technically a combination of case study questions and system design questions. But they are asked so frequently that it’s important to conceptualize them into their own category.

With this question, let’s assume have access to all user LinkedIn profiles, a list of jobs each user applied to, and answers to questions that the user filled in about their job search.

Using this information, how would you build a job recommendation feed? What would the job recommendation workflow look like?

Can we lay out the steps the user takes in the actual recommendation of jobs that allows us to understand what a potential dataset would first look like?

For this problem we have to understand what our dataset consists of before being able to build a model for recommendations. More so we need to understand what a recommendation feed might look like for the user.

For example, what we’re expecting is that the user could go to tab or open up a mobile app and then view a list of recommended jobs sorted by highest recommended at the top.

We can either use an **unsupervised or supervised model**. For an unsupervised model, we could use a nearest neighbors or collaborative filtering algorithm off of features from users and jobs. But if we want more accuracy, we would likely go with a supervised classification algorithm.

With this question, let’s think about a simple use case to start out with. Let’s say that we type in the word “hello” for the beginning of a movie.

If we typed in h-e-l-l-o, then a suitable suggestion might be a movie like “Hello Sunshine” or a Spanish movie named “Hola”.

Coding machine learning algorithms are increasingly common in interviews, especially for specialized subject areas like computer vision. These questions are framed around deriving machine learning algorithms encapsulated on sci-kit learn or other packages from scratch.

The interviewer is mainly testing a raw understanding of coding optimizations, performance, and memory on existing machine learning algorithms. Additionally this would be testing if the candidate REALLY understood the underlying algorithm if they could build it without using anything but the Numpy Python package.

Generally these type of machine learning interview questions are pretty controversial. They’re hard to do within a specific timeframe and generally pretty vague in how they’re graded.

Practice with these Python machine learning questions, including sample questions and an overview of the Python machine learning interview process.

**Example:**

dictionary = {
'a' : ['b','c','e'],
'm' : ['c','e'],
}
input = 'c'
closest_key(dictionary, input) -> 'm'

*c* is at distance 1 from *a* and 0 from *m*. Hence closest key for *c* is *m*.

**Hint:** Is your computed distance always positive? Negative values for distance (for example between ‘c’ and ‘a’ instead of ‘a’ and ‘c’) will interfere with getting an accurate result.

Note that only one letter can be changed at a time and each transformed word in the list must exist.

**Example:**

Input:
begin_word = "same",
end_word = "cost",
word_list = ["same","came","case","cast","lost","last","cost"]
def shortest_transformation(begin_word, end_word, word_list) -> 5
#since the transformation sequence is ['same','came','case','cast','cost'] which is five elements long

Become an Interview Query premium member for access to 50+ real machine learning interview questions with solutions. Or take a look at our data science interview course, which features sections in machine learning and machine learning system design.

If you’re looking for company-specific resources, check out guides Amazon Machine Learning Questions, Google Machine Learning Questions, or Facebook Machine Learning Questions.