Table of Contents

Introduction

Machine learning and modeling interview questions cover some of the most basic fundamentals in data science. Given that it’s a rapidly evolving field, machine learning is almost always in need of updates.

Modeling interview questions and the machine learning interview are many times an abstraction for testing a candidate’s experience in the field, as well as determining to what degree a data scientist or machine learning engineer can critically apply theory towards a business goal.

Machine learning interview questions follow a couple of patterns. While they can seem abstract and overwhelming, we can break them down into six types of questions.

  1. Machine learning algorithms and theory: Common questions
  2. Machine learning case study
  3. Applied modeling questions
  4. Machine learning system design
  5. Recommendation and search engines
  6. Writing algorithms from scratch

As we go through each framework, interview question, and machine learning concept, it’s worth remembering that machine learning and modeling interview questions are ultimately indicative of two things:

  1. A candidate’s past experience working with machine learning.
  2. The capacity to memorize concepts and apply them towards solutions the interviewer is looking for.

Machine Learning Algorithms Interview Questions

Machine learning algorithms questions to get an in-depth understanding of your conceptual knowledge of machine learning. Companies ask these questions mostly to machine learning and deep learning specialists that would be focusing on the specific building and training of a machine learning model.

Many times algorithms questions can be asked in different forms such as comparing the difference or similarity between two algorithms. Most importantly the interviewer is trying to understand your foundational knowledge on this subject.

For example, a common question asked within the machine learning algorithms interview questions is on the bias/variance tradeoff.

1. Let's say we want to build a model to predict booking prices on Airbnb. Between linear regression and random forest regression, which model would perform better and why?

Need a hint: The differences between linear regression and random forest.

Random forest regression is based on the ensemble machine learning technique of bagging. The two key concepts of random forests are:

  1. Random sampling of training observations when building trees.
  2. Random subsets of features for splitting nodes.

Random Forest can also handle missing values and cardinality well compared to linear regression while avoiding a sizable impact by outliers. Random Forest will also tend to perform better with categorical predictors.

2. What is bias in a model?

Bias is the amount our predictions are systematically off from the target. Bias is the measure of how “inflexible” the model is.

3. What is variance in a model?

Variance is the measure of how much the prediction would vary if the model was trained on a different dataset, drawn from the same population. Can be also thought of as the “flexibility” of the model.

4. Generally, what happens to bias & variance as we increase the complexity of the model?

Bias decreases and variance increases.

5. What is regularization?

Regularization is the act of modifying our objective function by adding a penalty term, to reduce overfitting.

6. Which regularization method would you prefer to treat correlated variables? Why?

Typically, we should prefer the regularization method that would drive feature coefficients to remove correlated features. LASSO could work here, however, if the data has a lot of features relative to the data size, then elastic net may be better.

7. Describe different regularization methods

L2 Regularization minimizes the sum of the squared residuals plus lambda times the slope squared. This is called the Ridge Regression Penalty. This increases the bias of the model, making the fit worse on the training data, but also decreases the variance.

8. What is gradient descent?

Gradient descent is a method of minimizing the cost function. The form of the cost function will depend on the type of supervised model. When optimizing our cost function, we compute the gradient to find the direction of steepest ascent. To find the minimum, we need to continuously update our Beta, proportional to the steps of the steepest gradient.

9. What is the difference between a parametric learning algorithm vs non-parametric learning algorithm?

A parametric learning algorithm has a finite set of parameters the learning algorithm estimates.
A non-parametric learning algorithm has a non-finite set of parameters. This means, that as the dataset grows, the learning algorithm can estimate more and more parameters from the dataset.

10. How do you interpret Linear Regression coefficients?

Interpreting Linear Regression coefficients is much simpler than Logistic Regression. The regression coefficient signifies how much the mean of the dependent variable changes, given a one-unit shift in that variable, holding all variables constant.

11. What is Maximum Likelihood Estimation?

Maximum Likelihood Estimation is where we find the distribution that is most likely to have generated the data. To do this, we have to estimate the parameter theta, that maximizes the the likelihood function evaluated at x. P(data | X)

12. What is Linear Discriminant Analysis?

LDA is a predictive modeling algorithm for multi-class classification. LDA will compute the directions that will represent the axes that maximize the separation between classes.

13. What's the difference between precision and recall?

Recall: What proportion of actual positives was identified correctly?

Precision: What proportion of positive identifications was actually correct?

14. What is the intuition behind F1 score?

The intuition is that we’re taking the harmonic mean between precision/recall. In a scenario where classes are imbalance, we’re likely to have either precision extremely high or recall extremely low, or vice-versa. As a result, this will be reflected in our F1 score, since the lower of the two metrics should drag the F1 score down.


15. Explain what Glove embeddings are.

Rather than use contextual words, we calculate a co-occurrence matrix of all words. Glove will also take local contexts into account, per a fixed window size, then calculate the covariance matrix. Then, we predict the co-occurence ratio between the words in the neural network.

GloVe will learn this matrix and train word vectors that predict co-occurrence ratios. Loss is weighted by word frequency.

It’s clear that these questions are meant to test if candidates understand the situations in which they would apply different types of models. They’re also mostly definition based questions, so if you memorize a bunch of different machine learning definitions and applications, you will usually do okay in this part.

Machine Learning Case Study

The machine learning case study requires a candidate to evaluate and explain a particular part of the model building process. A common case study problem would be for a candidate to explain how they would build a model for a product that exists at the company.

Example Question: Describe how you would build a model to predict Uber ETAs after a rider requests a ride.

Many times, this can be scoped down into a specific portion of the model building process. For instance, taking the example above, we could instead reword the problem to:

  • How would you evaluate the predictions of an Uber ETA model?
  • What features would you use to predict the Uber ETA for ride requests?

The main point of these case questions is to determine your knowledge of the full modeling lifecycle and how you would apply it to a business scenario.

We want to approach the case study with an understanding of what the machine learning & modeling lifecycle should look like from beginning to end, as well as creating a structured format to make sure we’re delivering a solution that explains our thought process thoroughly.

MECE Machine Learning
Using MECE for 

For the machine learning lifecycle, we have around six different steps that we should touch on from beginning to end:

  • Data Exploration & Pre-Processing
  • Feature Selection & Engineering
  • Model Selection
  • Cross Validation
  • Evaluation Metrics
  • Testing and Roll Out

Read more about how to frame a machine learning case study in our interview course.

1.Let's say that you work at a bank that wants to build a model to detect fraud on the platform.

The bank wants to implement a text messaging service in addition that will text customers when the model detects a fraudulent transaction in order for the customer to approve or deny the transaction with a text response.

How would we build this model?

Need a hint? We know that since we’re working with fraud, there has to be a case where there either is a fraudulent transaction or there isn't.

2. Let's say you have a categorical variable with thousands of distinct values, how would you encode it?

This depends on whether the problem is a regression or a classification model.

If it's a regression model, one way would be to cluster them based on the response by working backwards. You could sort them by the response variable, and then split the categorical variables into buckets based on the grouping of the response variable. This could be done by using a shallow decision tree to reduce the number of categories.

Another way given a regression model would be to target encode them. Replace each category in a variable with the mean response given that category. Now you have one continuous feature instead of a bunch of categories.

For a binary classification, you can target encode the column by finding the conditional probability of the response variable being a one, given that the categorical column takes a particular value. Then replace the categorical column with this numerical value. For example if you have a categorical column of city in predicting loan defaults, and the probability of a person who lives in San Francisco defaults is 0.4, you would then replace "San Francisco" with 0.4.

3. You’re tasked with building a model to predict if a driver on Uber will accept a ride request or not.

What algorithm would you use to build this model? What are the tradeoffs between different classifiers?

Check out how Chinmaya answers it in a mock interview below!

Applied Modeling Interview Questions

Applied modeling questions take machine learning concepts and ask how they could be applied to fix a certain problem. These questions are a little more nuanced, require more experience, but are great litmus tests of modeling and machine learning knowledge.

These types of questions are similar to case studies in that they are mostly ambiguous, require more contextual knowledge and information gathering from the interviewer, and are used to really test your understanding in a certain area of machine learning.

1. You’re given a model with 90% accuracy, should you deploy it?

2. We want to build a model to predict housing prices in the city of Seattle.

We've scraped 100K sold listings over the past three years but found that around 20% of the listings are missing square footage data.

How do we deal with the missing data to construct our model?

This is a pretty classic modeling interview question. Data cleanliness is a well-known issue within most datasets when building models. Real life data is messy, missing, and almost always needs to be wrangled with.
The key to answering this interview question is to probe and ask questions to learn more about the specific context. For example, we should clarify if there are any other features missing data in the listings.
If we're only missing data within the square footage data column, we can build models of different sizes of training data with under 80% of the dataset to see what the learning curve looks like. If the housing model at 60% of available data is only slightly less accuracy than at 80% of the square footage data, then depending on model accuracy bandwiths, we may be able to just drop all of the missing data for our model. We also might have a larger problem of then feature selection given a 30% increase in data does not improve our model accuracy by that much.
Now what's the second method?

Machine Learning System Design Interview Questions

Machine learning system design interview questions comprise of higher level design and architecture of recommendation systems, deploying machine learning models, and concepts on scaling these systems. At its core, machine learning system design problems are understanding how to solve the problem of deploying machine learning models that will work for all aspects of business requirements.

Preparing for the machine learning system design interview requires understanding a multi-step process of:

  1. Setting the problem statement.
  2. Architecting the high-level infrastructure.
  3. Explaining how data moves from one part to the next.
  4. Understand how to measure performance of the machine learning models.
  5. Deal with common problems around scale, reliability, and deployment.
Machine Learning Project Design

Example ML System Design Case Study: YouTube Video Recommendations

Now let's go over an example machine learning system design case study question and how to tackle it.

1. Problem Statement

Build a video recommendation system for YouTube users. We want to maximize user engagement and recommend new types of content to users.

Diagram for a video recommendation system
Video recommendation system

2. Metrics Design and Requirements

Offline Metrics

  • Precision (the fraction of relevant instances among the retrieved instances)
  • Recall (the fraction of the total amount of relevant instances that were actually retrieved)
  • Ranking Loss
  • Logloss

Online Metrics

  • Use A/B testing to compare:
  • Click Through Rates (CTR)
  • Watch time
  • Conversation rates

Training

  • User behavior is generally unpredictable and videos can become viral during the day. Ideally, we want to train many times during the day to capture temporal changes.

Inference

  • For every user to visit the homepage, the system will have to recommend 100 videos for them. The latency needs to be under 200ms, ideally sub 100ms.
  • For online recommendations, it’s important to find the balance between exploration vs. exploitation. If the model over-exploits historical data, new videos might not get exposed to users. We want to balance between relevance and fresh new content.

3. Multi-Stage Models

There are two stages: candidate generation and ranking. The reason for two stages is to make the system scale. It’s a common pattern that you will see in many machine learning systems.

Architecture diagram for the video recommendation system
Architecture diagram for the video recommendation system

We will explore the two stages in the section below:

  • The candidate model will find the relevant videos based on user watch history and the type of videos the user has watched.
  • The ranking model will optimize for the view likelihood, i.e. videos that have a high watch possibility should be ranked high. It’s a natural fit for the logistic regression algorithm.

Candidate Generation Model

  • Each user has a list of video watches (videos, minutes_watched).
  • For generating training data, we can make a user-video watch space. We can start by selecting a period of data like last month, last six months, etc. This should find a balance between training time and model accuracy.
  • The candidate generation can be done by matrix factorization. The purpose of candidate generation is to generate “somewhat” relevant content to users based on their watch history. The candidate list needs to be big enough to capture potential matches for the model to perform well with desired latency.
  • The ideal choice is to use collaborative algorithms because the inference time is fast and it can capture the similarity between user tastes in the user-video space.

Ranking Model

During inference, the ranking model receives a list of video candidates given by the Candidate Generation Model. For each candidate, the ranking model estimates the probability of that video being watched. It then sorts the video candidates based on the probability and returns the list to the upstream process.

Ranking Model Feature Engineering
Ranking Model Feature Engineering

Training data

  • We can use User Watched History data. Normally, the ratio between watched vs. not-watched is 2/98. So most of the time the user does not watch a video.

Model Building

At the beginning, it’s important that we started with a simple model, since we can add complexity later.

  • A fully connected neural network is simple yet powerful for representing non-linear relationships and it can handle big data.
  • We start with a fully connected neural network with sigmoid activation at the last layer. The reason for this is that the sigmoid function returns values in the range [0,1]. Therefore, it’s a natural fit for estimating probability.
  • For deep learning architecture, we can use relu (Rectified Linear Unit) as an activation function for hidden layers. It’s very effective in practice.
  • The loss function can be cross-entropy loss.
Model Prediction (Probability = 0.35)
Model Prediction

4. Calculation & estimation

For the sake of simplicity, we can make these assumptions:

  • Video views per month are 150 billion.
  • 10% of videos watched are from recommendations, a total of 15 billion videos.
  • On the homepage, a user sees 100 video recommendations.
  • On average, a user watches two videos out of 100 video recommendations.
  • If users do not click or watch some video within a given time frame, i.e., 10 minutes, then it is a missed recommendation.
  • The total number of users is 1.3 billion.

Data size

  • For 1 month, we collected 15 billion positive labels and 750 billion negative labels.
  • Generally, we can assume that for every data point we collect, we also collect hundreds of features. For simplicity, each row takes 500 bytes to store. In one month, we need 800 billion rows.
  • Total size: 500 * 800 * 10**9 = 4 * 10 ** 15 bytes = 4 Petabytes. To save costs, we can keep the last six months or one year of data in the data lake and archive old data in cold storage.

Bandwidth

  • Assume that every second we have to generate a recommendation request for 10 million users. Each request will generate ranks for 1k-10k videos.

Scale

  • Support 1.3 billion users

5. System design

High Level System Design diagram for a video recommendation engine
High Level System Design (Video Recommendation)
  • User Watched history stores which videos are watched by a particular user over time.
  • Search Query DB stores ahistorical queries that users have searched in the past. User/Video DB stores a list of Users and their profiles along with Video metadata.
  • User historical recommendations stores past recommendations for a particular user.
  • Resampling data: It’s part of the pipeline to help scale the training process by down-sampling negative samples.
  • Feature pipeline: A pipeline program to generate all required features for training a model. It’s important for feature pipelines to provide high throughput, as we require this to retrain models multiple times. We can use Spark or Elastic MapReduce or Google DataProc.
  • Model Repos: Storage to store all models, using AWS S3 is a popular option.
  • In practice, during inference, it’s desirable to be able to get the latest model near real-time. One common pattern for the inference component is to frequently pull the latest models from Model Repos based on timestamp.

6. Scaling Challenges

Huge data size

  • Solution: Pick 1 month or 6 months of recent data.

Imbalanced data

  • Solution: Perform random negative down-sampling.

High availability

  • Solution 1: Use model-as-a-service, each model will run in Docker containers.
  • Solution 2: We can use Kubernetes to auto-scale the number of pods.

When a user requests a video recommendation, the Application Server requests Video candidates from the Candidate Generation Model. Once it receives the candidates, it then passes the candidate list to the ranking model to get the sorting order. The ranking model estimates the watch probability and returns the sorted list to the Application Server. The Application Server then returns the top videos that the user should watch.

7. Scale the design

  • Scale out (horizontal) multiple Application Servers and use Load Balancers to balance loads.
  • Scale out (horizontal) multiple Candidate Generation Services and Ranking Services.
  • It’s common to deploy these services in a Kubernetes Pod and take advantage of the Kubernetes Pod Autoscaler to scale out these services automatically.
  • In practice, we can also use Kube-proxy so the Candidate Generation Service can call Ranking Service directly, reducing latency even further.
Video Recommendation System At Scale
Video Recommendation System At Scale

You can learn more about the Machine Learning System Design interview with our Machine Learning System Design course on Interview Query.

Additional ML System Design Questions

1. How would you build Twitter-style social media feed to display relevant posts to users?

2. Build an advertising bidding system that presents personalized ads to users

3. Design a machine learning system that can identify fraudulent transactions.

Recommendation and Search Engines Questions

Recommendation and search engines are questions that are technically a combination of case study questions and system design questions. But they are asked so frequently that it’s important to conceptualize them into their own category.

1.Let's say that you're working on a job recommendation engine.

You have access to all user LinkedIn profiles, a list of jobs each user applied to, and answers to questions that the user filled in about their job search.

Using this information, how would you build a job recommendation feed?

Need a hint?

What would the job recommendation workflow look like?

Can we lay out the steps the user takes in the actual recommendation of jobs that allows us to understand what a potential dataset would first look like?

2. How would you build a recommendation engine to recommend news to users on Google?

3. How would you evaluate a new search engine that your co-worker built?

4. How would you build the recommendation algorithm for type-ahead search for Netflix?

Need a hint?

Let's think about a simple use case to start out with. Let's say that we type in the word "hello" for the beginning of a movie.

If we typed in h-e-l-l-o, then a suitable suggestion might be a movie like "Hello Sunshine" or a Spanish movie named "Hola".

How would we decide between the two?

Writing ML algorithms from scratch

Coding machine learning algorithms are increasingly becoming more common on interviews. These questions are framed around deriving machine learning algorithms encapsulated on sci-kit learn or other packages from scratch.

The interviewer is mainly testing a raw understanding of coding optimizations, performance, and memory on existing machine learning algorithms. Additionally this would be testing if the candidate REALLY understood the underlying algorithm if they could build it without using anything but the Numpy Python package.

Generally these type of machine learning interview questions are pretty controversial. They're hard to do within a specific timeframe and generally pretty vague in how they're graded.

Example Questions

  1. Write a function to build K-NN from scratch on a sample input of a list of lists of integers.
  2. Given a list of tuples, write a function to compute the maximum gradient descent coefficient


If you have a machine learning interview coming up, check out our machine learning course on Interview Query!

Also, exactly how much machine learning do I need to know?

This is the most repetitive question that I have gotten ever since starting Interview Query.

Why do you think that is?

Because there is an infinite amount of knowledge you can consume in machine learning. Literally infinite. The very definition of machine learning and AI conceptualizes this fact.

Machine learning is a technology that is breaking ground every day. Technically, it should be improving faster and faster, given that machine learning and artificial intelligence is essentially supposed to be learning itself.

However, machine learning tested in an interview is completely different from how it is generally framed in real practice.

A data scientist is not expected to know the same level of knowledge necessary for machine learning compared to a machine learning engineer or research scientist. This varying expectation, however, can be confounded by what the employer thinks a data scientist does versus a machine learning engineer, such as a case where the role is titled data scientist, but the position is instead designed for building machine learning infrastructure the whole time.

Data Scientist ML Interviews

The data scientist role is primarily responsible for solving business problems using data to pull, munge, and generate insights from data. Data scientists will explore all aspects of the business and work cross-functionally with different teams to do everything from developing dashboards for reporting and exploring analytics for insights, to building models.

The last part of building models is tricky in determining how much machine learning a data scientist should know. Many data science roles that are focused on analytics don’t require any machine learning at all, while some roles are essentially machine learning engineers with a data scientist title. Generally, the main way to understand the difference is to ask everyone at the company about the day-to-day responsibilities of the role that you’re interviewing for.

For example, if we look at the Facebook Data Scientist role, we won't see much machine learning tested in their interview.

Image from Interview Query

But if we compare it with the data science role for C3.ai and we see a huge emphasis on machine learning.

C3.AI Data Science Role from Interview Query

Machine Learning and Data Engineers

Engineers build models and deploy them, develop infrastructure to scale, and work with data scientists to understand the best-use cases. They leverage data tools, programming frameworks, and data pipelines to ensure that models scale appropriately for any technical specifications.

Machine learning engineers should also have a strong knowledge of machine learning and theory, given their responsibility for building tooling and automation over the model creation, training, and evaluation life cycle.

Regular software engineers aren't expected to know too much about machine learning. But data engineers will likely need to know how to scale up data infrastructure alongside the machine learning engineers so that the models can retrieve and output the correct data points.

Research Scientists and AI Researchers

Research scientists are typically roles meant for teams to break new ground with machine learning in the research domain. The level of machine learning and statistics knowledge needed is usually very high.

Given these three roles, the best way to estimate how much machine learning knowledge is needed for the interview would be to first understand how embedded in machine learning your job will be. This is done with individual research on the company, position, team, and background information of your interview panel.