Prepare for and practice interview questions from Kroll across topics like Statistics, SQL, Machine Learning and more.

Kroll Interview Questions

Kroll Interview Guides

Statistics

P-value to a Layman

What are the assumptions of linear regression?

Assumptions of Linear Regression

Correlation in Regression

Describe linear regression to various audiences with different levels of knowledge.

Explaining Linear Regression to Different Audiences

Multicollinearity in Regression

Machine Learning

Bagging vs Boosting

Replaced by QUESTION 791 - Should be deleted

Bias Variance Tradeoff

Write a function to calculate precision and recall metrics.

Precision and Recall

Bias vs. Variance Tradeoff

Let's say that you're training a classification model.   How would you combat overfitting when building tree-based models?

Overfit Avoidance

Write a query to select the top 3 departments with at least ten employees and rank them according to the percentage of their employees making over 100K in salary.

Employee Salaries

Select the 2nd highest salary in the engineering department

2nd Highest Salary

Duplicate Rows

Write a query to create a pivot table that shows total sales for each branch by year

Branch Sales Pivot

Size of Joins

Brainteasers

Three Zebras

How would you answer when an Interviewer asks why you applied to their company?

Why Do You Want to Work With Us

What do you tell an interviewer when they ask you what your strengths and weaknesses are?

Your Strengths and Weaknesses

Analytics

What metrics would you use to determine the value of each marketing channel?

Marketing Channel Metrics

Let’s say that you're in charge of an e-commerce D2C business that sells socks. What business health metrics would you care?

D2C Socks e-Commerce

Describing a data project and its challenges

Hurdles In Data Projects

When an interviewer asks a question along the lines of:

<ul>
<li>What would your current manager say about you? What constructive criticisms might he give?</li>
<li>What are your three biggest strengths and weaknesses you have identified in yourself?</li>
</ul>

How would you respond?

When asked about your strengths in an interview, what is an effective way to respond?

When asked about your strengths in an interview, what is an effective way to respond?

Your Strengths and Weaknesses I

Which of the following is an acceptable strategy when discussing weaknesses in an interview?

Which of the following is an acceptable strategy when discussing weaknesses in an interview?

Your Strengths and Weaknesses II

When an interviewer asks you a question along the lines of:

<ul>
<li>Why did you apply to our company?</li>
<li>What are you looking for in your next job?</li>
<li>What makes you a good fit for our company?</li>
</ul>

How should you respond?

When asked 'What are you looking for in your next job?' in an interview, how can you tie the company's employee benefits into your response?

When asked 'What are you looking for in your next job?' in an interview, how can you tie the company's employee benefits into your response?

Why Do You Want to Work With Us I

How can company values be used effectively in an interview when asked 'What makes you a good fit for our company?'

How can company values be used effectively in an interview when asked 'What makes you a good fit for our company?'

Why Do You Want to Work With Us II

When responding to the question 'Why did you apply to our company?' during an interview, what aspect should you highlight?

When responding to the question 'Why did you apply to our company?' during an interview, what aspect should you highlight?

Why Do You Want to Work With Us III

Describe a data project you worked on. What were some of the challenges you faced?

How would you explain what a p-value is to someone who is not technical?

What does a p-value in a statistical test represent?

What does a p-value in a statistical test represent?

P-value to a Layman I

In a statistical test, how does a low p-value (less than 0.05) influence our decision about the null hypothesis?

In a statistical test, how does a low p-value (less than 0.05) influence our decision about the null hypothesis?

P-value to a Layman II

What are the assumptions of linear regression?

Which assumption of the residuals of the standard linear regression model can not be overcome by increasing the sample size?

Regression assumptions

Let’s say that you’re training a classification model.

How would you combat overfitting when building tree-based models?

How would you handle the data preparation for building a machine learning model using imbalanced data?

Addressing imbalanced data in machine learning through carefully prepared techniques.

Data Preparation for Imbalanced Data

Let’s say we’re comparing two machine learning algorithms. In which case would you use a bagging algorithm versus a boosting algorithm? 

Give an example of the tradeoffs between the two.

In machine learning, when would you use a bagging algorithm over a boosting algorithm?

In machine learning, when would you use a bagging algorithm over a boosting algorithm?

Bagging vs. Boosting

Write a SQL query to select the 2nd highest salary in the engineering department.

Note: If more than one person shares the highest salary, the query should select the next highest salary.

Example:

Input:

<code>employees</code> table

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>id</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>first_name</code></td>
<td>VARCHAR</td>
</tr>

<tr>
<td><code>last_name</code></td>
<td>VARCHAR</td>
</tr>

<tr>
<td><code>salary</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>department_id</code></td>
<td>INTEGER</td>
</tr>
</tbody>
</table>
<code>departments</code> table

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>id</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>name</code></td>
<td>VARCHAR</td>
</tr>
</tbody>
</table>
Output:

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>salary</code></td>
<td>INTEGER</td>
</tr>
</tbody>
</table>

Imagine you are asked to build a machine learning model to decide new loan approvals for a
financial firm. You ask the data department in the company for a subset of data to get started
working on the problem. The data includes different features about applicants such as age,
occupation, zip code, height, number of children, favorite color, etc. You decide to build
multiple machine learning models to test out different ideas before settling on the best one.

How would you explain the bias-variance tradeoff with regards to
building and choosing a model to use?

What is the difference between Logistic and Linear Regression?

When would use one instead of the other in practice?

What is the difference between Logistic and Linear Regression?

Linear vs Logistic Regression

Tell me about a project in which you had to clean and organize a large dataset.

Describing a real-world data cleaning and organization project

Data Cleaning Experiences

Data Pipelines

Let’s say you work at Allstate. Allstate is running <code>N</code> online ads right now. The table <code>ads</code> contains all those ads, ranked by popularity via the <code>id</code> column (e.g., the entry with <code>id = 1</code> is the most popular, etc.).

Create a subquery or common table expression named <code>top_ads</code> containing the top 3 ads (by popularity) and return the number of rows that would result from the following operations

<ol>
<li><code>ads INNER JOIN top_ads</code></li>
<li><code>ads LEFT JOIN top_ads</code></li>
<li><code>ads RIGHT JOIN top_ads</code></li>
<li><code>ads CROSS JOIN top_ads</code></li>
</ol>

Note: Please make the <code>join_type</code> column in your output have the values <code>inner_join</code>, <code>left_join</code>, etc. for each of their respective join types

Note: Please return only one query with each number in a different row

Example:

Input:

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>id</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>name</code></td>
<td>VARCHAR</td>
</tr>
</tbody>
</table>
Output:

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>join_type</code></td>
<td>VARCHAR</td>
</tr>

<tr>
<td><code>number_of_rows</code></td>
<td>INTEGER</td>
</tr>
</tbody>
</table>

Let’s say you are working on Google Docs. A product manager comes to you and asks how the product is doing. 

What are the top five metrics that you would start tracking to understand the health of Google Docs?

Docs Metrics

Product Sense & Metrics

How would you explain the bias variance tradeoff in machine learning to a high school student?

How would you tackle multicollinearity in multiple linear regression?

In the context of hypothesis testing, what are type I errors (type one errors) and type II errors (type two errors)? What is the difference between the two?

Bonus: Describe the probability of making each type of error mathematically.

What is the difference between type I and type II errors?

Type I and II Errors

Let’s say you’re given all the different marketing channels along with their respective marketing costs at a company called Mode, that sells B2B analytics dashboards.

What metrics would you use to determine the value of each marketing channel?

When determining the value of each marketing channel for the company Mode, which metric is considered the key metric?

When determining the value of each marketing channel for the company Mode, which metric is considered the key metric?

How do you detect and handle correlation between variables in linear regression? What will happen if you ignore the correlation in the regression model?

Given a <code>employees</code> and <code>departments</code> table, select the top 3 departments with at least ten employees and rank them according to the percentage of their employees making over 100K in salary.

Example:

Input:

<code>employees</code> table

<table>
<thead>
<tr>
<th>Columns</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>id</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>first_name</code></td>
<td>VARCHAR</td>
</tr>

<tr>
<td><code>last_name</code></td>
<td>VARCHAR</td>
</tr>

<tr>
<td><code>salary</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>department_id</code></td>
<td>INTEGER</td>
</tr>
</tbody>
</table>
<code>departments</code> table

<table>
<thead>
<tr>
<th>Columns</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>id</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>name</code></td>
<td>VARCHAR</td>
</tr>
</tbody>
</table>
Output:

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>percentage_over_100k</code></td>
<td>FLOAT</td>
</tr>

<tr>
<td><code>department_name</code></td>
<td>VARCHAR</td>
</tr>

<tr>
<td><code>number_of_employees</code></td>
<td>INTEGER</td>
</tr>
</tbody>
</table>

Given a <code>users</code> table, write a query to return only its duplicate rows.

Example:

Input:

<code>users</code> table

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>id</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>name</code></td>
<td>VARCHAR</td>
</tr>

<tr>
<td><code>created_at</code></td>
<td>DATETIME</td>
</tr>
</tbody>
</table>

How would you explain the concept of linear regression to three different audiences: a child, a first-year college student, and a seasoned mathematician? Ensure your explanations are tailored to the understanding level of each audience.

You are given a dictionary with two keys <code>a</code> and <code>b</code> that hold integers as their values.

Without declaring any other variable, swap the value of <code>a</code> with the value of <code>b</code> and vice versa.

Note: Return the dictionary after editing it.

Example:

Input:

<pre tabindex="0" class="chroma"><code>numbers = {
 &#39;a&#39;:3,
 &#39;b&#39;:4
}
</code></pre>

Output:

<pre tabindex="0" class="chroma"><code>def swap_values(numbers) -&gt; {&#39;a&#39;:4,&#39;b&#39;:3}
</code></pre>

Swap Variables

Data Structures & Algorithms

Given a 2-D matrix P of predicted values and actual values, write a function precision_recall to calculate precision and recall metrics.

Return the ordered pair (precision, recall).

Example:

Input:

<pre tabindex="0" class="chroma"><code>
P = [[121, 9],
 [17, 144]]
</code></pre>

where P represents the following table

<table>
<thead>
<tr>
<th></th>
<th>Predicted =TRUE</th>
<th>Predicted =FALSE</th>
</tr>
</thead>

<tbody>
<tr>
<td>Actual =TRUE</td>
<td>121</td>
<td>9</td>
</tr>

<tr>
<td>Actual =FALSE</td>
<td>17</td>
<td>144</td>
</tr>
</tbody>
</table>
Output:

<pre tabindex="0" class="chroma"><code>def precision_recall(P) -&gt; (0.931, 0.877,)
</code></pre>

Let’s say that you work on the revenue forecasting team at a company like Facebook.

An executive comes to you asking about how much revenue Facebook will make in the coming year.

How would you forecast revenue for the next year?

Forecasting New Year Revenue

Forecasting & Time Series

Let’s say that you’re in charge of an e-commerce D2C business that sells socks.

What business health metrics would you care about tracking on a company dashboard?

Build a random forest model from scratch with the following conditions:

<ul>
<li>The model takes as input a dataframe <code>data</code> and an array <code>new_point</code> with length equal to the number of fields in the <code>data</code></li>
<li>All values of both <code>data</code> and <code>new_point</code> are <code>0</code> or <code>1</code>, i.e., all fields are dummy variables and there are only two classes</li>
<li>Rather than randomly deciding what subspace of the data each tree in the forest will use like usual, make your forest out of decision trees that go through every permutation of the value columns of the data frame and split the data according to the value seen in <code>new_point</code> for that column</li>
<li>Return the majority vote on the class of <code>new_point</code></li>
<li>You may use <code>pandas</code> and <code>NumPy</code> but NOT <code>scikit-learn</code></li>
</ul>

Bonus: The <code>permutations</code> in the <code>itertools</code> package can help you easily get all of any iterable object.

Example:

Input:

<pre tabindex="0" class="chroma"><code>new_point = [0,1,0,1]
print(data)
...
 Var1 Var2 Var3 Var4 Target
0 1.0 1.0 1.0 0.0 1
1 0.0 0.0 0.0 0.0 0
2 1.0 0.0 1.0 0.0 0
3 0.0 1.0 1.0 1.0 1
4 1.0 0.0 1.0 0.0 0
.. ... ... ... ... ...
95 0.0 1.0 0.0 1.0 0
96 1.0 1.0 0.0 0.0 0
97 0.0 0.0 1.0 1.0 0
98 1.0 0.0 0.0 0.0 0
99 0.0 1.0 0.0 0.0 0

[100 rows x 5 columns]
</code></pre>

Output:

<pre tabindex="0" class="chroma"><code>def random_forest(new_point, data) -&gt; 0
</code></pre>

Build a random forest model from scratch.

Random Forest from Scratch

Three zebras are chilling in the desert. Suddenly a lion attacks.
Each zebra is sitting on a corner of a triangle with sides of equal length.
Each zebra randomly picks a direction and only runs along the outline of the triangle to either edge of the triangle.

What is the probability that none of the zebras collide?

Three zebras are chilling in the desert. Suddenly a lion attacks.

Each zebra is sitting on a corner of a triangle with side of equal length. Each zebra randomly picks a direction and only runs along the outline of the triangle to either edge of the triangle.

What is the probability that none of the zebras collide?

Three zebras are chilling in the desert. Suddenly a lion attacks.

Each zebra is sitting on a corner of a triangle with side of equal length.  Each zebra randomly picks a direction and only runs along the outline of the triangle to either edge of the triangle.

What is the probability that none of the zebras collide?



Estimate the cost of storing Google Earth photos each year.

Estimate the cost of storing Google Earth photos each year.

Google Earth Storage

Estimation

Your company, a multinational retail corporation, has been storing sales data from various branches worldwide in separate tables according to the year the sales were made. The current data structure is proving inefficient for business analytics and the management has requested your expertise to streamline the data.

Write a query to create a pivot table that shows total sales for each branch by year.

Note: Assume that the sales are represented by the <code>total_sales</code> column and are in USD. Each branch is represented by its unique <code>branch_id</code>.

<h3>Example:</h3>

Input:

For simplicity, consider two years: 2021 and 2022.

<code>sales_2021</code> table

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td>id</td>
<td>INTEGER</td>
</tr>

<tr>
<td>branch_id</td>
<td>INTEGER</td>
</tr>

<tr>
<td>total_sales</td>
<td>INTEGER</td>
</tr>
</tbody>
</table>
<code>sales_2022</code> table

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td>id</td>
<td>INTEGER</td>
</tr>

<tr>
<td>branch_id</td>
<td>INTEGER</td>
</tr>

<tr>
<td>total_sales</td>
<td>INTEGER</td>
</tr>
</tbody>
</table>
Output:

<code>sales_pivot</code> table

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td>branch_id</td>
<td>INTEGER</td>
</tr>

<tr>
<td>total_sales_2021</td>
<td>INTEGER</td>
</tr>

<tr>
<td>total_sales_2022</td>
<td>INTEGER</td>
</tr>
</tbody>
</table>
This output pivot table shows the total sales for each branch, broken down by year.

Calculate Moving Average	SQL	Easy
Predict Customer Churn	Machine Learning	Medium
A/B Test Significance	Statistics	Medium
Optimize Query Performance	SQL	Hard
Feature Importance Analysis	Machine Learning	Medium
Clean Missing Data	Python	Easy
Neural Network Architecture	Deep Learning	Hard
Calculate Cohort Retention	SQL	Medium
Bayesian Probability	Statistics	Easy
Recommend Similar Products	Machine Learning	Hard

Kroll Interview Questions

Kroll Interview Guides

Kroll Interview Questions

Challenge

Kroll Salaries by Position

Discussion & Interview Experiences

Discussion & Interview Experiences