Prepare for and practice interview questions from Udacity.

Udacity Interview Questions

Udacity Interview Guides

Machine Learning

Random Forest Explanation

Replaced by QUESTION 791 - Should be deleted

Bias Variance Tradeoff

Xgboost vs Random Forest

Random Forest Expansion

Related Jobs Optimization

Statistics

P-value to a Layman

What are the assumptions of linear regression?

Assumptions of Linear Regression

Determine the cause of the drop in capital approval rates.

Approval Drop

Brainteasers

How would you answer when an Interviewer asks why you applied to their company?

Why Do You Want to Work With Us

What do you tell an interviewer when they ask you what your strengths and weaknesses are?

Your Strengths and Weaknesses

Data Structures & Algorithms

Write a function to return the optimal friend that should host the party.

Optimal Host

N-gram Dictionary

Analytics

Describing a data project and its challenges

Hurdles In Data Projects

How would you design and measure the impact of a job training program on employability?

Job Training Program Evaluation

When an interviewer asks you a question along the lines of:

<ul>
<li>Why did you apply to our company?</li>
<li>What are you looking for in your next job?</li>
<li>What makes you a good fit for our company?</li>
</ul>

How should you respond?

When asked 'What are you looking for in your next job?' in an interview, how can you tie the company's employee benefits into your response?

When asked 'What are you looking for in your next job?' in an interview, how can you tie the company's employee benefits into your response?

Why Do You Want to Work With Us I

How can company values be used effectively in an interview when asked 'What makes you a good fit for our company?'

How can company values be used effectively in an interview when asked 'What makes you a good fit for our company?'

Why Do You Want to Work With Us II

When responding to the question 'Why did you apply to our company?' during an interview, what aspect should you highlight?

When responding to the question 'Why did you apply to our company?' during an interview, what aspect should you highlight?

Why Do You Want to Work With Us III

When an interviewer asks a question along the lines of:

<ul>
<li>What would your current manager say about you? What constructive criticisms might he give?</li>
<li>What are your three biggest strengths and weaknesses you have identified in yourself?</li>
</ul>

How would you respond?

When asked about your strengths in an interview, what is an effective way to respond?

When asked about your strengths in an interview, what is an effective way to respond?

Your Strengths and Weaknesses I

Which of the following is an acceptable strategy when discussing weaknesses in an interview?

Which of the following is an acceptable strategy when discussing weaknesses in an interview?

Your Strengths and Weaknesses II

Describe a data project you worked on. What were some of the challenges you faced?

How would you explain what a p-value is to someone who is not technical?

What does a p-value in a statistical test represent?

What does a p-value in a statistical test represent?

P-value to a Layman I

In a statistical test, how does a low p-value (less than 0.05) influence our decision about the null hypothesis?

In a statistical test, how does a low p-value (less than 0.05) influence our decision about the null hypothesis?

P-value to a Layman II

What are the assumptions of linear regression?

Which assumption of the residuals of the standard linear regression model can not be overcome by increasing the sample size?

Regression assumptions

Explain the difference between the XGBoost and random forest algorithms and give an example where you would use one over the other.

Why might a data scientist choose to use XGBoost instead of Random Forest for a particular machine learning task?

Why might a data scientist choose to use XGBoost instead of Random Forest for a particular machine learning task?

XGBoost vs Random Forest: Choice

How does random forest generate the forest? Additionally, why would we use it over other algorithms such as logistic regression?

What happens when you average the output on multiple decision trees?

What happens when you average the output on multiple decision trees?

Average Trees

Let’s say you’re a data engineer at Fidelity Investments, and you’re running a SQL query on a cloud-based data warehouse. All cluster resources and network health metrics look normal, but the query is still taking over 10 minutes to complete.

How would you go about diagnosing and improving the performance of this query?

How would you diagnose and speed up a slow SQL query when system metrics look healthy?

Slow SQL Query

Query Optimization

How would you explain the bias variance tradeoff in machine learning to a high school student?

Let’s say you’re working on a random forest model.

If the number of trees in a random forest are increased sequentially, will the accuracy of the model continue to increase?

Let’s say you’re working with a city agency that’s rolling out a new job training program for unemployed adults. The program includes skills workshops, career counseling, and job placement assistance, and you’ve been asked to evaluate whether it actually improves participants’ employment outcomes over the next year.

How would you design a study to measure the program’s impact on employability, and what metrics and methods would you use to ensure your evaluation is both rigorous and fair?

Given a table called <code>user_experiences</code>, write a query to determine the percentage of users that held the title of <code>&#34;Data Analyst&#34;</code> immediately before holding the title <code>&#34;Data Scientist&#34;</code>.

Immediate is defined as the user holding no other titles between the <code>&#34;Data Analyst&#34;</code> and <code>&#34;Data Scientist&#34;</code> roles.

Example:

Input:

<code>user_experiences</code> table

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>id</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>position_name</code></td>
<td>VARCHAR</td>
</tr>

<tr>
<td><code>start_date</code></td>
<td>DATETIME</td>
</tr>

<tr>
<td><code>end_date</code></td>
<td>DATETIME</td>
</tr>

<tr>
<td><code>user_id</code></td>
<td>INTEGER</td>
</tr>
</tbody>
</table>
Output:

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>percentage</code></td>
<td>FLOAT</td>
</tr>
</tbody>
</table>

User Experience Percentage

Let’s say we have a group of \(N\) friends represented by a list of dictionaries where each value is a friend name and their location on a three dimensional scale of (\(x, y, z\)). The friends want to host a party but want the friend with the optimal location (least distance to travel as a group) to host it.

Write a function <code>pick_host</code> to return the friend that should host the party.

Example:

Input:

<pre tabindex="0" class="chroma"><code>friends = [
 {&#39;name&#39;: &#39;Bob&#39;, location: (5,2,10)},
 {&#39;name&#39;: &#39;David&#39;, location: (2,3,5)},
 {&#39;name&#39;: &#39;Mary&#39;, location: (19,3,4)},
 {&#39;name&#39;: &#39;Skyler&#39;, location: (3,5,1)},
]

def optimal_host(friends) -&gt; &#39;David&#39;
</code></pre>

Capital approval rates have gone down for our overall approval rate. Let’s say last week it was 85% and the approval rate went down to 82% this week which is a statistically significant reduction.

The first analysis shows that all approval rates stayed flat or increased over time when looking at the individual products:

<ul>
<li>Product 1: 84% to 85% week over week</li>
<li>Product 2: 77% to 77% week over week</li>
<li>Product 3: 81% to 82% week over week</li>
<li>Product 4: 88% to 88% week over week</li>
</ul>

What could be the cause of the decrease?

Write a function <code>get_ngrams</code> to take in a word (string) and return a dictionary of n-grams and their frequency in the given string.

Example 1:

Input:

<pre tabindex="0" class="chroma"><code>string = &#39;banana&#39; 
n=2
</code></pre>

Output:

<pre tabindex="0" class="chroma"><code>output = {&#39;ba&#39;:1, &#39;an&#39;:2, &#39;na&#39;:2} 
</code></pre>

Example 2:

Input:

<pre tabindex="0" class="chroma"><code>string = &#39;banana&#39; 
n=3 
</code></pre>

Output:

<pre tabindex="0" class="chroma"><code>output = {&#39;ban&#39;:1, &#39;ana&#39;:2, &#39;nan&#39;:1}
</code></pre>

As a data engineer for Slack, they asked you to design their new product, “Slack for School”. When designing their database, you ponder upon the following questions:

<ol>
<li>What are the critical entities, and how would they interact?</li>

<li>Imagine we want to provide insights to teachers about students’ class participation. How should we design an ETL process to extract data about when and how often each student interacts with the app?</li>

<li>Suppose a teacher wants to see the students’ assignment submission trends over the last six months. How would you write a SQL query to retrieve this data?</li>
</ol>

How would you answer?

System design for a digital classroom service.

Digital Classroom System Design

Data Modeling

Let’s say we’re trying to determine fake reviews on our products. 

Based on past data, 98% reviews are legitimate and 2% are fake. If a review is fake, there is 95% chance that the machine learning algorithm identifies it as fake. If a review is legitimate, there is a 90% chance that the machine learning algorithm identifies it as legitimate.

What is the percentage chance the review is actually fake when the algorithm detects it as fake?

Fake Algorithm Reviews

Probability

A manager in a packet filling line is complaining that their machine is not functioning correctly. The machine weighs a group of packets at once and attempts to get 25 packets into a box.

He has been getting complaints from customers that the boxes contain amounts other than 25. How would you help out?

Incorrect Packets

Business Case

Let’s say that you work at a job board.

A PM wants to build a “related jobs” feature on every individual job description page. This “related jobs” feature would be a sidebar of jobs that are most related to the current job the user is browsing.

You have a couple of ideas on how to find related jobs. There’s NLP concepts such as bag of words and word embeddings. The PM also clarifies that the definition of “related jobs” are other jobs that are similar in position title and job description.

However when presented with the problem, you’re quickly realizing that there are millions of new jobs posted each day on the job board, and finding the top 10 related jobs for every single job posted each day could be extremely inefficient.

Explain a system / method that could solve the problem of finding the top 10 closest related jobs for millions of new jobs per day?

Note: Assume an existing pool of tens of millions of jobs that could be related for each new job. Also assume you have access to all text features of a job such as title, description, company, date, etc..

Calculate Moving Average	SQL	Easy
Predict Customer Churn	Machine Learning	Medium
A/B Test Significance	Statistics	Medium
Optimize Query Performance	SQL	Hard
Feature Importance Analysis	Machine Learning	Medium
Clean Missing Data	Python	Easy
Neural Network Architecture	Deep Learning	Hard
Calculate Cohort Retention	SQL	Medium
Bayesian Probability	Statistics	Easy
Recommend Similar Products	Machine Learning	Hard

Udacity Interview Questions

Udacity Interview Guides

Udacity Interview Questions

Challenge

Udacity Salaries by Position

Discussion & Interview Experiences

Discussion & Interview Experiences