Prepare for and practice interview questions from San Diego Metropolitan Transit System (Mts).

San Diego Metropolitan Transit System (Mts) Interview Questions

San Diego Metropolitan Transit System (Mts) Interview Guides

Machine Learning

Encoding Categorical Features

Missing Housing Data

Bank Fraud Model

One Million Rides

Bagging vs Boosting

Probability

First to Six

500 Cards

Raining in Seattle

Found Item

Given three uniform(0,4) random variables, what is the probability that the median of them is greater than 3?

Median Probability

A/B Testing

Experiment Validity

Button AB Test

A/B Test Power Size

Network Experiment Design

ETA Experiment

Data Structures & Algorithms

Given a list of integers, and an integer N, write a function to find all combinations that sum to the value N.

Sum to N

Write code to generate a sample from a multinomial distribution with keys 

Drawing Balls From Bin

Weighted Keys

Simulate rolling a dice from a continuous uniform distribution

Dice Rolls From Continuous Uniform

Justify an array of words given an arbitrary line width.

Max Width

Product Sense & Metrics

WAU vs Open Rates

Revenue Retention

Uber Eats Success

Increased Cancellations

Docs Metrics

When an interviewer asks a question along the lines of:

<ul>
<li>What would your current manager say about you? What constructive criticisms might he give?</li>
<li>What are your three biggest strengths and weaknesses you have identified in yourself?</li>
</ul>

How would you respond?

When asked about your strengths in an interview, what is an effective way to respond?

When asked about your strengths in an interview, what is an effective way to respond?

Your Strengths and Weaknesses I

Which of the following is an acceptable strategy when discussing weaknesses in an interview?

Which of the following is an acceptable strategy when discussing weaknesses in an interview?

Your Strengths and Weaknesses II

What do you tell an interviewer when they ask you what your strengths and weaknesses are?

Your Strengths and Weaknesses

Brainteasers

Describe a data project you worked on. What were some of the challenges you faced?

Describing a data project and its challenges

Hurdles In Data Projects

Analytics

How would you explain what a p-value is to someone who is not technical?

What does a p-value in a statistical test represent?

What does a p-value in a statistical test represent?

P-value to a Layman I

In a statistical test, how does a low p-value (less than 0.05) influence our decision about the null hypothesis?

In a statistical test, how does a low p-value (less than 0.05) influence our decision about the null hypothesis?

P-value to a Layman II

P-value to a Layman

Statistics

What are the assumptions of linear regression?

Which assumption of the residuals of the standard linear regression model can not be overcome by increasing the sample size?

Regression assumptions

What are the assumptions of linear regression?

Assumptions of Linear Regression

How would you handle the data preparation for building a machine learning model using imbalanced data?

Addressing imbalanced data in machine learning through carefully prepared techniques.

Data Preparation for Imbalanced Data

Let’s say we’re comparing two machine learning algorithms. In which case would you use a bagging algorithm versus a boosting algorithm? 

Give an example of the tradeoffs between the two.

In machine learning, when would you use a bagging algorithm over a boosting algorithm?

In machine learning, when would you use a bagging algorithm over a boosting algorithm?

Bagging vs. Boosting

Talk about a time when you had trouble communicating with stakeholders. How were you able to overcome it?

Strategically resolving misaligned expectations with stakeholders for a successful project outcome

Stakeholder Communication

Given a string, write a function to determine if it is palindrome or not.

Note: A palindrome is a word/string that is read the same way forward as it is backward, e.g. <code>&#39;reviver&#39;</code>, <code>&#39;madam&#39;</code>, <code>&#39;deified&#39;</code> and <code>&#39;civic&#39;</code> are all palindromes, while <code>&#39;tree&#39;</code>, <code>&#39;music&#39;</code> and <code>&#39;person&#39;</code> are not palindromes.

Example:

Input:

<pre tabindex="0" class="chroma"><code>word1 = &#34;tree&#34;
word2 = &#34;radar&#34;
</code></pre>

Output:

<pre tabindex="0" class="chroma"><code>def is_palindrome(word1) -&gt; False
def is_palindrome(word2) -&gt; True
</code></pre>

Given a string, write a function to determine if it is palindrome or not.

String Palindromes

Explain the difference between the XGBoost and random forest algorithms and give an example where you would use one over the other.

Why might a data scientist choose to use XGBoost instead of Random Forest for a particular machine learning task?

Why might a data scientist choose to use XGBoost instead of Random Forest for a particular machine learning task?

XGBoost vs Random Forest: Choice

Xgboost vs Random Forest

How does random forest generate the forest? Additionally, why would we use it over other algorithms such as logistic regression?

What happens when you average the output on multiple decision trees?

What happens when you average the output on multiple decision trees?

Average Trees

Random Forest Explanation

Let’s say that you work at a B2B SAAS company that’s interested in testing the pricing of different levels of subscriptions.

Your project manager comes to you and asks you to run a two-week-long A/B test to test an increase in pricing.

How would you approach designing this test? How would you determine whether the increase in pricing is a good business decision?

Testing Price Increase

Tell me about a project in which you had to clean and organize a large dataset.

Describing a real-world data cleaning and organization project

Data Cleaning Experiences

Data Pipelines

A team wants to A/B test multiple different changes through a sign-up funnel.

For example, on a page, a button is currently red and at the top of the page. They want to see if changing a button from red to blue and/or from the top of the page to the bottom of the page will increase click-through.

How would you set up this test?

Your team wants to run an AB test on the color (red or blue) and position (top/bottom of the page) of a button that links to a promotion.

How many different variants of the test should you run to see the effect that a red button has on the bottom of the page, if currently the button is blue and on the top of the page.

Your team wants to run an AB test on the color (red or blue) and position (top/bottom of the page) of a button that links to a promotion.

How many different variants of the test should you run to see the effect that a red button has on the bottom of the page, if currently the button is blue and on the top of the page.

A/B Test Variants

Let’s say we want to launch a re-design of a landing page to improve the click-through rate. We can do this by implementing an AB test.
Given that we launch an AB test, how would you infer if the results of the click-through rate were statistically significant or not?

How can you accurately conclude if the results of an A/B test, conducted to evaluate the effectiveness of a landing page redesign, are statistically significant?

How can you accurately conclude if the results of an A/B test, conducted to evaluate the effectiveness of a landing page redesign, are statistically significant?

Statistically Significant Test

Precisely ascertain whether the outcomes of an A/B test, executed to assess the impact of a landing page redesign, exhibit statistical significance.

Let’s say that your company is running a standard control and variant AB test on a feature to increase conversion rates on the landing page. The PM checks the results and finds a .04 p-value.

How would you assess the validity of the result?

Let’s say that you work at a bank that wants to build a model to detect fraud on the platform.

The bank wants to implement a text messaging service in addition that will text customers when the model detects a fraudulent transaction in order for the customer to approve or deny the transaction with a text response.

How would we build this model?

Let’s say that you work at a bank that wants to build a model to detect fraud on the platform.

The bank wants to implement a text messaging service that will text customers when the model detects a fraudulent transaction in order for the customer to approve or deny the transaction with a text response.

Which statement is true?

Let's say that you work at a bank that wants to build a model to detect fraud on the platform.

The bank wants to implement a text messaging service that will text customers when the model detects a fraudulent transaction in order for the customer to approve or deny the transaction with a text response.

Which statement is true?

Let’s say you are working on Google Docs. A product manager comes to you and asks how the product is doing. 

What are the top five metrics that you would start tracking to understand the health of Google Docs?

What’s the relationship between PCA and K-means clustering?

What does the variable “k” in k-means clustering refer to?

What does the variable "k" in k-means clustering refer to?

Input of K-means

PCA and K-Means

You’re given a string that may contain the characters <code>{</code>, <code>}</code>, <code>[</code>, <code>]</code>, <code>(</code>, and <code>)</code>.

Task: Verify that the string is balanced. A balanced string is one where every opening character, <code>{</code>, <code>[</code>, or <code>(</code>, has a corresponding closing character, <code>}</code>, <code>]</code>, or <code>)</code>.

Write a function called <code>is_balanced(string: str) -&gt; bool</code> which verifies the balance of a string.

Example:

<pre tabindex="0" class="chroma"><code>is_balanced(&#39;(())[]{}&#39;) -&gt; True
</code></pre>

<pre tabindex="0" class="chroma"><code>is_balanced(&#39;{([(){}])()}&#39;) -&gt; True
</code></pre>

<pre tabindex="0" class="chroma"><code>is_balanced(&#39;{}[]())&#39;) -&gt; False
</code></pre>

<hr/>

Write a function that tests whether a string of brackets is balanced.

The Brackets Problem

How would you tackle multicollinearity in multiple linear regression?

Multicollinearity in Regression

In the context of hypothesis testing, what are type I errors (type one errors) and type II errors (type two errors)? What is the difference between the two?

Bonus: Describe the probability of making each type of error mathematically.

What is the difference between type I and type II errors?

Type I and II Errors

Let’s say you want to test the close friends feature on Instagram Stories.

How would you make a control group and test group to account for network effects?

What could be a potential risk when Facebook segments the metrics of the user interface change by market or demographic groups?

What could be a potential risk when Facebook segments the metrics of the user interface change by market or demographic groups?

Network Experiment Design I

What is the primary metric that Facebook should monitor if it decides to make the user interface of its posting feature more like Instagram's?

What is the primary metric that Facebook should monitor if it decides to make the user interface of its posting feature more like Instagram's?

Network Experiment Design II

If Facebook changes its user interface to mimic Instagram's, which adverse effect should they anticipate and monitor?

If Facebook changes its user interface to mimic Instagram's, which adverse effect should they anticipate and monitor?

Network Experiment Design III

Let’s say you’re analyzing an AB test that has both a test group and a control group.

<ol>
<li>How do you calculate the sample size necessary for an accurate measurement?</li>

<li>Let’s say that the sample size is similar and sufficient between the two groups. In order to measure very small differences between the two, should the power get bigger or smaller?</li>
</ol>

What does the backpropagation algorithm do in the context of neural networks? What is the informal intuition behind the algorithm? What are some drawbacks of the algorithm compared to other optimization methods?

Bonus: Formally derive the backpropagation algorithm and prove that it does what it claims to do.

Backpropagation Explanation

Describe a time when you had to define a long-term vision for a project or team and move it from concept to reality.

Interviews for leadership or senior technical roles look for more than just “finishing a project.” Your answer should specifically address:

<ol>
<li>The “Why”: What data or organizational gap necessitated this vision?</li>
<li>The Framework: How did you translate a high-level goal into a roadmap?</li>
<li>The Friction: How did you handle stakeholders or team members who were skeptical of the new direction?</li>
</ol>

Vision Setting and Execution Strategy

Business Case

Given a <code>employees</code> and <code>departments</code> table, select the top 3 departments with at least ten employees and rank them according to the percentage of their employees making over 100K in salary.

Example:

Input:

<code>employees</code> table

<table>
<thead>
<tr>
<th>Columns</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>id</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>first_name</code></td>
<td>VARCHAR</td>
</tr>

<tr>
<td><code>last_name</code></td>
<td>VARCHAR</td>
</tr>

<tr>
<td><code>salary</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>department_id</code></td>
<td>INTEGER</td>
</tr>
</tbody>
</table>
<code>departments</code> table

<table>
<thead>
<tr>
<th>Columns</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>id</code></td>
<td>INTEGER</td>
</tr>

<tr>
<td><code>name</code></td>
<td>VARCHAR</td>
</tr>
</tbody>
</table>
Output:

<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
</tr>
</thead>

<tbody>
<tr>
<td><code>percentage_over_100k</code></td>
<td>FLOAT</td>
</tr>

<tr>
<td><code>department_name</code></td>
<td>VARCHAR</td>
</tr>

<tr>
<td><code>number_of_employees</code></td>
<td>INTEGER</td>
</tr>
</tbody>
</table>

Write a query to select the top 3 departments with at least ten employees and rank them according to the percentage of their employees making over 100K in salary.

Employee Salaries

We want to build a model to predict housing prices in the city of Seattle. We’ve scraped 100K sold listings over the past three years but found that around 20% of the listings are missing square footage data.

How do we deal with the missing data to construct our model?

We want to build a model to predict housing prices in the city of Seattle. We’ve scraped 100K sold listings over the past three years, but have discovered that around 20% of the listings are missing square footage data.

How would you approach dealing with this missing data, in order to construct the most useful predictive model possible?

We want to build a model to predict housing prices in the city of Seattle. We’ve scraped 100K sold listings over the past three years, but have discovered that around 20% of the listings are missing square footage data.

How would you approach dealing with this missing data, in order to construct the most useful predictive model possible?



Let’s you’re tasked with pitching a new feature for Google Home. Your co-worker comes to you with an idea to build a game feature for Google Home.

How would you go about deciding whether Google should build it?

Game Feature Home

How would you explain the concept of linear regression to three different audiences: a child, a first-year college student, and a seasoned mathematician? Ensure your explanations are tailored to the understanding level of each audience.

Describe linear regression to various audiences with different levels of knowledge.

Explaining Linear Regression to Different Audiences

Implement the k-means clustering algorithm in python from scratch, given the following:

<ul>
<li>A two-dimensional NumPy array <code>data_points</code> that is an arbitrary number of data points (rows) <code>n</code> and an arbitrary number of columns <code>m</code>.</li>
<li>Number of k clusters <code>k</code>.</li>
<li>The initial centroids value of the data points at each cluster <code>initial_centroids</code>.</li>
</ul>

Return a list of the cluster of each point in the original list data_points with the same order (as a integer).

Example

<img src="https://i.ibb.co/KKp2gPn/kemans-before.png" alt="before clustering"/>

After clustering the points with two clusters, the points will be clustered as follows.

<img src="https://i.ibb.co/bdwVWxR/kemans-example-after.png" alt="after clustering"/>

Note: There could be an infinite number of separating lines in this example.

Example

<pre tabindex="0" class="chroma"><code>
#Input
data_points = [(0,0),(3,4),(4,4),(1,0),(0,1),(4,3)]
k = 2
initial_centroids = [(1,1),(4,5)]


#Output 

k_means_clustering(data_points,k,initial_centroids) -&gt; [0,1,1,0,0,1]

</code></pre>

<img src="https://i.ibb.co/pRsRz12/example.png" alt=""/>

Implement the k-means clustering algorithm in python from scratch

Calculate Moving Average	SQL	Easy
Predict Customer Churn	Machine Learning	Medium
A/B Test Significance	Statistics	Medium
Optimize Query Performance	SQL	Hard
Feature Importance Analysis	Machine Learning	Medium
Clean Missing Data	Python	Easy
Neural Network Architecture	Deep Learning	Hard
Calculate Cohort Retention	SQL	Medium
Bayesian Probability	Statistics	Easy
Recommend Similar Products	Machine Learning	Hard

San Diego Metropolitan Transit System (Mts) Interview Questions

San Diego Metropolitan Transit System (Mts) Interview Guides

San Diego Metropolitan Transit System (Mts) Interview Questions

Challenge

Discussion & Interview Experiences

Discussion & Interview Experiences