Interview Query

Top 100+ Data Science Interview Questions


Data science interviews are tough for one reason: there’s no consistency on what types of questions getting asked.

The questions asked in data science interviews are highly dependent on the position and company you are interviewing for.

Some companies that view data scientists as high-powered data analysts will focus more on SQL questions in interviews. While other companies that are looking for data scientists with strong ETL and data engineering skills will focus more on full stack data system design questions.

What comes up most frequently? Interview Query analyzed the most commonly asked questions in 20,000 plus data scientist interviews.

a/b testingalgorithmsanalyticsmachine learningprobabilityproduct metricspythonsqlstatistics
Data Scientist
High confidence

Overall SQL questions, machine learning, data structures and algorithms, and statistics and A/B testing interview questions dominated the data science interview.

SQL is most important topic. It’s asked in more than 90% of data science interviews. Machine learning (85%+) and Python (80%+) are also frequently asked topics.

How to Prepare for the Data Science Interview

To prepare for a data science interview, follow these four steps:

  1. Determine what questions to study: Use the job posting and read interview experiences to determine which question topics to study.

  2. Benchmark your skills: Try example questions and find your strengths and weaknesses across each question type.

  3. Practice data science questions: Do as many practice problems as possible.

  4. Run through mock interviews: Simulate the interview experience with mock interviews.

Use these top 100 data science interview questions to benchmark your skills and practice.

Explore questions by category:

Behavioral Interview Questions

In data science interviews, behavioral interview questions are meant to assess technical competency, culture fit, past experience, and communication abilities.

1. Give an example of an analysis that you did that drove business impact.

Provide a concrete example and use statistics to back up your claim. You might talk about an increase in user engagement or improved marketing performance.

Be sure to structure your answer. The STAR format (Situation, Task, Actions, Results) is a go-to for many data science interviews. First, you outline the situation: “In my last job, I was working as a marketing analyst.” Then, highlight the task: “Marketing performance had stagnated, and we wanted to increase our marketing ROI.”

Finally, state the actions you took and the results: “I decided to do in-depth analysis on your audience targeting and found more profitable demographics to target. My analysis resulted in a 10 percent increase in marketing ROAS.”

2. How do you make technical topics accessible to non-technical audiences?

Behavioral questions in data science interviews are commonly used to assess your communication style. This question will help the interviewer understand how well you communicate complex topics. One way to answer this is to talk about data visualization and how you have created visualizations that made your insights more understandable.

3. Tell me about a data science project you have worked on. What challenges did you experience? How did you respond?

Be prepared to talk about any past data science projects that you have worked on. These could be professional projects or ones you have done in your free time. Practice talking about the project using the STAR format (or something similar), and always incorporate the challenges of the project.

Was there a challenge in gathering and cleaning the data? Did you have trouble generating insights? Was communicating the insights difficult?

4. Provide an example of a goal you have achieved. How did you reach it?

A behavioral question like this is designed to understand your learning style, organization, and planning skills. First, define the goal. If you’re interviewing for your first data science job, you might choose a goal like learning to code in Python. Or, you might choose a professional goal like completing a project on a tight deadline.

After you’ve provided an overview of the goal, you can use a format like STAR to outline your exact thought and action process.

5. Tell me about a time you missed a deadline. How did you respond?

Inevitably, questions like this will arise in your interview. These are specifically designed to see how you respond to adversity. Be honest in your response, but also be aware of the lessons you learned and how you applied (or would apply) these lessons in future scenarios.

Let’s say the missed deadline was a result of having unclear expectations at the start of the project. You might say, “I learned how important it is to get stakeholder input and nail down requirements before starting a project.” Then, talk about a time you were able to apply those insights or how you would apply them to a future project.

Additional Behavioral Questions for Data Scientists

Be sure to see our full guide to behavioral questions in data science interviews.

6. How have you used data to elevate the experience of a customer or stakeholder?

7. Provide an example of a goal you did not meet. How did you handle it?

8. How do you handle tight deadlines?

9. What did you do in your last job?

10. Tell me about a time you had to clean and organize a big dataset.

11. How do you learn new data science concepts? Tell us about a time when you had to learn a new skill.

12. What did you do in your last job? What was your greatest accomplishment?

Product Metrics and Analytics

Product metrics and analytics questions assess your product intuition and ability to make data-driven product decisions. Data science metrics questions ask you to investigate analytics, measure success, track feature changes, or assess growth. Sometimes, they take the form of data analytics case studies.

13. Friend requests are down 10% on Facebook. What would you investigate?

See a video mock interview for this product sense question:

product sense question

See our guide to Facebook data science interview questions.

14. Average comments per user has dropped over a three-month period, despite consistent growth after a new launch. What metrics would you investigate?

This decreasing comments product question assesses your product intuition. Here’s a sample solution:

You’re provided information that the user count is increasing linearly. Therefore, the decrease isn’t due to a declining user base. You might want to look at churn and active user engagement to solve this problem.

Active users will churn off the platform, resulting in fewer comments. In this case, you might be interested in the metric: comments per active user.

15. You work for an SAAS that provides leads to insurance agents and are asked to determine whether providing more leads to customers results in greater retention.

Let’s assume that, for this question, the VP of Sales provides you a graph that shows agents at Month 3 receiving more than three times the number of leads than agents in Month 1 as proof. What might be flawed about the VP’s thinking?

Hint: The key is to not confuse the output metrics with the input metrics. In this case, while the question makes us think that more leads is the output metric, what should really be investigated is churn. If you break out customers in cohorts based on the number of leads per month, we can see if churn goes down by cohort each month.

16. How would you verify the claim that Facebook is losing younger users? What metrics would you look at?

With product sense questions in data science interviews, always start with clarifying questions. With this Facebook product data science question, you could ask where the claim came from, and a definition of “younger” user. “Churn” and “active users” are two metrics that would help you verify the claim. In particular, you should be interested in daily, weekly, and monthly rates for both metrics.

By looking at daily, weekly, and monthly churn, we’d have a better grasp on whether younger users are less engaged, e.g. higher weekly churn vs. lower monthly churn, or if Facebook is actually losing younger users.

17. A PM tells you that the weekly active user (WAU) metric is up 5% but email notification open rates are down 2%. What would you investigate to diagnose this problem?

When initially reading this question, we should first assume that it is a debugging question, and then possibly dive into trade-offs.

WAU (weekly active users) and email open rates are most of the time directly correlated, especially if emails are used as a call to action to bring users back onto the website. An email opened and then clicked would lead to an active user, but it doesn’t necessarily mean that they have correlations or are the only factor causing changes. First, you might want to look at debugging: looking at problems in tracking, seasonality, open rates by country/demographic, etc.

If this doesn’t reveal a solution, you’d next want to consider trade-offs.

Additional Product Metrics Questions

Be sure to see our full guide to product metrics questions in data science interviews.

18. How would you measure the success of Uber Eats?

19. Pinterest introduced a new feature: video pins. How would you determine what percentage of pins should be videos?

20. Reddit recently implemented a new search toolbar, what metrics would you measure to determine the impact of the search toolbar?

21. Traffic went down 5%. How would you determine the reason for this?

22. How would you measure the success of Facebook Groups?

23. You work at Twitter and are tasked with assigning negative values to abusive tweets. How would you do that?

Machine Learning Interview Questions

Machine learning and modeling questions assess your experience using machine learning tools and depth of knowledge. Most commonly, these questions examine algorithm concepts, machine learning case studies, and recommendation engines.

24. What are the assumptions of linear regression?

With a question that asks the assumptions of linear regression, know that there are several assumptions, and that they’re baked into the dataset and how the model is built. The first assumption is that there is a linear relationship between the features and the response variable, otherwise known as the value you’re trying to predict. What else can we assume?

25. What is the difference between xgboost and random forest?

Random forest is a bagging algorithm, and in using it, you have several base learners or decision trees, which are generated in parallel and form the base learners of the bagging technique.

However, in boosting, the trees are built sequentially such that each subsequent tree aims to reduce the errors of the previous tree. Each tree learns from its predecessors and updates the residual errors. Hence, the tree that grows next in the sequence will learn from an updated version of the residuals.

26. What metrics would you use to assess accuracy and validity of a spam classifier?

Spam filtering is a binary classification problem in which the emails can either be spam or not spam. Also, be cognizant of how unbalanced the dataset might be, e.g. lots of non-spam emails vs much fewer spam emails.

With that in mind, define the classification definitions for each output scenario:

  • True Positives: The email prediction is spam and the email is actually spam.
  • True Negatives: The email prediction is not spam and the email is also not spam.
  • False positives: The email prediction is spam and the email is actually not spam.
  • False negatives: The email prediction is not spam and the email is actually spam.

Given these definitions, what would be the metrics for measuring model accuracy? See the full solution on Interview Query.

27. How would you create a system for Facebook Marketplace to detect if someone is listing a firearm?

This intermediate machine learning question was asked by Facebook. See a video mock interview solution for this question:

machine learning question

28. You want to build a model to predict housing prices, but 20% of the listings are missing square footage. What do you do?

This classic data modeling question was asked by Redfin and Panasonic. A quick solution– you could build models with different volumes of training data. With this method, you can avoid using the 20% that are missing values. Imputation would be another option for solving this question.

29. Is a decision tree model best for predicting if a borrower will pay back a personal loan? How would you evaluate performance of the model?

A few questions to consider are: How would you evaluate performance of the model? And how would you compare a decision tree to other models? See a full solution in this YouTube mock interview:

Square machine learning mock interview

See our guide to the Square data scientist interview.

Additional Machine Learning Questions

Be sure to see our full guide to machine learning questions in data science interviews.

30. What are some differences between classification and regression models?

31. An ecommerce website’s pricing algorithm is underpricing products. What steps would you take to diagnose the problem?

32. Explain K-fold cross-validation.

33. How would you use machine learning to predict missing values for a dataset missing 30% of values?

34. What is dimensionality reduction? What are the benefits of using it?

35. What is ensemble learning? How would you explain it to a nontechnical person?

36. How would you evaluate if a decision tree is the right choice for predicting if a borrower will pay back a loan?

SQL Interview Questions for Data Scientists

Data science SQL questions are very common and asked in about 90% of data science interviews. These questions assess your ability to write clean code, usually during a whiteboarding session. Common topics include pulling metrics, analytics case studies, ETL, and database design.

37. How would you test for NULL values in a database?

NULL is usually used in databases to signify a missing or unknown value. To compare the fields with NULL values, you can use the IS NULL or IS NOT NULL operator.

38. Given three tables, representing customer transactions and customer attributes, write a query to get the average order value by gender.

For this problem, which has been asked in Uber data science interviews, note that we are going to assume that the question states average order value for all users that have ordered at least once.

Therefore, we can apply an INNER JOIN between users and transactions.


, ROUND(AVG(quantity  *price), 2) AS aov

FROM users AS u

INNER JOIN transactions AS t

ON = t.user_id

INNER JOIN products AS p

ON t.product_id =


39. Given a table with event logs, find the percentage of users that had at least one 7-day streak of visiting the same URL.

Here’s the key observation for this problem: Say that, for a given user, you sort their login dates in ascending order and assign a rank (row_num) to each date (doesn’t matter if you use RANK() or ROW_NUMBER() if the dates are unique). If you subtract corresponding ranks from dates, you’ll find that days belonging in the same streak have the same number.

40. Write a query to identify customers who placed more than three transactions each in both 2019 and 2020.

This question gives us two tables and asks us to find the names of customers who placed more than three transactions in both 2019 and 2020.

Note the phrasing of the question institutes this logical expression: Customer transaction > 3 in 2019 AND Customer transactions > 3 in 2020.

Our first query will join the transactions table to the user’s table so that we can easily reference both the user’s name and the orders together. We can join our tables on the id field of the user’s table and the user_id field of the transactions table:

FROM transactions t

JOIN users u

ON = user_id

41. Given the employees and departments table, write a query to get the top three highest employee salaries by department.

In this question, we need to order the salaries by department. A window function here is useful. Window functions enable calculations within a certain partition of rows. In this case, the RANK() function would be useful. What would you put in the PARTITION BY and ORDER BY clauses?

Your window function can look something like:


with id and metric replaced by the fields relevant to the question

42. Write an SQL query to create a metric to recommend pages for each user based on recommendations from their friend’s liked pages.

Let’s say we want to build a naive recommender for this SQL problem. We’re given two tables: one table called friends with a user_id and friend_id columns representing each user’s friends, and another table called page_likes with a user_id and a page_id representing the page each user liked.

Let’s solve this problem by visualizing what kind of output we want from the query. Given that we have to create a metric for each user to recommend pages, we know we want something with a user_id and a page_id along with some sort of recommendation score.

Let’s try to think of an easy way to represent the scores of each user_id and page_id combo. One naive method would be to create a score by summing up the total likes by friends on each page that the user hasn’t currently liked. Then, the max value on our metric will be the most recommendable page.

Additional SQL Questions for Data Science Interviews

Be sure to see our full guide to SQL questions in data science interviews.

43. When do you use the CASE WHEN function?

44. Write a SQL query to list all customer orders.

45. What is a foreign key? How is it used?

46. What’s the difference between an INNER JOIN and LEFT JOIN?

47. Write a SQL query to return all the neighborhoods with 0 users in them.

48. Write a SQL query to find total distance traveled by Lyft users last month.

49. Write a SQL query to get month-over-month change in new users.

50. What are the differences between DDL, DML, DCL, and TCL?

Python Interview Questions

Python interview questions test your technical coding skills, with questions that explore concepts like data structures, NumPy and data science packages, probability simulation, statistics and distributions, and string parsing / data manipulation.

51. Write a function to select only the rows from a dataframe where the student’s favorite color is green or red and their grade is above 90.

This question requires us to filter a dataframe by two conditions: 1) the grade of the student and 2) their favorite color.

Let’s start with filtering by grade since it’s a bit simpler than filtering by strings. We can filter columns in pandas by setting our dataframe equal to itself with the filter in place.

52. Write a function to generate N samples from a normal distribution and plot the histogram. You may omit the plot to test your code.

This is a relatively simple problem because we have to set up our distribution and then generate n samples from it which are then plotted. In this question, we make use of the SciPy library which is a library made for scientific computing.

First, we will declare a standard normal distribution. A standard normal distribution, for those of you who may have forgotten, is the normal distribution with mean=0 and standard deviation = 1. To declare a normal distribution, we use SciPy’s stats.norm(mean, variance) function and specify the parameters as mentioned above.

53. Amy and Brad take turns rolling a six-sided die. Whoever rolls a “6” first wins. Amy rolls first. What’s the probability that Amy wins?

In this question, we can write a Python function to simulate the scenario to see how frequently Amy wins first. To solve this question, you must understand how to create two people and simulate the scenario with one person rolling first each time.

54. Given a dictionary consisting of many roots and a sentence, write a function to stem all the words in the sentence with the root forming it.

In data science, there exists the concept of stemming, which is the heuristic of chopping off the end of a word to clean and bucket it into an easier feature set. That’s being tested in this Facebook question.

One tip: If a word has many roots that can form it, replace it with the root with the shortest length.



roots = ["cat", "bat", "rat"]
sentence = "the cattle was rattled by the battery"


 "the cat was rat by the bat"

55. Given a percentile threshold and N samples, write a function to simulate a truncated normal distribution.



m = 2
sd = 1
n = 6
percentile_threshold = 0.75


def truncated_dist(m,sd,n, percentile_threshold): ->

 [2, 1.1, 2.2, 3, 1.5, 1.3]

All values in the output sample are in the lower 75% = percentile_threshold of the distribution.

Hint: First, for this question, we need to calculate where to truncate our distribution. We want a sample where all values are below the percentile_threshold.

56. Write a function to list the pairs of friends with their corresponding timestamps of the friendship beginning and the timestamp of the friendship ending.

For this Facebook data science interview question, you’re given two lists of dictionaries representing friendship beginnings and endings: friends_added and friends_removed. Each dictionary contains the user_ids and created_at time of the friendship beginning /ending.


friends_added = [
    {'user_ids': [1, 2], 'created_at': '2020-01-01'},
    {'user_ids': [3, 2], 'created_at': '2020-01-02'},
    {'user_ids': [2, 1], 'created_at': '2020-02-02'},
    {'user_ids': [4, 1], 'created_at': '2020-02-02'}]

friends_removed = [
    {'user_ids': [2, 1], 'created_at': '2020-01-03'},
    {'user_ids': [2, 3], 'created_at': '2020-01-05'},
    {'user_ids': [1, 2], 'created_at': '2020-02-05'}]


friendships = [{
    'user_ids': [1, 2],
    'start_date': '2020-01-01',
    'end_date': '2020-01-03'
    'user_ids': [1, 2],
    'start_date': '2020-02-02',
    'end_date': '2020-02-05'
    'user_ids': [2, 3],
    'start_date': '2020-01-02',
    'end_date': '2020-01-05'

Additional Python Data Science Interview Questions

Be sure to see our full guide to Python questions in data science interviews.

57. What built-in data types are used in Python?

58. How is a negative index used in Python?

59. What’s the difference between lists and tuples in Python?

60. Name mutable and immutable objects.

61. What is the difference between print and return?

62. What is the difference between dataframes and matrices?

63. What is a Python module? How is it different from libraries?

Probability Interview Questions

Probability Theory underpins all of statistics and machine learning, and, in data science interviews, probability questions are useful for assessing analytical reasoning. Most commonly, you’ll be asked to calculate probability based on a given scenario.

64. What is an unbiased estimator? Give an example for a layperson.

Unbiased estimators is a statistic that’s used to approximate a population parameter. An example would be taking a sample of 1,000 voters in a political poll to estimate the total voting population. There is no such thing as a perfectly unbiased estimator.

Need some help? See our probability course for an in-depth explanation.

65. Explain how a probability distribution could be not normal and give an example scenario.

A probability distribution is not normal if most of its observations do not cluster around the mean, forming the bell curve. An example of a non-normal probability distribution is a uniform distribution, in which all values are equally likely to occur within a given range.

66. You are given two fair 6-sided dice. If the sum of values is 7, then you win $21. However, you must pay $10 per roll. Is the game worth playing?

First, we consider how many ways we can possibly roll a seven. Let D_1 and D_2 be the result of the first and second dice respectively. There are six ways that the sum (D_1+D_2)=7. See the solution on Interview Query.

67. You have a shuffled deck of 500 cards numbered 1 to 500. You pick 3 cards. What’s the probability of each subsequent being larger than the last?

Imagine this as a sample space problem, ignoring all other distracting details. If someone randomly picks three differently numbered unique cards without replacement, then we can assume that there will be a lowest card, a middle card, and a high card.

Let’s make this easy and assume we drew the numbers 1, 2, and 3. In our scenario, if we drew (1,2,3) in that exact order, then that would be the winning scenario.

But what’s the full range of outcomes we could draw? See the full solution to this LinkedIn question on Interview Query.

68. Netflix has hired people to rate movies. Assuming all raters rate the same amount of movies, what is the probability that a movie is rated good?

More context: Out of all of the raters, 80% of the raters carefully rate movies and rate 60% of the movies as good and 40% as bad. The other 20% are lazy raters and rate 100% of the movies as good.

This question is asking us what percentage of movies are being rated good. How would we formally express this probability given the information provided?

Additional Probability Data Science Questions

Be sure to see our full guide to probability questions in data science interviews.

69. There is a fair coin (heads, tails) and an unfair coin (both tails). You pick one and flip it 5 times. It comes up tails 5 times. What is the chance you are flipping the unfair coin?

70. A fair six-sided die is rolled twice. What is the probability of getting 1 on the first roll and not getting 6 on the second roll?

71. Say you are given an unfair coin, with an unknown bias towards heads or tails. How can you generate fair odds using this coin? 72. Given draws from a normal distribution with known parameters, how can you simulate draws from a uniform distribution?

73. Give examples of machine learning models with high bias.

74. How many ways can you split 12 people into 3 teams of 4?

Statistics and A/B Testing Interview Questions

Statistics and A/B questions test your ability to design tests, understand the results, and perform statistical computations. Most commonly, they’ll be framed as:

  • Case study - A/B testing case study questions will provide you a testing scenario and ask you to design an A/B test, assess what’s going on with a test, or measure the results.
  • Experiment design - These questions test your ability to design and measure an A/B test, and include many core statistical concepts like P-value, significance, and power in testing.

75. How would you approach designing an A/B test? What factors would you consider?

In general, you should start with understanding what you want to measure. From there, you can begin to design and implement a test. There are four key aspects to consider:

  1. Setting Metrics -  A good metric is simple, directly related to the goal at hand, and quantifiable.
  2. Constructing Thresholds -What degree does your key metric must change in order for the experiment to be considered successful?
  3. Sample Size and Experiment Length - how large of a group are we going to test on and for how long?
  4. Randomization and Assignment - Who gets which version of the test and when? We need at least one control group and one variant group.

76. What are MLE and MAP? What is the difference between the two?

Maximum likelihood estimation (MLE) and maximum a posterior (MAP) are methods used to estimate a variable with respect to observed data. They are both good for estimating a single variable, as opposed to a distribution.

In machine learning, you can think of this variable as a model parameter or a weight that the model can learn. These functions are used to find the optimal parameter given the training dataset. However, they are different in that they take different approaches to estimate the variables.

77. How would you explain what happened in the experiment outlined below? How would you improve the experimental design?

Overview: You are designing an experiment to measure the impact financial rewards have on user response rates. The results show that the treatment group (received a $10 reward) has a 30% response rate, while the control group (no rewards) has a 50% response rate.

See a full video solution this A/B testing experiment design problem on YouTube:

A/B Testing Interview with a Google Data Scientist

78. What do you think the distribution of time spent per day on Facebook looks like? What metrics would you use to describe that distribution?

Having the vocabulary to describe a distribution is an important skill as a data scientist when it comes to communicating ideas to your peers. There are four important concepts, with supporting vocabulary, that you can use to structure your answer to a question like this. These are:

  1. Center (mean, median, mode)
  2. Spread (standard deviation, inter quartile range, range)
  3. Shape (skewness, kurtosis, uni or bimodal)
  4. Outliers (Do they exist?)

See a full solution for this problem on Interview Query.

79. What metrics would you analyze and what statistical methods would you use to identify athletic anomalies indicative of a dishonest user on a sports tracking app?

More context: Suppose you work for a company whose main product is a sports app that tracks and displays running/jogging/cycling data for its users. Some metrics tracked by the app are distance, pace, splits, elevation gain, and heart rate.

You’re given the task of formulating a method to identify dishonest or cheating users – such as users who drive a car while claiming they’re on a bike ride.

Try this question on Interview Query.

80. How would you make a control group for the close friends feature on Instagram Stories and test group to account for network effects?

Usually, A/B test design starts with randomly dividing users into two groups. Then, you give each group a different version of the product, and look for differences in behavior between the groups. Random assignment is done on a per-user basis, typically. Unfortunately, this standardized method doesn’t work well in this A/B testing case question for Instagram Stories. Why is that?

Additional A/B Testing and Statistics Questions

Be sure to see our full guide to A/B testing and statistics questions in data science interviews.

81. What significance level would you target in an A/B test?

82. How would you explain what a p-value is to someone who is not technical?

83. What does it mean for a function to be monotonic?

84. Why is it important that a transformation applied to a metric is monotonic?

85. How is statistical significance assessed?

86. What does selection bias mean?

Business Case Questions

In data science interviews, business case questions provide a scenario and test your problem-solving approach. You’ll be provided with a dataset and problem, and then you’ll be asked to investigate and offer solutions to the problem.

87. DoorDash just launched in NYC and Charlotte. How would you choose Dashers (delivery drivers) for these new cities?

First, think about the criteria? Would it be the same for both cities? Certainly not. That’s because they are very different in terms of density; dashers in NYC might prefer bikes, while Charlotte’s dashers might prefer cars. See a full solution to this question on Interview Query.

88. You want to remove duplicate product names from a very large ecommerce database. How would you go about doing this?

More context: The duplicates may be listed under different sellers, names, etc. For example, “iPhone X” and “Apple iPhone 10” are two names that mean the same iPhone.

See a full mock interview solution for this question on YouTube:

Duplicate Products

89. Let’s say you are a data scientist on the marketing team for Spotify. How would you determine how much Spotify should pay for an ad in a third party app?

Start with some clarifying questions. How large is the user base of this third-party app? What are the user demographics? Additionally, you’d want to learn some more about Spotify’s marketing strategy, e.g. how much you spend on similar campaigns, the goals of the ad campaign, target CPA (if that’s the goal).

Then, you can start to estimate pricing based on probable conversion rates.

90. How would you determine if the price of a Netflix subscription is truly the deciding factor for a consumer?

See a full solution for this business case question on YouTube:

Netflix Pricing

Additional Business Case Questions

See our full guide to business case questions in data science interviews.

91. How would you measure the success of Facebook Groups?

92. Determine average lifetime value of a customer based on provided criteria.

93. Is a 50% discounted rider promotion a good or bad idea? How would you measure the promotion?

94. What metrics would you look at to estimate demand for a ride-sharing app?

95. How would you approach forecasting revenue for the coming year?

Database Design and Engineering

Database design questions are asked to test your knowledge of data architecture and design. In data science interviews, database design questions will usually ask you to design a database from scratch for a provided application or business idea.

96. What are the features of a physical data model?

Here’s an overview of the features in physical models:

  • Specifying all tables and columns
  • Foreign keys to identify relationships between tables
  • Denormalization may occur based on user requirements.
  • Physical considerations may cause the physical data model to be quite different from the logical data model.
  • Physical data model will be different for different RDBMS. For example, data type for a column may be different between MySQL and SQL Server.

97. Explain a challenging data modeling project you have worked on.

With a question like this, focus on outlining the project and the unique challenges of the project. For example, if you worked with a healthcare company, patient privacy could have been a potential problem. Additionally, explain all of the entities that were linked and your process for designing the data model.

98. Let’s say you’re setting up the analytics tracking for a web app. How would you create a schema to represent client click data on the web?

A simple but effective design schema for this question would be to first represent each action with a specific label. In this case, assigning each click event a name or label describing its specific action.

For example, let’s say the product is Dropbox and we want to track each folder click on the UI of an individual person’s Dropbox. We can label the clicking on a folder as an action name called folder_click. When the user clicks on the side panel to login or logout and we need to specify the action, we can call it login_click and logout_click.

99. How would you design a database that could record rides between riders and drivers for a ride-sharing app? What would the table schema look like?

See the full solution for this database design question on YouTube:

Design a Ride Sharing Schema

100. Design the database schema to support a Yelp-like app.

More context: The application will allow anyone to review restaurants. This app should allow users to sign up, create a profile, and leave reviews on restaurants if they wish. These reviews can include texts and images. A user can only write one review for a restaurant, but can update their review later on.

The two main tables that can support the application functions would be a user table & a review table. You can include an additional restaurants table to normalize the restaurant information better.

Additional Database Design Questions

Be sure to see our full guide to database design questions in data science interviews.

101. Write an ETL taking data from an API that performs X transformations and results in CSV files.

102. Explain how you would build a system to track changes in a database.

103. How would you design the data model for the notification system of a Reddit-style app?

104.  What is a conceptual data model? What is a physical model?

105. What schema is better: Star or snowflake?

Additional Data Science Interview Resources

Join Interview Query today and begin preparing for your interview.

Members have access to:

Gain insight access to all of these resources when you join.