The Saturation of Data Science

Jay Feng

Remember when the term data scientist first came out in early 2010? Harvard Business Review coined it as the The Sexiest Job of the 21st Century and prompted everyone who ever worked in statistics or analytics to immediately back-change their positions and titles for recruiters on Linkedin madly searching for these so-called unicorn data scientists. The joke at the time was that a data scientist is actually a data analyst that lived in the bay area. The other one was that a data scientist was a statistician plus an extra $100K.

Fast forward a few years later and now these very same statisticians are being paid $150K to $300K a year in total compensation. Don’t believe me? Check out the different salary charts here on Levels.fyi.

I mention these stats because it helps illustrate a point of career achievement in software engineering and data science in the past decade. The originations of both crafts that were never regarded as highly respected compared to traditional paths within medicine, law, or banking, have now switched to become the easier standard commodification of a successful career path. All it takes is a decent undergraduate degree here, or bootcamp completion there, and a nice cushy internship over there, with an eventual guaranteed outcome for stable income, regular hours, and a sling-shot into the upper middle class.

We are entering the saturation of the data science market. Not in job demand but rather as a career that is now in the purview of the general public. While ten years ago it was the machine learning enthusiast and analytics fiends that paved the creation of the data science job by doing what they loved. We have now headed to the point where freshmen in college decide on the data science career out of the rejection of other disciplines.

There’s nothing wrong with that. But as a data science and analytics candidate within the system, the best way to stand out towards companies and startups hiring will always be the appreciation of the curiosity and analyses in the field itself. A successful data scientist is not measured by the academic rigor or other optimizations a perennial 4.0 student can make. Rather it’s the push towards continuous curiosity of using data to unravel why certain people get cancer, why houses get sold for a million dollars, why people enjoy reading about fake news, and why people love working in data science.

Three SQL questions you must know to pass your data science interview

Jay Feng

I’ve interviewed a lot of data scientist candidates and have found there are a a lot of SQL interview questions for data science that eventually boil down to three generalized types of conceptual understandings.

If you’re an interviewing data scientist, these problems are a must know! They’re also great filter questions to test more than the basic one-hour primer study of a candidate that read the differences between an INNER and LEFT JOIN. Here’s an example question from each.

1. Getting the first or last value for each user in a `transactions` table.

            
        `transactions`   
        +---------------+---------+     
        | user_id       | int     |     
        | created_at    | datetime|     
        | product       | varchar |     
        +---------------+---------+ 
        Question: Given the user transactions table above, 
        write a query to get the first purchase for each user.
            
        

Why does this matter? How would you query for the first time a person commented on a post and read the post itself? How do we cohort users by start date? All of these analyses need this concept of querying based on first or last time and it definitely can be solved without using an expensive partition function.

Explanation:

          
      We want to take a table that looks like this:

       user_id | created_at | product  
       --------+------------+--------
        123    | 2019-01-01 | apple    
        456    | 2019-01-02 | banana   
        123    | 2019-01-05 | pear    
        456    | 2019-01-10 | apple   
        789    | 2019-01-11 | banana  

      and turn it into this

       user_id | created_at | product   
      ---------+------------+--------
       123     | 2019-01-01 | apple      
       456     | 2019-01-02 | banana     
       789     | 2019-01-11 | banana   
      
      How do we get there? 
          
      

We can solve this problem by doing a multi-column join.

First, how do we figure out the first time each user purchased? This should be pretty simple and can be done by a simply GROUP BY aggregation and aggregating for the minimum datetime. Notice how the table has a created_at column. This is the column that determines which row is the first purchase for the specific user, so we can write a query with an aggregation to get the minimum datetime for every user.

          
      SELECT user_id, MIN(created_at) AS min_created_at
      FROM transactions
      GROUP BY 1
          
      

Awesome. Now all we have to do is join this table back to the original on two columns: user_id and created_at. The self join will effectively filter for the first purchase. Then all we have to do is grab all of the columns on the left side table.

          
      SELECT t.user_id, t.created_at, t.product
      FROM transactions AS t
      INNER JOIN (
          SELECT user_id, MIN(created_at) AS min_created_at
          FROM transactions
          GROUP BY 1
      ) AS t1
          ON t.user_id = t1.user_id
            t.created_at = t1.min_created_at
          
      

2. Knowing the difference between a LEFT JOIN and INNER JOIN in practice.

          
      `users`
      +---------+---------+     
      | id      | int     |     
      | name    | varchar |     
      | city_id | int     |<-+     
      +---------+---------+  |
                             |
                             |
      `cities`               |
      +---------+---------+  |   
      | id      | int     |<-+   
      | name    | varchar |       
      +---------+---------+
      Question: Given the `users` and `cities` tables above, 
      write a query to return the list of cities without any users.
          
      

Why does this matter? Anyone can memorize the definitions of an inner join and left join when asked during an interview. The Venn diagram provides an adequate explanation. But can the candidate actually implement the difference when in practice?

Explanation:

What is the actual difference between a LEFT JOIN and INNER JOIN?

          
      INNER JOIN: returns rows when there is a match in both tables. 
      LEFT JOIN: returns all rows from the left table, even if there are no matches in the right table.
          
      

Okay, so we know that each user in the users table must live in a city given the city_id field. However the cities table doesn’t have a user_id field. In which if we run an INNER JOIN between these two tables joined by the city_id in each table, we’ll get all of the cities that have users and all of the cities without users will be filtered out.

          
      SELECT cities.name, users.id
      FROM cities
      LEFT JOIN users 
          ON users.city_id = cities.id
          
      

But what if we run a LEFT JOIN between cities and users?

          
      cities.name  | users.id
      _____________|__________
      seattle      | 123
      seattle      | 124
      portland     | null
      san diego    | 534
      san diego    | 564
          
      

Here we see that since we are keeping all of the values on the LEFT side of the table, since there’s no match on the city of Portland to any users that exist in the database, the city shows up as NULL. Therefore now all we have to do is run a WHERE filter to where any value in the users table is NULL.

          
      SELECT cities.name, users.id
      FROM cities
      LEFT JOIN users 
          ON users.city_id = cities.id
      WHERE users.id IS NULL
          
      

3. Aggregations with a conditional statement

          
     `transactions`   
     +---------------+---------+     
     | user_id       | int     |     
     | created_at    | datetime|     
     | product       | varchar |     
     +---------------+---------+
     Question: Given the same user transactions table as before, 
     write a query to get the total purchases made in the morning 
     versus afternoon/evening (AM vs PM) by day. 
          
      

Why does this matter? If you can’t use conditional statements and/or aggregate with conditional statements, there’s no way to run any kind of analytics. How do you look at differences in populations based on new features or variables?

Explanation:

Notice whenever the question asks for a versus statement, we’re comparing two groups. Every time we have to compare two groups we must use a GROUP BY. It’s in the name. Heh.

In this case, we need to create a separate column to actually run our GROUP BY on, which in this case, is the difference between AM or PM in the created_at field. In that case, let’s create a condition in SQL to differentiate them.

           
     CASE WHEN 
         HOUR(created_at) > 11 
     THEN 'PM' ELSE 'AM' END AS time_of_day 
           
       

Pretty simple. We can cast the created_at column to the hour and set the new column value time_of_day as AM or PM based on this condition. Now we just have to run a GROUP BY on the original created_at field truncated to the day AND the new column we created that differentiates each row value. The last aggregation will then be the output variable we want which is total purchases by running the COUNT function.

             
       SELECT
           DATE_TRUNC('day', created_at) AS date
         , CASE WHEN 
               HOUR(created_at) > 11 
           THEN 'PM' ELSE 'AM' END AS time_of_day
         , COUNT(*)
       FROM transactions
       GROUP BY 1,2
             
         

Want more interview questions and answers? Sign up for our mailing list.

No spam and one click unsubscribe.