A friend of mine called me the other day out of the blue in a state of frustration.
Let’s call him Don.
Don was ranting about his latest data science interview he had thirty minutes ago. His interviewer had asked him an interview question over the phone that was given two example tables, could he write a query that would involve joining both tables to return a value.
When presented with a code editor, my friend didn’t understand why SQL was the only option.
“Um, I don’t know SQL but I can do anything you want in Pandas,” Don replied nervously.
The interviewer was quiet in response over the phone. He finally spoke up and asked why Don didn’t know SQL? Don was a machine learning engineer at a mid-sized startup. Nonetheless, he got a rejection email the next day.
I was also perplexed about why he didn’t know SQL. And then I realized after talking to many aspiring data scientists on Interview Query that they too, like Don, had very little SQL experience. Most data science enthusiasts understood the intricacies of advanced data manipulation, aggregation, joins, and merges within Pandas only to flail when given a SQL question asking about a simple sub-query. And yet why was SQL the default language used for all interview questions regarding analytics in data science if so many interviewees didn’t know it?
It’s pretty simple: You can’t query a database in pandas.
SQL is used because most interviews are from huge tech companies and most huge tech companies have a lot of data stored in their databases. Like a ridiculous amount of data. One days worth of data at a company like Facebook or Dropbox can be represented by a number with more commas than fingers on your hands!
But sure, if you know SQL basics then why can’t you just run a `select * from events_table where blah = ‘doodoo’` and pull the data directly into pandas right? But if your analytics database is like most companies, it’ll take a couple hours and a thousand clusters before it finishes that query. And even then pandas runs on computer memory and companies generally hold their data in relational databases for a reason. Your dataset is bigger than five million rows? Have fun watching your laptop slowly break down.
Here’s the ultimate secret though: SQL is not hard.
Most people that have used pandas will pick up SQL like a trivial second language. Most people that understand Excel will also understand SQL very easily. Almost all of the operations that you can do in SQL you can do in Pandas and vice versa. And getting good at SQL has it’s clear advantages not only for just interviews.
Pandas and it's simplicity towards getting started is like data science 101 for aspiring data scientists. It comes from a place where work in data science land is perceived exactly like a Kaggle competition where datasets are pre-cleaned, requirements to pass are easily scored, and all datasets are clearly labeled under a nicely quant folder named
But the world is not like this. The real world is messy and requirements change faster than your CEO's mood swings. Which is why we have SQL, to help query the databases that hold and store our data efficiently and make it easy to run analytics.
So learn SQL. Don certainly did. Because he eventually got a job at Facebook with his new SQL knowledge.
If you want some practice questions, you can find some of the best data science SQL analytics questions on Interview Query.