Table of Contents
Today I'm joined by Shashank who is a data scientist and former business intelligence engineer at Amazon! Shashank and I worked on a business case study problem that is asked by Amazon business intelligence engineers and analysts. It involves breaking down a vague problem in a database into a solvable solution that makes sense and can scale for Amazon's purposes. At the end we go over feedback and tips and tricks for the next one!
Preparing for a data science case study interview? Check out this guide on "Acing the Data Science Case Study Interview" on Interview Query!
Here's the link to watch the video of the mock interview.
Welcome to the mock interview and before we get started I would love to ask you about your background and how you got into data science.
Sure, thank you so much for having me. So a little bit about my background. I've always been a data and numbers guy and I used to work as a business intelligence engineer back before I got my Master's. I did a Master's in Data Analytics before becoming a data scientist for almost two years. Most of my experience has been around machine learning problems and implementation which lead me to learning some DevOps skills as well.
I ended up learning how to convert Python code to PySpark and then moved over to Amazon as a business intelligence engineer. Most of my work was around Tableau reporting, building ETL jobs, figuring out what kind of data needed to be in the reporting databases so that your end reports work well. And just digging deeper into different metrics that the business wants to learn and providing more information on what they could look at instead of what they're already looking at.
Duplicate Products Case Study
Awesome, so I'd love to start out with the first question.
So let's say you're working at a huge e-commerce website like Amazon and you want to get rid duplicate products that may be listed under different sellers names in a really large database. So for example we have two of the same product but named differently like iPhone X and Apple iPhone 10.
Given that we have these two same products with different names, we want to de-duplicate it. But let's say that this example shows up for a lot of different cases. So what's one way that you would go about solving it.
Gotcha, so actually doing this, if it's an established e-commerce company I would assume that they would have some kind of an ID for every product that they have in their inventory. So something like an SKU or an ID. And if it's Amazon, then that's pretty unique and you know that even if the description is different under different sellers, I would assume that they would have the same SKU.
So if you just look at the list of all the SKUs and different sellers and then do a distinct GROUP BY on SKU across all sellers you'll find out which SKUs are replicated. And then once you have that you can go to the business team saying what do you want to do with them.
Okay, let's make it a little bit more complicated and say that we don't have that SKU field and that people are just creating their listings by entering in what they think the product names are, maybe a picture or description as well, and essentially what the create product flow looks like on Amazon right now.
How would we then do the mapping to the SKU or would you think of a different approach towards solving the problem?
Yep a couple of things come to mind.
If we have images for these products that we think may be duplicated, we could try to use algorithms that identify similar images. Then once you have that list of similar images, you look at the descriptions and build a string similarity algorithm which outputs which descriptions sound similar or are close to each other. Now you would have at least two data points that you know these two products are similar. Then it's probably going to be a little bit of manual intervention to identify if they really are similar or not.
The other thing that I can think of is maybe reviews on different products. So imagine that there are two different products just named differently but both of them are the Apple iPhone 10. You would assume that the reviews are pretty much talking about a phone and that it's manufactured by Apple. They probably have the same kinds of experiences and reviews, so you could see if the reviews are very similar to each other and that would give a good indication that the product is probably the same.
Okay so let's say we go with all these methods. We're looking at similarity across images, descriptions, and reviews and we're getting this score for each one of them. Now how do we go about decide if we can de-duplicate them or not?
Would we have a human review every single one? Do we do some sort of scaling process? Because let's say we have to do this for thousands and thousands of products right? What's the next step?
Gotcha. Well from the beginning we don't really know which products are the same or not so we can't really do a supervised learning method. It needs to be an unsupervised technique that first tries to identify what products are similar to each other. I probably would do a clustering technique based on just descriptions and reviews.
We'll definitely need to do some cleaning and tokenization for the text data to bring it to a structured format. Then we can run a TFIDF on different descriptions and reviews to find out which documents are similar to each other. We'll get some scores and depending on how many documents end up in a particular cluster, we will definitely have to do a manual step to see if they're actually same or not.
I'm not aware of a clustering technique that works on images but we would probably have to build out features from the image, bring it to a structured format, and then do clustering on top. So we might identify ten different clusters if there are ten items that are duplicated and then look at the clusters descriptive statistics to see if the customer in reviews is really talking about a phone is actually a tablet or a computer etc..And then try to go about in a manual investigation from that point.
Okay so let's say we do that and we're going through these clusters and we find that the algorithm has for a couple of them, clustered just phones together instead of doing a specific enough cluster for the same product. Or maybe we're getting thousands of different clusters potentially where like there may or may not be duplicates.
Is there any way that we can optimize our manual intervention or scale this problem out so that we use the least amount of manual oversight while also figuring out a way to deduplicate efficiently?
I guess it would depend on the features that we actually extract as the more granular the features in our dataset, the better the clusters could be. If we are creating clusters just on the type of device then you're right, I think all phones and all computers will just end up together.
But if we are given that these are also duplicate listings, we would definitely want to look at more information in the listing itself; like the price of the product, the different types of colors that's available, and then what features in iPhones and androids that are similar to each other. The features need to be as close to the product itself so that our clusters are more identifiable amongst each other and not generic as phones and computers.
And then maybe the customers itself. We can look at purchasing behavior as well. iPhones typically tend to sell out as soon as they are launched so w can try to use information around when a particular product was launched and then look at the purchase pattern during that time and then try to integrate these features in the dataset.
Mock Interview Feedback
I wanted to do a brief feedback session on the question. What did you think about the first question?
I think the question was good. It was pretty vague in the beginning but I think based on what your cues were, I felt like we wanted a more algorithmic solution than a SQL database one.
So initially I thought that it's more like a simple question where I can tell what kind of distinct ID or which columns I need to be grouping by, but turned out that we wanted to check this at a more higher level. So I think it was a good brainstorming question. There are multiple ways that it could have gone but I think we ended up with a few good starting points.
Gotcha, so two bits of feedback. I think I like scaling out the approach but it would be better to have a broader horizon on the the case. For example not limiting to just like kind of phones and maybe think about other cases with an e-commerce as big as Amazon.
Then I think having more data points is also helpful when explaining these concepts. For example, being able to create assumptions off of one example is tough if there are thousands of these duplicated products on Amazon across thousands of different categories.
It would also be helpful to then think about how much can we automate out and what our threshold error rates are for each product. With iPhones, it's pretty important to minimize the error rate. But let's say that we're selling Pokemon cards that are duplicated. How much of that can we get on an automated solution of doing word matching and then being satisfied with a high enough match rate threshold?
I think for example, if we had a three percent error after doing a manual review of the matching, we can scale that out and then ask if we're okay with that error going forward. So generally what I'm getting to is a disclaimer or at least a dialogue around what could make sense in terms of implementation, not only technical process.
Yeah basically tweak the sensitivity enough for our business cases. Assuming that obviously we don't really care about duplication if it's something that doesn't affect the business too much.
Yeah we do care if multiple sellers are trying to sell really high value products like iPhones or Macs for whatever reason and we don't want these products to go past the first page results.
But in general, the thought process on the case was pretty good and structured and went down the narrow the path that I was trying to lead towards.
Amazon Interviewing Rubric
Lastly, is is there any kind of ideas or thoughts on how these types of questions are measured in for real interviews. This is kind of apart from the mock interview but I would love your thoughts on how you think these questions, that are more ambiguous, should get graded.
From a rubric perspective, for case interviews from a candidate perspective, I think that it's important to figure out the kinds of clarifying questions they asked and how much information other than the problem statement they're able to extract out of the interviewer. If they don't ask questions then the interviewer could fail them when they actually wanted to give them some information so that the discussion leads down to the path that they want to go to.
From what I understand more often than not in the case study interviews, it's the interviewer who decides where the candidate needs to end up at. Even if the candidate has multiple different ideas, it's up to the interviewer to guide it. I think the first step would be checking how many questions are they able to get in the beginning, how many extra data points they are able to think of, and then maybe like five minutes of questioning and then based on all the answers they get, how are they able to like drill down on those data points.
Such as do they have a segmentation approach to estimating different values in that problem statement? And then of course calling out the assumptions you know at every point. Is a candidate calling out what assumptions they are making and then checking if those assumptions make sense or not with the business point of view?
At the end I think there's never really a right answer so it's just about how well the candidate is able to summarize their solution to the problem .
I like that and think that's a great way to kind of define it as well and especially since it's so ambiguous on either side. I think being able to make note of what the good points are is helpful.
Check out our guide to the Amazon Business Analyst interview process.