Let’s say you are designing a marketplace for your website.
Selling firearms is prohibited by your website’s Terms of Service Agreement (not to mention the laws of your country). To this end, you want to create a system that automatically detects if a listing on the marketplace sells a gun.
How would you go about doing this?
The interviewer is looking for a few points here:
We’re going to start with a four-step framework here:
I’m curious about a few things here. One thing I’m wondering about is how this is working in production. The task is to identify the listings automatically. What happens with those identifications? Do they go to a human who checks it, or it immediately changes the website?
How would this fit in? Is the end of the line always going to be customer service?
I can reframe the question back to you. Let’s say that this is the current system, and given the kind of context, would you think that the goal is first to disable it (the marketplace post) when they’re posting, or do you think we should be detecting it afterward, and then sending it to the customer support?
Let’s say that the current setup is that it’s all crowdsourced. We have a flag, and a user can flag the listing if they see it’s a gun, and then it’ll go to someone in customer support, who can remove the listing. If they determine it is a gun and that’s the only thing that happens right now.
I was asking that question because if customer service is going to look at it afterward, which would probably be the better way to start, then I need to do an excellent job detecting all the possible firearm listings.
It’s terrible if I miss something that is a firearm because the customer service isn’t going to look at it. It might cost some money, but it’s probably going to be okay if I give false positives, that is, listings that aren’t guns, and then let the customer service deal with it. But this pipeline would be very costly because you may be breaking the laws of your country or your terms of service if you miss the listing.
We want the customer service to get everything pertinent so they can decide. But if costs are a concern, then we would be concerned with the false positives.
Let’s think about an assumption of what is okay. If someone posts a gun, it gets put onto the marketplace and removed within one hour. In this scenario, it is unlikely that someone will try to purchase the gun from the seller.
A bad scenario is that we get overloaded with other items that are not guns, and they get the typical message, “your posting was flagged until customer service can review it. All of a sudden, sellers think this marketplace doesn’t work. I can’t even list my plants on here without them getting flagged. When we’re thinking about the different scenarios here, what is optimal? What would you think Facebook would want to do in this situation?
It depends on what happens with the model at the end. There are different scenarios we’ve laid out. One strategy is that you don’t want to miss anything; in that case, false negatives are essential. In other circumstances, if it ends up being a sticker that’s posted because the model identifies it, but it’s not there, it can lead to many issues. If we’re concerned with both those scenarios, we want to minimize false positives and negatives. We can use metrics like the F1 score to reduce that. F1 score is a combination of precision and recall, and accuracy is where you’re concerned.
You make a bunch of predictions, and how many of them are correct? Recall is the case where you make a bunch of predictions. How many of the actual case scenarios do you get? How much stuff do you miss? And that seems pertinent here because the number of guns will probably be minimal; I’m assuming the actual number of gun posts is maybe even less than 1% in a Facebook marketplace.
Measures like precision and recall will ignore the thing that’s very obvious and predominant and focuses on just getting the positive case of getting the listing with guns. The imbalance case will also come into consideration for the models we choose. The other thing is what sort of scale, scope, and data we might have.
The model doesn’t need to be that fast, at least when it gets deployed. Then in terms of the primary data, we would have access to other postings where customer service has flagged stuff. We have an extensive data set where we know that there was a posting and whether it was of guns or not.
Yes. I think we can identify if there were guns; there’s probably something where the actual value itself is flagged and then categorization of why it was flagged. For this scenario, probably customer service is labeling them as guns or firearms in that category.
It’s essential to have an accurate model for identifying these small cases of gun postings, and we don’t have many concerns about model training time. With that in mind, we’re concerned with accuracy and already have this large data set. There is something we might want to do with augmenting the data about data collection, features, and the model.
I have flags that users might have given. I have the particular user who posted it; their demographic, location, and if that part of the country has more guns listed or not.
The significant part here is the text in that data and the ability to leverage that information to get the keywords or to get patterns of words. We have user data flags, context, knowledge, and text features. Let’s focus on the text and what we can do with it to use for our model. You want to start with a simpler model, and the simpler model might be something like a bag of words.
Let’s assume that we have all the data that Facebook itself has.
We have some baseline data on how their current process is working, and we can have this other baseline where we use the most straightforward approach for text analysis. Technically, we could use more complicated models like attention-based transformers that consider contextual information. But I’ll focus on the simpler model and talk through that. If we have the text data, we can extract the bag of words, which means that we get each body of text, the unique words in the text, and the counts of those words.
The other thing we can do is an approach called the TFIDF, where we scale the value of each word based on its frequency in different postings. The reason this is valuable is that you might expect postings about guns. For instance, before running the model, we use specific terms not found in other postings. I might be helping my model by selecting words unique to particular listings, which will up weigh those words that tend to be very specific to specific listings because the bag of words can be extensive. So it’s a vast sparse matrix. Sometimes you might want to make additional reductions, and you might do something like a PCA to reduce that to 500 dimensions.
The point of this process is that you’re taking the text and putting it into some embedding space. The idea is that you want to plot points in space that mean similar things. We can always substitute it with other, more sophisticated methods. Now, the model that we want to build is the imbalance sample. For our prototype, we could start with a tree-based model, particularly a gradient-boosted tree, because what’s nice about these models is that you have each tree that makes a prediction.
In this case, it’s taking all our features and predicting whether it’s a gun posting or not. Then it takes the data points with the most error and scales them. So it weighs those data points in error for the next tree. What it will do is it’ll up weigh the sort of minority sample points; those listings that are for guns are very few, and if they keep causing an error in the model, their weight will keep going up, and they’ll be more and more critical in making the prediction. That’s why a gradient-boosted tree would be a good start. The only issue could be if you want online training; the gradient-boosted trees may not be optimal for it, and we could try other models if we wanted.
The difference between online and offline training is that online training happens while the model is deployed and continually improves, is that right?
Exactly, but in this case, we probably would want to update the model often. In this case, the gradient-boosted trees are quick to train and fast at delivering the predictions. We could retrain the whole model, but say if, for whatever reason, every time customer service asks you to update the model, then this tree-based approach will not be optimal depending on your package. You might want to use other methods like a neural network that could allow for this online training.
There are very few gun posts. The one thing I could have mentioned earlier is one way to deal with that is to balance the sample. We could have evened out the two examples if we had a lot of data. We could have taken how many gun postings we have and just gotten a matched sample of the equivalent other models, but say, they’re not enough gun postings to have that match. Then accuracy won’t be that great. So as I said, we can use precision and recall, and we can combine them in this F1 score that ranges between zero and one. That can tell us how well the model is performing at predicting those gun sales. When building the model, we probably would be training on our historical data. We’d be taking some samples of data in time. That’s our training sample, and then the test set is something later in time, which would probably mimic how it’s occurring in production, where our model is trained with a given collection of data in time. It predicts new data in the future. We might want to consider how long this prediction is good for and how often we want to keep rebuilding this model because everything on the internet gets increasingly creative at doing these things. So, we might like to update the model to deal with these creative people.
Yeah. That’s because as people realize that their posts are flagged by malicious actors active in their campaign to sell guns on the internet, they start creatively disguising their posts and the traditional NLP task of detecting bullets or guns for sale turns into code names. We still have to reuse that for identifying and manual tagging.
Another question I have is how do we know the performance of making more advanced approaches? So let’s say that we want to dive into computer vision. How would you assess the necessity of maybe using images in your analysis versus just only using text and that images probably are harder to train?
There’s more complicated with having expertise and images than just text, which has excellent Python packages. How would you approach that situation? How would you know it’s worth doing the image analysis into the features versus just going with the basic model?
Yeah. So I guess the question is, what’s the added value of the images, and is it worth bringing all that?
Yes. I think it’s a straightforward approach that you could use, you have all the features in there, and then you get its prediction. What is the complete model’s prediction accuracy? So say, for instance, the whole model is at 90%. Then you drop the images from the model, and you see when you remove the images, the accuracy is 85%, and then you do it again, but you removed the text data, and now it’s at 60% accuracy. So yeah, there, you can know.
The text is valuable, but the images do drop it. When you drop the images, the accuracy does drop, but is that a significant drop? You could simulate it, like randomly sampling the data, mainly because we have enough data. So you could randomly sample from the data and retrain the model each time and get this drop in accuracy when you remove the images. So I could say that 95% of the time, the reduction in accuracy is more than zero. It’s a negative drop, so then I might say, yeah, images are important because almost all the time I’ve tried it out, there is that drop in accuracy in simulating across multiple samples.
Yeah. That makes sense. The final question is just more about the question itself. What do you think about this question? How well do you think it assesses a candidate’s overall performance? How do you feel your answer would fit into a broader Facebook interview?
I’m not sure about the broader Facebook interview, but I guess this question looks pretty standard; it’s very machine learning. You test your knowledge of minorities. When you have very few things you’re predicting, you have to build the model, but you go from end to end. To me, it seems like a reasonably common problem you might face, something like this where you need to identify something from a particular listing or a particular post.
Yeah, I liked your answer and how you structured everything. I feel that’s an excellent approach for most machine learning type questions as well as I think most of them have a very defined beginning, middle, and end in terms of where the data is, how you build the model, and then how you would deploy and evaluate it. I think focusing on those approaches is critical. So I think you did a great job there.