Classification has numerous applications in data science and machine learning, particularly in relation to churn prediction, recommendation engines, sentiment analysis, loan approval, and anomaly detection.
Therefore, if you want to land a data science job, you need to understand how classification works, the math behind it, and the best use cases for this technique.
Besides reading up on the subject, you can build your expertise by exploring hands-on projects. Projects provide training in classification algorithms, real-world use cases, and, ultimately, they can help build domain expertise.
Want to try a classification project? We’ve highlighted some of the best datasets for classification and machine learning projects (although you might prefer to scrape your own and create an original data source). You’ll also find links to tutorials and pre-set projects for these data sources.
Practice using classification algorithms, like random forest and decision trees with these datasets and project ideas. Most of these projects tend to focus on binary classification, but there are a few multiclass problems as well. You’ll also find links to tutorials and source code for additional guidance.
Also, check out our list of great beginner data science projects for more ideas.
Build a classifier that predicts whether a building permit’s ‘type’ is ELECTRICAL
or not. Note that there are many different types of permits, but we are only interested in ELECTRICAL
.
You are free to use any algorithm(s). You are also free to use all or a subset of the information included. Use your classifier to predict whether each building permit contained in xtest_data.csv
is ELECTRICAL
or not.
How to do the Project: Please write in python and comment appropriately to justify your feature selection, model(s) choices, validation, etc.
Ideally, your code and comments should tell a complete story, but you are also welcome to write a short note that highlights your work on this project.
For example, feel free to identify something you found particularly challenging and how you tackled it or what avenues you might pursue given substantially more time.
One of the best sources for classification datasets is the UCI Machine Learning Repository. The Mushroom dataset is a classic, the perfect data source for logistic regression, decision tree, or random forest classification practice. Many of the UCI datasets have extensive tutorials, making this a great source for beginner classification projects. A few specific UCI datasets to consider include the Wine Quality dataset and Iris classification data.
How to Do the Project: Check out this tutorial for an overview of using several algorithms to classify mushrooms, including KNN, decision tree, random forest, and support vector machine classifiers.
Want to learn image classification? Take a look at the MNIST dataset, which features thousands of images on handwritten digits. MNIST includes a training set of 60,000 images, as well as a test set of 10,000 examples. This is one of the best datasets to practice image classification, and it’s perfect for a beginner.
How to Do the Project: Practice computer vision concepts and build a simple digit recognizer. Here’s a helpful tutorial to get started.
The Titanic Machine Learning Competition is one of the most popular data science competitions on Kaggle. It’s the perfect building expertise with classification algorithms, like K-nearest neighbor and random forest. The competition’s aim is simple: predict who survived the sinking of the Titanic using machine learning.
How to Do the Project: See this tutorial on Medium, which will help you apply seven classification algorithms to the Titanic dataset. Which algorithm do you think has the best accuracy?
Classification is widely used for loan prediction. If you’re interested in fintech jobs, you should absolutely have experience building loan prediction models. A great dataset to start with is the Loan prediction dataset on Kaggle, which you can use to build a yes/no loan approval model. Another finance dataset to check out is this bank marketing dataset. Use the data to determine whether a client will make a deposit.
How to Do the Project: This Kaggle notebook offers a solid explanation of using logistic regression, support vector machine, or decision tree classifiers for loan approval.
These projects test more intermediate classification skills, like using convolutional neural networks (CNN). Any of these datasets and project ideas are great for those who have experience working with machine learning.
As a company, we have to buy our books ahead of time. We took out a loan last month to buy our original batch of books. The value of the loan was the total cost of all the books that we bought.
We made some money back through customers buying our books last month. Next month, we know which books we will be sending to which customers, but we do not know who will buy what books.
The question for you to answer is: Will we be able to both pay back our loan and afford our next book purchase order?
You should create some sort of machine learning model for answering this take-home as opposed to simply looking at the average conversion rate or something like that).
However, we do not expect you to build models from scratch. NumPy, SciPy, sci-kit-learn, and everything else are all fair game.
Health informatics is a fast-growing field in data science, and there’s a wide range of applications of machine learning in healthcare. This Python project uses the IDC (Invasive Ductal Carcinoma) dataset and asks you to build a model to predict IDC breast cancer. You could also work on a similar project using the UCI Breast Cancer dataset. )
How to Do the Project: This tutorial walks you through using Python - along with the Keras library - to build a convolutional neural network.
You might also want to check out the ImageNet dataset, a great source for CNN projects, or this tutorial for building a CNN with Python.
One use case for classification is building prediction models, specifically related to marketing and conversions. The challenge for projects like these is finding reliable data sources.
One option is this Clicks and Conversion Tracking dataset on Kaggle, which features the social media marketing performance of an anonymous brand. If you’re looking for another source, check out this conversion rate dataset on Github.
How to Do the Project: There are numerous models you can create to predict conversions, but here’s a helpful tutorial that examines using decision trees to predict conversion rate.
Building genre classification models will allow you to practice intermediate Python techniques, including K-nearest neighbor and random forest algorithms, as well as the Librosa library.
There are numerous datasets you can use. While the Million Song Dataset is one of the best, there are also music datasets in the data world.
How to Do the Project: Here’s a helpful tutorial that looks at using content-based filters for music genre classification.
The RAVDESS dataset features 7,000+ files, in which actors express various emotions while speaking. In terms of building speech recognition models, this dataset is one of the most comprehensive out there.
You might also want to check out data sources like the LSSED: A Large-Scale Dataset and Benchmark for Speech Emotion Recognition or see this list of emotion recognition datasets.
How to Do the Project: This tutorial walks you through using a convolutional neural network to examine RAVDESS data.
Practice advanced machine learning skills with these datasets and project ideas. Most advanced classification problems deal with multiclass classifiers, deep learning, and image classification.
Assume you have been selected to help the Chicago Police Department build the machine learning services which will power their next generation of mobile crime analytics software. This software aims in particular at predicting, in real-time, the category of a crime as soon as it is reported by an emergency call (for instance robbery
, assault
, theft
).
This prediction can only be made with information available at the time of the call (such as time and location) without on-the-ground assessment or knowledge of ex-post action (such as arrest, conviction, demographics of the victim(s) or offender(s).
Build a model that can predict whether or not the crime is a THEFT
(identified in the Primary Type column) given a relevant set of features at your disposal. Please explain your choice of features in light of the use case highlighted above. Use the training data to train the model and discuss its performance on the test data.
The questions for you to answer in this take-home are:
THEFT
and what is the accuracy of your model?You’ll find a variety of text datasets available online, and many of these are great launching points for a text classification project. Text classification, however, can be tricky, so here are a few specific datasets we thought would be particularly helpful.
The Hate Speech and Offensive Language dataset on Kaggle is a great source for a Python natural language processing problem looking at multiclass classification – determining whether the text is offensive, hate speech, or neither.
You might also consider Fake News detection. A team from UC Berkeley built a multiclass classifier to determine if news articles were fake news, clickbait, or neither. You can read about that project here.
How to Do the Project: You can scrape your own text data source. Twitter is great for projects like this, like this example that determined the most hated player in the NFL.
Here’s a great Python project for text classification: What emotion is being conveyed? These problems are difficult, often because there aren’t many reliable labeled datasets, on top of filtering for multiclass.
This Kaggle text emotion dataset is perfect for the problem. But you’ll also find others, like this Rotten Tomatoes sentiment analysis from Stanford, which features a rating between 1-25 for movies, or the Sentiment140 dataset, with data on brand sentiment from Twitter.
How to Do the Project: This end-to-end tutorial from The Clever Programmer walks you through data preparation, tokenization, and creating a list of emotional words to classify the text.
One of the best advanced machine learning projects, gesture recognition is one of the most challenging problems in computer vision. There are two types: static and real-time dynamic gesture recognition. Check out this problem Sign Language and Static Gesture recognition with scikitlearn on GitHub, which features a great dataset of ASL images. Be sure to check out this guide to sign language recognition with CNNs.
How to Do the Project: The hardest part will be finding good data – you also have the option to create your own dataset. This YouTube tutorial shows you how to capture images with OpenCV and label them with Labellmg. Ultimately, it walks you through real-time gesture recognition with Tensorflow.
Object detection and image segmentation are two complex data science problems. One of the best datasets for exploring object detection is the Common Objects in Context dataset, a large-scale dataset featuring numerous images in context.
How to Do the Project: If you want to give it a shot, check out this guide on image segmentation with COCO. The tutorial will show you how to manipulate images with the Python Pycoco library, as well as how to use Keras to process and classify images.
Interview Query offers a variety of ways to learn and practice classification. Check out these project idea lists from Interview Query: