Landing a machine learning job requires a strong portfolio, and one of the best ways to build your portfolio is by doing machine learning projects.
A machine learning project will help you practice skills and show interviewers you can use those skills. Ultimately, a candidate with a strong machine learning portfolio has a better chance at open jobs than those without one. However, starting a project to practice machine learning skills from scratch can be difficult, especially if you don’t have a dataset or a well-constructed problem statement.
To provide guidance, we highlighted some of the best datasets and ideas for machine learning projects, including:
This list provides ideas and inspiration for starting a machine learning project from scratch, including available source code, project examples, and links to helpful datasets.
Beginner machine learning projects will help you practice and build competency in core machine learning skills. Beginner projects include basic recommendation engines, prediction models, and entry-level computer vision projects. These are some of the top machine-learning projects for beginners:
An activity has defined as a series of actions in a game that complete a word (Nonsensical or otherwise; a full word or simply a letter) as requested by the game. If the child abandons play or never completes the action correctly as the game is expecting; that is not recorded.
Questions to answer in this takehome:
What do you do when you are missing critical information or need to know more about a specific topic? Today’s most straightforward approach to that question is, “Google it!”. Millions of people use Google every day in billions of searches to find information about a wide variety of topics.
One exciting project idea is to use Google’s Pytrends API to analyze what people around the world are searching. Pytrends can help you obtain different information about what people use Google for. For example, you can find search statistics about a specific topic, trending searches, and categorize those pieces of information by time, region and keywords.
Machine learning is an umbrella term covering multiple subfields, including computer vision. Computer vision has quite a bit of active research as the potential for the automatic interpretation of visual inputs is massive. Because of that potential and research, it attracts a lot of attention from users.
Suppose you’re new to machine learning in general or to computer vision. In that case, an excellent place to start is using the MNIST dataset to build a digit recognizer. Building this project will help you get familiar with the basics of computer vision and neural networks.
Categorizing machine learning projects into simple or complex ones is a challenging task. The project’s complexity often depends on how you choose to implement it rather than the project itself. One great example of that is recommendation systems.
At first, you would assume that building a recommendation system is an intermediate or advanced project. But with experience, you can create simple and straightforward code to implement your recommendation engine.
For example, you can use the rich, rare dataset to implement a simple recommendation system using a user-user similarity matrix that recommends items that similar users like.
Suppose you are into being physically active and sporty. In that case, one project that might interest you is the recognition of different human activities using the smartphone dataset.
This dataset contains the fitness activity recordings of 30 people captured through smartphone-enabled inertial sensors. This project aims to use machine learning algorithms to accurately classify the different fitness activities. Mainly, you will need to implement a multiclass classification algorithm and work on your data visualization and analysis skills.
How can you tell whether a brand or bottle of wine is worth your money? You probably need to know a bit about the grape type, the wine age, and its price. Or, you could build a prediction model to determine if a bottle is a quality product or not.
The Wine Quality Data Set is a classic in the UCI Machine Learning Repository (a go-to source for machine learning datasets). Using the wine dataset, you can build a prediction model and gain hands-on experience with data visualization, regression models, and more.
Follow this tutorial for predicting wine quality in Scikit-learn.
Another option: Check out the Red Wine Quality dataset on Kaggle for project inspiration.
If you’re interested in machine learning’s application in health and wellness, you should try this project and build a breast cancer prediction model. Using data from the Breast Cancer Wisconsin Diagnostic Dataset, you can follow this tutorial for building a simple classification-based model for predicting cancer, which walks you through every step, including importing data, data exploration, feature selection, and model selection.
The Iris Flower dataset from UCI is one of the most well-known pattern recognition databases. Many beginners use it to build image classification models to determine the species of an Iris based on the image.
This is a great beginner machine learning project because there are so many tutorials to get you started.
For example, this step-by-step tutorial walks you through the entire project, from setting up the environment and loading the data to comparing models like logistic regression and support vector machines.
This project provides a great introduction to model building and has also been asked in interviews for data science positions at Twitter and Walmart. (Try the logistic regression from scratch interview question.)
Although you’d likely use Scikit-learn’s logistic regression function in production models, this project does give beginners an in-depth look at the math and provide ideas about developing custom extensions.
See this helpful tutorial for building a logistic regression model on simulated data.
Recommendation engines provide hands-on experience with machine learning tools and techniques. For this beginner project, use data from the Movielens Dataset, which features more than 25 million movie ratings from 15,000 users.
Follow this movie recommendation system tutorial to see how to build content and collaborative filtering for the engine. You can keep it simple and use a few columns from the dataset to make the system, e.g., genre and release date.
At the immediate level, machine learning projects dive into more advanced techniques like text mining, text summarization, image recognition, and natural language processing (NLP). Some intermediate machine learning project ideas include:
Purpose
We would like you to use a Jupyter (python) notebook to work with a slice of this data. You’ll get a sense of the type of questions that we deal with at States Title, and we’ll get a sense of your data science approach.
How you can do it:
Write python code that allows you to stand up a nationwide title insurance company:
default_notices.csv
, train_property_data.csv
, and test_property_data.csv
, described below.test_property_data.csv
file, which represents your prediction of the overall title risk for the property. This column should behave in such a way that properties with lower risk are predicted to be more profitable than properties with higher risk.Summarizing a text shortens its body while maintaining its message and meaning. You can build an abstractive text summarizer that uses advanced natural language processing techniques to generate a new, shorter version that conveys the same information. You can build this project using Pandas, Numpy, and NTLK in addition to an unsupervised learning algorithm for word representation.
Text mining is the process of structuring and extracting useful information from unstructured data, which is 80% of all raw text data. When we mine text, we effectively transform it into a structured format, facilitating the identification of key patterns and relationships within datasets. If you want to dip your toes into some natural language processing, you can use these datasets to implement multi-level classification or to evaluate the performance of multi-label algorithms.
Music is a big part of everyone’s daily life. Often, people have different tastes in the music they listen to while they work, exercise, or just relax. One exciting project that you can build is a music genre classifier. This project’s idea is to automatically use one or more machine learning algorithms (such as multiclass support vector machine, K-means clustering, or convolutional neural networks) to automatically classify different musical genres from audio. Often this classification is done through the filtering of audio files using their low-level frequency and time-domain features.
An idea that has long intrigued researchers and companies is the automatic recognition of handwritten characters. This project aims to model a neural network to detect & recognize handwritten characters. To implement this project, you can use the A-Z handwritten alphabet dataset along with Keras, TensorFlow, and Pandas.
Fraud detection is an intermediate machine learning skill, and this project will help you prepare for fraud analytics and security roles. Follow this tutorial for using Scikit-learn to investigate fraud on the Enron emails dataset. The dataset features 500,000 messages from 150 former Enron employees, many of whom were high-level executives. With the tutorial, you will build a model to predict persons of interest from the available data.
This project will allow you to build a neural network model to predict stock prices. This model is an intermediate machine learning project because it requires knowledge of neural networks and solid Python skills. You can pull data from Yahoo! Finance or use the historical NASDAQ dataset on Kaggle for practice on various Python packages, including Numpy, pandas, Matplotlib, and Keras.
Predicting churn is a skill that’s useful in a variety of industries, including e-commerce, media, and finance. Fortunately, there are a variety of churn prediction datasets you can use to practice the skill. In this tutorial, you’ll learn how to use Python, pandas, Scikit-learn, Recency, Frequency and Monetary value (RFM) analysis, and SMOTE to predict churn using this retail dataset on Kaggle. The data features more than 60,000 transactions.
After processing the data, you’ll use RFM analysis to qualify customers and predict how much a customer will spend in a year.
Market Basket Analysis (MBA) is a machine learning technique used in retail. If customers buy from one product group, they’re likely to buy related products. For example, if a customer bought baby wipes, there’s a high likelihood the customer would also purchase baby formula. One way to do MBA analysis is to use an Apriori algorithm to identify patterns for association rule mining. You can perform this task using this Kaggle grocery shopper dataset. And this tutorial will walk you through using Apriori algorithms.
Practice using regression to predict sale purchases in this project. You can follow this Kaggle notebook for an in-depth look at how to perform data cleaning, feature engineering, and ultimately make predictions. Although this tutorial uses a proprietary dataset, numerous open datasets are available, including Black Friday on Kaggle.
Numerous music datasets are available, but one of the most popular is the Million Songs Dataset. In this project, you’ll build a recommendation engine that provides users with recommendations of popular songs based on their play history. Follow this tutorial to see how to perform data loading, data processing, and building a popular recommendation engine. Ultimately, the engine will take the songs the user has listened to, and a co-occurrence matrix is constructed based on the score and rank of the songs.
Advanced machine learning projects dive into the most advanced machine learning skills, including sentiment analysis, deep learning, and computer vision. These are some advanced projects to try:
Assume you have been selected to help the Chicago Police Department build the machine learning services which will power their next generation of mobile crime analytics software. This software aims in particular at predicting, in real-time, the category of a crime as soon as it is reported by an emergency call (for instance robbery
, assault
, theft
). This prediction can only be made with information available at the time of the call (such as time and location) without on-the-ground assessment or knowledge of ex-post action (such as arrest, conviction, demographics of the victim(s) or offender(s)).
Build a model that can predict whether or not the crime is a ‘THEFT’ (identified in the Primary Type column) given a relevant set of features at your disposal. Please explain your choice of features in light of the use case highlighted above. Use the training data to train the model and discuss its performance on the test data.
The Questions for you to answer in this take-home are:
The Myers-Briggs Type Indicator is a famous personality test that divides people into 16 different personality types. You will need to answer various questions, which the system then evaluates to determine your personality type. This dataset contains different information about the test that you can then use to evaluate the validity of the test design, analyze its results, and make predictions about the different personality types or categorizations of human behavior.
In most projects, the first step is often obtaining some data to analyze and apply algorithms. For example, you can use the YouTube-Comment-Scraper-Python library to fetch YouTube video comments and then use those to implement various sentiment analyses, hate-speech flaggers, and bot-detection projects. Using this library, you will learn how to implement an automated scraper which will help you focus on exploratory data analysis and feature engineering. Follow this Kaggle notebook to learn how to perform YouTube sentiment analysis.
Mental health is an essential topic of discussion, and the ability to detect and recognize people’s mental health state can help save lives or vastly improve quality of life. If you want to build a project that feels important or if you have struggled with mental health issues, you can use the Twitter dataset (or scrape recent Twitter data) to build a sentiment analysis that recognizes depression cues.
Another project for music lovers. This time we are not categorizing the music; we will generate it. Many songs today contain elements generated by computers. One approach to developing music is using deep learning or neural networks. If you want to try generating music, you can try MuseNet, or WaveNet, or use a dataset like the Maestro to classify and develop your own music.
This price prediction machine learning project requires advanced skills and knowledge of bidirectional LSTM neural networks.
Using the LSTM deep neural network, you’ll perform time series predictions in TensorFlow 2 to predict Bitcoin prices. You can pull Bitcoin prices from Yahoo! Finance or the Bitcoin Historical Prices dataset on Kaggle, which features minute-to-minute prices for 2017 to 2020.
Check out this dataset on Kaggle, featuring reviews for 1,000 hotels. The data comes from Datafiniti’s Business Database. To build a sentiment analysis tool, you’ll need to perform advanced web scraping on TripAdvisor. Then, you can follow this tutorial using the Natural Language Toolkit (NLTK) submodule VADER. This project allows you to practice various skills, including web scraping, natural language processing, and sentiment analysis.
This operational analytics project provides practice in several advanced machine learning skills, like LSTM modeling for large time-series models. Follow this tutorial to predict energy consumption for a single household. You can use this Household Energy Consumption dataset from UCI to perform your analysis. However, many energy consumption datasets are available online, including this historical dataset on Kaggle, or you could scrape data from the U.S. Energy Information Administration.
There are numerous datasets you can use for this project, including the Yale Face Database or the AT&T Database of Faces. In this project, you’ll use OpenCV, one of the most popular computer vision libraries, to build a facial recognition tool. You can practice using three main algorithms to do it, including Eigenfaces, Fisherfaces, and Local Binary Patterns Histograms. This OpenCV tutorial offers explanations of all the algorithms and step-by-step instructions for using them.
Building a solid machine learning projects portfolio can make or break your chances of getting the role you are applying for. Luckily, numerous data science projects and free online datasets are available to start. Ultimately, it would be best if you were prepared to talk about your projects in interviews and answer the most common data science project interview questions. If you can present a task well, your portfolio will make you much more competitive for machine learning roles. . Today, we went through 12 machine learning projects that you can build and add to your portfolio to make it stand out among the crowd and help you get your desired role. These project levels vary from beginner to advanced, so you are sure to find one that matches your current skills level, with additional ideas that will challenge you to grow.