Interview Query

Top 30 Data Science Projects with Source Code

Getting Started

Data science projects are an incredible learning tool for practicing new skills and deepening your expertise. Whether you want to master data analytics, or need to brush up on machine learning fundamentals, a data science project is one of the best ways to gain hands-on experience and learn by trial and error.

So, where to begin? What data projects should you focus on?

Data science is such a broad field, and there’s an endless array of data projects you can pursue, from building chatbots to testing fraud detection models. Ultimately, it depends on the goals you have and the tools you want to master. Want to improve on data analytics? Perform exploratory data analysis on a dataset that you scraped. Want to master Python? Try building a recommendation engine.

To provide inspiration, we’ve highlighted 30 data science projects, which we’ve broken down into several relevant categories.

These projects will help you practice some of the most useful analytics, data science, and machine learning skills. We’ve also included tips, free datasets to use (and be sure to see our list here too), as well as source code. Browse projects by type:

Data Science Projects for Beginners

Project-based learning is one of the best ways to expand your expertise and gain experience working with data science tools and concepts. If you’re just starting out, these beginner data science projects will help you practice the most essential skills:

1. Detecting Fake News with NLP

This is a great beginner data science project for practicing NLP techniques like text classification. You can start with this Fake and Real News Dataset on Kaggle, which features two four-column charts (true and fake news) with the title, text, subject, and date of the article. You can follow along with this tutorial for using Python to determine if an article is real or fake.

Detecting Fake News with NLP

How to Do This Project: The tutorial above gives step-by-step directions on using Python, as well as libraries like TensorFlow and PyCare

2. Color Detection Beginner Computer Vision Project

Computer vision is extremely complex, but there are many projects beginners can try to start experimenting with computer vision. This project, for example, will allow you to practice color detection using Python. Start with this helpful DataFlair tutorial, which covers how to build a simple app for color detection.

Color Detection Beginner Computer Vision Project

How to Do the Project:The tutorial above features a color dataset, as well as source code to help you get started.

3. Build a Simple Movie Recommendation System

The TMDB 5000 Movie Database is a sprawling movie dataset, which includes 24 columns for each entry. You can use this dataset to build a basic movie recommendation system, which is a perfect beginner project. To get started, check out this Kaggle notebook, which provides a walk-through for all three types of recommendation systems.

Build a Simple Movie Recommendation System

How to Do the Project: Follow the tutorial above and start with demographic filtering first, which is the most straightforward method for a recommendation system. Then, work through the content-based and collaborative filtering examples.

4. Predict Income with Census Data

All of the datasets in the University of California, Irvine Machine Learning Repository are perfect for beginner data science projects. Because they’ve been pre-processed, they’re usually ready for analysis. Plus, there are many tutorials online that will walk you through how to use them and how to analyze performance.

One that is particularly great for beginners is the US Census Income Dataset. This is a great introductory data science classification project that asks you to determine if someone’s income is greater than $50,000 based on attributes in the dataset.

Predict Income with Census Data

How to Do the Project: Check out this helpful tutorial to get started. It shows you how to build a model to predict income based on a variety of census data.

5. Cassava Leaf Disease Classification

This Kaggle competition introduces a dataset of 20,000+ images of cassava leaves, providing a great classification machine learning project. Your task is to then classify the images into four disease categories or determine if the leaf is healthy.

Cassava Leaf Disease Classification

How to Do the Project: Take a look at this Kaggle notebook for an overview of working with this dataset. You’ll also want to flip through the Getting Started tutorial from the competition’s authors.

6. Getting Started with Text Mining

Text mining is one of the most in-demand data science skills, and there are many ways you can practice this technique. First, read through this beginner’s guide to text mining for some helpful hints. Another strategy would be to look at text pre-processing tasks, like text normalization. The Google Text Normalization Challenge on Kaggle is a great dataset to work with.

Getting Started with Text Mining

How to Do the Project: After you complete the Text Normalization Challenge, look at doing other projects with text data, like text analytics, text classification, or text clustering.

7. Titanic Dataset for Prediction Modeling

One of the most well-known Kaggle datasets is the Titanic dataset. This is one of the best sources for a predictive modeling project, especially for beginners, as there are numerous notebooks you can view and get help from.

Titanic Dataset for Prediction Modeling

How to Use the Data: Build a model to predict if a passenger survived the sinking of the Titanic. This is a great introduction to using Python for predictive modeling.

8. Beginner Exploratory Data Analysis Project on Life Expectancy

The WHO’s life expectancy dataset is perfect for diving into practicing exploratory analysis. With life expectancy data for 193 countries, with several attributes, you can use this to build prediction models, determine which factors correlate with longer life expectancy, and much more.

Beginner Exploratory Data Analysis Project on Life Expectancy

How to Use the Data: This is one of the best datasets for EDA, data visualization, or data storytelling projects. Here’s a helpful overview of doing EDA with the life expectancy dataset.

Data Analytics Projects

Analytics exercises and assignments are great for learning a range of skills: data visualization, EPA, intermediate SQL, regression analysis, etc. The list goes on and on. First, take a list of our top data analytics projects for inspiration, or you can try any of the data analytics projects below:

1. Scrape Data for NBA Analytics Projects

Data scraping is a go-to skill for analysts and data scientists, and Python is one of the best tools for scraping your own data. This tutorial shows you have to scrape data from Basketball-Reference. Follow along and build your own free dataset for an NBA analytics project or data visualization project.

Scrape Data for NBA Analytics Projects

How to Do This Project: Customize datasets with scraping and answer a basketball analytics question like “What’s the correlation between free-throw percentage and win percentage?” Or, “What’s the optimal strategy for the 2-for-1 play?”

2. Airbnb Analytics Project

Ever wondered how Airbnb listings look in your cities? Things like listings by neighborhood, the number of listings per host, or average prices? Check out Inside Airbnb– the site is your source for cleaned and aggregated Airbnb data for numerous cities around the world, making this a great source for a data analytics project.

Airbnb Analytics Project

How to Do the Project: Check out Inside Airbnb’s About page for question prompts to get you started.

3. Car Rental Prices Analytics Project

Which car models get rented most frequently? When’s the best day to rent a car? You can answer questions like this, using the Cornell Car Rental Dataset on Kaggle. Featuring information on 6,000+ rental cars, the dataset is great for EDA-type data analytics projects.

Car Rental Prices Analytics Project

How to Do the Project:Think up some problem statements before you get started. You might want to analyze fares, car rentals by model, or seasonal trends.

4. Practice Data Cleaning

Data cleaning may be the janitorial work of data analytics, but it’s absolutely essential. Bad data equals bad results, and if you can’t do things like handle missing values, parse dates, or manage inconsistent data entries, you’ll likely run into problems in future data analytics projects. Fortunately, this Kaggle challenge offers three mini data cleaning projects you can try.

Practice Data Cleaning

How to Do the Project: This is a five-day challenge that provides hands-on practice with a variety of data cleaning tasks. Both source code and data are provided.

5. Earth Surface Temperature Visualization Projects

Here’s a great data visualization project to practice Python and building visualizations. First, check out the Earth Surface Temperature Data on Kaggle. Then, take a look at this Kaggle notebook to see how to conduct some analysis and build visualizations.

Earth Surface Temperature Visualization Projects

How to Do the Project: Here’s a helpful tutorial for doing time-series analysis using the Earth Temperature dataset.

6. Retail Data Analysis with SQL

SQL is one of the go-to languages for data scientists, and SQL projects are one of the best ways to learn intermediate-to-advanced SQL functions. With this project, you can perform sales reporting using SQL on an open retail dataset. Check out this tutorial to get started.

Retail Data Analysis with SQL

How to Do the Project: Check out this e-commerce dataset on Kaggle, or this churn dataset for a large Telco. You can use the above tutorial to walk you through writing SQL functions for e-commerce reporting or building a customer churn model.

7. Customer Churn Analysis Project

This project leverages the Telco Customer Churn dataset on Kaggle, which is an IBM dataset. You can read more about it here. Using the dataset, you can perform a number of analytics projects, focused on predicting and analyzing churn.

Customer Churn Analysis Project

How to Do the Project: Follow along with this end-to-end tutorial from Amanda Iglesias Moreno on Medium. In particular, you’ll get detailed info on how to use histograms and normalized stacked bars to visualize the data.

8. Stroke Prediction Dataset for Data Visualization Projects

If you’re interested in health informatics, this is a go-to source for a beginner-to-intermediate health analytics project. The Stroke Prediction Dataset features 5,000+ data points that you can use to build a stroke prediction model or practice creating health data visualizations.

Stroke Prediction Dataset for Data Visualization Projects

How to Use the Data: This is a great source for a data visualization project– specifically, data storytelling. Plus, it gives you practice in Python or R.

Machine Learning Projects

We’ve highlighted top Python data science projects, as well as classification projects.

1. DoorDash Take-Home Analytics Project

The DoorDash take home often focuses on machine learning. Want to prep? Our sample DoorDash takehome asks you to build a model to predict delivery time.

DoorDash Take-Home Analytics Project

How to Do the Project: Download the take-home on Interview Query, which includes requirements, datasets, and a sample solution. You might also want to see this DoorDash blog post on improving ETA predictions.

2. Build a QR Code / Barcode Scanner App in Python

If you want to get started with computer vision, this is a straightforward project that will allow you to work with concepts, like edge detection. Behic Guven’s tutorial on Towards Data Science, which will walk you through building this type of app, is a good starting point.

Build a QR Code / Barcode Scanner App in Python

How to Do the Project: You’ll need to use libraries like OpenCV, Pyzbar, and Pillow if you follow along with the tutorial. Or, if you prefer a video walk-through, see this helpful video.

3. Transform Images into Cartoons with OpenCV

This is another fun machine learning project using OpenCV that provides practice in concepts like image transformation, facial recognition, and object detection. Get started with this tutorial on DataFlair, which includes source code and step-by-step instructions.

Transform Images into Cartoons with OpenCV

How to Do the Project: You’ll find source code in the DataFlair tutorial. Another option: check out this helpful mini-project tutorial on Medium.

4. Credit Card Approval Prediction

Risk analysis and assessment is a classic use of machine learning, and it’s one of the best project topics for anyone interested in fintech data science jobs. To start, you can use this dataset on Kaggle, which is perfect for predicting if a client is a “good” or “bad” client for approval.

Credit Card Approval Prediction

How to Do the Project: Take a look at this Kaggle credit card approval notebook, which should provide some direction. If you’re interested in data science jobs in the finance industry, this is definitely a project to give a try.

5. Predicting Crypto Prices with Machine Learning

Cryptocurrency exchanges are hiring data scientists at a fast pace. Although most don’t work on pricing predictions (they focus more on business analysis), this is a great project if you want to build your portfolio for a crypto data science job. Follow along with this helpful walk-through from Abhinav Sagar on Medium.

Predicting Crypto Prices with Machine Learning

How to Do the Project: In the tutorial, you’ll learn how to build a machine learning app that uses LSTM neural networks to predict crypto prices.

6. Build a Recommendation Engine

There are numerous datasets you can use for building recommendation engines, but if you want to build an engine for music, take a look at the Million Song Dataset. Featuring metadata for a million songs, this is a great source for building a recommendation engine with Python.

Build a Recommendation Engine

How to Do the Project: Follow along with this helpful tutorial from Ajinkya Khobragade on Medium. In particular, you’ll learn how to build a collaborative-filtering recommendation engine. Also, see this video on creating TreeMaps in Tableau.

7. Practice Classification with this ML Project

The UCI Repository is a go-to source for free data, and this is a classic classification project. Essentially, the project asks you to build a classification engine based on images of Iris flowers, and there are numerous tutorials and how-tos online.

Practice Classification with this ML Project

How to Do the Project: One of the most famous datasets in the UCI Repository is the Iris dataset. This is the perfect source for a beginner machine learning classification project (because there are so many tutorials with source code available online).

8. Build a Computer Vision Face-Swapping App in Python

This is a massive dataset featuring more than 200,000 images of celebrity faces. If you’re interested in an OpenCV project, this is your go-to data source. Project ideas include: building a Python face-swapping app, facial recognition, or celebrity face generation with deep convolution GANs.

Build a Computer Vision Face-Swapping App in Python

How to Use the Data: Building a face-swapping app with OpenCV is one of the best ways to gain hands-on experience in computer vision. Check out this tutorial for hints and source code.

BONUS: Data Science Projects for Passive Income

These project ideas will bolster your portfolio and your bank account. Featuring more advanced concepts, these are data science projects to generate passive income. Here’s the good news: You don’t have to build super-complex apps to launch a data science side hustle.

1. Build a Valuable Dataset

Businesses, investors, and governmental agencies pay high prices for quality data. Building a dataset that’s valuable–or at the very least useful– means that you can then generate passive income with a data science project. Here’s a helpful blog post about building valuable datasets that might offer some direction, and you might think of some from this list:

Build a Valuable Dataset

How to Do the Project: Typically, industries like real estate, cryptocurrency and NFTs, and finance have potential customers that would be willing to pay for a monthly data subscription. One tip: think about how you can add value (or more data) to an existing dataset, e.g. making it more accessible through an API or aggregating multiple datasets.

2. Concert Ticket Reselling

Concert tickets are one of the highest-value resale items. One project idea: build an app that monitors ticket prices on Craigslist and ticket reseller sites, like StubHub or SeatGeek, and then buys those tickets below a certain threshold. Take a look at this article on analyzing concert ticket pricesfor ideas.

Concert Ticket Reselling

How to Do The Project: Take a look at this tutorial for scraping data on Craigslist. Although it looks at scraping used items, you can adapt it to concert tickets or other high-value items.

3. Build a Python Package

Although building a Python package doesn’t generate income, it can help you build your personal brand, and looks great on the resume. Check out this guide to building your first Python package on Towards Data Science to get started.

Build a Python Package

How to Do The Project: The guide above provides step-by-step directions for developing a Python package, including creating a Read.Me file, as well as licensing and deployment.

4. Build Your Own Trading Bot

Let’s preface this one by saying that creating sustainable income with a trading bot is very difficult and high-risk. Instead, this might be more of a project you do for learning, rather than a passive income generator. But if you want to give it a shot: check out this guide to building an algorithmic trading app with machine learning on GitConnected. It focuses particularly on day trading U.S. stocks.

Build Your Own Trading Bot

How to Do the Project: Other avenues to look at: sports betting or cryptocurrency trading. Again, these are high-risk endeavors that require ongoing maintenance… so maybe not the best passive income data science projects to try.

5. Finding Real Estate Investment Opportunities

Real estate investing is an age-old passive income generator. But data science can help you maximize profit margins. Essentially, what you’re looking for are homes in areas where average rents cover the mortgage, or even better, turn payments that are more than the monthly mortgage.

You can do this by scaling your analysis across the U.S. First, scrape sold home data from sites like Zillow or Redfin, as well as rental data from sites like Zumper and Craigslist. Then, merge the datasets together to determine which areas have the best price-to-rental ratios across segments, like square footage and number of bedrooms.

Finding Real Estate Investment Opportunities

How to Do The Project: Analysis is the most intense part of this project. After you’ve scraped the data, look for investments in areas you’d like to own properties.

6. Alternative Asset Arbitrage

Data science simplifies the process of arbitrage, making it easier to find price differences between markets. In fact, that’s exactly the strategy Sam Bankman-Fried used to make millions on crypto-asset arbitrage. Crypto and NFTs seem to be the big ones these days, but the strategy also works in sports betting, concert tickets, sneakers, and trading cards.

Alternative Asset Arbitrage

How to Do the Project: There are a number of tutorials and articles to read. Take a look at this one that looks at quantifying sneaker resale prices, based on features. Here’s another on crypto-asset arbitrage.

More Project Ideas from Interview Query

If you don’t want to commit to a project, you might consider answering real data science interview questions.

Practice questions will help you build your data science skills, including Python, SQL, and machine learning, as well as skills essential to a data science career, like business sense and product sense.