Finding a good dataset for your data science projects can be challenging, especially when you’re looking for high-quality, real-world datasets that are free to use. Whether you’re just starting out, a student working on a data science project, or someone working on your portfolio, having access to the right dataset is crucial.
We got you covered! We’ve put together a list of the best free datasets you can use for your data science projects, from machine learning to computer vision. Not only that, we’ll also cover where to find free datasets.
There are many open sources where you can find free datasets for your data science projects. Below are some data repositories to help you get started.
Kaggle is one of the largest online platforms for data science and machine learning. It also provides tools for analysis, competitions, and a community where users can collaborate and learn from each other.
Data.gov is the U.S. government’s open data portal, offering free access to over 300,000 datasets. You can find data on health, finance, climate, and more for research and data science projects.
OpenML is an open platform for sharing datasets, algorithms, and experiments. It provides a wide range of datasets for model training, benchmarking, and data science projects.
Hugging Face is an online platform dedicated to data science and machine learning. You can browse datasets, train AI models, share your work, and collaborate with others as you do so.
Awesome Public Datasets is a list of topic-centric public data sources in high quality. It includes free datasets across various fields, from agriculture to eSports.
KDnuggets includes a collection of free datasets for machine learning, data science, and AI projects. You can also find links to other dataset repositories here.
These free machine learning datasets could be just what you need for your data science project. They offer real-world data to train, test, and enhance your models for various AI and data science projects.
The data was normalized by A. Bifet and is useful for time-series forecasting and price trend analysis. It contains electricity market data from New South Wales, Australia, collected between May 7, 1996, and December 5, 1998.
Due to the lack of a universally agreed upon feature set in phishing detection research, this dataset highlights key attributes and introduces new predictive features.
The SMS Spam Collection is a dataset of 5,574 SMS messages in English, labeled as either ham (legitimate) or spam. This dataset is useful for building and evaluating machine learning models aimed at filtering unwanted messages and improving text classification systems.
This dataset is designed for predicting students’ adaptability levels in online education based on factors like demographics, academic background, internet access, and learning environment.
This dataset contains a comprehensive collection of images categorized to assist in the development of machine learning models for detecting diseases in various fruits and vegetables.
This dataset includes loan data from 2007 to 2015, detailing loan status, payment history, credit scores, finance inquiries, and collections. With 890,000 observations and 75 variables, it is useful for analyzing credit risk, borrower behavior, and loan performance.
A startup is a new business aiming for growth, often facing high uncertainty and failure rates. This dataset is designed to predict startup success or failure.
HR analytics involves using data to make informed decisions about human resources and employee-related matters. It can provide valuable insights to organizations in terms of employee engagement, productivity, and overall organizational effectiveness.
If you’re a music enthusiast looking to explore audio features, lyrics, and metadata, this dataset is a solid pick for a data science project. Perfect for music analysis, trend discovery, and building recommendation systems.
With census data on demographics and employment, this dataset is good for predicting whether an individual’s income exceeds $50K/year. Useful for classification tasks or for socioeconomic analysis.
With car features like price, capacity, and safety, this dataset is good for evaluating car acceptability. Useful for classification tasks, decision-making models, and recommendation systems.
This dataset is good for analyzing student loan trends, payment behavior, and academic engagement. Useful for predictive modeling, financial analysis, and education research.
This dataset provides average salaries by job title and grade for full-time regular employees. It offers insights into salary distribution across various roles.
Traffic collisions lead to crashes, and understanding driver-related factors can help in accident prevention and risk assessment. This dataset provides details on motor vehicle operators involved in traffic collisions on county and local roadways, capturing key insights into crash incidents and driver behavior.
Understanding customer trends and improving sales strategies starts with analyzing demographics and purchasing behavior. This dataset includes details like age, gender, and estimated salary, along with whether a user made a purchase.
This dataset provides insights into athlete performance, country participation, and the evolution of the Olympics over time. It is valuable for sports analytics, data visualization, and machine learning applications.
In the competitive mobile market, pricing a product correctly is crucial. This dataset captures sales data from various mobile brands to analyze the relationship between device features and price ranges.
Understanding and predicting student outcomes is key to reducing academic dropout rates. This dataset helps identify students at risk by analyzing enrollment data, including academic history, demographics, and socio-economic factors.
Assessing the health of animals across different species, this dataset helps build predictive models to determine whether an animal’s condition is dangerous based on five key symptoms.
Analyzing wine quality can help winemakers improve production and maintain consistency. This dataset includes key wine attributes, such as acidity and alcohol content, along with quality ratings and wine types.
Build data visualization projects with these helpful datasets. We looked at data with the potential for interesting visualizations and datasets that weren’t too messy or overly complex.
This dataset includes revenue and sales data from Supercell and asks you to visualize a single aspect of the data that you find important.
With more than 11 million nodes and 85 million edges, this dataset is useful for building graphical relationship models of X users.
This is a great dataset for visualizing hotel bookings. You’ll be able to build visualizations that answer questions like:
Design visualizations that show top authors, best-selling titles, and review ratings for the best-selling books on Amazon.
Visualize the impact COVID is having on hiring with this dataset from the Amazon Open Data Registry. It features regularly updated hiring data from 3+ million job organizations.
If you’re interested in political visualizations, FiveThirtyEight is one of the best data sources. Its updated polling data is great for visualizing averages and polling movements.
Build charts to visualize the United State’s international trade, including top imports, top exports, and annual trade balances.
This dataset is useful for Matplotlib visualizations. You can create visualizations of exchange rates and currency valuations over time. The dataset features more than 20 years of daily exchange rate data.
This dataset has more than 400,000 records featuring daily circulation for the San Francisco library system. You can build visualizations related to new acquisitions, most checked-out authors, most checked-out titles, etc.
This Kaggle dataset features daily trending video data from YouTube. Trending videos aren’t necessarily the most watched but are generally the most interacted-with videos. Visualizations include the most popular videos of the year or month or the most trending videos by the artist/creator.
This dataset features more than 31 years of unemployment in numerous countries worldwide. There is a wide range of visualizations you can create, including comparisons of countries, unemployment rates over time, or countries with the lowest unemployment.
This dataset originated on New York State Open Data and features station, line, location information, etc. You can use this dataset to visualize popular lines or subway maps.
This dataset contains information on more than 10,000 athletes in 40+ sports, and it’s a great source for building country medal count visualizations. There’s also coaching data, so you can add medal information by coach.
Say you want to take a big dataset and investigate. As you dive into the data, you discover patterns, trends, and anomalies. These datasets are perfect for exploratory data analysis projects because they contain large amounts of mostly clean data.
This Airbnb dataset, part of a sample data analytics take-home, contains user information for bookings in Brazil.
A fun dataset to explore and great for beginners, this features all of the Netflix original movies up to June 1, 2020, and their corresponding IMDb scores.
This Stripe dataset, which features product usage and marketing data, is perfect for diving into marketing and product analytics to determine how well a product performs.
Featuring 4 years of data from a superstore, this dataset is perfect for analyzing and identifying trends and sales forecasting.
This sample dataset from a Home Depot data science take-home can be used to produce a gross sales forecast for a new product launch.
This dataset is made up of mock marketing analytics data used by master’s in business analytics students. A great source for a marketing analytics project.
This is a great dataset for surfacing actionable insights for animal shelters, including what factors led to successful animal outcomes.
Another FiveThirtyEight dataset features survey data from non-voters in the U.S. A few project ideas are identifying key factors that result in non-voting or building a voting likeliness model.
A sprawling dataset from Amazon, the Common Crawl corpus features crawling data from billions of websites. Check out the Example Projects page for ideas.
This dataset is useful for a sports analytics project. Featuring data on more than 20,000 matches and individual stats from 2008 to 2016, this is great for exploratory data analysis projects on line-ups, team stats, wins, and individual player stats.
This large-scale dataset, which was originally developed in 2018, features product information for more than 600,000 food items. Data includes allergens, ingredients, and nutrition facts, and there are a wide range of data analytics projects you can do with it.
This useful marketing analytics dataset features survey data from 2,500+ millennials. The survey asked which social platform has influenced your online shopping the most.
This dataset features Google Analytics metrics from Austin, TX’s website. This dataset is great for working in Google Analytics or analyzing website traffic.
This dataset features more than 20 million metrics on Uber pickups in NYC in 2014 and 2015. This is great for an exploratory data analysis or analytics project, and you can gather insights into popular pickup locations, common trip routes, and the locations with the longest pickups.
This dataset is a great source for a campaign budget optimization project or for diving into exploratory data analysis for marketing analytics projects.
This dataset contains a wide range of economic and social indicators for countries worldwide, including information about their GDP, population, and education levels.
This dataset contains salaries for roles in the data science field for the year 2023. You can group the data by domain, years of experience, and even by country of employment, allowing many angles for exploratory analysis.
Consider working on a data science project with these free healthcare and medical datasets. It can be used in helping with disease prediction, patient outcome analysis, and medical research.
The dataset comprises clinical records of patients with heart failure, detailing various medical attributes that may contribute to heart failure incidents.
This dataset contains medical images used for classifying various skin diseases, including Actinic Keratosis, Atopic Dermatitis, Benign Keratosis, Dermatofibroma, Melanocytic Nevus, Melanoma, Squamous Cell Carcinoma, Tinea Ringworm Candidiasis, and Vascular Lesions.
A medical insurance company aims to deepen its understanding of the factors impacting individual health insurance costs. They have gathered a dataset featuring diverse individual profiles, encompassing details such as age, gender, BMI, number of children, smoking status, and region.
This dataset contains information related to individuals and their risk factors for heart disease. The data includes demographic information such as age and gender, as well as medical history, lifestyle factors, and symptoms associated with heart disease.
This dataset provides global prevalence data on mental health disorders like schizophrenia, depression, anxiety, and substance use disorders. It offers insights into their impact, helping to better understand these conditions and their implications.
The DARWIN (Diagnosis AlzheimeR WIth haNdwriting) dataset includes data from 174 participants taken from handwriting samples from both patients and healthy individuals. It supports research in handwriting-based cognitive assessment.
The dataset is designed for machine learning projects focused on predicting the type and severity of brain tumors, as well as understanding various treatment methods and patient outcomes. It contains simulated data for brain tumor diagnosis, treatment, and patient details.
This dataset supports real-time and remote rabies diagnosis in humans and animals, especially in low-resource settings. It includes 12,684 observations across three datasets and can be used for outbreak prediction and resource planning, such as estimating vaccine needs.
Intraoperative anesthesia data addresses gaps in electronic records in low- and middle-income countries (LMICs). It provides key data elements essential for clinical decision-making and research, helping improve patient outcomes and surgical care.
This dataset includes diseases, their symptoms, precautions, and associated weights. It is structured for easy cleaning and processing, making it useful for disease prediction models and healthcare analytics.
If you’re looking into finance and economics for your data science project, these datasets can help you analyze market trends, stock performance, consumer spending, and economic indicators for data-driven insights and predictive modeling.
This dataset classifies people described by a set of attributes as having good or bad credit risks.
The dataset includes various client and campaign-related attributes, with the classification goal of predicting whether a client will subscribe to a term deposit. It is useful for customer behavior analysis and predictive modeling in marketing.
Fraudulent transactions are a growing challenge for fintech companies. This dataset captures 51,000+ transactions, each labeled as fraudulent or legitimate, based on real-world patterns.
This dataset is used for predicting customer churn in the banking industry. It contains information on bank customers and their churn status.
Understand and predict consumer choices in cultural product design with this dataset. It combines purchase behavior, social media interactions, sentiment analysis, and cultural relevance factors.
In the competitive mobile market, pricing a product correctly is crucial. This dataset captures sales data from various mobile brands to analyze the relationship between device features and price ranges.
This dataset provides national accounts data from 1970 onward for over 200 countries, compiled through a global collaboration led by the UN Statistics Division. It includes key economic indicators, making it valuable for analyzing economic trends, policy research, and financial forecasting.
If you want to analyze Apple Inc.’s stock performance, market trends, and financial patterns, this dataset provides historical stock prices sourced from Yahoo Finance.
This dataset provides historical foreign exchange rates for various currencies, making it useful for time series analysis and financial forecasting.
This dataset captures travel insurance purchase decisions, showing which customers opted for insurance and which did not. It’s useful for analyzing customer behavior, risk assessment, and building predictive models for insurance uptake.
Complaints from customers about financial products and services provide valuable insights into consumer issues, financial trends, and company responses. This dataset helps in identifying patterns, assessing service quality, and improving customer satisfaction in the financial sector.
Analyze sales trends and forecast demand with this dataset, which tracks weekly purchase quantities for over 800 products over 52 weeks.
This dataset tracks monthly U.S. Consumer Price Index (CPI) data, averaging CPI across all U.S. cities. It’s useful for analyzing inflation trends and economic shifts.
This dataset contains financial data from the Taiwan Economic Journal (1999–2009) to analyze and predict company bankruptcy. Bankruptcy status is determined based on Taiwan Stock Exchange regulations and can be used for financial risk assessment, credit scoring, and predictive modeling.
Customer churning is a critical issue for telecom companies, and this dataset helps analyze factors influencing customer retention. It includes data on 7,043 customers in California, covering demographics, service usage, satisfaction scores, churn scores, and Customer Lifetime Value (CLTV).
Find out how consumers celebrate Valentine’s Day with this dataset, which includes over a decade of survey data from the National Retail Federation.
Explore this dataset featuring major U.S.-traded companies alongside emerging, innovative firms that recently went public. It covers industries such as social media, e-commerce, streaming, electric vehicles, cloud computing, and more, offering insights into market trends and business growth.
This dataset contains annual spending records of clients from a wholesale distributor, categorized by different product types. It can be used for customer segmentation, spending pattern analysis, and business insights.
This dataset tracks trends in productivity and hourly wages in the United States from 1948 to 2021. It includes overall compensation data as well as insights into earnings for production and nonsupervisory workers.
This dataset captures historical sales records from a supermarket chain operating across three branches over a three-month period. It is ideal for predictive analytics, demand forecasting, and business intelligence applications.
Natural language processing (NLP) involves text or speech data used to train models for understanding and processing language. Use the datasets below for your data science projects, including sentiment analysis, speech recognition, and text classification.
This dataset includes 18K job descriptions, with around 800 labeled as fake. It contains both textual data and meta-information about the jobs. The dataset can be used to train classification models to detect fraudulent job postings.
This dataset contains labeled tweets from Twitter and labeled comments from Reddit. It is useful for analyzing sentiment trends across multiple social media platforms.
If you’re looking to work on sentiment analysis or recommender systems, the Amazon Product Reviews dataset is a solid pick. It includes reviews, ratings, and metadata, making it great for consumer behavior insights and trend analysis.
You can use this dataset for sentiment classification or for consumer behavior analysis. It contains customer reviews for various products, including details about product categories, brands, user ratings, and sentiments.
This dataset is good for fake news detection, text classification, and NLP. It includes news articles with titles, body text, subjects, and publish dates.
With AI-generated text becoming more common, this dataset is great for analyzing the differences between AI and human writing. It includes 500K essays from both, making it useful for text classification and authorship detection.
Humans express a variety of emotions, and you might be curious about what emotion a message conveys. This dataset provides labeled text for emotion detection, making it useful for building models to classify emotions in text.
If you’re into language translation, this dataset is a great pick. It contains English and French text pairs, making it useful for training NLP models for machine translation.
Phishing URLs are a major cybersecurity threat, and if you want to detect or classify malicious websites, this dataset is a solid pick. It contains 800K+ URLs, split between legitimate and phishing domains, making it useful for training models to identify online threats.
Sarcastic headlines are often used as clickbait, making sarcasm detection an interesting NLP challenge. If you want to classify sarcasm in news headlines, this dataset is a solid pick. It can help media houses analyze engagement strategies and improve search engine recommendations.
With 50,000 highly polar movie reviews, this dataset is good for sentiment analysis, text classification, and NLP model training. It includes both raw text and bag-of-words formats, making it useful for various natural language processing tasks.
The dataset features annotated sentences from abstracts and introductions of 30 research papers in biology, machine learning, and psychology. Ideal for argument classification, NLP, and academic text analysis.
Uncover what drives success in social media marketing with this dataset, which provides insights into the marketing strategies of top online influencers. It includes survey responses detailing target audience segments, market descriptions, and strategies used for engagement and growth.
This dataset contains tweets posted by fact-checkers between December 1, 2019 and June 2, 2020. It can be used for analyzing misinformation trends or fact-checking patterns.
Gain an in-depth look into Boston Airbnb experiences through past customer reviews, and a detailed dataset offers valuable perspectives on Airbnb stays.
This dataset can be used to help improve customer service responses to be clearer and more accurate, improving communication with customers.
This dataset includes over 30,000 hours of transcribed speech from diverse speakers. It provides a valuable resource for training speech-to-text models.
You can use this dataset to improve automatic speech recognition models for anime and Japanese media, as it captures unique speech patterns and linguistic features from visual novels.
Analyze sentiment in YouTube comments with this dataset, which includes over one million labeled comments across topics like programming, news, sports, and politics.
Customer reviews provide insights into whether customers enjoyed their meal, which can be used for analyzing sentiment and customer preferences.
Computer vision enables machines to interpret and understand visual data. These computer vision datasets can help you train models for image classification, object detection, image recognition, and more.
The dataset consists of instances randomly sampled from seven outdoor images. Each image was manually segmented to classify every pixel, and each instance represents a 3×3 pixel region.
This dataset is useful for object detection tasks such as license plate detection and recognition. It contains 1695 images, each with bounding box annotations for license plate.
The dataset is designed for research in emotion recognition and facial expression analysis across diverse races, genders, and ages.
This dataset contains three labels, “Healthy”, “Powdery”, “Rust” referring to plant conditions. There is a total of 1530 images divided into train, test, and validation sets.
The dataset consists of real-time CCTV footage annotated for accident detection and severity classification. It includes video frames with accident labels (Accident/Non-Accident) and severity levels (Minor, Substantial, Critical).
This dataset contains images labeled as “Vehicles” and “Non-Vehicles,” suitable for object detection and classification tasks.
ImageNet is an image database organized according to the WordNet hierarchy. It has over 100,000 phrases and an average of 1000 images per phrase. It’s suitable for image classification and object detection.
With gesture data from 7 videos, this dataset is good for analyzing motion and segmenting gestures. Useful for computer vision and human-computer interaction.
Using high-resolution aerial imagery, this dataset is good for classifying urban land cover to support sustainable urban planning.
Ideal for segmentation tasks, this dataset includes 101 scanned pages from Russian newspapers and magazines with pixel-based ground truth masks. Useful for document analysis, OCR, and computer vision projects.
It contains images of waste items in their natural, unaltered state, enabling comparisons between models trained on ideal objects versus actual discarded materials.
AI-generated images are now more common than ever. Use this dataset to analyze and determine whether an image is real or fake.
Misidentifying drones as birds can pose risks to people, infrastructure, and air traffic. This dataset contains diverse images of birds and drones in motion, helping to improve detection and classification models.
This dataset contains infrared images of people in both drunk and sober states. It can be used for research on impairment detection and computer vision applications.
This dataset contains images of different fruits and vegetables, making it useful for image recognition tasks. It was created to help develop applications that identify food items from photos and suggest recipes based on the detected ingredients.
This dataset contains media of cars from various angles, and you can create an algorithm to detect and identify cars in different scenarios.
It contains high-quality images with carefully annotated bounding boxes, available in both pixel coordinates and YOLO format for easy integration into various models.
Use this dataset for behavior analysis and driver monitoring, covering various driving scenarios and environments to improve traffic and road safety.
X-ray testing is used to detect anomalies and hidden defects. You can use this dataset for developing computer vision models that analyze X-ray images for automated detection and classification tasks.
Masks help protect individuals from respiratory diseases and were a key precaution during COVID-19. This dataset allows you to build a model to detect whether people are wearing masks correctly, incorrectly, or not at all.
Pricing optimization is the most important lever for increasing revenue with data. Try to identify prices that maximize revenue for these different products and environments:
Over 1 million transactions from an online retailer, including customer data, product data, and transaction data.
Weekly retail prices and volume data for avocados in various US markets from 2015 to 2018.
Data on beer consumption and prices in Sao Paulo, Brazil, from 2015 to 2018.
Information on Airbnb listings in New York City, including listing prices and attributes such as location, number of bedrooms, and amenities.
Information on Uber pickups in New York City from 2014 to 2015, including pickup times and locations.
Information on over 53,000 diamonds, including their cut, color, clarity, carat weight, and price.
Weekly sales data for 45 Walmart stores across the US from 2010 to 2012, including information on promotions, holidays, and weather conditions.
Using this dataset, you can build predictive pricing models, especially around surge pricing.
Using this dataset, you can build a predictive model for used car pricing. The dataset includes various features such as the car’s make and model, location, year of manufacture, kilometers driven, fuel type, transmission, owner type, mileage, engine size, power, and the number of seats. These attributes can be used to analyze how different factors influence the resale value of cars.
Using this dataset, you can build predictive models to estimate health insurance costs based on various factors such as age, gender, BMI, number of children, smoking status, and region. These models can help insurance companies better understand the risk factors associated with individual clients and set more accurate premiums.
By analyzing patterns within the data, you can also explore how different variables interact with each other to influence the overall charges, potentially identifying key drivers behind high insurance costs.
Datasets are important for data science projects because they help with analysis, model training, and real-world applications. With so many free datasets available, you can test ideas, find insights, and create useful solutions. Explore these datasets and start bringing your data science project ideas to life!