101+ Best Free Datasets for Data Science in 2025 [Ultimate Guide]

101+ Best Free Datasets for Data Science in 2025 [Ultimate Guide]

Overview

Finding a good dataset for your data science projects can be challenging, especially when you’re looking for high-quality, real-world datasets that are free to use. Whether you’re just starting out, a student working on a data science project, or someone working on your portfolio, having access to the right dataset is crucial.

We got you covered! We’ve put together a list of the best free datasets you can use for your data science projects, from machine learning to computer vision. Not only that, we’ll also cover where to find free datasets.

Where to Find Free Datasets: Data Search Engines & Repositories

Open data image

There are many open sources where you can find free datasets for your data science projects. Below are some data repositories to help you get started.

1. Kaggle

Kaggle is one of the largest online platforms for data science and machine learning. It also provides tools for analysis, competitions, and a community where users can collaborate and learn from each other.

2. UCI Machine Learning Repository

Data.gov is the U.S. government’s open data portal, offering free access to over 300,000 datasets. You can find data on health, finance, climate, and more for research and data science projects.

3. OpenML

OpenML is an open platform for sharing datasets, algorithms, and experiments. It provides a wide range of datasets for model training, benchmarking, and data science projects.

4. Hugging Face

Hugging Face is an online platform dedicated to data science and machine learning. You can browse datasets, train AI models, share your work, and collaborate with others as you do so.

5. Awesome Public Datasets (Github)

Awesome Public Datasets is a list of topic-centric public data sources in high quality. It includes free datasets across various fields, from agriculture to eSports.

6. KDnuggets

KDnuggets includes a collection of free datasets for machine learning, data science, and AI projects. You can also find links to other dataset repositories here.

Never Get Stuck with an Interview Question Again

Datasets for Machine Learning

These free machine learning datasets could be just what you need for your data science project. They offer real-world data to train, test, and enhance your models for various AI and data science projects.

Dataset visualization from Kaggle

1. NSW Electricity Market Price Prediction Dataset

The data was normalized by A. Bifet and is useful for time-series forecasting and price trend analysis. It contains electricity market data from New South Wales, Australia, collected between May 7, 1996, and December 5, 1998.

2. Phishing Website Detection

Due to the lack of a universally agreed upon feature set in phishing detection research, this dataset highlights key attributes and introduces new predictive features.

3. SMS Spam Collection Dataset

The SMS Spam Collection is a dataset of 5,574 SMS messages in English, labeled as either ham (legitimate) or spam. This dataset is useful for building and evaluating machine learning models aimed at filtering unwanted messages and improving text classification systems.

4. Students Adaptability Level in Online Education

This dataset is designed for predicting students’ adaptability levels in online education based on factors like demographics, academic background, internet access, and learning environment.

5. Fruit and Vegetable Disease Detection

This dataset contains a comprehensive collection of images categorized to assist in the development of machine learning models for detecting diseases in various fruits and vegetables.

6. Lending Club Loan Data

This dataset includes loan data from 2007 to 2015, detailing loan status, payment history, credit scores, finance inquiries, and collections. With 890,000 observations and 75 variables, it is useful for analyzing credit risk, borrower behavior, and loan performance.

7. Startup Success Prediction

A startup is a new business aiming for growth, often facing high uncertainty and failure rates. This dataset is designed to predict startup success or failure.

8. Employee’s Performance for HR Analytics

HR analytics involves using data to make informed decisions about human resources and employee-related matters. It can provide valuable insights to organizations in terms of employee engagement, productivity, and overall organizational effectiveness.

9. Spotify Most Popular Songs Dataset

If you’re a music enthusiast looking to explore audio features, lyrics, and metadata, this dataset is a solid pick for a data science project. Perfect for music analysis, trend discovery, and building recommendation systems.

10. Census Income

With census data on demographics and employment, this dataset is good for predicting whether an individual’s income exceeds $50K/year. Useful for classification tasks or for socioeconomic analysis.

Never Get Stuck with an Interview Question Again

11. Car Evaluation

With car features like price, capacity, and safety, this dataset is good for evaluating car acceptability. Useful for classification tasks, decision-making models, and recommendation systems.

12. Student Loan Status

This dataset is good for analyzing student loan trends, payment behavior, and academic engagement. Useful for predictive modeling, financial analysis, and education research.

13. Average Salary by Job Classification

This dataset provides average salaries by job title and grade for full-time regular employees. It offers  insights into salary distribution across various roles.

14. Traffic Collision Driver Data

Traffic collisions lead to crashes, and understanding driver-related factors can help in accident prevention and risk assessment. This dataset provides details on motor vehicle operators involved in traffic collisions on county and local roadways, capturing key insights into crash incidents and driver behavior.

15. User Demographics and Purchase Behavior

Understanding customer trends and improving sales strategies starts with analyzing demographics and purchasing behavior. This dataset includes details like age, gender, and estimated salary, along with whether a user made a purchase.

16. Olympic Games Dataset

This dataset provides insights into athlete performance, country participation, and the evolution of the Olympics over time. It is valuable for sports analytics, data visualization, and machine learning applications.

17. Mobile Price Classification

In the competitive mobile market, pricing a product correctly is crucial. This dataset captures sales data from various mobile brands to analyze the relationship between device features and price ranges.

18. Student Dropout and Academic Success Prediction

Understanding and predicting student outcomes is key to reducing academic dropout rates. This dataset helps identify students at risk by analyzing enrollment data, including academic history, demographics, and socio-economic factors.

19. Animal Condition Classification

Assessing the health of animals across different species, this dataset helps build predictive models to determine whether an animal’s condition is dangerous based on five key symptoms.

20. Wine Quality Prediction

Analyzing wine quality can help winemakers improve production and maintain consistency. This dataset includes key wine attributes, such as acidity and alcohol content, along with quality ratings and wine types.

Datasets for Data Visualization

Build data visualization projects with these helpful datasets. We looked at data with the potential for interesting visualizations and datasets that weren’t too messy or overly complex.

dataset visualization

21. Supercell Data Analytics Take-home

Supercell Take-Home

This dataset includes revenue and sales data from Supercell and asks you to visualize a single aspect of the data that you find important.

22. X Nodes Dataset (Formerly Twitter)

With more than 11 million nodes and 85 million edges, this dataset is useful for building graphical relationship models of X users.

23. Hotel Booking Demand Data

This is a great dataset for visualizing hotel bookings. You’ll be able to build visualizations that answer questions like:

  • When’s the best time of year to book?
  • How long is the optimal stay length to receive the best rate?

24. Amazon Top 50 Bestselling Books 2009-2019

Design visualizations that show top authors, best-selling titles, and review ratings for the best-selling books on Amazon.

25. COVID Jobs Impact & Hiring Data

Visualize the impact COVID is having on hiring with this dataset from the Amazon Open Data Registry. It features regularly updated hiring data from 3+ million job organizations.

26. Latest Polls from FiveThirtyEight

If you’re interested in political visualizations, FiveThirtyEight is one of the best data sources. Its updated polling data is great for visualizing averages and polling movements.

27. U.S. International Trade in Goods and Services 1960-Present

Build charts to visualize the United State’s international trade, including top imports, top exports, and annual trade balances.

Never Get Stuck with an Interview Question Again

28. Euro Exchange Rates

This dataset is useful for Matplotlib visualizations. You can create visualizations of exchange rates and currency valuations over time. The dataset features more than 20 years of daily exchange rate data.

29. San Francisco Public Library Usage Data

This dataset has more than 400,000 records featuring daily circulation for the San Francisco library system. You can build visualizations related to new acquisitions, most checked-out authors, most checked-out titles, etc.

30. Trending YouTube Videos Data

This Kaggle dataset features daily trending video data from YouTube. Trending videos aren’t necessarily the most watched but are generally the most interacted-with videos. Visualizations include the most popular videos of the year or month or the most trending videos by the artist/creator.

31. World Unemployment Dataset

This dataset features more than 31 years of unemployment in numerous countries worldwide. There is a wide range of visualizations you can create, including comparisons of countries, unemployment rates over time, or countries with the lowest unemployment.

32. NYC Subway Entries and Exits Data

This dataset originated on New York State Open Data and features station, line, location information, etc. You can use this dataset to visualize popular lines or subway maps.

33. 2021 Tokyo Olympics Dataset

This dataset contains information on more than 10,000 athletes in 40+ sports, and it’s a great source for building country medal count visualizations. There’s also coaching data, so you can add medal information by coach.

Exploratory Data Analysis

Say you want to take a big dataset and investigate. As you dive into the data, you discover patterns, trends, and anomalies. These datasets are perfect for exploratory data analysis projects because they contain large amounts of mostly clean data.

Dataset visualization

34. Airbnb Analytics Take-Home

Airbnb Take-Home

This Airbnb dataset, part of a sample data analytics take-home, contains user information for bookings in Brazil.

35. Netflix Original Films & IMDB Scores

A fun dataset to explore and great for beginners, this features all of the Netflix original movies up to June 1, 2020, and their corresponding IMDb scores.

36. Stripe Analytics Take-Home

Stripe Take-Home

This Stripe dataset, which features product usage and marketing data, is perfect for diving into marketing and product analytics to determine how well a product performs.

37. Superstore Sales Dataset

Featuring 4 years of data from a superstore, this dataset is perfect for analyzing and identifying trends and sales forecasting.

38. Home Depot Product Analysis Take-Home

Home Depot Take-Home

This sample dataset from a Home Depot data science take-home can be used to produce a gross sales forecast for a new product launch.

39. Marketing Analytics Data

This dataset is made up of mock marketing analytics data used by master’s in business analytics students. A great source for a marketing analytics project.

40. Animal Shelter Analytics Data

This is a great dataset for surfacing actionable insights for animal shelters, including what factors led to successful animal outcomes.

41. Why Americans Don’t Vote: Non-Voter Data

Another FiveThirtyEight dataset features survey data from non-voters in the U.S. A few project ideas are identifying key factors that result in non-voting or building a voting likeliness model.

42. Website Crawling Data

A sprawling dataset from Amazon, the Common Crawl corpus features crawling data from billions of websites. Check out the Example Projects page for ideas.

43. European Soccer Dataset

This dataset is useful for a sports analytics project. Featuring data on more than 20,000 matches and individual stats from 2008 to 2016, this is great for exploratory data analysis projects on line-ups, team stats, wins, and individual player stats.

44. Open Food Data

This large-scale dataset, which was originally developed in 2018, features product information for more than 600,000 food items. Data includes allergens, ingredients, and nutrition facts, and there are a wide range of data analytics projects you can do with it.

45. Social Media Influence Survey Data

This useful marketing analytics dataset features survey data from 2,500+ millennials. The survey asked which social platform has influenced your online shopping the most.

46. Google Analytics Dataset

This dataset features Google Analytics metrics from Austin, TX’s website. This dataset is great for working in Google Analytics or analyzing website traffic.

47. Uber Pickup Data for NYC

This dataset features more than 20 million metrics on Uber pickups in NYC in 2014 and 2015. This is great for an exploratory data analysis or analytics project, and you can gather insights into popular pickup locations, common trip routes, and the locations with the longest pickups.

48. Marketing Analytics Dataset

This dataset is a great source for a campaign budget optimization project or for diving into exploratory data analysis for marketing analytics projects.

49. World Bank Dataset

This dataset contains a wide range of economic and social indicators for countries worldwide, including information about their GDP, population, and education levels.

50. Data Science Salaries

This dataset contains salaries for roles in the data science field for the year 2023. You can group the data by domain, years of experience, and even by country of employment, allowing many angles for exploratory analysis.

Health and Medical Datasets

Consider working on a data science project with these free healthcare and medical datasets. It can be used in helping with disease prediction, patient outcome analysis, and medical research.

51. Heart Failure Prediction

The dataset comprises clinical records of patients with heart failure, detailing various medical attributes that may contribute to heart failure incidents.

52. Skin Disease Classification

This dataset contains medical images used for classifying various skin diseases, including Actinic Keratosis, Atopic Dermatitis, Benign Keratosis, Dermatofibroma, Melanocytic Nevus, Melanoma, Squamous Cell Carcinoma, Tinea Ringworm Candidiasis, and Vascular Lesions.

53. Medical Insurance Cost Prediction

A medical insurance company aims to deepen its understanding of the factors impacting individual health insurance costs. They have gathered a dataset featuring diverse individual profiles, encompassing details such as age, gender, BMI, number of children, smoking status, and region.

54. Heart Disease Prediction

This dataset contains information related to individuals and their risk factors for heart disease. The data includes demographic information such as age and gender, as well as medical history, lifestyle factors, and symptoms associated with heart disease.

55. Global Trends in Mental Health Disorder

This dataset provides global prevalence data on mental health disorders like schizophrenia, depression, anxiety, and substance use disorders. It offers insights into their impact, helping to better understand these conditions and their implications.

56. DARWIN Handwriting Dataset for Alzheimer’s Detection

The DARWIN (Diagnosis AlzheimeR WIth haNdwriting) dataset includes data from 174 participants taken from handwriting samples from both patients and healthy individuals. It supports research in handwriting-based cognitive assessment.

57. Brain Tumor Dataset

The dataset is designed for machine learning projects focused on predicting the type and severity of brain tumors, as well as understanding various treatment methods and patient outcomes. It contains simulated data for brain tumor diagnosis, treatment, and patient details.

58. Rabies Diagnosis and Outbreak Prediction

This dataset supports real-time and remote rabies diagnosis in humans and animals, especially in low-resource settings. It includes 12,684 observations across three datasets and can be used for outbreak prediction and resource planning, such as estimating vaccine needs.

59. Intraoperative Anesthesia and Outcomes Dataset

Intraoperative anesthesia data addresses gaps in electronic records in low- and middle-income countries (LMICs). It provides key data elements essential for clinical decision-making and research, helping improve patient outcomes and surgical care.

60. Disease Symptom Prediction

This dataset includes diseases, their symptoms, precautions, and associated weights. It is structured for easy cleaning and processing, making it useful for disease prediction models and healthcare analytics.

Finance and Economic Datasets

If you’re looking into finance and economics for your data science project, these datasets can help you analyze market trends, stock performance, consumer spending, and economic indicators for data-driven insights and predictive modeling.

61. Credit Risk Classification

This dataset classifies people described by a set of attributes as having good or bad credit risks.

62. Bank Marketing Campaign

The dataset includes various client and campaign-related attributes, with the classification goal of predicting whether a client will subscribe to a term deposit. It is useful for customer behavior analysis and predictive modeling in marketing.

63. Fraud Detection

Fraudulent transactions are a growing challenge for fintech companies. This dataset captures 51,000+ transactions, each labeled as fraudulent or legitimate, based on real-world patterns.

64. Bank Customer Churn Prediction

This dataset is used for predicting customer churn in the banking industry. It contains information on bank customers and their churn status.

65. Consumer Preferences in Cultural Products

Understand and predict consumer choices in cultural product design with this dataset. It combines purchase behavior, social media interactions, sentiment analysis, and cultural relevance factors.

66. Mobile Price Classification

In the competitive mobile market, pricing a product correctly is crucial. This dataset captures sales data from various mobile brands to analyze the relationship between device features and price ranges.

67. Global Economy Indicators

This dataset provides national accounts data from 1970 onward for over 200 countries, compiled through a global collaboration led by the UN Statistics Division. It includes key economic indicators, making it valuable for analyzing economic trends, policy research, and financial forecasting.

68. Apple Inc. Stock Prices

If you want to analyze Apple Inc.’s stock performance, market trends, and financial patterns, this dataset provides historical stock prices sourced from Yahoo Finance.

69. Currency Exchange Rates

This dataset provides historical foreign exchange rates for various currencies, making it useful for time series analysis and financial forecasting.

70. Travel Insurance Prediction

This dataset captures travel insurance purchase decisions, showing which customers opted for insurance and which did not. It’s useful for analyzing customer behavior, risk assessment, and building predictive models for insurance uptake.

71. Consumer Complaint Database

Complaints from customers about financial products and services provide valuable insights into consumer issues, financial trends, and company responses. This dataset helps in identifying patterns, assessing service quality, and improving customer satisfaction in the financial sector.

72. Sales Transactions Weekly Dataset

Analyze sales trends and forecast demand with this dataset, which tracks weekly purchase quantities for over 800 products over 52 weeks.

73. U.S Inflation Dataset

This dataset tracks monthly U.S. Consumer Price Index (CPI) data, averaging CPI across all U.S. cities. It’s useful for analyzing inflation trends and economic shifts.

74. Company Bankruptcy Prediction

This dataset contains financial data from the Taiwan Economic Journal (1999–2009) to analyze and predict company bankruptcy. Bankruptcy status is determined based on Taiwan Stock Exchange regulations and can be used for financial risk assessment, credit scoring, and predictive modeling.

75. Telco Customer Churn Prediction

Customer churning is a critical issue for telecom companies, and this dataset helps analyze factors influencing customer retention. It includes data on 7,043 customers in California, covering demographics, service usage, satisfaction scores, churn scores, and Customer Lifetime Value (CLTV).

76. Valentine’s Day Consumer Dataset

Find out how consumers celebrate Valentine’s Day with this dataset, which includes over a decade of survey data from the National Retail Federation.

77. Most Watched Stocks of Past Decade (2013-2023)

Explore this dataset featuring major U.S.-traded companies alongside emerging, innovative firms that recently went public. It covers industries such as social media, e-commerce, streaming, electric vehicles, cloud computing, and more, offering insights into market trends and business growth.

78. Wholesale Customer Spending Dataset

This dataset contains annual spending records of clients from a wholesale distributor, categorized by different product types. It can be used for customer segmentation, spending pattern analysis, and business insights.

79. U.S. Productivity and Hourly Compensation (1948-2021)

This dataset tracks trends in productivity and hourly wages in the United States from 1948 to 2021. It includes overall compensation data as well as insights into earnings for production and nonsupervisory workers.

80. Supermarket Sales Dataset

This dataset captures historical sales records from a supermarket chain operating across three branches over a three-month period. It is ideal for predictive analytics, demand forecasting, and business intelligence applications.

Datasets for Natural Language Processing

Natural language processing (NLP) involves text or speech data used to train models for understanding and processing language. Use the datasets below for your data science projects, including sentiment analysis, speech recognition, and text classification.

dataset from kaggle visualization

81. Real or Fake Job Posting Prediction

This dataset includes 18K job descriptions, with around 800 labeled as fake. It contains both textual data and meta-information about the jobs. The dataset can be used to train classification models to detect fraudulent job postings.

82.Twitter and Reddit Sentiment Analysis

This dataset contains labeled tweets from Twitter and labeled comments from Reddit. It is useful for analyzing sentiment trends across multiple social media platforms.

83. Amazon Product Reviews

If you’re looking to work on sentiment analysis or recommender systems, the Amazon Product Reviews dataset is a solid pick. It includes reviews, ratings, and metadata, making it great for consumer behavior insights and trend analysis.

84. Consumer Sentiments and Ratings

You can use this dataset for sentiment classification or for consumer behavior analysis. It contains customer reviews for various products, including details about product categories, brands, user ratings, and sentiments.

85. Fake News vs. Real News Classification

This dataset is good for fake news detection, text classification, and NLP. It includes news articles with titles, body text, subjects, and publish dates.

86. AI vs. Human Text

With AI-generated text becoming more common, this dataset is great for analyzing the differences between AI and human writing. It includes 500K essays from both, making it useful for text classification and authorship detection.

87. Emotion Detection from Text

Humans express a variety of emotions, and you might be curious about what emotion a message conveys. This dataset provides labeled text for emotion detection, making it useful for building models to classify emotions in text.

88. Language Translation (English-French)

If you’re into language translation, this dataset is a great pick. It contains English and French text pairs, making it useful for training NLP models for machine translation.

89. Phishing and Legitimate URLS

Phishing URLs are a major cybersecurity threat, and if you want to detect or classify malicious websites, this dataset is a solid pick. It contains 800K+ URLs, split between legitimate and phishing domains, making it useful for training models to identify online threats.

90. Sarcasm Detection

Sarcastic headlines are often used as clickbait, making sarcasm detection an interesting NLP challenge. If you want to classify sarcasm in news headlines, this dataset is a solid pick. It can help media houses analyze engagement strategies and improve search engine recommendations.

91. IMDB Reviews

With 50,000 highly polar movie reviews, this dataset is good for sentiment analysis, text classification, and NLP model training. It includes both raw text and bag-of-words formats, making it useful for various natural language processing tasks.

92. Sentence Classification

The dataset features annotated sentences from abstracts and introductions of 30 research papers in biology, machine learning, and psychology. Ideal for argument classification, NLP, and academic text analysis.

93. Online Influencer Marketing

Uncover what drives success in social media marketing with this dataset, which provides insights into the marketing strategies of top online influencers. It includes survey responses detailing target audience segments, market descriptions, and strategies used for engagement and growth.

94. Fact-Checker Tweets

This dataset contains tweets posted by fact-checkers between December 1, 2019 and June 2, 2020. It can be used for analyzing misinformation trends or fact-checking patterns.

95. Boston Airbnb Reviews

Gain an in-depth look into Boston Airbnb experiences through past customer reviews, and a detailed dataset offers valuable perspectives on Airbnb stays.

96. Customer Service Dataset

This dataset can be used to help improve customer service responses to be clearer and more accurate, improving communication with customers.

97. People’s Speech Dataset

This dataset includes over 30,000 hours of transcribed speech from diverse speakers. It provides a valuable resource for training speech-to-text models.

98. Japanese Anime Speech

You can use this dataset to improve automatic speech recognition models for anime and Japanese media, as it captures unique speech patterns and linguistic features from visual novels.

99. Youtube Comments Dataset

Analyze sentiment in YouTube comments with this dataset, which includes over one million labeled comments across topics like programming, news, sports, and politics.

100. Restaurant Reviews

Customer reviews provide insights into whether customers enjoyed their meal, which can be used for analyzing sentiment and customer preferences.

Never Get Stuck with an Interview Question Again

Datasets for Computer Vision

Computer vision enables machines to interpret and understand visual data. These computer vision datasets can help you train models for image classification, object detection, image recognition, and more.

mask detection dataset image

101. Image Segmentation

The dataset consists of instances randomly sampled from seven outdoor images. Each image was manually segmented to classify every pixel, and each instance represents a 3×3 pixel region.

102. License Place Detection

This dataset is useful for object detection tasks such as license plate detection and recognition. It contains 1695 images, each with bounding box annotations for license plate.

103. Facial Expression Recognition

The dataset is designed for research in emotion recognition and facial expression analysis across diverse races, genders, and ages.

104. Plant Disease Recognition

This dataset contains three labels, “Healthy”, “Powdery”, “Rust” referring to plant conditions. There is a total of 1530 images divided into train, test, and validation sets.

105. Road Accidents from CCTV Footages

The dataset consists of real-time CCTV footage annotated for accident detection and severity classification. It includes video frames with accident labels (Accident/Non-Accident) and severity levels (Minor, Substantial, Critical).

106. Vehicle detection

This dataset contains images labeled as “Vehicles” and “Non-Vehicles,” suitable for object detection and classification tasks.

107. ImageNet Dataset

ImageNet is an image database organized according to the WordNet hierarchy. It has over 100,000 phrases and an average of 1000 images per phrase. It’s suitable for image classification and object detection.

108. Gesture Phase Segmentation Dataset

With gesture data from 7 videos, this dataset is good for analyzing motion and segmenting gestures. Useful for computer vision and human-computer interaction.

109. Urban Land Cover Classification

Using high-resolution aerial imagery, this dataset is good for classifying urban land cover to support sustainable urban planning.

110. Newspaper & Magazine Segmentation

Ideal for segmentation tasks, this dataset includes 101 scanned pages from Russian newspapers and magazines with pixel-based ground truth masks. Useful for document analysis, OCR, and computer vision projects.

111. Waste Classification Dataset

It contains images of waste items in their natural, unaltered state, enabling comparisons between models trained on ideal objects versus actual discarded materials.

112. Fake Image Dataset

AI-generated images are now more common than ever. Use this dataset to analyze and determine whether an image is real or fake.

113. Bird vs. Drone Detection

Misidentifying drones as birds can pose risks to people, infrastructure, and air traffic. This dataset contains diverse images of birds and drones in motion, helping to improve detection and classification models.

114. Drunk vs. Sober Dataset

This dataset contains infrared images of people in both drunk and sober states. It can be used for research on impairment detection and computer vision applications.

115. Fruits and Vegetable Image Recognition

This dataset contains images of different fruits and vegetables, making it useful for image recognition tasks. It was created to help develop applications that identify food items from photos and suggest recipes based on the detected ingredients.

116. Car Object Detection

This dataset contains media of cars from various angles, and you can create an algorithm to detect and identify cars in different scenarios.

117. Face Detection

It contains high-quality images with carefully annotated bounding boxes, available in both pixel coordinates and YOLO format for easy integration into various models.

118. Driver Behaviour Dataset

Use this dataset for behavior analysis and driver monitoring, covering various driving scenarios and environments to improve traffic and road safety.

119. X-ray Images for X-ray Testing

X-ray testing is used to detect anomalies and hidden defects. You can use this dataset for developing computer vision models that analyze X-ray images for automated detection and classification tasks.

120. Face Mask Detection

Masks help protect individuals from respiratory diseases and were a key precaution during COVID-19. This dataset allows you to build a model to detect whether people are wearing masks correctly, incorrectly, or not at all.

Never Get Stuck with an Interview Question Again

Datasets for Pricing Optimization

Pricing optimization is the most important lever for increasing revenue with data. Try to identify prices that maximize revenue for these different products and environments:

121. Online Retail Dataset:

Over 1 million transactions from an online retailer, including customer data, product data, and transaction data.

122. Avocado Prices:

Weekly retail prices and volume data for avocados in various US markets from 2015 to 2018.

123. Beer Consumption in Sao Paulo:

Data on beer consumption and prices in Sao Paulo, Brazil, from 2015 to 2018.

124. New York Airbnb Dataset:

Information on Airbnb listings in New York City, including listing prices and attributes such as location, number of bedrooms, and amenities.

125. Uber Pickups in New York City:

Information on Uber pickups in New York City from 2014 to 2015, including pickup times and locations.

126. Diamonds Dataset:

Information on over 53,000 diamonds, including their cut, color, clarity, carat weight, and price.

127. Walmart Sales Forecasting Dataset:

Weekly sales data for 45 Walmart stores across the US from 2010 to 2012, including information on promotions, holidays, and weather conditions.

128. Taxi Trip Pricing

Using this dataset, you can build predictive pricing models, especially around surge pricing.

129. Used Car Price Estimation

Using this dataset, you can build a predictive model for used car pricing. The dataset includes various features such as the car’s make and model, location, year of manufacture, kilometers driven, fuel type, transmission, owner type, mileage, engine size, power, and the number of seats. These attributes can be used to analyze how different factors influence the resale value of cars.

130. Health Insurance Price Predict

Using this dataset, you can build predictive models to estimate health insurance costs based on various factors such as age, gender, BMI, number of children, smoking status, and region. These models can help insurance companies better understand the risk factors associated with individual clients and set more accurate premiums.

By analyzing patterns within the data, you can also explore how different variables interact with each other to influence the overall charges, potentially identifying key drivers behind high insurance costs. ​

Never Get Stuck with an Interview Question Again

Conclusion

Datasets are important for data science projects because they help with analysis, model training, and real-world applications. With so many free datasets available, you can test ideas, find insights, and create useful solutions. Explore these datasets and start bringing your data science project ideas to life!