Top Social Media Datasets

Top Social Media Datasets


Collecting and combining good data is a challenge all data scientists experience. Even more difficult, however, is turning that good data into projects that generate riveting insights and inspire new discussion. As data scientists, good data is something we define as raw, uncompromised, and observed in their natural state– this is where social media datasets come into play.

Social media datasets are superior to other datasets due to an array of factors. One of those factors is that social media datasets contain information that is already correlated, sorted (i.e., user profiling) and numerous. Moreover, data shared through websites such as Twitter and Reddit tend to be unfiltered, which can be used as leverage for businesses to create unadulterated datasets for sentiment analysis or research.

A dataset is made up of related data cells, which can then be maneuvered and accessed either as a singular unit, curated based on similar characteristics, or modified individually. Datasets, at their core, are untapped potential that is waiting to be utilized. As data scientists, we should be able to use datasets in the following manner:

  • Generating sentiment analysis.
  • Training neural networks.
  • Creating business insights (which drive business decisions).

Twitter Datasets And Project Ideas

Twitter is often cited as one of the best places to explore datasets. Not only are existing datasets numerous, but the core user demographic for Twitter is extensive, and is home to communities of polarizing ideologies, as well as raw, and genuine opinions.

The authenticity of Twitter datasets makes them great for businesses to explore trends that drive business decisions.

We have gathered five datasets that you can use to form a project around (linked below). Each of the datasets can be of varying specificity, and they will tackle:

  • Overspending Analysis Take-Home
  • 2015 New Year’s Resolutions
  • The US 2016 Elections Twitter Analysis
  • Apple (FAANG) 2016 Sentiments
  • Sentiments on Climate Change

These four datasets are seemingly random and unrelated, but this only serves to highlight how much variety you can get from Twitter.

1. Twitter: Overspending Analysis Take-Home

Twitter Take-Home Challenge/Social Media Dataset

This social media dataset doesn’t focus on users. Instead, it’s more of a product case study challenge. Specifically, the data provided relates to Twitter’s advertising business.

You’re provided a dataset of campaigns, budgets, and impressions, and you’re asked to measure the data generated by an A/B test on a new advertising product.

This take-home tests your ability to:

  • Process social media advertising data
  • Calculate and measure the effectiveness of an A/B test
  • Determine the winner of an A/B test

Overall, this Twitter data science takehome is a product case study, in that you’re asked to determine how a new product compares to an old product in an A/B test.

2. 2015 New Year’s Resolutions

One of the most organized and detailed social media datasets for Twitter compiles tweets related to 2015 New Year’s resolutions. While 2015 resolutions might seem outdated, it is evident that most people who make New Year’s resolutions tend to fail halfway through and, as a result, commit to the same promises year after year.

This dataset is helpful as it can assist businesses either downscale or upscale their operations at certain times of the year. For example, a quick skim through the dataset will reveal the prevalence of gym and weight-loss-related resolutions at the beginning of the year, just after the resolutions are made.

A project that tackles and predicts the increase or decrease of demand for certain services can drive business decisions and generate those actions that produce service supply shifts throughout the year. Moreover, the dataset is straightforward to navigate as it sorts the resolution based on relevant categories (for example, philanthropy, health and fitness, and personal growth).

3. Climate Change Twitter Sentiment Analysis

On July 22, 2022, the UK recorded its highest temperatures in recorded history. Concurrently, reports of melting ice caps in the Arctic spread around the internet. Multiple research papers, dating first from 1896, have helped back up the authenticity of climate change.

Despite the body of scientific evidence piling up, from social media we can gather that there are still people doubting the correlation between carbon emissions and climate change.

For one of your projects, you can perform a sentiment analysis on climate change using this data set provided by CrowdFlower. This social media dataset contains a column that specifies whether a tweet suggests the acceptance or rejection of the climate change phenomena.

4. Apple (FAANG) 2016 Sentiment Analysis

By January 2022, Apple became the world’s first $3 Trillion company. Many factors led to this point, but it has long been understood that Apple’s valuation rides on a public perception of the brand’s products being labeled as “premium”. This is based in no small part on their product’s decent build quality and innovative marketing. However, does Twitter share the same opinion?

2016 was a big year for Apple, introducing many novel products with features still prevalent today. These are the following:

  • iPhone SE, Apple’s budget iPhone lineup.
  • 9.7-inch iPad Pro, Apple’s dip at iPad Pros on a smaller form factor.
  • iPhone 7 and iPhone 7 Plus, one of which sported the double camera setup, but also started the trend of the removal of the headphone jack.
  • Apple Watch Series 2.
  • Airpods, which were controversial due to the headphone jack issue.
  • The Macbook Pro 2016 introduced the contentious touch bar, as well as the butterfly keyboard second generation, which was also plagued with quality control issues.
  • New software innovations and operating system updates.

Sentiment analysis dataset

2016 is the perfect year to conduct a project on sentiment analysis for Apple, at a time when the company introduced many features and hardware that created a large divide in the tech community. To this day, there is still no consensus on the preference for the touch bar.

It would also be interesting to see how far opinion has developed and how the opinions from 2016 have grown since that time.

5. US 2016 Elections and ISIS Twitter Analysis

The US 2016 Elections, alongside the peak of the terrorist group ISIS’s self-declared caliphate, were two of the most talked about events and phenomena of 2016, with the elections garnering approximately 60 million tweets and ISIS gaining a quarter of that. In the same year, Paragon Science Inc. compiled a list of all related Twitter networks that mention or are related to the US 2016 elections alongside tweets that heavily emphasize ISIS.

Paragon Science Inc. compiled this social media dataset by employing dynamic anomaly detection technology, a deviation from the chaos theory and dynamical theory that were most often employed to understand these types of phenomena.

The dataset is a lovechild of many methods, utilizing sentiment analysis, network analysis, community detection, and topic detection.

We can use this dataset to develop the following projects:

  • Using Natural language processing (NLP), we can gather keywords related to either the US elections or ISIS.
  • The keywords we gathered can also create an exploratory data analysis project, visualizing related keywords and finding weakly or strongly connected graphs.
  • Another project we can explore is the creation of an algorithm that produces political Tweets.
  • Moreover, we can also do a statistical data analytics project to identify what factors aided Trump’s victory and Clinton’s loss in that 2016 election.

Facebook Datasets And Project Ideas

If you are actively looking at tech stocks, you might encounter quite a bit of doom-posting about the fall of Facebook. This pessimism does not match the active user numbers however, as currently 69% of Americans, 79% of Canadians and 329 million people in India alone still actively use Facebook. Moreover, in the 35 to 44 age demographic, Facebook remains as the dominant social media site.

For businesses that are explicitly targeting developing countries and/or Millennials in the 35 to 44 age demographic, Facebook datasets might be the most helpful.

For Facebook, we have four datasets to get started with:

  • Exploratory Data Analysis: General Facebook Dataset
  • SNAP Facebook Social Circles
  • Political Advertisements from Facebook
  • A Dataset Compiling Removed Domestic Facebook Pages

1. Exploratory Data Analysis: General Facebook Dataset

A dataset that explores the demographic of Facebook users allows data scientists to grasp the contours of the current active community. Because the dataset contains a lot of general information, it might not be as valuable for a project that requires specific statistics and user preferences.

However, this social media dataset holds comprehensive information such as age, year born and birth date, gender, friend count and likes. Because of that, this dataset can be used for exploratory data analysis.

Facebook dataset

Exploratory data analysis allows you to explore the data with visual techniques, check trends, confirm assumptions and get a general gist of the data.

2. SNAP Facebook Social Circles

SNAP is an initiative by Stanford University and stands for Stanford Network Analysis Platform. SNAP primarily contains social media datasets for network analysis, as well as various platforms running C++ and Python. However, due to the sheer amount of data in SNAP datasets, it is recommended to utilize the faster C++ libraries.

The reasoning behind C++’s better performance is that many Python libraries use C code in them, thus, hampering performance. Moreover, Python is interpreted while C++ is compiled.

SNAP’s social media datasets are not limited to Facebook; they also offer Reddit, Twitter and more. However, for this section,we will focus on their Facebook social circles dataset.

The SNAP Facebook dataset is focused on friend lists (i.e. social circles) commonly used in ego networks. Ego networks are a type of network that are based around a focal node (a central node) named an “ego,” with social relationships surrounding them as “alters.” The relationships between alters are also specified.

These alters, in themselves, can become the focal node as well. Moreover, the SNAP dataset also analyzes the “features” (i.e., political party) but anonymizes them as to be unidentifiable. For example, political party A will be labeled “1,” and party B will be labeled “0.”

Projects focusing on an analysis with a particular feature in mind will not do well with this dataset. For instance, if an organization prioritizes campaigns for the supporters of a specific political party, it will be challenging as the dataset itself does not specify which users belong to which political party.

However, projects that analyze the divisiveness and the clustering of nodes (e.g., how friend groups influence the appearance of certain features) will be a perfect fit for this dataset.

3. Political Advertisements from Facebook

Political Advertisements from facebook dataset

Advertisements that advocate social change, whether political or non-political, are grouped into the “political ad campaigns list.” As such, it is vital to distinguish these two categories from each other.

In this dataset, various ad campaigns from Facebook were collected, including their HTML code, metadata and their “political” or “not political” classification. Do note that the classification is not made by hand but rather by a machine learning classifier.

One project idea is to utilize this dataset and build neural networks to improve the current AI. Moreover, you can use the dataset for an AI that develops political ad campaigns from scratch (like the dalle mini, which is admittedly an ambitious project). The dataset also contains the necessary metadata and HTML code that one can use for training neural networks or building a classification project.

4. A Dataset Compiling Removed Domestic Facebook Pages

Facebook in 2018 removed hundreds of US political pages for inauthentic activity (i.e., political spam). The social media dataset below contains daily engagement metrics of three specific removed political pages:

  • Right Wing News
  • Daily Vine
  • Silence is Consent

The data starts from February 14, 2011, and ends on December 9, 2018. The dataset also contains the number of interactions per day, including shares and likes, as well as a day-on-day growth statistic.

One project idea is to create and analyze the trends and growth of the said Facebook pages through a thorough look at the engagement statistics. Alternatively, a pseudo-sentiment analysis (without the use of NLP) can be introduced by analyzing the Facebook reactions (Like, Love, Haha, Wow, Sad, Angry) on these pages using posts from 2016 to 2018 (as Facebook introduced the reactions menu as an alternative to the singular Like button in 2016).

Reddit Datasets And Project Ideas

While Facebook and Twitter may get all the credit for being the conventional social media platforms for gathering social media datasets, Reddit offers a unique platform underutilized by many organizations and data scientists, overlooked frequently due to its unfamiliar setup.

In spite of its different format, Reddit’s community-first approach is often more useful in generating business insights on target consumers once you account for the more than 50,000 niche communities (subreddits). While Reddit may not deliver the most extensive of datasets, the social media datasets from Reddit are collected from a user base that is easier to confirm as relevant, as they self-select themselves for you in the communities they choose to join.

For Reddit, we tackle questions from the following databases:

  • r/financialindependence subreddit demographic dataset.
  • Redditor Demographic 2011 Survey.
  • Reddit growth dataset.
  • Reddit AAPL (Apple Inc.) stock sentiment analysis.

1. r/financialindependence Subreddit Demographic Dataset

One of the most extraordinary things Reddit datasets offer is the ability to look at a subreddit and its members’ demographic and be able to grasp a product’s demand and outlook.

For example, when interested in learning which online games have the most active communities, you can quickly look up the membership numbers of a specific game’s subreddit and compare the numbers to other competing games.

r/financialindependence is a subreddit full of individuals aiming to become financially independent, specifically meaning they wish to generate income without having to actively work for the income (passive income streams). This social media dataset contains the said subreddit’s demographics (age, income, marital status, country of residence), allowing you to create insights that might drive business decisions that can better utilize such demographics.

Organizations that heavily benefit from projects like these are companies offering services that provide opportunities for generating passive income (i.e., providing real estate for other companies such as Airbnb, investment platforms such as Binance, and more). Specifically, you can utilize this social media dataset to predict whether a particular person might be interested in downloading apps like Binance.

2. Redditor Demographic 2011 Survey: Exploratory Data Analysis

The following dataset is akin to Facebook’s exploratory data analysis project but differs in one crucial aspect. Because Reddit is built around communities, this dataset is much more substantial. Aside from providing the general gist of Reddit’s user base, it also gives insight into how a person of a particular demographic might favor specific subreddits over others.

The 2011 Survey contains information about the user such as their age range, education, country of residence, income, and most importantly, their favorite subreddit. Because of how robust it is, we can do the following projects:

  • A project that helps predict, based on demographic, what subreddit a user is likely to be a part of.
  • A project that can help brands determine which subreddits are most relevant to their businesses.
  • A project that determines which subreddit holds the highest and lowest members by earnings.

As you can see, the dataset is incredibly flexible. There are definitely a lot more projects that come into mind, but these are a select few that are notable.

3. Reddit Growth Dataset

Reddit Growth Dataset The following dataset contains Reddit’s engagement data (submissions, comments, votes, account membership) per month, from the month of conception (June 2005) to March 2017.

One project you can do using this data is to predict the engagement numbers for the next few months from today and cross-check them with the current available data (2017 - current year).

4. Reddit AAPL stock sentiment analysis

The value of stocks heavily relies on public sentiment, and while Reddit, at first glance, might not be a determining factor in stock prices, a closer look at the news will reveal otherwise. One of the most extraordinary events in Reddit history was the inflation of the Gamestop stock price due to a subreddit called r/WallStreetBets.

The subreddit was able to single handedly modify and inflate the Gamestop stock price, revealing a vulnerability in stock pricing and the sheer power of Reddit. When it comes to stocks and trading Reddit is no pushover, and is, in fact, a key tool.

AAPL (Apple Inc.) remains one of the world’s most potent stocks, despite the recent decline in the value of other tech stocks. Running a sentiment analysis of the AAPL stock using Reddit data can be a powerful way to look at the industry.

Since the data’s scope lasts only from November 2016 until October 2021, you can build a machine learning project to predict AAPL’s stock price and cross-check the result on the proceeding AAPL stock prices from October 2021 to the present.

Another project idea is to compare the results of sentiment analysis from a particular time frame (i.e., November 2017) and see how it is reflected in AAPL’s November 2017 stock price.

Youtube Datasets And Project Ideas

Youtube is the world’s leading video streaming platform. Moreover, it is currently one of the world’s most significant outlets for audiovisual ad campaigns, as well as the primary outlet of multimedia content including news, music, travel, education and virtually any other video-related content.

Below, we list four social media datasets that utilize YouTube in particular:

  • Trending YouTube Video Statistics
  • YouTube 8M Dataset
  • Donald Glover’s This is America - YouTube Comments Sentiment Analysis
  • Most Subscribed YouTube Channels

1. Trending Youtube Video Statistics

Trending Youtube Video Statistics dataset YouTube’s trending chart is a position most video creators would want to be listed on, as it reflects the general popularity of the video you’ve made and the hype it receives around a community. Nevertheless, the trending tab is not uniform around the world, as it elevates content differently in every country.

This dataset records trending videos, including their engagement statistics, tags, metadata and more. As specified earlier, the dataset heavily varies per country, and currently, the dataset includes the trending chart of the following countries:

  • USA
  • Great Britain
  • Canada
  • Germany
  • France
  • South Korea
  • Mexico
  • Russia
  • India
  • Japan

Moreover, one can utilize this dataset for the following projects:

  • Create an algorithm that determines what tags and content will trend in a specific country.
  • Analysis of which content tends to trend more.
  • Determining what type of YouTube content generates the most user engagement.
  • Using machine learning algorithms (i.e., RNNs) to generate YouTube comments.

2. YouTube 8M Dataset

youTube 8M Dataset

Unstructured data, especially multimedia data, is tough to analyze and requires a commitment of huge amounts of processing power and resources. The YouTube 8M dataset compiles 350,000 hours of unstructured, raw video files and turns them into a CSV file that allows for easier processing.

The dataset also contains human-verified annotations describing the video’s audiovisual elements. The dataset can be used to train machine learning algorithms that require video input. You can consider the following projects:

  • For example, you can develop a project that determines the genre of a YouTube video based on its content.
  • You can also create video content based on prompts for a more challenging (significantly more challenging, in fact) project.
  • Another project you can do is to create an algorithm that describes and annotates YouTube videos.

3. Donald Glover’s This is America - YouTube Comments Sentiment Analysis

At the forefront of music activism, Childish Gambino’s “This Is America,” is a viral anthem that displays and protests the police brutality often experienced by the black community in America. As of writing, the music video currently has 846 million views with more than 750,000 comments.

This dataset contains the information of more than 200,000 comments scrapped using YouTube’s API. One of the projects you can do is to use Natural Language Processing to do sentiment analysis on the comments within the video.

Natural language processing (NLP) allows computers to understand human language. It takes complex sentences and breaks them down into simple structures, enabling algorithms to better perform numerical interpretation. NLP is often used in sentiment analysis.

4. Most Subscribed YouTube Channels Datasets

The following dataset contains the most subscribed YouTube channels, including their engagement statistics, date created, video count, genre, and rank. One of the projects you can do is a statistical analysis to determine which genres are most popular.

Other Datasets To Discover

Aside from the datasets above, what datasets can you build projects with and explore? These social media datasets can be of use:

1. LinkedIn Social Media Dataset

This social media dataset contains information about accounts on LinkedIn, the individual’s job experience, their LinkedIn activity, their current company, gender, race and more. You can do statistical analysis to analyze which demographic is favored in which jobs and which type of individual or group is highest or lowest paid.

2. Shopping Influence And Social Media

This dataset contains the answers to a survey questionnaire about how Millennials get influenced by social media, especially with their shopping habits. You can create a project analyzing which demographic of which social media platform has their shopping behaviors the most heavily influenced.

3. Fake News Detection

While this dataset is not necessarily one we could call a social media dataset, it is a dataset where one can build awesome projects. Debuted as part of a hackathon, this database contains information that can help you make a fake news detection algorithm using NLP.

Social Media Data Science Projects

All of these free datasets can be used on your next data science project. In particular, social media datasets are great for data analytics projects, especially LinkedIn data. You can also use these sources for advanced machine learning and classification data science projects, as Twitter data works well for text classification projects.