How to Create a Data Science Project from Scratch

How to Create a Data Science Project from Scratch

Overview

Data science projects help you learn new skills, build your resume, and they can act as a marketing tool to land job interviews.

BUT, the key is: The project needs to be ORIGINAL.

You can jump on Kaggle and participate in a competition. Or find a dataset in the UCI Machine Learning Repository and build a classifier. BUT those projects have been done before… a lot. They’re great for learning, but they won’t make hiring managers say, “I want to work with this person”.

INSTEAD, if your end goal is to land a job, your project needs to be NOVEL. You have to start from scratch. If you do that, it’s much more likely people will take notice.

Where should you start? In this guide, we’re covering the steps needed to create your own data science project, starting with formulating a problem statement and ending with using that project to market yourself and land an interview. This covers everything you need to know to create an original data science project.

Step One: Choose a Topic for Your Project

A good data science project starts with an interesting idea. And that’s usually the hardest step. But here’s a suggestion: Find the overlap between your interests and the available data online.

If you start with a topic you’re really interested in, and know there’s data available to investigate it, you’ll be much more motivated to do the work and find an answer.

Here’s an example: Early in our founder Jay’s data science career, he worked on an NBA analytics problem. The leading question: Does the 2-for-1 strategy matter in NBA games? Fortunately, all the data to investigate is available online, thanks to the play-by-play stats on NBA Reference that go back decades.

So start with something you’re interested in: Comic books, movies, sports, public health… And then think about the data that might be available on that subject.

Step Two: Formulate a Hypothesis

Successful data science and machine learning projects start with clear, measurable hypotheses. But you might be wondering: What makes a good hypothesis in data science?

Here are some mistakes to avoid:

Be Specific: A common mistake is choosing a topic that’s so broad that it becomes impossible to measure.

A question like: How many people are leaving cities because they’re too expensive? That’s really broad, and finding data for a question like that would be near impossible, or take so much time as to make it not worth the effort.

Plus, a question like that makes another critical mistake: There are too many unknowable variables.

A better hypothesis is really clear and investigable. So a better question about housing for a data science project would be something like: What effect did the 2008 recession have on the population of San Francisco?

This question is much more specific, measurable, and there’s a better chance you can find quality data to investigate it.

Step Three: Find Data

You can’t solve a data science problem without… quality data. (You would have never guessed that, right?)

So how do you do it? There are really three options for finding interesting data for machine learning projects:

  1. Pay for Data (or Use a Open Data). Jay built a project to predict crime in Seattle when he was an undergrad using public data. The project actually got some press. Socrata, which offers open and paid data, is a great source, as are governmental sites.
  2. Use a Free Dataset. There are lots of sources of robust data for free online… but for every free data source, there are hundreds of projects already. So it’s much more difficult to be original with free datasets. One option is to embellish the dataset and add additional information. Or you can apply an original technique or ask a different leading question.
  3. Scrape the data. Build a simple Python app and scrape data. In the past, Jay scraped craigslist data to analyze apartment prices in Seattle. This was another project that got some press.

So, there are tons of places you can look for data. The hard part is processing and cleaning it. Once you’ve got that down, you can start your project and investigate your hypothesis.

Step Four: Start Your Data Science Project

Now, it’s time for the fun part: doing the data science project. Typically, there are a few common steps:

  • Data cleaning - Once you’ve settled on a dataset, you’ll first want to clean and process it.
  • Enriching data - Next, you’ll take all of your pre-processed data and enrich it, narrowing down your feature set and creating features. For example, you could create a time-based feature to analyze differences in times/dates.
  • Analysis and visualization - With the enriched dataset, you can start to explore it. You’ll be looking to isolate insights and also create helpful charts, heatmaps and graphs to visualize the insights you’ve identified.
  • Data modeling - Lastly, you’ll want to take the project further, using your data and machine learning to make predictions or forecasts.

While you work on the project, remember to DOCUMENT everything. Create a code repository to store code and visualizations, and take notes throughout the process (which will be helpful in the next step).

Step Five: Market Your Project

If you really want to use a data science project to gain exposure, you have to market it. You’re investing a lot of energy… you don’t want it to go unnoticed.

Fortunately, there are tons of easy ways to get your project in front of eyeballs. Some that we’ve had success with are:

  • Write a blog post. Write about your project and get published. You don’t even need a website. You can publish a blog on Medium (or submit to an online publication like Towards Data Science). This is the baseline of what you should be doing.

If you’ve got a great idea, people will want to read about it. And you don’t have to write a thesis. Just focus on the problem, the data you used, challenges, and the conclusions you drew.

  • Do some PR outreach. Journalists are always looking for stories, and reaching out to them really works. Some simple email outreach or submitting a news tip through a news website are fast ways to get exposure for your project.

If you really want to go all out you can write a press release and pay for distribution (which is useful for topics with broad audiences like housing, crime or public health). Or you can email a short press release to journalists who cover whatever niche your project was in.

  • Build a web app. Share your project via a web app. A lot of data scientists built web apps during COVID to visualize data, often to show the spread of the virus, vaccination rates, etc. Lots of those visualizations were viewed millions of times.
  • Distribute it on social media. There’s plenty of places to try: r/dataisbeautiful, hacker news, sharing with your network on LinkedIn. Publish the blog or web app, and then distribute it. You don’t have to go crazy with it, but if you want to maximize exposure, you should be doing a bit of distribution.

YouTube video

Check out my YouTube video of this article:

Data Science Project from Scratch

More Resources for Creating a Data Science Project