Back to Data Analytics
Data Analytics

Data Analytics

28 of 84 Completed

Setting Metrics

What is a good metric?

When first sitting down to run an A/B test–assuming that you have already chosen to do so and it makes sense to run one–the first question is about what to measure. Understanding what to measure is the most important part of running an A/B test. The metrics measured determine how you will design an experiment around it.

At a high level, metrics are quantifiable measures of how you’re doing. It determines if the product is doing well or if it’s doing poorly. However, many people dislike metrics because they feel like metrics are over-emphasized to a point of gamification at larger companies.

This kind of understanding is a fallacy of either poor communication or choice of metrics. A good metric for a product has a 100% correlation with an overall improvement to the product itself, whatever that definition for improvement may be.

A rather universal example of this relates to happiness in your life. If your goals were to improve your own happiness, you might choose a metric like money or wealth, which are quantifiable to measure how happy you are. But, inherently, we know that many super wealthy people are not necessarily happy because the function can break down when it becomes gamed.

This applies to products as well. If my goal for Interview Query was to make it the best data science course to learn from, I might choose a metric such as revenue to tie to that goal. The expectation is that if we make more money, we’ll then ostensibly have the best content. But in achieving that goal, instead of improving the overall content of the material and writing, we could just optimize our conversion funnels.

Therefore, it goes to say that all metrics that we choose to set should be directly related to the goal at hand. The easiest way to assess if we’ve succeeded at picking the right metric is by asking a question: “If this metric were to double while everything else independent stayed the same, would we achieve our goal?”

Metrics for Experiments

The main rule is to have one key metric for each experiment. You should monitor multiple metrics to make sure you’re not accidentally tanking them in the process, but having one key metric for the experiment is pertinent.

The reason for this is that it’s important to have a threshold that the metric can cross to measure success.

For example, let’s take the example of the Interview Query question bank. We want to change the ordering of the question bank so that users try more interview questions. If we set a goal of increasing the number of interview questions viewed per user, then our metric seems relatively simple to measure.

But what if we notice that users have stopped adding as many comments on each question as an effect of our change? Do we care about this decrease? Maybe or maybe not. But depending on how much we care, we should then either change our overarching success metric to reflect this interaction or resolve to implement the change based on our metric of increasing the number of interview questions viewed per user.

Once you start testing many metrics on an experiment, you end up with an increased false positive rate. While correction methods can counteract this, they make each test more conservative and it becomes less likely to detect a difference in any given test.

Metrics Measurement

There are four categories of evaluation metrics that we should use:

  1. Aggregations -> Sums and Counts
  2. Distributions -> Mean and median or different percentiles
  3. Percentages -> Click through rate, percentage of users that did X action, etc…
  4. Ratios -> Any two metrics divided by each other

In the example above, the number of questions viewed per user metric is a ratio of the total number of questions viewed divided by the total number of users in the experiment.

One thing we have to consider is how sensitive this metric is. For example, as noted, this metric definitely accomplishes our goal of improving the overall interview question bank activity but can potentially be fallible when it comes to re-engagement. What if all of the users view the questions during one day and then never come back? Or what if all of the users stop making comments?

Again, how much we care about the different metrics influences how we think about calculating the metric we’re using.

Good job, keep it up!

33%

Completed

You have 56 sections remaining on this learning path.

Advance your learning journey! Go Premium and unlock 40+ hours of specialized content.