Weighted Average in SQL Guide

Weighted Average in SQL Guide

In-Depth Guide and Interview Questions

Introduction

Data analysis plays a pivotal role in decision-making processes across various industries. Among the many techniques available, the weighted average is a powerful tool for deriving valuable insights from numerical data. But how exactly can we calculate the weighted average using SQL?

Let’s explore some practical tips and techniques in generating the weighted average using SQL.

What Is the Weighted Average?

What is the weighted average, and how does it differ from the arithmetic mean?

The rationale behind the arithmetic mean is simple– it represents a set of data points as a singular output. There are limitations to this approach since it assumes that all data points are of equal importance, which isn’t always the case in the real world. To accommodate for variance in significance or “weights”, we use the weighted average.

The general query for the weighted average is:

SELECT (SUM(value * weight) / SUM(weight))

Figure 1.1: The General Form for the Weighted Average

This query represents the weighted average formula, which calculates the weighted average by dividing the total sum of the products of each value (valueᵢ) and its corresponding weight (weightᵢ) by the sum of all the weights (weight_j).

Formula for the Weighted Average (assuming non-percentile weights)

Figure 1.0: Formula for the Weighted Average (assuming non-percentile weights)

We can see that the formula above maps to our SQL query straightforwardly. SUM(value * weight) corresponds to the numerator, while SUM(weight) is the denominator of the formula.

It seems simple, right? However, while this general formula is technically the correct answer, not all applications of weighted averages will reflect this general form. To learn how to deal with real-world schemas, let’s take a look at some different examples.

SQL Weighted Average Example with Real-World Schemas

The general query for the weighted average (Figure 1.1) assumes that we’re gathering the weighted average for the whole table. However, data scientists and analysts typically use databases with more complex schemas in their day-to-day workflow.

To see how we can adjust the general query, we’ll use the following table schemas to set up a more realistic example.

Students table

Column Data Type
student_id INT (PK)
student_name VARCHAR(50)

Grades table

Column Data Type
grade_id INT (PK)
student_id INT (FK)
course_name VARCHAR(50)
grade DECIMAL(3,2)
units INT

Let’s assume that you’re working as a data analyst for a university. You’re tasked to generate a report on the general weighted grade point average (GWA) of all the students in the university. Since courses have different weights (represented in this case as units), it’s important to use the weighted average instead of the arithmetic mean to calculate the final output.

Since we need to generate the weighted average per student with our table schema, the general form solution by itself won’t work for this problem.

The grades table has a many-to-one relationship with the Students table, and we can map many grades to one student. To generate one GWA for each student, we can use GROUP BY and SUM, which is the first approach we’ll discuss today.

Approach 1: GROUP BY and SUM

The many-to-one nature of our schema makes GROUP BY a natural choice for our solution. From the general form, our query will transform into:

SELECT
    students.student_name,
    SUM(grades.units) AS 'Total Units Taken',
    ROUND(SUM(grades.grade * grades.units) / SUM(grades.units), 3) AS 'General Weighted Average'
FROM
    grades
INNER JOIN
    students ON grades.student_id = students.student_id
GROUP BY
    grades.student_id;

Figure 2.0: GROUP BY and SUM Solution

While this looks very different from the general form presented above, the line ROUND(SUM(grades.grade * grades.units) / SUM(grades.units), 3) AS 'General Weighted Average' is just a modified version of our basic query.

In the next section, we’ll break down our solution in parts to show how we generated this example.

If you need a refresher on the basics of GROUP BY, read through the Interview Query Introduction to SQL course.

GROUP BY with the General Form

General Weighted Average

Using the table schema, a grade can be mapped to a student using the foreign key, which in this case is student_id.

student_id student_name course_name grade
1 John Smith Mathematics 3.50
1 John Smith Physics 4.00
1 John Smith English 3.80
2 Jane Doe Chemistry 3.90
2 Jane Doe Biology 3.70

Figure 2.1: Table before GROUP BY

To work with the data more efficiently, we’ll use the GROUP BY clause on the grades.student_id column.

This will group rows with the same student_id, effectively dividing the original table into smaller sets of data. We can then perform the operations in the general form on each set separately, rather than on the entire table.

student_id student_name course_name grade
1 John Smith Mathematics 3.50
1 John Smith Physics 4.00
1 John Smith English 3.80
student_id student_name course_name grade
2 Jane Doe Chemistry 3.90
2 Jane Doe Biology 3.70

Figure 2.2: Tables after GROUP BY

We can now treat grades.grade as the value and the grades.units as the weight. The joins are then done to match a student name to a GWA. Using the general form (with GROUP BY, of course), our output will be:

student_name Total Units Taken General Weighted Average
John Smith 10 3.740
Jane Doe 7 3.814
Michael Johnson 7 3.371
Emily Williams 6 3.667
Daniel Brown 6 3.700
Olivia Davis 4 3.400

Figure 2.3: Results of the GROUP BY and SUM Query

Approach 2: Window Functions and the Weighted Moving Average

Our first approach is great whenever we want something as simple as a general weighted average. But, if we want our data to be dynamic (such as in leaderboards or scoreboards), a moving average may be more suitable. Using our previous table schema, how could we get the moving weighted average with SQL?

Because our values now vary based on previous data (as the data is moving), we’ll need to employ a more advanced SQL feature called window functions. This will allow us to aggregate data on a per-row (or per window) basis instead of by table. While we tackled the first iteration of breaking down the table with GROUP BY, we can create a more granular solution using window functions.

Window Functions

Window functions are functions that use one or multiple rows to return values for each row. They are different from aggregate functions in that aggregate functions return just one value for every group of rows, while window functions return a value for each row.

Our moving weighted average solution should take the form of the following:

SELECT 
    student_name, 
    course_name, 
    grade, 
    units,
    ROUND((SUM(grade*units) OVER (
PARTITION BY student_id 
ORDER BY grade_id ROWS 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / SUM(units) OVER (
PARTITION BY student_id 
ORDER BY grade_id ROWS 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)), 3) 
AS 'Running Average Grade'
FROM 
    Students
JOIN 
    Grades ON Students.student_id = Grades.student_id
ORDER BY 
    grades.student_id, grade_id;

Figure 3.0: Window Function Solution for Calculating the Moving Average

While this might look intimidating at first glance, we’ll break it down step-by-step below.

Window Functions with the General Form

The modified version of the general form in this query is:

ROUND((SUM(grade*units) OVER (
PARTITION BY student_id 
ORDER BY grade_id ROWS 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / SUM(units) OVER (
PARTITION BY student_id 
ORDER BY grade_id ROWS 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)), 3) 
AS 'Running Average Grade'

Figure 3.1: The Modified General Form from Figure 3.0

Still quite a big chunk, isn’t it? We can express the general form G SELECT (SUM(value * weight) / SUM(weight)) into two parts, G₁ SUM(value * weight) and G₂ SUM(weight).

Similarly, we can express the modified general form (see Figure 3.1) as G’ into two parts:

G’₁ :

(SUM(grade*units) OVER (
PARTITION BY student_id 
ORDER BY grade_id ROWS 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /

and G’₂ :

SUM(units) OVER (
PARTITION BY student_id 
ORDER BY grade_id ROWS 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

Note, the code snippet SUM(value*weight) from G₁ is still in G’₁ as SUM(grade*units) . The same thing is true with G₂ . What makes them different are our complex window functions. Window functions are declared using the OVER keyword, typically after an aggregation function (which in this case is SUM).

A window function is defined by three sections: PARTITION BY, ORDER BY, and the window frame.

  • PARTITION BY student_id means that the window is defined for each unique student_id and the calculation is performed separately for each student.
  • ORDER BY grade_id specifies the order in which the rows are considered within each partition. Here, the rows are ordered by the grade_id column.
  • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is our window frame and it defines the range of rows included in the calculation. We can read this like plain English and understand that we start the calculation from the beginning of our partition (specified by UNBOUNDED PRECEEDING) up until the current row.

Let’s see this in action.

  1. For PARTITION BY, the results will look largely equivalent to Figures 2.1 and 2.2, as partitions function similarly to GROUP BY.
student_id grade_id student_name course_name grade
1 1 John Smith Mathematics 3.50
1 2 John Smith Physics 4.00
1 3 John Smith English 3.80
2 4 Jane Doe Chemistry 3.90
2 5 Jane Doe Biology 3.70

Figure 3.2: Table before PARTITION BY

student_id grade_id student_name course_name grade
1 1 John Smith Mathematics 3.50
1 2 John Smith Physics 4.00
1 3 John Smith English 3.80
student_id grade_id student_name course_name grade
2 4 Jane Doe Chemistry 3.90
2 5 Jane Doe Biology 3.70

Figure 3.3: Tables after PARTITION BY

  1. We then have the ORDER BY clause, which means we’ll run the window aggregation in the specified order. In this case, we order them by grade_id.
  2. Using our window frame definition, we should have the following happen when running our query:

Window Function Aggregation Visualization

Figure 3.4: Window Function Aggregation Visualization

The aggregation will be performed on each partition separately. In this case, we have two partitions: Partition 1 (consisting of grade_id 1 to 3) and Partition 2 (for grade_id 4 to 5).

The window function aggregation will consider the order of the rows based on the grade_id column. It’ll start the calculation from the beginning of each partition as defined by UNBOUNDED PRECEDING and continue up to the current row within the window.

Let’s focus on Partition 1 as an example. When the window function is performed on the second row (grade_id 2), it’ll calculate the values starting from the beginning of the partition (grade_id 1) up to the current window, which is grade_id 2.

Therefore, the results of our query should look like this:

student_name course_name grade units Running Average Grade
John Smith Mathematics 3.50 4 3.500
John Smith Physics 4.00 3 3.714
John Smith English 3.80 3 3.740
Jane Doe Chemistry 3.90 4 3.900
Jane Doe Biology 3.70 3 3.814
Michael Johnson History 3.20 3 3.200
Michael Johnson Geography 3.50 4 3.371
Emily Williams Computer Science 3.90 4 3.900
Emily Williams Art 3.20 2 3.667
Daniel Brown Psychology 3.60 3 3.600
Daniel Brown Sociology 3.80 3 3.700
Olivia Davis Economics 3.40 4 3.400

Figure 3.5: Results of the Moving Average Query

Practice Weighted Average SQL Questions

In this section, we’ll solve three of the most common SQL weighted average interview questions.

Let’s find out how many you can solve correctly!

1. Weighted Average Email Campaign

Question

A marketing team is doing a review of their past email campaigns and they’re interested in understanding the effectiveness of each campaign. They’ve collected data on the number of users who opened each email and the number of users who clicked on a link within the email. They want to compute a weighted average score for each campaign where the weight of the open rate is 0.3 and the weight of the click rate is 0.7.

Write a SQL query to calculate the weighted average for each campaign.

Note: The weighted average should be rounded to two decimal places.

Example:

Input:

email_campaigns table

Column Type
campaign_id INTEGER
campaign_name VARCHAR
num_users INTEGER
num_opens INTEGER
num_clicks INTEGER

Output:

Column Type
campaign_name VARCHAR
weighted_avg DECIMAL(4,2)

Solution

The query for this problem should include:

SELECT
    campaign_name,
    ROUND(
        (0.3 * (num_opens / num_users)) +
        (0.7 * (num_clicks / num_users)),
    2) AS weighted_avg
FROM
    email_campaigns;

Figure 4.1: Preliminary Solution for Weighted Average Email Campaign

The query first calculates the open rate (num_opens / num_users) and click rate (num_clicks / num_users). The weighted average is then calculated using 0.3 for the weight of the open rate and 0.7 for the click rate.

Note that, in SQL, dividing two integers results in an integer. If num_users, num_opens, and num_clicks are integers, you should convert them to float to get the correct result. This is especially important when dealing with ratios or percentages.

With the conversion, the query should look like:

SELECT
    campaign_name,
    ROUND(
         -- casting to assure that we get the correct division result
        (0.3 * (CAST(num_opens AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))) +
        (0.7 * (CAST(num_clicks AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))),
    2) AS weighted_avg
FROM
    email_campaigns;

Figure 4.2: Corrected Solution for Weighted Average Email Campaign with Unaccounted Edge Cases

However, as it stands, this query can’t handle a situation where num_users is 0, which would cause a dividing by zero error.

We can prevent this by introducing a conditional clause:

SELECT
    campaign_name,
		-- introducing a conditional clause to avoid a division by zero error
    IF(num_users > 0,
        ROUND(
            (0.3 * (CAST(num_opens AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))) +
            (0.7 * (CAST(num_clicks AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))),
        2),
    0) AS weighted_avg
FROM
    email_campaigns;

Figure 4.3: Full Solution

In this version, if num_users is greater than 0, it calculates the weighted average as before. If num_users is 0, it simply returns 0 as the weighted average.

2. Rolling Weighted Average Sales

Question

The sales department is conducting a performance review and is interested in trends in product sales. They’ve decided to use a weighted moving average as part of their analysis.

Write a SQL query to calculate the 3-day weighted moving average of sales for each product. Use the weights 0.5 for the current day, 0.3 for the previous day, and 0.2 for the day before that. Round the average to two decimal points.

Note: only output the weighted moving average for dates that have two or more preceding dates. You may assume that the table has no missing dates.

Example:

Input:

sales table

Column Type
date DATE
product_id INTEGER
sales_volume INTEGER

Output:

Column Type
date DATE
product_id INTEGER
weighted_avg_sales FLOAT

Solution:

Your answer should look something like the following query:

WITH weighted_sales AS (
    SELECT
        date,
        product_id,
        sales_volume,
        -- Calculate the weighted average sales using the sales_volume of the current row and the two previous rows
        (0.5 * sales_volume + 
         0.3 * LAG(sales_volume, 1) OVER (PARTITION BY product_id ORDER BY date) +
         0.2 * LAG(sales_volume, 2) OVER (PARTITION BY product_id ORDER BY date))
         AS weighted_avg_sales,
				 -- Count the number of preceding days (including the current day) within a rolling window of 3 days
         COUNT(*) OVER (PARTITION BY product_id ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
         AS num_preceding_days
    FROM
        sales
)
SELECT
    date,
    product_id,
    weighted_avg_sales
FROM
    weighted_sales
WHERE
		-- Count the number of preceding days (including the current day) within a rolling window of 3 days
    num_preceding_days = 3
ORDER BY
	date,
  product_id;

Figure 5.0: Full Solution

We structured our solution in this article using a common table expression (CTE), so let’s focus first on that block. The CTE is referenced as the query inside the WITH ... AS syntax.

There are two critical parts of this query, each aimed at fulfilling a requirement of the question:

  • (A) generating the weighted moving average
  • (B) only including days with at least two preceding days.

First, let’s dissect the weighted_avg_sales section:

(0.5 * sales_volume + 
         0.3 * LAG(sales_volume, 1) OVER (PARTITION BY product_id ORDER BY date) +
         0.2 * LAG(sales_volume, 2) OVER (PARTITION BY product_id ORDER BY date))
         AS weighted_avg_sales

Figure 5.1: Solution for Requirement A

We have three different nested expressions in this block.

  • 0.5 * sales_volume takes the current day’s sales and applies a weight of 0.5.
  • 0.3 * LAG(sales_volume, 1) OVER (PARTITION BY product_id ORDER BY date)calculates the sales volume of the previous day, lagged by 1, within each product_id group. The LAG function is used to access the sales volume of the preceding day within each product_id group, ordered by date.
  • 0.2 * LAG(sales_volume, 2) OVER (PARTITION BY product_id ORDER BY date) calculates the sales volume of the day before the previous day, lagged by 2, within each product_id group.

By calculating their sum, we get their weighted average.

Notice that we did not need to divide the sum by the SUM of the weights. This is because the weights in this question are percentile, so we can directly multiply the weight by the value without dividing the total by the sum of the weights.

For requirement B, we’re interested in the num_preceding_days alias and the WHERE clause in the main query.

The num_preceding_days column represents the count of preceding days within each product_id group. It’s calculated using a window function that counts the number of rows in the window frame, including the current row and the two preceding rows, based on the order of date.

This count determines whether a row is included in the final result set, as only rows with num_preceding_days equal to 3 are selected, as specified in WHERE num_preceding_days = 3.

3. Rolling Weighted Average with Skipped Dates

Question

The analytics team of a platform wants to calculate the 3-day rolling weighted average for new daily users, where the most recent day has a weight of 3, the second-most recent day has a weight of 2, and the third-most recent day has a weight of 1. The team wants to analyze the trend and growth of new daily users over a short period. When the system doesn’t detect any new users by the end of the day, it won’t load that date into the database.

Write a SQL query to calculate the 3-day rolling weighted average for new daily users from the users table. Round the average to two decimal places.

Example:

Input:

stocks table

Column Type
date DATE
price DECIMAL

Output:

Column Type
date DATE
weighted_avg DECIMAL

Solution

-- This CTE is used to calculate the lag offsets and the first date in the dataset.
WITH ordered_users AS (
  SELECT 
    date,
    new_users,
    LAG(new_users, 1) OVER (ORDER BY date) AS new_users_lag_1,
    LAG(new_users, 2) OVER (ORDER BY date) AS new_users_lag_2,
    MIN(date) OVER () AS initial_date
  FROM 
    users
),
calculated_users AS (
  SELECT
    date,
    initial_date,
    -- This 'CASE WHEN' block does the logic in determining how to calculate a datapoint depending
    -- on the date difference from the previous rows.
    CASE
      WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 1 THEN 
        CASE 
          WHEN DATEDIFF(date, LAG(date, 2) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + 2 * new_users_lag_1 + new_users_lag_2) / 6, 2) 
          ELSE ROUND((3 * new_users + 2 * new_users_lag_1) / 6, 2) 
        END
      WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + new_users_lag_1) / 6, 2)
      ELSE ROUND(3 * new_users / 6, 2)
    END AS weighted_average
  FROM
    ordered_users
)
SELECT 
  date,
  weighted_average
FROM 
  calculated_users
WHERE 
	-- Ensures that only the dates at least two days from the first day of calculation
  -- are added to the result set. We can include the second row, as long as it has
	-- skipped a date from the first row.
  DATEDIFF(date, initial_date) >= 2;

Figure 6.1: Full Solution

In this query, two CTEs are used to organize and isolate the calculations, as well as to separate our conditional filters from other operations.

The first CTE, ordered_users, is used for data organization and preparation. It performs two significant operations:

  • Creates additional columns through the LAG() function as offsets
  • Calculates the earliest date in the dataset (initial_date) using a window function

These are all potentially complicated calculations that are best done separately from the main query for readability and maintenance. This results in the following table:

date new_users
2023-01-01 100
2023-01-02 50
2023-01-04 80
2023-01-05 120

Figure 6.2: Table before the LAG Offsets

date new_users new_users_lag_1 new_users_lag_2
2023-01-01 100 NULL NULL
2023-01-02 50 100 NULL
2023-01-04 80 50 100
2023-01-05 120 80 50

Figure 6.3: Table after the LAG Offsets

The second CTE, calculated_users, is used to compute the weighted average of new users based on the data prepared in the ordered_users CTE. This weighted average is calculated differently depending on the gap between the current date and the previous date(s).

Next, the OVER (ORDER BY date) clause ensures that the rows are ordered by the date column before the LAG() function is applied.

Ideally, we’ll retrieve the date and the corresponding new users from the first CTE and apply the weighted average logic accordingly, as expressed in the code: ( 3 * new_users+ 2 * new_users_lag_1+ new_users_lag_2), represented as G₁ in our general form.

However, note that some dates may be missing. There are several ways to address this, with some more convenient (such as Postgres’ sequence builder) than others. For the most vendor-agnostic solution, we used the CASE WHEN approach:

CASE
      -- (A)
      WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 1 THEN 
        CASE 
					-- case 1:
          WHEN DATEDIFF(date, LAG(date, 2) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + 2 * new_users_lag_1 + new_users_lag_2) / 6, 2) 
          -- case 2:
					ELSE ROUND((3 * new_users + 2 * new_users_lag_1) / 6, 2) 
        END
		  -- (B)
			-- case 3:
      WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + new_users_lag_1) / 6, 2)
	    -- case 4:
			ELSE ROUND(3 * new_users / 6, 2)
    END AS weighted_average

Figure 6.4: CASE WHEN Block

To walk through how we got the weighted sum for G₁ with the missing dates considered, let’s divide up this block with a decision tree. So, if:

  • The date one row back is indeed the day before the current date (A):
    • Case 1: check if the row before that is two days before our current row. In that case, we can assume that there are no missing dates. We use 3 * new_users + 2 * new_users_lag_1 + new_users_lag_2 to get our weighted sum.
    • Case 2: If the date two days back is missing and the third date had no new users, so we should only consider the current date and the day before. So, we use 3 * new_users + 2 * new_users_lag_1.
  • On the other hand, we should also check if the date one row before our current row is two days back (B):
    • Case 3: If it is, this means that only the day before the current day is missing. Our weighted sum should be calculated as 3 * new_users + new_users_lag_2.
    • Case 4: If not, this means that all the days prior to our current date are missing. We should evaluate our weighted sum as 3 * new_users.

We’ll then divide our weighted sum in G₁ by six, as that is the sum of the weights (3+2+1), which is represented as G₂ in our general form.

Notice that we divided the sum weighted sum by the SUM of the weights because we don’t have percentile weights in this question.

Lastly, we add the WHERE clause, which is a little tricky too. While it’s easy to dismiss the first two rows from our result set, this would be a naïve approach.

Consider a situation wherein the second row is not the day after the first row. This means that it’s logically correct to assume that the second date should be a part of the result set. As such, instead of basing our decision on whether to include the second row based on its placement on the table, we should consider the interval between the first date instead, using DATEDIFF(date, initial_date) >= 2.

Learn SQL with Interview Query

The best way to get better in SQL is practice. Interview Query offers a variety of SQL learning resources to help you practice and improve your SQL skills, including: