Data analysis plays a pivotal role in decision-making processes across various industries. Among the many techniques available, the weighted average is a powerful tool for deriving valuable insights from numerical data. But how exactly can we calculate the weighted average using SQL?
Let’s explore some practical tips and techniques in generating the weighted average using SQL.
What is the weighted average, and how does it differ from the arithmetic mean?
The rationale behind the arithmetic mean is simple– it represents a set of data points as a singular output. There are limitations to this approach since it assumes that all data points are of equal importance, which isn’t always the case in the real world. To accommodate for variance in significance or “weights”, we use the weighted average.
The general query for the weighted average is:
SELECT (SUM(value * weight) / SUM(weight))
Figure 1.1: The General Form for the Weighted Average
This query represents the weighted average formula, which calculates the weighted average by dividing the total sum of the products of each value
(valueᵢ) and its corresponding weight
(weightᵢ) by the sum of all the weights
Figure 1.0: Formula for the Weighted Average (assuming non-percentile weights)
We can see that the formula above maps to our SQL query straightforwardly.
SUM(value * weight) corresponds to the numerator, while
SUM(weight) is the denominator of the formula.
It seems simple, right? However, while this general formula is technically the correct answer, not all applications of weighted averages will reflect this general form. To learn how to deal with real-world schemas, let’s take a look at some different examples.
The general query for the weighted average (Figure 1.1) assumes that we’re gathering the weighted average for the whole table. However, data scientists and analysts typically use databases with more complex schemas in their day-to-day workflow.
To see how we can adjust the general query, we’ll use the following table schemas to set up a more realistic example.
Let’s assume that you’re working as a data analyst for a university. You’re tasked to generate a report on the general weighted grade point average (GWA) of all the students in the university. Since courses have different weights (represented in this case as units), it’s important to use the weighted average instead of the arithmetic mean to calculate the final output.
Since we need to generate the weighted average per student with our table schema, the general form solution by itself won’t work for this problem.
The grades table has a many-to-one relationship with the Students table, and we can map many grades to one student. To generate one GWA for each student, we can use
GROUP BY and
SUM, which is the first approach we’ll discuss today.
The many-to-one nature of our schema makes
GROUP BY a natural choice for our solution. From the general form, our query will transform into:
SELECT students.student_name, SUM(grades.units) AS 'Total Units Taken', ROUND(SUM(grades.grade * grades.units) / SUM(grades.units), 3) AS 'General Weighted Average' FROM grades INNER JOIN students ON grades.student_id = students.student_id GROUP BY grades.student_id;
Figure 2.0: GROUP BY and SUM Solution
While this looks very different from the general form presented above, the line
ROUND(SUM(grades.grade * grades.units) / SUM(grades.units), 3) AS 'General Weighted Average' is just a modified version of our basic query.
In the next section, we’ll break down our solution in parts to show how we generated this example.
If you need a refresher on the basics of
GROUP BY, read through the Interview Query Introduction to SQL course.
Using the table schema, a grade can be mapped to a student using the foreign key, which in this case is
Figure 2.1: Table before GROUP BY
To work with the data more efficiently, we’ll use the
GROUP BY clause on the
This will group rows with the same
student_id, effectively dividing the original table into smaller sets of data. We can then perform the operations in the general form on each set separately, rather than on the entire table.
Figure 2.2: Tables after GROUP BY
We can now treat
grades.grade as the value and the
grades.units as the weight. The joins are then done to match a student name to a GWA. Using the general form (with GROUP BY, of course), our output will be:
|student_name||Total Units Taken||General Weighted Average|
Figure 2.3: Results of the GROUP BY and SUM Query
Our first approach is great whenever we want something as simple as a general weighted average. But, if we want our data to be dynamic (such as in leaderboards or scoreboards), a moving average may be more suitable. Using our previous table schema, how could we get the moving weighted average with SQL?
Because our values now vary based on previous data (as the data is moving), we’ll need to employ a more advanced SQL feature called window functions. This will allow us to aggregate data on a per-row (or per window) basis instead of by table. While we tackled the first iteration of breaking down the table with GROUP BY, we can create a more granular solution using window functions.
Window functions are functions that use one or multiple rows to return values for each row. They are different from aggregate functions in that aggregate functions return just one value for every group of rows, while window functions return a value for each row.
Our moving weighted average solution should take the form of the following:
SELECT student_name, course_name, grade, units, ROUND((SUM(grade*units) OVER ( PARTITION BY student_id ORDER BY grade_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / SUM(units) OVER ( PARTITION BY student_id ORDER BY grade_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)), 3) AS 'Running Average Grade' FROM Students JOIN Grades ON Students.student_id = Grades.student_id ORDER BY grades.student_id, grade_id;
Figure 3.0: Window Function Solution for Calculating the Moving Average
While this might look intimidating at first glance, we’ll break it down step-by-step below.
The modified version of the general form in this query is:
ROUND((SUM(grade*units) OVER ( PARTITION BY student_id ORDER BY grade_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / SUM(units) OVER ( PARTITION BY student_id ORDER BY grade_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)), 3) AS 'Running Average Grade'
Figure 3.1: The Modified General Form from Figure 3.0
Still quite a big chunk, isn’t it? We can express the general form G
SELECT (SUM(value * weight) / SUM(weight)) into two parts, G₁
SUM(value * weight) and G₂
Similarly, we can express the modified general form (see Figure 3.1) as G’ into two parts:
(SUM(grade*units) OVER ( PARTITION BY student_id ORDER BY grade_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
and G’₂ :
SUM(units) OVER ( PARTITION BY student_id ORDER BY grade_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
Note, the code snippet
SUM(value*weight) from G₁ is still in G’₁ as
SUM(grade*units) . The same thing is true with G₂ . What makes them different are our complex window functions. Window functions are declared using the
OVER keyword, typically after an aggregation function (which in this case is
A window function is defined by three sections:
ORDER BY, and the window frame.
PARTITION BY student_idmeans that the window is defined for each unique
student_idand the calculation is performed separately for each student.
ORDER BY grade_idspecifies the order in which the rows are considered within each partition. Here, the rows are ordered by the
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWis our window frame and it defines the range of rows included in the calculation. We can read this like plain English and understand that we start the calculation from the beginning of our partition (specified by
UNBOUNDED PRECEEDING) up until the current row.
Let’s see this in action.
PARTITION BY, the results will look largely equivalent to Figures 2.1 and 2.2, as partitions function similarly to
Figure 3.2: Table before PARTITION BY
Figure 3.3: Tables after PARTITION BY
ORDER BYclause, which means we’ll run the window aggregation in the specified order. In this case, we order them by
Figure 3.4: Window Function Aggregation Visualization
The aggregation will be performed on each partition separately. In this case, we have two partitions: Partition 1 (consisting of
grade_id 1 to 3) and Partition 2 (for
grade_id 4 to 5).
The window function aggregation will consider the order of the rows based on the
grade_id column. It’ll start the calculation from the beginning of each partition as defined by
UNBOUNDED PRECEDING and continue up to the current row within the window.
Let’s focus on Partition 1 as an example. When the window function is performed on the second row (
grade_id 2), it’ll calculate the values starting from the beginning of the partition (
grade_id 1) up to the current window, which is
Therefore, the results of our query should look like this:
|student_name||course_name||grade||units||Running Average Grade|
|Emily Williams||Computer Science||3.90||4||3.900|
Figure 3.5: Results of the Moving Average Query
In this section, we’ll solve three of the most common SQL weighted average interview questions.
Let’s find out how many you can solve correctly!
A marketing team is doing a review of their past email campaigns and they’re interested in understanding the effectiveness of each campaign. They’ve collected data on the number of users who opened each email and the number of users who clicked on a link within the email. They want to compute a weighted average score for each campaign where the weight of the open rate is 0.3 and the weight of the click rate is 0.7.
Write a SQL query to calculate the weighted average for each campaign.
Note: The weighted average should be rounded to two decimal places.
The query for this problem should include:
SELECT campaign_name, ROUND( (0.3 * (num_opens / num_users)) + (0.7 * (num_clicks / num_users)), 2) AS weighted_avg FROM email_campaigns;
Figure 4.1: Preliminary Solution for Weighted Average Email Campaign
The query first calculates the open rate
(num_opens / num_users) and click rate
(num_clicks / num_users). The weighted average is then calculated using 0.3 for the weight of the open rate and 0.7 for the click rate.
Note that, in SQL, dividing two integers results in an integer. If
num_clicks are integers, you should convert them to float to get the correct result. This is especially important when dealing with ratios or percentages.
With the conversion, the query should look like:
SELECT campaign_name, ROUND( -- casting to assure that we get the correct division result (0.3 * (CAST(num_opens AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))) + (0.7 * (CAST(num_clicks AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))), 2) AS weighted_avg FROM email_campaigns;
Figure 4.2: Corrected Solution for Weighted Average Email Campaign with Unaccounted Edge Cases
However, as it stands, this query can’t handle a situation where
num_users is 0, which would cause a dividing by zero error.
We can prevent this by introducing a conditional clause:
SELECT campaign_name, -- introducing a conditional clause to avoid a division by zero error IF(num_users > 0, ROUND( (0.3 * (CAST(num_opens AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))) + (0.7 * (CAST(num_clicks AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))), 2), 0) AS weighted_avg FROM email_campaigns;
Figure 4.3: Full Solution
In this version, if
num_users is greater than 0, it calculates the weighted average as before. If
num_users is 0, it simply returns 0 as the weighted average.
The sales department is conducting a performance review and is interested in trends in product sales. They’ve decided to use a weighted moving average as part of their analysis.
Write a SQL query to calculate the 3-day weighted moving average of sales for each product. Use the weights 0.5 for the current day, 0.3 for the previous day, and 0.2 for the day before that. Round the average to two decimal points.
Note: only output the weighted moving average for dates that have two or more preceding dates. You may assume that the table has no missing dates.
Your answer should look something like the following query:
WITH weighted_sales AS ( SELECT date, product_id, sales_volume, -- Calculate the weighted average sales using the sales_volume of the current row and the two previous rows (0.5 * sales_volume + 0.3 * LAG(sales_volume, 1) OVER (PARTITION BY product_id ORDER BY date) + 0.2 * LAG(sales_volume, 2) OVER (PARTITION BY product_id ORDER BY date)) AS weighted_avg_sales, -- Count the number of preceding days (including the current day) within a rolling window of 3 days COUNT(*) OVER (PARTITION BY product_id ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS num_preceding_days FROM sales ) SELECT date, product_id, weighted_avg_sales FROM weighted_sales WHERE -- Count the number of preceding days (including the current day) within a rolling window of 3 days num_preceding_days = 3 ORDER BY date, product_id;
Figure 5.0: Full Solution
We structured our solution in this article using a common table expression (CTE), so let’s focus first on that block. The CTE is referenced as the query inside the
WITH ... AS syntax.
There are two critical parts of this query, each aimed at fulfilling a requirement of the question:
First, let’s dissect the
(0.5 * sales_volume + 0.3 * LAG(sales_volume, 1) OVER (PARTITION BY product_id ORDER BY date) + 0.2 * LAG(sales_volume, 2) OVER (PARTITION BY product_id ORDER BY date)) AS weighted_avg_sales
Figure 5.1: Solution for Requirement A
We have three different nested expressions in this block.
0.5 * sales_volumetakes the current day’s sales and applies a weight of 0.5.
0.3 * LAG(sales_volume, 1) OVER (PARTITION BY product_id ORDER BY date)calculates the sales volume of the previous day, lagged by 1, within each
LAGfunction is used to access the sales volume of the preceding day within each
product_idgroup, ordered by
0.2 * LAG(sales_volume, 2) OVER (PARTITION BY product_id ORDER BY date)calculates the sales volume of the day before the previous day, lagged by 2, within each
By calculating their sum, we get their weighted average.
For requirement B, we’re interested in the
num_preceding_days alias and the
WHERE clause in the main query.
num_preceding_days column represents the count of preceding days within each
product_id group. It’s calculated using a window function that counts the number of rows in the window frame, including the current row and the two preceding rows, based on the order of
This count determines whether a row is included in the final result set, as only rows with
num_preceding_days equal to 3 are selected, as specified in
WHERE num_preceding_days = 3.
The analytics team of a platform wants to calculate the 3-day rolling weighted average for new daily users, where the most recent day has a weight of 3, the second-most recent day has a weight of 2, and the third-most recent day has a weight of 1. The team wants to analyze the trend and growth of new daily users over a short period. When the system doesn’t detect any new users by the end of the day, it won’t load that date into the database.
Write a SQL query to calculate the 3-day rolling weighted average for new daily users from the
users table. Round the average to two decimal places.
-- This CTE is used to calculate the lag offsets and the first date in the dataset. WITH ordered_users AS ( SELECT date, new_users, LAG(new_users, 1) OVER (ORDER BY date) AS new_users_lag_1, LAG(new_users, 2) OVER (ORDER BY date) AS new_users_lag_2, MIN(date) OVER () AS initial_date FROM users ), calculated_users AS ( SELECT date, initial_date, -- This 'CASE WHEN' block does the logic in determining how to calculate a datapoint depending -- on the date difference from the previous rows. CASE WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 1 THEN CASE WHEN DATEDIFF(date, LAG(date, 2) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + 2 * new_users_lag_1 + new_users_lag_2) / 6, 2) ELSE ROUND((3 * new_users + 2 * new_users_lag_1) / 6, 2) END WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + new_users_lag_1) / 6, 2) ELSE ROUND(3 * new_users / 6, 2) END AS weighted_average FROM ordered_users ) SELECT date, weighted_average FROM calculated_users WHERE -- Ensures that only the dates at least two days from the first day of calculation -- are added to the result set. We can include the second row, as long as it has -- skipped a date from the first row. DATEDIFF(date, initial_date) >= 2;
Figure 6.1: Full Solution
In this query, two CTEs are used to organize and isolate the calculations, as well as to separate our conditional filters from other operations.
The first CTE,
ordered_users, is used for data organization and preparation. It performs two significant operations:
LAG()function as offsets
initial_date) using a window function
These are all potentially complicated calculations that are best done separately from the main query for readability and maintenance. This results in the following table:
Figure 6.2: Table before the LAG Offsets
Figure 6.3: Table after the LAG Offsets
The second CTE,
calculated_users, is used to compute the weighted average of new users based on the data prepared in the
ordered_users CTE. This weighted average is calculated differently depending on the gap between the current date and the previous date(s).
OVER (ORDER BY date) clause ensures that the rows are ordered by the
date column before the
LAG() function is applied.
Ideally, we’ll retrieve the date and the corresponding new users from the first CTE and apply the weighted average logic accordingly, as expressed in the code: (
3 * new_users+ 2 * new_users_lag_1+ new_users_lag_2), represented as G₁ in our general form.
However, note that some dates may be missing. There are several ways to address this, with some more convenient (such as Postgres’ sequence builder) than others. For the most vendor-agnostic solution, we used the
CASE WHEN approach:
CASE -- (A) WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 1 THEN CASE -- case 1: WHEN DATEDIFF(date, LAG(date, 2) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + 2 * new_users_lag_1 + new_users_lag_2) / 6, 2) -- case 2: ELSE ROUND((3 * new_users + 2 * new_users_lag_1) / 6, 2) END -- (B) -- case 3: WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + new_users_lag_1) / 6, 2) -- case 4: ELSE ROUND(3 * new_users / 6, 2) END AS weighted_average
Figure 6.4: CASE WHEN Block
To walk through how we got the weighted sum for G₁ with the missing dates considered, let’s divide up this block with a decision tree. So, if:
3 * new_users + 2 * new_users_lag_1 + new_users_lag_2to get our weighted sum.
3 * new_users + 2 * new_users_lag_1.
3 * new_users + new_users_lag_2.
3 * new_users.
We’ll then divide our weighted sum in G₁ by six, as that is the sum of the weights (3+2+1), which is represented as G₂ in our general form.
💡 Notice that we divided the sum weighted sum by the
SUM of the weights because we don’t have percentile weights in this question.
Lastly, we add the
WHERE clause, which is a little tricky too. While it’s easy to dismiss the first two rows from our result set, this would be a naïve approach.
Consider a situation wherein the second row is not the day after the first row. This means that it’s logically correct to assume that the second date should be a part of the result set. As such, instead of basing our decision on whether to include the second row based on its placement on the table, we should consider the interval between the first date instead, using
DATEDIFF(date, initial_date) >= 2.