In-Depth Guide and Interview Questions

Data analysis plays a pivotal role in decision-making processes across various industries. Among the many techniques available, the weighted average is a powerful tool for deriving valuable insights from numerical data. But how exactly can we calculate the weighted average using SQL?

Let’s explore some practical tips and techniques in generating the weighted average using SQL.

What is the weighted average, and how does it differ from the arithmetic mean?

The rationale behind the arithmetic mean is simple– it represents a set of data points as a singular output. There are limitations to this approach since it assumes that all data points are of equal importance, which isn’t always the case in the real world. To accommodate for variance in significance or “weights”, we use the **weighted average.**

The general query for the weighted average is:

```
SELECT (SUM(value * weight) / SUM(weight))
```

*Figure 1.1:* *The General Form**for the Weighted Average*

This query represents the weighted average formula, which calculates the weighted average by dividing the total sum of the products of each value `(valueᵢ)`

and its corresponding weight `(weightᵢ)`

by the sum of all the weights `(weight_j)`

.

*Figure 1.0: Formula for the Weighted Average (assuming non-percentile weights)*

We can see that the formula above maps to our SQL query straightforwardly. `SUM(value * weight)`

corresponds to the numerator, while `SUM(weight)`

is the denominator of the formula.

It seems simple, right? However, while this general formula is *technically* the *correct* answer, not all applications of weighted averages will reflect this general form. To learn how to deal with real-world schemas, let’s take a look at some different examples.

The general query for the weighted average (Figure 1.1) assumes that we’re gathering the weighted average for the whole table. However, data scientists and analysts typically use databases with more complex schemas in their day-to-day workflow.

To see how we can adjust the general query, we’ll use the following table schemas to set up a more realistic example.

`Students`

table

Column | Data Type |
---|---|

student_id | INT (PK) |

student_name | VARCHAR(50) |

`Grades`

table

Column | Data Type |
---|---|

grade_id | INT (PK) |

student_id | INT (FK) |

course_name | VARCHAR(50) |

grade | DECIMAL(3,2) |

units | INT |

Let’s assume that you’re working as a data analyst for a university. You’re tasked to generate a report on the general weighted grade point average (GWA) of all the students in the university. Since courses have different weights (represented in this case as units), it’s important to use the weighted average instead of the arithmetic mean to calculate the final output.

Since we need to generate the weighted average per ** student** with our table schema, the general form solution by itself won’t work for this problem.

The grades table has a many-to-one relationship with the Students table, and we can map many grades to one student. To generate one GWA for each student, we can use `GROUP BY`

and `SUM`

, which is the first approach we’ll discuss today.

The many-to-one nature of our schema makes `GROUP BY`

a natural choice for our solution. From the general form, our query will transform into:

```
SELECT
students.student_name,
SUM(grades.units) AS 'Total Units Taken',
ROUND(SUM(grades.grade * grades.units) / SUM(grades.units), 3) AS 'General Weighted Average'
FROM
grades
INNER JOIN
students ON grades.student_id = students.student_id
GROUP BY
grades.student_id;
```

*Figure 2.0: GROUP BY and SUM Solution*

While this looks very different from the general form presented above, the line `ROUND(SUM(grades.grade * grades.units) / SUM(grades.units), 3) AS 'General Weighted Average'`

is just a modified version of our basic query.

In the next section, we’ll break down our solution in parts to show how we generated this example.

If you need a refresher on the basics of `GROUP BY`

, read through the Interview Query Introduction to SQL course.

Using the table schema, a grade can be mapped to a student using the foreign key, which in this case is `student_id`

.

student_id | student_name | course_name | grade |
---|---|---|---|

1 | John Smith | Mathematics | 3.50 |

1 | John Smith | Physics | 4.00 |

1 | John Smith | English | 3.80 |

2 | Jane Doe | Chemistry | 3.90 |

2 | Jane Doe | Biology | 3.70 |

*Figure 2.1: Table before GROUP BY*

To work with the data more efficiently, we’ll use the ** GROUP BY** clause on the

`grades.student_id`

This will group rows with the same ** student_id**, effectively dividing the original table into smaller sets of data. We can then perform the operations in the general form on each set separately, rather than on the entire table.

student_id | student_name | course_name | grade |
---|---|---|---|

1 | John Smith | Mathematics | 3.50 |

1 | John Smith | Physics | 4.00 |

1 | John Smith | English | 3.80 |

student_id | student_name | course_name | grade |
---|---|---|---|

2 | Jane Doe | Chemistry | 3.90 |

2 | Jane Doe | Biology | 3.70 |

*Figure 2.2: Tables after GROUP BY*

We can now treat `grades.grade`

as the ** value** and the

`grades.units`

as the student_name | Total Units Taken | General Weighted Average |
---|---|---|

John Smith | 10 | 3.740 |

Jane Doe | 7 | 3.814 |

Michael Johnson | 7 | 3.371 |

Emily Williams | 6 | 3.667 |

Daniel Brown | 6 | 3.700 |

Olivia Davis | 4 | 3.400 |

*Figure 2.3: Results of the GROUP BY and SUM Query*

Our first approach is great whenever we want something as simple as a general weighted average. But, if we want our data to be dynamic (such as in leaderboards or scoreboards), a moving average may be more suitable. Using our previous table schema, how could we get the **moving weighted average** with SQL?

Because our values now vary based on previous data (as the data is *moving*), we’ll need to employ a more advanced SQL feature called window functions. This will allow us to aggregate data on a per-row (or *per window*) basis instead of by table. While we tackled the first iteration of breaking down the table with GROUP BY, we can create a more granular solution using **window functions.**

Window functionsare functions that use one or multiple rows to return values for each row. They are different from aggregate functions in that aggregate functions return just one value for every group of rows, while window functions return a value for each row.

Our moving weighted average solution should take the form of the following:

```
SELECT
student_name,
course_name,
grade,
units,
ROUND((SUM(grade*units) OVER (
PARTITION BY student_id
ORDER BY grade_id ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / SUM(units) OVER (
PARTITION BY student_id
ORDER BY grade_id ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)), 3)
AS 'Running Average Grade'
FROM
Students
JOIN
Grades ON Students.student_id = Grades.student_id
ORDER BY
grades.student_id, grade_id;
```

*Figure 3.0: Window Function Solution for Calculating the Moving Average*

While this might look intimidating at first glance, we’ll break it down step-by-step below.

The modified version of the general form in this query is:

```
ROUND((SUM(grade*units) OVER (
PARTITION BY student_id
ORDER BY grade_id ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / SUM(units) OVER (
PARTITION BY student_id
ORDER BY grade_id ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)), 3)
AS 'Running Average Grade'
```

*Figure 3.1: The Modified General Form from Figure 3.0*

Still quite a big chunk, isn’t it? We can express the general form **G** `SELECT (SUM(value * weight) / SUM(weight))`

into two parts, **G₁** `SUM(value * weight)`

and **G₂** `SUM(weight)`

.

Similarly, we can express the modified general form (see Figure 3.1) as **G’** into two parts:

**G’₁** :

```
(SUM(grade*units) OVER (
PARTITION BY student_id
ORDER BY grade_id ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
```

and **G’₂** :

```
SUM(units) OVER (
PARTITION BY student_id
ORDER BY grade_id ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```

Note, the code snippet `SUM(value*weight)`

from **G₁** is still in **G’₁** as `SUM(grade*units)`

. The same thing is true with **G₂** . What makes them different are our complex window functions. Window functions are declared using the `OVER`

keyword, typically after an aggregation function (which in this case is `SUM`

).

A window function is defined by three sections: `PARTITION BY`

, `ORDER BY`

, and the window frame.

means that the window is defined for each unique`PARTITION BY student_id`

`student_id`

and the calculation is performed separately for each student.specifies the order in which the rows are considered within each partition. Here, the rows are ordered by the`ORDER BY grade_id`

`grade_id`

column.is our window frame and it defines the range of rows included in the calculation. We can read this like plain English and understand that we start the calculation from the beginning of our partition (specified by`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`

`UNBOUNDED PRECEEDING`

) up until the current row.

Let’s see this in action.

- For
`PARTITION BY`

, the results will look largely equivalent to Figures 2.1 and 2.2, as partitions function similarly to`GROUP BY`

.

student_id | grade_id | student_name | course_name | grade |
---|---|---|---|---|

1 | 1 | John Smith | Mathematics | 3.50 |

1 | 2 | John Smith | Physics | 4.00 |

1 | 3 | John Smith | English | 3.80 |

2 | 4 | Jane Doe | Chemistry | 3.90 |

2 | 5 | Jane Doe | Biology | 3.70 |

*Figure 3.2: Table before PARTITION BY*

student_id | grade_id | student_name | course_name | grade |
---|---|---|---|---|

1 | 1 | John Smith | Mathematics | 3.50 |

1 | 2 | John Smith | Physics | 4.00 |

1 | 3 | John Smith | English | 3.80 |

student_id | grade_id | student_name | course_name | grade |
---|---|---|---|---|

2 | 4 | Jane Doe | Chemistry | 3.90 |

2 | 5 | Jane Doe | Biology | 3.70 |

*Figure 3.3: Tables after PARTITION BY*

- We then have the
`ORDER BY`

clause, which means we’ll run the window aggregation in the specified order. In this case, we order them by`grade_id`

. - Using our window frame definition, we should have the following happen when running our query:

*Figure 3.4: Window Function Aggregation Visualization*

The aggregation will be performed on each partition separately. In this case, we have two partitions: Partition 1 (consisting of `grade_id`

1 to 3) and Partition 2 (for `grade_id`

4 to 5).

The window function aggregation will consider the order of the rows based on the `grade_id`

column. It’ll start the calculation from the beginning of each partition as defined by `UNBOUNDED PRECEDING`

and continue up to the current row within the window.

Let’s focus on Partition 1 as an example. When the window function is performed on the second row (`grade_id`

2), it’ll calculate the values starting from the beginning of the partition (`grade_id`

1) up to the current window, which is `grade_id`

2.

Therefore, the results of our query should look like this:

student_name | course_name | grade | units | Running Average Grade |
---|---|---|---|---|

John Smith | Mathematics | 3.50 | 4 | 3.500 |

John Smith | Physics | 4.00 | 3 | 3.714 |

John Smith | English | 3.80 | 3 | 3.740 |

Jane Doe | Chemistry | 3.90 | 4 | 3.900 |

Jane Doe | Biology | 3.70 | 3 | 3.814 |

Michael Johnson | History | 3.20 | 3 | 3.200 |

Michael Johnson | Geography | 3.50 | 4 | 3.371 |

Emily Williams | Computer Science | 3.90 | 4 | 3.900 |

Emily Williams | Art | 3.20 | 2 | 3.667 |

Daniel Brown | Psychology | 3.60 | 3 | 3.600 |

Daniel Brown | Sociology | 3.80 | 3 | 3.700 |

Olivia Davis | Economics | 3.40 | 4 | 3.400 |

*Figure 3.5: Results of the Moving Average Query*

In this section, we’ll solve three of the most common SQL weighted average interview questions.

Let’s find out how many you can solve correctly!

A marketing team is doing a review of their past email campaigns and they’re interested in understanding the effectiveness of each campaign. They’ve collected data on the number of users who opened each email and the number of users who clicked on a link within the email. They want to compute a weighted average score for each campaign where the weight of the open rate is 0.3 and the weight of the click rate is 0.7.

Write a SQL query to calculate the weighted average for each campaign.

** Note:** The weighted average should be rounded to two decimal places.

**Input:**

`email_campaigns`

table

Column | Type |
---|---|

campaign_id | INTEGER |

campaign_name | VARCHAR |

num_users | INTEGER |

num_opens | INTEGER |

num_clicks | INTEGER |

**Output:**

Column | Type |
---|---|

campaign_name | VARCHAR |

weighted_avg | DECIMAL(4,2) |

The query for this problem should include:

```
SELECT
campaign_name,
ROUND(
(0.3 * (num_opens / num_users)) +
(0.7 * (num_clicks / num_users)),
2) AS weighted_avg
FROM
email_campaigns;
```

*Figure 4.1: Preliminary Solution for Weighted Average Email Campaign*

The query first calculates the open rate `(num_opens / num_users)`

and click rate `(num_clicks / num_users)`

. The weighted average is then calculated using 0.3 for the weight of the open rate and 0.7 for the click rate.

Note that, in SQL, dividing two integers results in an integer. If `num_users`

, `num_opens`

, and `num_clicks`

are integers, you should convert them to float to get the correct result. This is especially important when dealing with ratios or percentages.

With the conversion, the query should look like:

```
SELECT
campaign_name,
ROUND(
-- casting to assure that we get the correct division result
(0.3 * (CAST(num_opens AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))) +
(0.7 * (CAST(num_clicks AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))),
2) AS weighted_avg
FROM
email_campaigns;
```

*Figure 4.2: Corrected Solution for Weighted Average Email Campaign with Unaccounted Edge Cases*

However, as it stands, this query can’t handle a situation where `num_users`

is 0, which would cause a dividing by zero error.

We can prevent this by introducing a conditional clause:

```
SELECT
campaign_name,
-- introducing a conditional clause to avoid a division by zero error
IF(num_users > 0,
ROUND(
(0.3 * (CAST(num_opens AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))) +
(0.7 * (CAST(num_clicks AS DECIMAL(10,2)) / CAST(num_users AS DECIMAL(10,2)))),
2),
0) AS weighted_avg
FROM
email_campaigns;
```

*Figure 4.3: Full Solution*

In this version, if `num_users`

is greater than 0, it calculates the weighted average as before. If `num_users`

is 0, it simply returns 0 as the weighted average.

The sales department is conducting a performance review and is interested in trends in product sales. They’ve decided to use a weighted moving average as part of their analysis.

Write a SQL query to calculate the 3-day weighted moving average of sales for each product. Use the weights 0.5 for the current day, 0.3 for the previous day, and 0.2 for the day before that. Round the average to two decimal points.

*Note: only output the weighted moving average for dates that have two or more preceding dates. You may assume that the table has no missing dates.*

**Input:**

`sales`

table

Column | Type |
---|---|

date | DATE |

product_id | INTEGER |

sales_volume | INTEGER |

**Output:**

Column | Type |
---|---|

date | DATE |

product_id | INTEGER |

weighted_avg_sales | FLOAT |

Your answer should look something like the following query:

```
WITH weighted_sales AS (
SELECT
date,
product_id,
sales_volume,
-- Calculate the weighted average sales using the sales_volume of the current row and the two previous rows
(0.5 * sales_volume +
0.3 * LAG(sales_volume, 1) OVER (PARTITION BY product_id ORDER BY date) +
0.2 * LAG(sales_volume, 2) OVER (PARTITION BY product_id ORDER BY date))
AS weighted_avg_sales,
-- Count the number of preceding days (including the current day) within a rolling window of 3 days
COUNT(*) OVER (PARTITION BY product_id ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
AS num_preceding_days
FROM
sales
)
SELECT
date,
product_id,
weighted_avg_sales
FROM
weighted_sales
WHERE
-- Count the number of preceding days (including the current day) within a rolling window of 3 days
num_preceding_days = 3
ORDER BY
date,
product_id;
```

*Figure 5.0: Full Solution*

We structured our solution in this article using a common table expression (CTE), so let’s focus first on that block. The CTE is referenced as the query inside the `WITH ... AS`

syntax.

There are two critical parts of this query, each aimed at fulfilling a requirement of the question:

- (A) generating the weighted moving average
- (B) only including days with at least two preceding days.

First, let’s dissect the `weighted_avg_sales`

section:

```
(0.5 * sales_volume +
0.3 * LAG(sales_volume, 1) OVER (PARTITION BY product_id ORDER BY date) +
0.2 * LAG(sales_volume, 2) OVER (PARTITION BY product_id ORDER BY date))
AS weighted_avg_sales
```

*Figure 5.1: Solution for Requirement A*

We have three different nested expressions in this block.

`0.5 * sales_volume`

takes the current day’s sales and applies a weight of 0.5.calculates the sales volume of the previous day, lagged by 1, within each`0.3 * LAG(sales_volume, 1) OVER (PARTITION BY product_id ORDER BY date)`

group. The`product_id`

function is used to access the sales volume of the preceding day within each`LAG`

group, ordered by`product_id`

`date`

.calculates the sales volume of the day before the previous day, lagged by 2, within each`0.2 * LAG(sales_volume, 2) OVER (PARTITION BY product_id ORDER BY date)`

group.`product_id`

By calculating their sum, we get their weighted average.

Notice that we did not need to divide the sum by the `SUM`

of the weights. This is because the weights in this question are percentile, so we can directly multiply the weight by the value without dividing the total by the sum of the weights.

For requirement B, we’re interested in the `num_preceding_days`

alias and the `WHERE`

clause in the main query.

The `num_preceding_days`

column represents the count of preceding days within each `product_id`

group. It’s calculated using a window function that counts the number of rows in the window frame, including the current row and the two preceding rows, based on the order of `date`

.

This count determines whether a row is included in the final result set, as only rows with `num_preceding_days`

equal to 3 are selected, as specified in `WHERE num_preceding_days = 3`

.

The analytics team of a platform wants to calculate the 3-day rolling weighted average for new daily users, where the most recent day has a weight of 3, the second-most recent day has a weight of 2, and the third-most recent day has a weight of 1. The team wants to analyze the trend and growth of new daily users over a short period. When the system doesn’t detect any new users by the end of the day, it won’t load that date into the database.

Write a SQL query to calculate the 3-day rolling weighted average for new daily users from the ** users** table. Round the average to two decimal places.

**Input:**

`stocks`

table

Column | Type |
---|---|

date | DATE |

price | DECIMAL |

**Output:**

Column | Type |
---|---|

date | DATE |

weighted_avg | DECIMAL |

```
-- This CTE is used to calculate the lag offsets and the first date in the dataset.
WITH ordered_users AS (
SELECT
date,
new_users,
LAG(new_users, 1) OVER (ORDER BY date) AS new_users_lag_1,
LAG(new_users, 2) OVER (ORDER BY date) AS new_users_lag_2,
MIN(date) OVER () AS initial_date
FROM
users
),
calculated_users AS (
SELECT
date,
initial_date,
-- This 'CASE WHEN' block does the logic in determining how to calculate a datapoint depending
-- on the date difference from the previous rows.
CASE
WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 1 THEN
CASE
WHEN DATEDIFF(date, LAG(date, 2) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + 2 * new_users_lag_1 + new_users_lag_2) / 6, 2)
ELSE ROUND((3 * new_users + 2 * new_users_lag_1) / 6, 2)
END
WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + new_users_lag_1) / 6, 2)
ELSE ROUND(3 * new_users / 6, 2)
END AS weighted_average
FROM
ordered_users
)
SELECT
date,
weighted_average
FROM
calculated_users
WHERE
-- Ensures that only the dates at least two days from the first day of calculation
-- are added to the result set. We can include the second row, as long as it has
-- skipped a date from the first row.
DATEDIFF(date, initial_date) >= 2;
```

*Figure 6.1: Full Solution*

In this query, two CTEs are used to organize and isolate the calculations, as well as to separate our conditional filters from other operations.

The first CTE, ** ordered_users**, is used for data organization and preparation. It performs two significant operations:

- Creates additional columns through the
`LAG()`

function as offsets - Calculates the earliest date in the dataset (
) using a window function`initial_date`

These are all potentially complicated calculations that are best done separately from the main query for readability and maintenance. This results in the following table:

date | new_users |
---|---|

2023-01-01 | 100 |

2023-01-02 | 50 |

2023-01-04 | 80 |

2023-01-05 | 120 |

*Figure 6.2: Table before the LAG Offsets*

date | new_users | new_users_lag_1 | new_users_lag_2 |
---|---|---|---|

2023-01-01 | 100 | NULL | NULL |

2023-01-02 | 50 | 100 | NULL |

2023-01-04 | 80 | 50 | 100 |

2023-01-05 | 120 | 80 | 50 |

*Figure 6.3: Table after the LAG Offsets*

The second CTE, ** calculated_users**, is used to compute the weighted average of new users based on the data prepared in the

`ordered_users`

Next, the ** OVER (ORDER BY date)** clause ensures that the rows are ordered by the

`date`

`LAG()`

Ideally, we’ll retrieve the date and the corresponding new users from the first CTE and apply the weighted average logic accordingly, as expressed in the code: ( `3 * new_users+ 2 * new_users_lag_1+ new_users_lag_2`

), represented as **G₁** in our general form.

However, note that some dates may be missing. There are several ways to address this, with some more convenient (such as Postgres’ sequence builder) than others. For the most vendor-agnostic solution, we used the `CASE WHEN`

approach:

```
CASE
-- (A)
WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 1 THEN
CASE
-- case 1:
WHEN DATEDIFF(date, LAG(date, 2) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + 2 * new_users_lag_1 + new_users_lag_2) / 6, 2)
-- case 2:
ELSE ROUND((3 * new_users + 2 * new_users_lag_1) / 6, 2)
END
-- (B)
-- case 3:
WHEN DATEDIFF(date, LAG(date, 1) OVER (ORDER BY date)) = 2 THEN ROUND((3 * new_users + new_users_lag_1) / 6, 2)
-- case 4:
ELSE ROUND(3 * new_users / 6, 2)
END AS weighted_average
```

*Figure 6.4: CASE WHEN Block*

To walk through how we got the weighted sum for **G₁** with the missing dates considered, let’s divide up this block with a decision tree. So, if:

- The date one row back is indeed the day before the current date
:*(A)***Case 1:**check if the row before that is two days before our current row. In that case, we can assume that. We use*there are no missing dates*`3 * new_users + 2 * new_users_lag_1 + new_users_lag_2`

to get our weighted sum.**Case 2: If the date two days back is missing**and the third date had no new users, so we should only consider the current date and the day before. So, we use`3 * new_users + 2 * new_users_lag_1`

.

- On the other hand, we should also check if the date one row before our current row is two days back
*(B):***Case 3:**If it is, this means that**only the day before the current day is missing.**Our weighted sum should be calculated as`3 * new_users + new_users_lag_2`

.**Case 4:**If not, this means that**all the days prior to our current date are missing.**We should evaluate our weighted sum as`3 * new_users`

.

We’ll then divide our weighted sum in **G₁** by six, as that is the sum of the weights (3+2+1), which is represented as **G₂** in our general form.

Notice that we divided the sum weighted sum by the `SUM`

of the weights because we don’t have percentile weights in this question.

Lastly, we add the `WHERE`

clause, which is a little tricky too. While it’s easy to dismiss the first two rows from our result set, this would be a naïve approach.

Consider a situation wherein the second row is not the day after the first row. This means that it’s logically correct to assume that the second date should be a part of the result set. As such, instead of basing our decision on whether to include the second row based on its placement on the table, we should consider the interval between the first date instead, using `DATEDIFF(date, initial_date) >= 2`

.

The best way to get better in SQL is practice. Interview Query offers a variety of SQL learning resources to help you practice and improve your SQL skills, including: