This article will introduce you to different SQL queries that calculate cumulative sums. Cumulative sums play a vital role in analyzing and understanding data within a SQL database, and through them we can monitor the progression of various metrics over time, as well as uncover valuable insights. Within, we will explore different techniques and methods for calculating cumulative sums, enabling you to apply these queries confidently in your own projects.

Later on we will go through the three main methods: aggregate functions, window functions and grouping techniques. But first, lets nail down what we mean when we talk about a cumulative sum.

A cumulative sum, also known as a running total, is a progressive addition of a sequence of numbers. It is calculated by successively adding each new value in the sequence to the previous sum, such that you can see where a metric stands currently or on a specific date. Cumulative sums are powerful tools for examining trends, monitoring progress, and comparing time-based data.

For example, suppose you manage a sales team and want to track the units sold within the current period. A cumulative sum can help you measure the team’s effectiveness and track the team’s progress against their quota. If you find you are below quota, this information can be valuable for adjusting resources, setting new targets, or assessing the impact of past process changes.

To illustrate the concept of a cumulative sum for the sales team, let’s consider a simple dataset:

Date | Sales |
---|---|

2023-05-01 | 100 |

2023-05-02 | 200 |

2023-05-03 | 150 |

By calculating a cumulative sum, we can determine the running total of sales for each day:

Date | Sales | CumulativeSales |
---|---|---|

2023-05-01 | 100 | 100 |

2023-05-02 | 200 | 300 |

2023-05-03 | 150 | 450 |

If the team was supposed to be at a running total of 1,000 units by the end of the third day, you may want to start thinking about a pep talk to the team, or changes to the process. You may also want to revise those initial targets, or re-evaluate how they are set in the first place.

There are several methods for calculating cumulative sums in SQL. Our focus will be on the use of aggregate functions, window functions and grouping techniques. Each method has unique benefits and use cases, and by mastering these concepts you can efficiently analyze your data and extract valuable insights for your organization.

The SUM() function and the OVER() clause are two essential components for calculating cumulative sums in SQL. The SUM() function is an aggregate function that computes the total of a specified column, while the OVER() clause is used to define a window or set of rows for the function to operate on.

When used together, they enable you to calculate cumulative sums with ease. Let’s revisit our previous example of daily sales:

Date | Sales |
---|---|

2023-05-01 | 100 |

2023-05-02 | 200 |

2023-05-03 | 150 |

To calculate the cumulative sum of sales for each day, we can use the following SQL query:

```
SELECT Date, Sales,
SUM(Sales)
OVER (ORDER BY Date) as CumulativeSales
FROM daily_sales;`
```

The SUM() function calculates the total sales, and the OVER() clause defines a window of rows ordered by the ** Date** column. The window starts from the first row and grows incrementally, allowing the SUM() function to operate on a progressively larger set of rows for each record. This results in a running total, or cumulative sum, for the sales.

The result will be:

Date | Sales | CumulativeSales |
---|---|---|

2023-05-01 | 100 | 100 |

2023-05-02 | 200 | 300 |

2023-05-03 | 150 | 450 |

With a better understanding of how the SUM() function and the OVER() clause interact to create a running sum, we can now present the general syntax:

```
SELECT column1, column2, SUM(column_to_sum) OVER (ORDER BY column_to_order_by) as CumulativeSum
FROM table_name;
```

In this example, the ** column_to_sum** is the column for which you want to calculate the cumulative sum, and the

`column_to_order_by`

As you can see, using the SUM() function and the OVER() clause allows us to calculate cumulative sums in SQL effectively. This approach is particularly useful when working with time-based data or other ordered sequences.

The ** RANK()** window function creates a new column, assigning a unique rank to each row in the result set based on a predetermined attribute. You can specify this ‘predetermined attribute’ through the

`ORDER BY`

Let’s take an example using a sales table. If we use ** RANK() OVER (ORDER BY Date)**, it means that we are ranking the results based on their date. On the other hand, if we use

`RANK() OVER (ORDER BY Sales)`

To use the `RANK()`

function for cumulative sum calculations, you can employ a subquery or a Common Table Expression (CTE) to generate ranks based on their dates. Then, we create two instances of our table so that we can do a self join later. Let’s mark these instances `r1`

and `r2`

for ease of reference.

We do a self join, joining `r1`

and `r2`

, which we can then match `r1`

’s rows to `r2`

’s rows where the rank in `r1`

’s row is greater than or equal to the rank in `r2`

’s rows. Let’s see the following example, referencing our previous table:

While we operate on this row in `r1`

:

Date | Sales | Rank |
---|---|---|

2023-05-02 | 200 | 2 |

We have two rows that match the condition `r1.rank ≥= r2.rank`

, which are the rows:

Date | Sales | Rank |
---|---|---|

2023-05-01 | 100 | 1 |

2023-05-02 | 200 | 2 |

We can then get the `SUM`

of the matching rows on `r2`

, i.e., ranks 1 and 2 to get the cumulative sum. Finally, we group the results by the date of `r1`

, meaning we will get the sum with respect to `r1`

’s date column, enabling us to retrieve the cumulative sum on a per-day window.

Consider the following example, where we calculate the cumulative sum of daily sales:

```
WITH RankedSales AS (
SELECT Date, Sales, RANK() OVER (ORDER BY Date) as SalesRank
FROM daily_sales
)
SELECT r1.Date, r1.Sales, SUM(r2.Sales) as CumulativeSales
FROM RankedSales r1
JOIN RankedSales r2 ON r1.SalesRank >= r2.SalesRank
GROUP BY r1.Date, r1.Sales, r1.SalesRank
ORDER BY r1.Date;
```

In this example, we first create a CTE ** RankedSales** with the RANK() function to assign a rank to each row based on the

`Date`

`RankedSales`

r1.Date | r1.Sales | r1.RankedSales | r2.Date | r2.Sales | r2.RankedSales |
---|---|---|---|---|---|

2023-05-01 | 100 | 1 | 2023-05-01 | 100 | 1 |

2023-05-02 | 200 | 2 | 2023-05-01 | 100 | 1 |

2023-05-02 | 200 | 2 | 2023-05-02 | 200 | 2 |

2023-05-03 | 150 | 3 | 2023-05-01 | 100 | 1 |

2023-05-03 | 150 | 3 | 2023-05-02 | 200 | 2 |

2023-05-03 | 150 | 3 | 2023-05-03 | 150 | 3 |

Finally, we use the GROUP BY clause to aggregate the sales data and calculate the cumulative sum for each date.

r1.Date | r1.Sales | r1.RankedSales | CumulativeSales | r2.Sales | r2.RankedSales |
---|---|---|---|---|---|

2023-05-01 | 100 | 1 | SUM OF ROWS IN: | 100 | 1 |

2023-05-02 | 200 | 2 | SUM OF ROWS IN: | 100 | 1 |

200 | 2 | ||||

2023-05-03 | 150 | 3 | SUM OF ROWS IN: | 100 | 1 |

200 | 2 | ||||

150 | 3 |

The GROUP BY clause is used to combine identical value rows within the columns you specify into a single row. It’s often employed with aggregate functions like SUM(), COUNT(), or AVG() to perform calculations on each group of rows. To compute cumulative sums using the GROUP BY clause, we will create a self-join on the table such that each group of rows represents a growing window of time, in a very similar way to what we did with RANK() function before. We will then use the GROUP BY clause to aggregate the data and compute the cumulative sum for each group. Calculating cumulative sums using the GROUP BY clause can be achieved with the help of subqueries or self-joins. For instance, let’s revisit our daily sales example:

```
SELECT d1.Date, d1.Sales, SUM(d2.Sales) as CumulativeSales
FROM daily_sales d1
JOIN daily_sales d2 ON d1.Date >= d2.Date
GROUP BY d1.Date, d1.Sales
ORDER BY d1.Date;
```

In this example, we join the ** daily_sales** table to itself, retaining only the rows with equal or earlier dates. The self-join effectively creates a growing window of rows for each date. Then, we use the GROUP BY clause to aggregate the sales data by date and sales amount, calculating the cumulative sum for each date.

We will now see how this can be used beyond dates, and applied to other measures of time like day or month.

*Confused with joins and self-joins? Learn more about self-joins here.*

Calculating cumulative sums for different time frames is crucial for various types of data analysis, such as trend identification and performance measurement. In this section, we will discuss how to compute cumulative sums by date, day, week, and month using SQL.

We already accomplished in the last section, but here is the query for using the GROUP BY clause to calculate cumulative sum by date:

```
SELECT d1.Date, d1.Sales, SUM(d2.Sales) as CumulativeSales
FROM daily_sales d1
JOIN daily_sales d2 ON d1.Date >= d2.Date
GROUP BY d1.Date, d1.Sales
ORDER BY d1.Date;
```

This query calculates the cumulative sales for each date in the ** daily_sales** table.

To compute the cumulative sum by day, you can extract the day from the date using the ** DAY()** function and then apply a similar technique as before:

```
SELECT DAY(d1.Date) as Day, d1.Sales, SUM(d2.Sales) as CumulativeSales
FROM daily_sales d1
JOIN daily_sales d2 ON DAY(d1.Date) >= DAY(d2.Date)
GROUP BY Day, d1.Sales
ORDER BY Day;
```

This query calculates the cumulative sales for each day of the month. If you want to get the cumulative sum by week or by month, we can simply modify the `DAY()`

functions in our query into `WEEK()`

or `MONTH()`

.

There are scenarios where cumulative sums need to be reset, such as when analyzing data within different categories or groups. For instance, when calculating the cumulative sales for each product category, you may need to reset the cumulative sum whenever the category changes.

To reset cumulative sums, you can use the ** PARTITION BY** clause within the

`OVER()`

```
SELECT Category,
Date,
Sales,
SUM(Sales) OVER (PARTITION BY Category ORDER BY Date) as CumulativeSales
FROM daily_sales
ORDER BY Category, Date;
```

This query calculates the cumulative sales for each product category, resetting the cumulative sum whenever the category changes. The ** PARTITION BY** clause ensures that the cumulative sum starts over for each distinct category.

Window functions in SQL are used to perform calculations across a set of rows that are related to the current row. They provide an advanced way to calculate cumulative sums by allowing you to define a range of rows for aggregation.

For example, to calculate a 7-day rolling cumulative sum of daily sales, you can use the ** ROWS BETWEEN** clause within the

`OVER()`

```
SELECT Date,
Sales,
SUM(Sales) OVER (ORDER BY Date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as RollingCumulativeSales
FROM daily_sales
ORDER BY Date;
```

This query calculates the 7-day rolling cumulative sales, considering the current row and the six preceding rows. The ** ROWS BETWEEN** clause defines the range of rows to include in the aggregation.

Window functions provide a powerful and flexible way to calculate cumulative sums in SQL, allowing you to define custom ranges and aggregation methods.

In some cases, you might need to calculate the cumulative sum across multiple columns. For example, let’s say you have a table containing the sales data for two different products (ProductA and ProductB) and you want to compute the cumulative sum of their combined sales.

Here’s a way to accomplish this:

```
WITH CombinedSales AS (
SELECT Date,
ProductA_Sales + ProductB_Sales as TotalSales
FROM sales_data
),
CumulativeSales AS (
SELECT Date,
SUM(TotalSales) OVER (ORDER BY Date) as CumulativeSales
FROM CombinedSales
)
SELECT * FROM CumulativeSales;
```

In this example, we first create a CTE ** CombinedSales** that calculates the combined sales for both products by adding their corresponding sales columns. Then, we create another CTE

`CumulativeSales`

`CumulativeSales`

Having covered the fundamentals and techniques, let’s now put our knowledge into practice by applying these concepts to two real-world interview questions.

Let’s start with something easy. You are working with the sales team of an e-commerce store to analyze their monthly performance.

They give you the `sales`

table that tracks every purchase made on the store. The table contains the columns `id`

(purchase id), `product_id`

, `date`

(purchase date), and `price`

.

Write a SQL query to compute the cumulative sum of sales for each product, sorted by `product_id`

and date.

*Note:**The cumulative sum for a product on a given date is the sum of the price of all purchases of the product that happened on that date and on all previous dates.*

**Input:**

`sales`

table

Column | Type |
---|---|

id | INTEGER |

product_id | INTEGER |

date | DATE |

price | FLOAT |

**Output:**

Column | Type |
---|---|

product_id | INTEGER |

date | DATE |

cumulative_sum | FLOAT |

This question is relatively easy, and can be solved with many techniques and approaches. For this specific question, let’s use the self-JOIN technique to get the cumulative sum:

```
SELECT
s1.product_id,
s1.date,
SUM(s2.price) AS cumulative_sum
FROM
sales s1
JOIN
sales s2 ON s1.product_id = s2.product_id AND s1.date >= s2.date
GROUP BY
s1.product_id, s1.date
ORDER BY
s1.product_id, s1.date;
```

In the query, we perform a self-join on the ** sales** table, which means we join the table with itself. The goal is to establish a relationship between rows based on a common attribute, which in this case is the

`product_id`

The self-join is achieved through the following part of the code:

```
FROM
sales s1
JOIN
sales s2 ON s1.product_id = s2.product_id AND s1.date >= s2.date
```

Here, we have two instances of the ** sales** table referred to as

`s1`

`s2`

`s1`

`s2`

`ON`

The condition ** s1.product_id = s2.product_id** ensures that only rows with the same

`product_id`

Additionally, we have the condition ** s1.date >= s2.date**, which ensures that we are joining the rows where the date in

`s1`

`s2`

By performing this self-join, we are able to create pairs of rows from the ** sales** table where the

`product_id`

After the join, we group the results by ** product_id** and

`date`

`GROUP BY`

Finally, we calculate the cumulative sum of the ** price** column for each group using the

`SUM()`

```
SUM(s2.price) AS cumulative_sum
```

By summing the ** price** column in the joined rows, we obtain the cumulative sum of sales for each product up to and including the current date.

The result is then ordered by ** product_id** and

`date`

Let’s turn it up a notch.

As an accountant for a local grocery store, you have been tasked with calculating the daily sales of each product since their last restocking. You have been provided with three tables: `products`

, `sales`

, and `restocking`

. The `products`

table contains information about each product, the `sales`

table records the sales transactions, and the `restocking`

table tracks the restocking events. Write an SQL query to determine the daily sales for each product since their last restocking event.

Using the given table schemas, write an SQL query to retrieve the `product_name`

, `date`

, and `sales_since_last_restock`

for each product. The `sales_since_last_restock`

column represents the cumulative sales of the product for each day since its last restocking event. Order the result set by `product_id`

.

Example:

**Input:**

Table `products`

:

Column | Type |
---|---|

product_id | int |

product_name | varchar |

Table `sales`

:

Column | Type |
---|---|

sales_id | int |

product_id | int |

date | date |

sold_count | int |

Table `restocking`

:

Column | Type |
---|---|

restock_id | int |

product_id | int |

restock_date | date |

**Output**
The output should include the following columns:

Column | Type |
---|---|

product_name | varchar |

date | date |

sales_since_last_restock | int |

To approach a complex SQL question involving various facets and querying techniques, it’s important to simplify our logic and streamline our thinking process. By doing so, we can focus solely on the requirements at hand. Let’s outline the key objectives of the question:

- Retrieve the latest restock dates, ensuring to obtain the most recent restock date per product_id.
- Filter the data based on the latest restock dates. This filtration process should be performed independently for each product_id, considering the date window starting from their respective latest restock.
- Compute the cumulative sum using the filtered data.

By breaking down the approach into these distinct steps, we can effectively tackle the question and generate the expected result set. Let’s take a look at our solution in code:

```
-- 1. First, create a CTE to get the latest restocking date for each product
WITH latest_restock AS (
SELECT
product_id,
MAX(restock_date) AS max_restock_date
FROM
restocking
GROUP BY
product_id
),
-- 2. Then, create another CTE to get the sales that occurred after the latest restocking
filtered_sales AS (
SELECT
s.*
FROM
sales s
JOIN latest_restock lr ON s.product_id = lr.product_id
WHERE
s.date >= lr.max_restock_date
)
-- 3. Finally, calculate the cumulative sum of sales since the latest restocking
SELECT
p.product_name,
fs.date,
SUM(fs.sold_count) OVER (PARTITION BY fs.product_id ORDER BY fs.date) AS sales_since_last_restock
FROM
filtered_sales fs
JOIN products p ON fs.product_id = p.product_id
ORDER BY
fs.product_id, fs.date;
```

To improve the explanation, here’s a revised version:

Our solution is divided into three parts, each with its own Common Table Expression (CTE) for clarity. Let’s go through each part:

- The first CTE generates a table that includes the
`product_id`

and its corresponding latest restock date using the`MAX(`

) function. To ensure independent results for each`product_id`

, we use the GROUP BY clause. - Using the data from the first CTE, we apply a
`WHERE`

clause to filter out rows that are not relevant for computation, as specified in the line`s.date >= lr.max_restock_date`

. We compare each sale entry’s date to the most recent restocking date for its corresponding`product_id`

. This ensures that we only consider sales that occurred after the last restocking. - With the filtered dataset, we can easily calculate the cumulative sum using a window function. By applying the SUM() function on the
`sold_count`

column and partitioning the data by`product_id`

, we obtain the cumulative sum for each`product_id`

. The ORDER BY clause is used to determine the order of the data within each partition, in this case, based on the`date`

.

By following these steps, we can effectively generate the cumulative sum based on the specified criteria. This approach allows us to analyze sales data and track the progression of sold quantities for each product over time.

Understanding and calculating cumulative sums in SQL is crucial for various data analysis tasks, such as aggregating data over time, analyzing trends, and computing running totals for different metrics. In this article, we’ve covered various techniques for calculating cumulative sums, including using the SUM() function with the OVER() clause, the RANK() function, the GROUP BY clause, and other window functions.

We’ve also discussed calculating cumulative sums for different time frames, and advanced techniques like resetting cumulative sums or calculating cumulative sums over multiple columns.

Keep practicing and honing your skills to become proficient in handling complex SQL queries and data manipulation tasks.

Further your SQL practice with **the SQL learning path**, and brush up your skills with our comprehensive **SQL question database**.

For most job interviews, knowing SQL is not enough. You also need to be prepared for **product**, **A/B testing**, **Python**, and **machine learning questions**. Ace your next interview with the following resources: