# SQL Cumulative Sum: A Comprehensive Guide

## Introduction

This article will introduce you to different SQL queries that calculate cumulative sums. Cumulative sums play a vital role in analyzing and understanding data within a SQL database, and through them we can monitor the progression of various metrics over time, as well as uncover valuable insights. Within, we will explore different techniques and methods for calculating cumulative sums, enabling you to apply these queries confidently in your own projects.

Later on we will go through the three main methods: aggregate functions, window functions and grouping techniques. But first, lets nail down what we mean when we talk about a cumulative sum.

## Understanding Cumulative Sum in SQL

A cumulative sum, also known as a running total, is a progressive addition of a sequence of numbers. It is calculated by successively adding each new value in the sequence to the previous sum, such that you can see where a metric stands currently or on a specific date. Cumulative sums are powerful tools for examining trends, monitoring progress, and comparing time-based data.

For example, suppose you manage a sales team and want to track the units sold within the current period. A cumulative sum can help you measure the team’s effectiveness and track the team’s progress against their quota. If you find you are below quota, this information can be valuable for adjusting resources, setting new targets, or assessing the impact of past process changes.

To illustrate the concept of a cumulative sum for the sales team, let’s consider a simple dataset:

Date Sales
2023-05-01 100
2023-05-02 200
2023-05-03 150

By calculating a cumulative sum, we can determine the running total of sales for each day:

Date Sales CumulativeSales
2023-05-01 100 100
2023-05-02 200 300
2023-05-03 150 450

If the team was supposed to be at a running total of 1,000 units by the end of the third day, you may want to start thinking about a pep talk to the team, or changes to the process. You may also want to revise those initial targets, or re-evaluate how they are set in the first place.

## Calculating Cumulative Sum in SQL

There are several methods for calculating cumulative sums in SQL. Our focus will be on the use of aggregate functions, window functions and grouping techniques. Each method has unique benefits and use cases, and by mastering these concepts you can efficiently analyze your data and extract valuable insights for your organization.

### Aggregate Functions: Using the SUM() Function and OVER() Clause

The SUM() function and the OVER() clause are two essential components for calculating cumulative sums in SQL. The SUM() function is an aggregate function that computes the total of a specified column, while the OVER() clause is used to define a window or set of rows for the function to operate on.

When used together, they enable you to calculate cumulative sums with ease. Let’s revisit our previous example of daily sales:

Date Sales
2023-05-01 100
2023-05-02 200
2023-05-03 150

To calculate the cumulative sum of sales for each day, we can use the following SQL query:

``````SELECT Date, Sales,
SUM(Sales)
OVER (ORDER BY Date) as CumulativeSales
FROM daily_sales;`
``````

The SUM() function calculates the total sales, and the OVER() clause defines a window of rows ordered by the `Date` column. The window starts from the first row and grows incrementally, allowing the SUM() function to operate on a progressively larger set of rows for each record. This results in a running total, or cumulative sum, for the sales.

The result will be:

Date Sales CumulativeSales
2023-05-01 100 100
2023-05-02 200 300
2023-05-03 150 450

With a better understanding of how the SUM() function and the OVER() clause interact to create a running sum, we can now present the general syntax:

``````SELECT column1, column2, SUM(column_to_sum) OVER (ORDER BY column_to_order_by) as CumulativeSum
FROM table_name;
``````

In this example, the `column_to_sum` is the column for which you want to calculate the cumulative sum, and the `column_to_order_by` is the column used to order the rows within the window.

As you can see, using the SUM() function and the OVER() clause allows us to calculate cumulative sums in SQL effectively. This approach is particularly useful when working with time-based data or other ordered sequences.

### Window Functions: Utilizing the RANK() Function for Cumulative Sums

The `RANK()` window function creates a new column, assigning a unique rank to each row in the result set based on a predetermined attribute. You can specify this ‘predetermined attribute’ through the `ORDER BY` clause in the window function.

Let’s take an example using a sales table. If we use `RANK() OVER (ORDER BY Date)`, it means that we are ranking the results based on their date. On the other hand, if we use `RANK() OVER (ORDER BY Sales)`, the ranks will be assigned based on the sales amount.

To use the `RANK()` function for cumulative sum calculations, you can employ a subquery or a Common Table Expression (CTE) to generate ranks based on their dates. Then, we create two instances of our table so that we can do a self join later. Let’s mark these instances `r1` and `r2` for ease of reference.

We do a self join, joining `r1` and `r2` , which we can then match `r1`’s rows to `r2`’s rows where the rank in `r1`’s row is greater than or equal to the rank in `r2`’s rows. Let’s see the following example, referencing our previous table:

While we operate on this row in `r1`:

Date Sales Rank
2023-05-02 200 2

We have two rows that match the condition `r1.rank ≥= r2.rank`, which are the rows:

Date Sales Rank
2023-05-01 100 1
2023-05-02 200 2

We can then get the `SUM` of the matching rows on `r2`, i.e., ranks 1 and 2 to get the cumulative sum. Finally, we group the results by the date of `r1`, meaning we will get the sum with respect to `r1`’s date column, enabling us to retrieve the cumulative sum on a per-day window.

Consider the following example, where we calculate the cumulative sum of daily sales:

``````WITH RankedSales AS (
SELECT Date, Sales, RANK() OVER (ORDER BY Date) as SalesRank
FROM daily_sales
)
SELECT r1.Date, r1.Sales, SUM(r2.Sales) as CumulativeSales
FROM RankedSales r1
JOIN RankedSales r2 ON r1.SalesRank >= r2.SalesRank
GROUP BY r1.Date, r1.Sales, r1.SalesRank
ORDER BY r1.Date;
``````

In this example, we first create a CTE `RankedSales` with the RANK() function to assign a rank to each row based on the `Date` column. Then, we join the `RankedSales` table to itself, retaining only the rows with equal or lower ranks. This self-join effectively constructs a growing window of rows for each date. Let’s visualize this below:

r1.Date r1.Sales r1.RankedSales r2.Date r2.Sales r2.RankedSales
2023-05-01 100 1 2023-05-01 100 1
2023-05-02 200 2 2023-05-01 100 1
2023-05-02 200 2 2023-05-02 200 2
2023-05-03 150 3 2023-05-01 100 1
2023-05-03 150 3 2023-05-02 200 2
2023-05-03 150 3 2023-05-03 150 3

Finally, we use the GROUP BY clause to aggregate the sales data and calculate the cumulative sum for each date.

r1.Date r1.Sales r1.RankedSales CumulativeSales r2.Sales r2.RankedSales
2023-05-01 100 1 SUM OF ROWS IN: 100 1
2023-05-02 200 2 SUM OF ROWS IN: 100 1
200 2
2023-05-03 150 3 SUM OF ROWS IN: 100 1
200 2
150 3

### Grouping Techniques: Using the GROUP BY clause

The GROUP BY clause is used to combine identical value rows within the columns you specify into a single row. It’s often employed with aggregate functions like SUM(), COUNT(), or AVG() to perform calculations on each group of rows. To compute cumulative sums using the GROUP BY clause, we will create a self-join on the table such that each group of rows represents a growing window of time, in a very similar way to what we did with RANK() function before. We will then use the GROUP BY clause to aggregate the data and compute the cumulative sum for each group. Calculating cumulative sums using the GROUP BY clause can be achieved with the help of subqueries or self-joins. For instance, let’s revisit our daily sales example:

``````SELECT d1.Date, d1.Sales, SUM(d2.Sales) as CumulativeSales
FROM daily_sales d1
JOIN daily_sales d2 ON d1.Date >= d2.Date
GROUP BY d1.Date, d1.Sales
ORDER BY d1.Date;
``````

In this example, we join the `daily_sales` table to itself, retaining only the rows with equal or earlier dates. The self-join effectively creates a growing window of rows for each date. Then, we use the GROUP BY clause to aggregate the sales data by date and sales amount, calculating the cumulative sum for each date.

We will now see how this can be used beyond dates, and applied to other measures of time like day or month.

## Cumulative Sum Calculations by Date and Time

Calculating cumulative sums for different time frames is crucial for various types of data analysis, such as trend identification and performance measurement. In this section, we will discuss how to compute cumulative sums by date, day, week, and month using SQL.

### Cumulative Sum by Date

We already accomplished in the last section, but here is the query for using the GROUP BY clause to calculate cumulative sum by date:

``````SELECT d1.Date, d1.Sales, SUM(d2.Sales) as CumulativeSales
FROM daily_sales d1
JOIN daily_sales d2 ON d1.Date >= d2.Date
GROUP BY d1.Date, d1.Sales
ORDER BY d1.Date;
``````

This query calculates the cumulative sales for each date in the `daily_sales` table.

### Cumulative Sum by Day

To compute the cumulative sum by day, you can extract the day from the date using the `DAY()` function and then apply a similar technique as before:

``````SELECT DAY(d1.Date) as Day, d1.Sales, SUM(d2.Sales) as CumulativeSales
FROM daily_sales d1
JOIN daily_sales d2 ON DAY(d1.Date) >= DAY(d2.Date)
GROUP BY Day, d1.Sales
ORDER BY Day;
``````

This query calculates the cumulative sales for each day of the month. If you want to get the cumulative sum by week or by month, we can simply modify the `DAY()` functions in our query into `WEEK()` or `MONTH()`.

### Resetting Cumulative Sums

There are scenarios where cumulative sums need to be reset, such as when analyzing data within different categories or groups. For instance, when calculating the cumulative sales for each product category, you may need to reset the cumulative sum whenever the category changes.

To reset cumulative sums, you can use the `PARTITION BY` clause within the `OVER()` window function. Here’s an example:

``````SELECT Category,
Date,
Sales,
SUM(Sales) OVER (PARTITION BY Category ORDER BY Date) as CumulativeSales
FROM daily_sales
ORDER BY Category, Date;
``````

This query calculates the cumulative sales for each product category, resetting the cumulative sum whenever the category changes. The `PARTITION BY` clause ensures that the cumulative sum starts over for each distinct category.

### Cumulative Sum with Window Functions

Window functions in SQL are used to perform calculations across a set of rows that are related to the current row. They provide an advanced way to calculate cumulative sums by allowing you to define a range of rows for aggregation.

For example, to calculate a 7-day rolling cumulative sum of daily sales, you can use the `ROWS BETWEEN` clause within the `OVER()` window function:

``````SELECT Date,
Sales,
SUM(Sales) OVER (ORDER BY Date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as RollingCumulativeSales
FROM daily_sales
ORDER BY Date;
``````

This query calculates the 7-day rolling cumulative sales, considering the current row and the six preceding rows. The `ROWS BETWEEN` clause defines the range of rows to include in the aggregation.

Window functions provide a powerful and flexible way to calculate cumulative sums in SQL, allowing you to define custom ranges and aggregation methods.

### Calculating Cumulative Sum Over Multiple Columns

In some cases, you might need to calculate the cumulative sum across multiple columns. For example, let’s say you have a table containing the sales data for two different products (ProductA and ProductB) and you want to compute the cumulative sum of their combined sales.

Here’s a way to accomplish this:

``````WITH CombinedSales AS (
SELECT Date,
ProductA_Sales + ProductB_Sales as TotalSales
FROM sales_data
),
CumulativeSales AS (
SELECT Date,
SUM(TotalSales) OVER (ORDER BY Date) as CumulativeSales
FROM CombinedSales
)
SELECT * FROM CumulativeSales;
``````

In this example, we first create a CTE `CombinedSales` that calculates the combined sales for both products by adding their corresponding sales columns. Then, we create another CTE `CumulativeSales` that calculates the cumulative sales using the SUM() function with the OVER() clause. Finally, we select all the data from the `CumulativeSales` CTE.

## Practice Interview Questions

Having covered the fundamentals and techniques, let’s now put our knowledge into practice by applying these concepts to two real-world interview questions.

### Practice Question 1: Cumulative Sales by Product

Let’s start with something easy. You are working with the sales team of an e-commerce store to analyze their monthly performance.

They give you the `sales` table that tracks every purchase made on the store. The table contains the columns `id` (purchase id), `product_id``date` (purchase date), and `price`.

Write a SQL query to compute the cumulative sum of sales for each product, sorted by `product_id` and date.

Note: The cumulative sum for a product on a given date is the sum of the price of all purchases of the product that happened on that date and on all previous dates.

### Example:

Input:

`sales` table

Column Type
id INTEGER
product_id INTEGER
date DATE
price FLOAT

Output:

Column Type
product_id INTEGER
date DATE
cumulative_sum FLOAT

### Solution

This question is relatively easy, and can be solved with many techniques and approaches. For this specific question, let’s use the self-JOIN technique to get the cumulative sum:

``````SELECT
s1.product_id,
s1.date,
SUM(s2.price) AS cumulative_sum
FROM
sales s1
JOIN
sales s2 ON s1.product_id = s2.product_id AND s1.date >= s2.date
GROUP BY
s1.product_id, s1.date
ORDER BY
s1.product_id, s1.date;
``````

In the query, we perform a self-join on the `sales` table, which means we join the table with itself. The goal is to establish a relationship between rows based on a common attribute, which in this case is the `product_id` column.

The self-join is achieved through the following part of the code:

``````FROM
sales s1
JOIN
sales s2 ON s1.product_id = s2.product_id AND s1.date >= s2.date
``````

Here, we have two instances of the `sales` table referred to as `s1` and `s2`. By joining `s1` with `s2`, we can compare and match rows from the table based on the condition specified in the `ON` clause.

The condition `s1.product_id = s2.product_id` ensures that only rows with the same `product_id` are considered for the join.

Additionally, we have the condition `s1.date >= s2.date`, which ensures that we are joining the rows where the date in `s1` is greater than or equal to the date in `s2`. This condition helps us capture all the purchases for a particular product up to and including the current date.

By performing this self-join, we are able to create pairs of rows from the `sales` table where the `product_id` matches and the dates are in chronological order.

After the join, we group the results by `product_id` and `date` using the `GROUP BY` clause. This allows us to aggregate the rows based on these columns.

Finally, we calculate the cumulative sum of the `price` column for each group using the `SUM()` function:

``````SUM(s2.price) AS cumulative_sum
``````

By summing the `price` column in the joined rows, we obtain the cumulative sum of sales for each product up to and including the current date.

The result is then ordered by `product_id` and `date` to present the cumulative sum of sales for each product in ascending order.

### Practice Question 2: Sales Since Last Restocking

Let’s turn it up a notch.

As an accountant for a local grocery store, you have been tasked with calculating the daily sales of each product since their last restocking. You have been provided with three tables: `products`, `sales`, and `restocking`. The `products` table contains information about each product, the `sales` table records the sales transactions, and the `restocking` table tracks the restocking events. Write an SQL query to determine the daily sales for each product since their last restocking event.

Using the given table schemas, write an SQL query to retrieve the `product_name`, `date`, and `sales_since_last_restock` for each product. The `sales_since_last_restock` column represents the cumulative sales of the product for each day since its last restocking event. Order the result set by `product_id`.

Example:

Input:

Table `products`:

Column Type
product_id int
product_name varchar

Table `sales`:

Column Type
sales_id int
product_id int
date date
sold_count int

Table `restocking`:

Column Type
restock_id int
product_id int
restock_date date

Output The output should include the following columns:

Column Type
product_name varchar
date date
sales_since_last_restock int

### Solution

To approach a complex SQL question involving various facets and querying techniques, it’s important to simplify our logic and streamline our thinking process. By doing so, we can focus solely on the requirements at hand. Let’s outline the key objectives of the question:

1. Retrieve the latest restock dates, ensuring to obtain the most recent restock date per product_id.
2. Filter the data based on the latest restock dates. This filtration process should be performed independently for each product_id, considering the date window starting from their respective latest restock.
3. Compute the cumulative sum using the filtered data.

By breaking down the approach into these distinct steps, we can effectively tackle the question and generate the expected result set. Let’s take a look at our solution in code:

``````-- 1. First, create a CTE to get the latest restocking date for each product
WITH latest_restock AS (
SELECT
product_id,
MAX(restock_date) AS max_restock_date
FROM
restocking
GROUP BY
product_id
),

-- 2. Then, create another CTE to get the sales that occurred after the latest restocking
filtered_sales AS (
SELECT
s.*
FROM
sales s
JOIN latest_restock lr ON s.product_id = lr.product_id
WHERE
s.date >= lr.max_restock_date
)

-- 3. Finally, calculate the cumulative sum of sales since the latest restocking
SELECT
p.product_name,
fs.date,
SUM(fs.sold_count) OVER (PARTITION BY fs.product_id ORDER BY fs.date) AS sales_since_last_restock
FROM
filtered_sales fs
JOIN products p ON fs.product_id = p.product_id
ORDER BY
fs.product_id, fs.date;
``````

To improve the explanation, here’s a revised version:

Our solution is divided into three parts, each with its own Common Table Expression (CTE) for clarity. Let’s go through each part:

1. The first CTE generates a table that includes the `product_id` and its corresponding latest restock date using the `MAX(`) function. To ensure independent results for each `product_id`, we use the GROUP BY clause.
2. Using the data from the first CTE, we apply a `WHERE` clause to filter out rows that are not relevant for computation, as specified in the line `s.date >= lr.max_restock_date` . We compare each sale entry’s date to the most recent restocking date for its corresponding `product_id`. This ensures that we only consider sales that occurred after the last restocking.
3. With the filtered dataset, we can easily calculate the cumulative sum using a window function. By applying the SUM() function on the `sold_count` column and partitioning the data by `product_id`, we obtain the cumulative sum for each `product_id`. The ORDER BY clause is used to determine the order of the data within each partition, in this case, based on the `date`.

By following these steps, we can effectively generate the cumulative sum based on the specified criteria. This approach allows us to analyze sales data and track the progression of sold quantities for each product over time.

## Conclusion

Understanding and calculating cumulative sums in SQL is crucial for various data analysis tasks, such as aggregating data over time, analyzing trends, and computing running totals for different metrics. In this article, we’ve covered various techniques for calculating cumulative sums, including using the SUM() function with the OVER() clause, the RANK() function, the GROUP BY clause, and other window functions.

We’ve also discussed calculating cumulative sums for different time frames, and advanced techniques like resetting cumulative sums or calculating cumulative sums over multiple columns.

Keep practicing and honing your skills to become proficient in handling complex SQL queries and data manipulation tasks.

## More Resources

Further your SQL practice with the SQL learning path, and brush up your skills with our comprehensive SQL question database.

For most job interviews, knowing SQL is not enough. You also need to be prepared for product, A/B testing, Python, and machine learning questions. Ace your next interview with the following resources: