SQL Cumulative Sum: A Comprehensive Guide

SQL Cumulative Sum: A Comprehensive Guide

Introduction

This article will introduce you to different SQL queries that calculate cumulative sums. Cumulative sums play a vital role in analyzing and understanding data within a SQL database, and through them we can monitor the progression of various metrics over time, as well as uncover valuable insights. Within, we will explore different techniques and methods for calculating cumulative sums, enabling you to apply these queries confidently in your own projects.

Later on we will go through the three main methods: aggregate functions, window functions and grouping techniques. But first, lets nail down what we mean when we talk about a cumulative sum.

Want to learn more about SQL or boost your chances of landing that job? Check out our SQL learning path here.

Understanding Cumulative Sum in SQL

A cumulative sum, also known as a running total, is a progressive addition of a sequence of numbers. It is calculated by successively adding each new value in the sequence to the previous sum, such that you can see where a metric stands currently or on a specific date. Cumulative sums are powerful tools for examining trends, monitoring progress, and comparing time-based data.

For example, suppose you manage a sales team and want to track the units sold within the current period. A cumulative sum can help you measure the team’s effectiveness and track the team’s progress against their quota. If you find you are below quota, this information can be valuable for adjusting resources, setting new targets, or assessing the impact of past process changes.

To illustrate the concept of a cumulative sum for the sales team, let’s consider a simple dataset:

Date Sales
2023-05-01 100
2023-05-02 200
2023-05-03 150

By calculating a cumulative sum, we can determine the running total of sales for each day:

Date Sales CumulativeSales
2023-05-01 100 100
2023-05-02 200 300
2023-05-03 150 450

If the team was supposed to be at a running total of 1,000 units by the end of the third day, you may want to start thinking about a pep talk to the team, or changes to the process. You may also want to revise those initial targets, or re-evaluate how they are set in the first place.

image

Calculating Cumulative Sum in SQL

There are several methods for calculating cumulative sums in SQL. Our focus will be on the use of aggregate functions, window functions and grouping techniques. Each method has unique benefits and use cases, and by mastering these concepts you can efficiently analyze your data and extract valuable insights for your organization.

Aggregate Functions: Using the SUM() Function and OVER() Clause

The SUM() function and the OVER() clause are two essential components for calculating cumulative sums in SQL. The SUM() function is an aggregate function that computes the total of a specified column, while the OVER() clause is used to define a window or set of rows for the function to operate on.

When used together, they enable you to calculate cumulative sums with ease. Let’s revisit our previous example of daily sales:

Date Sales
2023-05-01 100
2023-05-02 200
2023-05-03 150

To calculate the cumulative sum of sales for each day, we can use the following SQL query:

SELECT Date, Sales, 
SUM(Sales) 
OVER (ORDER BY Date) as CumulativeSales
FROM daily_sales;`

The SUM() function calculates the total sales, and the OVER() clause defines a window of rows ordered by the Date column. The window starts from the first row and grows incrementally, allowing the SUM() function to operate on a progressively larger set of rows for each record. This results in a running total, or cumulative sum, for the sales.

The result will be:

Date Sales CumulativeSales
2023-05-01 100 100
2023-05-02 200 300
2023-05-03 150 450

With a better understanding of how the SUM() function and the OVER() clause interact to create a running sum, we can now present the general syntax:

SELECT column1, column2, SUM(column_to_sum) OVER (ORDER BY column_to_order_by) as CumulativeSum
FROM table_name;

In this example, the column_to_sum is the column for which you want to calculate the cumulative sum, and the column_to_order_by is the column used to order the rows within the window.

As you can see, using the SUM() function and the OVER() clause allows us to calculate cumulative sums in SQL effectively. This approach is particularly useful when working with time-based data or other ordered sequences.

Not enough experience with window functions? Learn more about window functions in Interview Query’s SQL medium SQL course. If you need more experience with basic aggregate functions, learn more about them here.

Window Functions: Utilizing the RANK() Function for Cumulative Sums

The RANK() window function creates a new column, assigning a unique rank to each row in the result set based on a predetermined attribute. You can specify this ‘predetermined attribute’ through the ORDER BY clause in the window function.

Let’s take an example using a sales table. If we use RANK() OVER (ORDER BY Date), it means that we are ranking the results based on their date. On the other hand, if we use RANK() OVER (ORDER BY Sales), the ranks will be assigned based on the sales amount.

To use the RANK() function for cumulative sum calculations, you can employ a subquery or a Common Table Expression (CTE) to generate ranks based on their dates. Then, we create two instances of our table so that we can do a self join later. Let’s mark these instances r1 and r2 for ease of reference.

We do a self join, joining r1 and r2 , which we can then match r1’s rows to r2’s rows where the rank in r1’s row is greater than or equal to the rank in r2’s rows. Let’s see the following example, referencing our previous table:

While we operate on this row in r1:

Date Sales Rank
2023-05-02 200 2

We have two rows that match the condition r1.rank ≥= r2.rank, which are the rows:

Date Sales Rank
2023-05-01 100 1
2023-05-02 200 2

We can then get the SUM of the matching rows on r2, i.e., ranks 1 and 2 to get the cumulative sum. Finally, we group the results by the date of r1, meaning we will get the sum with respect to r1’s date column, enabling us to retrieve the cumulative sum on a per-day window.

Consider the following example, where we calculate the cumulative sum of daily sales:

WITH RankedSales AS (
  SELECT Date, Sales, RANK() OVER (ORDER BY Date) as SalesRank
  FROM daily_sales
)
SELECT r1.Date, r1.Sales, SUM(r2.Sales) as CumulativeSales
FROM RankedSales r1
JOIN RankedSales r2 ON r1.SalesRank >= r2.SalesRank
GROUP BY r1.Date, r1.Sales, r1.SalesRank
ORDER BY r1.Date;

In this example, we first create a CTE RankedSales with the RANK() function to assign a rank to each row based on the Date column. Then, we join the RankedSales table to itself, retaining only the rows with equal or lower ranks. This self-join effectively constructs a growing window of rows for each date. Let’s visualize this below:

r1.Date r1.Sales r1.RankedSales r2.Date r2.Sales r2.RankedSales
2023-05-01 100 1 2023-05-01 100 1
2023-05-02 200 2 2023-05-01 100 1
2023-05-02 200 2 2023-05-02 200 2
2023-05-03 150 3 2023-05-01 100 1
2023-05-03 150 3 2023-05-02 200 2
2023-05-03 150 3 2023-05-03 150 3

Finally, we use the GROUP BY clause to aggregate the sales data and calculate the cumulative sum for each date.

r1.Date r1.Sales r1.RankedSales CumulativeSales r2.Sales r2.RankedSales
2023-05-01 100 1 SUM OF ROWS IN: 100 1
2023-05-02 200 2 SUM OF ROWS IN: 100 1
200 2
2023-05-03 150 3 SUM OF ROWS IN: 100 1
200 2
150 3

image

Grouping Techniques: Using the GROUP BY clause

The GROUP BY clause is used to combine identical value rows within the columns you specify into a single row. It’s often employed with aggregate functions like SUM(), COUNT(), or AVG() to perform calculations on each group of rows. To compute cumulative sums using the GROUP BY clause, we will create a self-join on the table such that each group of rows represents a growing window of time, in a very similar way to what we did with RANK() function before. We will then use the GROUP BY clause to aggregate the data and compute the cumulative sum for each group. Calculating cumulative sums using the GROUP BY clause can be achieved with the help of subqueries or self-joins. For instance, let’s revisit our daily sales example:

SELECT d1.Date, d1.Sales, SUM(d2.Sales) as CumulativeSales
FROM daily_sales d1
JOIN daily_sales d2 ON d1.Date >= d2.Date
GROUP BY d1.Date, d1.Sales
ORDER BY d1.Date;

In this example, we join the daily_sales table to itself, retaining only the rows with equal or earlier dates. The self-join effectively creates a growing window of rows for each date. Then, we use the GROUP BY clause to aggregate the sales data by date and sales amount, calculating the cumulative sum for each date.

We will now see how this can be used beyond dates, and applied to other measures of time like day or month.

Confused with joins and self-joins? Learn more about self-joins here.

Cumulative Sum Calculations by Date and Time

Calculating cumulative sums for different time frames is crucial for various types of data analysis, such as trend identification and performance measurement. In this section, we will discuss how to compute cumulative sums by date, day, week, and month using SQL.

Cumulative Sum by Date

We already accomplished in the last section, but here is the query for using the GROUP BY clause to calculate cumulative sum by date:

SELECT d1.Date, d1.Sales, SUM(d2.Sales) as CumulativeSales
FROM daily_sales d1
JOIN daily_sales d2 ON d1.Date >= d2.Date
GROUP BY d1.Date, d1.Sales
ORDER BY d1.Date;

This query calculates the cumulative sales for each date in the daily_sales table.

Cumulative Sum by Day

To compute the cumulative sum by day, you can extract the day from the date using the DAY() function and then apply a similar technique as before:

SELECT DAY(d1.Date) as Day, d1.Sales, SUM(d2.Sales) as CumulativeSales
FROM daily_sales d1
JOIN daily_sales d2 ON DAY(d1.Date) >= DAY(d2.Date)
GROUP BY Day, d1.Sales
ORDER BY Day;

This query calculates the cumulative sales for each day of the month. If you want to get the cumulative sum by week or by month, we can simply modify the DAY() functions in our query into WEEK() or MONTH().

image

Advanced Cumulative Sum Techniques

Resetting Cumulative Sums

There are scenarios where cumulative sums need to be reset, such as when analyzing data within different categories or groups. For instance, when calculating the cumulative sales for each product category, you may need to reset the cumulative sum whenever the category changes.

To reset cumulative sums, you can use the PARTITION BY clause within the OVER() window function. Here’s an example:

SELECT Category,
       Date,
       Sales,
       SUM(Sales) OVER (PARTITION BY Category ORDER BY Date) as CumulativeSales
FROM daily_sales
ORDER BY Category, Date;

This query calculates the cumulative sales for each product category, resetting the cumulative sum whenever the category changes. The PARTITION BY clause ensures that the cumulative sum starts over for each distinct category.

Cumulative Sum with Window Functions

Window functions in SQL are used to perform calculations across a set of rows that are related to the current row. They provide an advanced way to calculate cumulative sums by allowing you to define a range of rows for aggregation.

For example, to calculate a 7-day rolling cumulative sum of daily sales, you can use the ROWS BETWEEN clause within the OVER() window function:

SELECT Date,
       Sales,
       SUM(Sales) OVER (ORDER BY Date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as RollingCumulativeSales
FROM daily_sales
ORDER BY Date;

This query calculates the 7-day rolling cumulative sales, considering the current row and the six preceding rows. The ROWS BETWEEN clause defines the range of rows to include in the aggregation.

Window functions provide a powerful and flexible way to calculate cumulative sums in SQL, allowing you to define custom ranges and aggregation methods.

Calculating Cumulative Sum Over Multiple Columns

In some cases, you might need to calculate the cumulative sum across multiple columns. For example, let’s say you have a table containing the sales data for two different products (ProductA and ProductB) and you want to compute the cumulative sum of their combined sales.

Here’s a way to accomplish this:

WITH CombinedSales AS (
  SELECT Date,
         ProductA_Sales + ProductB_Sales as TotalSales
  FROM sales_data
),
CumulativeSales AS (
  SELECT Date,
         SUM(TotalSales) OVER (ORDER BY Date) as CumulativeSales
  FROM CombinedSales
)
SELECT * FROM CumulativeSales;

In this example, we first create a CTE CombinedSales that calculates the combined sales for both products by adding their corresponding sales columns. Then, we create another CTE CumulativeSales that calculates the cumulative sales using the SUM() function with the OVER() clause. Finally, we select all the data from the CumulativeSales CTE.

image

Practice Interview Questions

Having covered the fundamentals and techniques, let’s now put our knowledge into practice by applying these concepts to two real-world interview questions.

Practice Question 1: Cumulative Sales by Product

Let’s start with something easy. You are working with the sales team of an e-commerce store to analyze their monthly performance.

They give you the sales table that tracks every purchase made on the store. The table contains the columns id (purchase id), product_iddate (purchase date), and price.

Write a SQL query to compute the cumulative sum of sales for each product, sorted by product_id and date.

Note: The cumulative sum for a product on a given date is the sum of the price of all purchases of the product that happened on that date and on all previous dates.

Example:

Input:

sales table

Column Type
id INTEGER
product_id INTEGER
date DATE
price FLOAT

Output:

Column Type
product_id INTEGER
date DATE
cumulative_sum FLOAT

Solution

This question is relatively easy, and can be solved with many techniques and approaches. For this specific question, let’s use the self-JOIN technique to get the cumulative sum:

SELECT
    s1.product_id,
    s1.date,
    SUM(s2.price) AS cumulative_sum
FROM
    sales s1
JOIN
    sales s2 ON s1.product_id = s2.product_id AND s1.date >= s2.date
GROUP BY
    s1.product_id, s1.date
ORDER BY
    s1.product_id, s1.date;

In the query, we perform a self-join on the sales table, which means we join the table with itself. The goal is to establish a relationship between rows based on a common attribute, which in this case is the product_id column.

The self-join is achieved through the following part of the code:

FROM
    sales s1
JOIN
    sales s2 ON s1.product_id = s2.product_id AND s1.date >= s2.date

Here, we have two instances of the sales table referred to as s1 and s2. By joining s1 with s2, we can compare and match rows from the table based on the condition specified in the ON clause.

The condition s1.product_id = s2.product_id ensures that only rows with the same product_id are considered for the join.

Additionally, we have the condition s1.date >= s2.date, which ensures that we are joining the rows where the date in s1 is greater than or equal to the date in s2. This condition helps us capture all the purchases for a particular product up to and including the current date.

By performing this self-join, we are able to create pairs of rows from the sales table where the product_id matches and the dates are in chronological order.

After the join, we group the results by product_id and date using the GROUP BY clause. This allows us to aggregate the rows based on these columns.

Finally, we calculate the cumulative sum of the price column for each group using the SUM() function:

SUM(s2.price) AS cumulative_sum

By summing the price column in the joined rows, we obtain the cumulative sum of sales for each product up to and including the current date.

The result is then ordered by product_id and date to present the cumulative sum of sales for each product in ascending order.

image

Practice Question 2: Sales Since Last Restocking

Let’s turn it up a notch.

As an accountant for a local grocery store, you have been tasked with calculating the daily sales of each product since their last restocking. You have been provided with three tables: products, sales, and restocking. The products table contains information about each product, the sales table records the sales transactions, and the restocking table tracks the restocking events. Write an SQL query to determine the daily sales for each product since their last restocking event.

Using the given table schemas, write an SQL query to retrieve the product_name, date, and sales_since_last_restock for each product. The sales_since_last_restock column represents the cumulative sales of the product for each day since its last restocking event. Order the result set by product_id.

Example:

Input:

Table products:

Column Type
product_id int
product_name varchar

Table sales:

Column Type
sales_id int
product_id int
date date
sold_count int

Table restocking:

Column Type
restock_id int
product_id int
restock_date date

Output The output should include the following columns:

Column Type
product_name varchar
date date
sales_since_last_restock int

Solution

To approach a complex SQL question involving various facets and querying techniques, it’s important to simplify our logic and streamline our thinking process. By doing so, we can focus solely on the requirements at hand. Let’s outline the key objectives of the question:

  1. Retrieve the latest restock dates, ensuring to obtain the most recent restock date per product_id.
  2. Filter the data based on the latest restock dates. This filtration process should be performed independently for each product_id, considering the date window starting from their respective latest restock.
  3. Compute the cumulative sum using the filtered data.

By breaking down the approach into these distinct steps, we can effectively tackle the question and generate the expected result set. Let’s take a look at our solution in code:

-- 1. First, create a CTE to get the latest restocking date for each product
WITH latest_restock AS (
  SELECT 
    product_id, 
    MAX(restock_date) AS max_restock_date
  FROM 
    restocking
  GROUP BY 
    product_id
),

-- 2. Then, create another CTE to get the sales that occurred after the latest restocking
filtered_sales AS (
  SELECT 
    s.*
  FROM 
    sales s
    JOIN latest_restock lr ON s.product_id = lr.product_id
  WHERE 
    s.date >= lr.max_restock_date
)

-- 3. Finally, calculate the cumulative sum of sales since the latest restocking
SELECT 
  p.product_name,
  fs.date,
  SUM(fs.sold_count) OVER (PARTITION BY fs.product_id ORDER BY fs.date) AS sales_since_last_restock
FROM 
  filtered_sales fs
  JOIN products p ON fs.product_id = p.product_id
ORDER BY 
  fs.product_id, fs.date;

To improve the explanation, here’s a revised version:

Our solution is divided into three parts, each with its own Common Table Expression (CTE) for clarity. Let’s go through each part:

  1. The first CTE generates a table that includes the product_id and its corresponding latest restock date using the MAX() function. To ensure independent results for each product_id, we use the GROUP BY clause.
  2. Using the data from the first CTE, we apply a WHERE clause to filter out rows that are not relevant for computation, as specified in the line s.date >= lr.max_restock_date . We compare each sale entry’s date to the most recent restocking date for its corresponding product_id. This ensures that we only consider sales that occurred after the last restocking.
  3. With the filtered dataset, we can easily calculate the cumulative sum using a window function. By applying the SUM() function on the sold_count column and partitioning the data by product_id, we obtain the cumulative sum for each product_id. The ORDER BY clause is used to determine the order of the data within each partition, in this case, based on the date.

By following these steps, we can effectively generate the cumulative sum based on the specified criteria. This approach allows us to analyze sales data and track the progression of sold quantities for each product over time.

image

Conclusion

Understanding and calculating cumulative sums in SQL is crucial for various data analysis tasks, such as aggregating data over time, analyzing trends, and computing running totals for different metrics. In this article, we’ve covered various techniques for calculating cumulative sums, including using the SUM() function with the OVER() clause, the RANK() function, the GROUP BY clause, and other window functions.

We’ve also discussed calculating cumulative sums for different time frames, and advanced techniques like resetting cumulative sums or calculating cumulative sums over multiple columns.

Keep practicing and honing your skills to become proficient in handling complex SQL queries and data manipulation tasks.