Analytic Functions in SQL

Analytic Functions in SQL

Introduction

SQL has traditionally served as a database querying language for data storage and retrieval. However, with the exponential growth of data, there’s been an increased demand for robust server-side analytic functions that can compute and process data efficiently. In this article, we’ll explore a range of analytic functions in SQL that allow us to extract valuable insights directly from our data.

What Are SQL Analytic Functions?

SQL analytic functions, also known as window functions, are powerful features that perform calculations across rows in a result set, allowing for advanced data analysis and aggregation within queries. This provides data scientists and analysts with a way to implement logic without client-side programming languages that operate individually on the result sets (e.g., Python).

How are SQL analytic functions different from aggregation functions

Like aggregation functions, analytic functions provide greater insights into data trends, patterns, rankings, and comparisons. Fundamentally, they differ on a critical feature: the granularity of their operations.

? We declare an analytic function through the use of the OVER() clause.

While aggregation functions operate on the entire result set, analytic functions work with subsets of the result set called windows. These are user-defined, allowing for more dynamic results.

Exploring Analytic Functions: Understanding Its Sections and Usage

Before we dive into the different types of analytic functions in SQL, we must first understand how they’re declared and how to specify their behavior.

Analytic functions in SQL consist of several parts, which determine how a window function operates over a table:

SQL Analytic Functions

Function

The function component of an analytic function defines the operation that is performed within the window. In other words, while other parts of an analytic function determine WHAT the window is, the function component determines HOW the window is operated on.

Let’s look at an example of the SUM() OVER() analytic function to better understand this concept. The SUM() function calculates the sum over the specified window, allowing us to obtain cumulative totals or perform other types of aggregations. In a way, you can think of the function as the engine that powers the window’s functionality.

Functions vary in purpose and type. Typically, they’re classified under four distinct categories:

  • Ranking Functions
  • Window-Aggregation Functions
  • Statistical Functions
  • Offset Functions

We’ll explain how to use these different types of analytic functions later.

Partitioning Clause (PARTITION BY)

The partitioning clause divides the rows of the table into separate groups or partitions based on one or more column values. The function is applied independently within each partition. This allows you to perform calculations and aggregations on specific subsets of data.

PARTITION BY works similarly to GROUP BY, but instead of dividing the table into sets, it defines how we partition the table into windows. Let’s take a look at the following visualization:

student_id grade_id student_name course_name grade
1 1 John Smith Mathematics 3.50
1 2 John Smith Physics 4.00
1 3 John Smith English 3.80
2 4 Jane Doe Chemistry 3.90
2 5 Jane Doe Biology 3.70

This is the table before PARTITION BY (student_id) is executed.

However, after the partition, the table is divided into windows as defined by the student_id:

student_id grade_id student_name course_name grade
1 1 John Smith Mathematics 3.50
1 2 John Smith Physics 4.00
1 3 John Smith English 3.80
student_id grade_id student_name course_name grade
2 4 Jane Doe Chemistry 3.90
2 5 Jane Doe Biology 3.70

Ordering Clause (ORDER BY)

The ordering clause defines the order in which the rows are processed within each partition. It specifies the column(s) that the rows are sorted by before applying the analytic function.

Framing Clause

The framing clause, which is optional, allows you to specify a subset of rows within each partition on which the analytic function will operate, thereby determining the window used to calculate the function’s result. Think of it as an even more granular version of PARTITION BY. However, instead of partitioning based on a column, it partitions based on position (as dictated by ORDER BY).

The framing clause can be specified using the ROWS BETWEEN or RANGE BETWEEN keywords.

Rows Between

The ROWS BETWEEN clause allows you to define a window or frame of rows based on their relative positions within each partition. It specifies a range of rows that are included in the calculation of the analytic function’s result.

Range Between

The RANGE BETWEEN clause defines a window or frame of rows based on their values rather than their positions. The range of rows included in the calculation is based on the order specified in the ORDER BY clause.

It’s important to note that the behavior of RANGE BETWEEN can vary depending on the specific analytic function and the data type of the ordering column.

Frame Operations

When defining ROWS BETWEEN and RANGE BETWEEN, there are certain operations you can declare to further specify the frame size.

1. UNBOUNDED PRECEDING TO CURRENT ROW: This specifies that the window should include all rows from the beginning of the partition up to and including the current row being evaluated by the function.

Current Row

2. n PRECEDING TO CURRENT ROW: This creates a fixed window size that includes the current row being evaluated by the function and the preceding “n” number of rows.

NPreceeding

? In this example, n = 1.

3. UNBOUNDED PRECEDING TO UNBOUNDED FOLLOWING: This creates a window that includes all of the rows between the start and end of the partition.

All

For the first two frame operations, you can swap in the FOLLOWING keyword instead of PRECEDING to define a range of rows after the current row.

The output from ROWS BETWEEN and RANGE BETWEEN can vary vastly. Let’s look at the following section to understand why.

Range Between Versus Rows Between

ROWS BETWEEN is typically used with analytic functions that operate on continuous or numerical values. It considers the values of the ordering column. This is different from ROWS BETWEEN, which considers the relative position of the rows instead.

The following illustration displays how these clauses produce different windows:

With ROWS BETWEEN, the window includes the current row and the one immediately preceding it based on the specified order (in this case, by units), which are ‘Chemistry’ and ‘English.’

However, with RANGE BETWEEN,, it would be all the rows with unit values between 3 (from 4-1) and 4. This table includes ‘Math,’ ‘Physics,’ ‘English,’ and ‘Chemistry’ because all of them have units between 3 and 4.

?RANGE BETWEEN is a window, while ROWS BETWEEN is a row-based window.

RANGE BETWEEN includes all four rows because the order of the rows doesn’t matter– only the values. So if multiple rows have the same or close values (within 1 unit in this case), they’ll all be included in the window.

Ranking Functions

In this section, we’ll cover ranking functions, which are a group of analytic functions that assign ranks to each row in a result set based on certain criteria. These functions are helpful for data rankings, such as identifying the top performers or finding the relative position of a particular row compared to others.

Row_Number()

This function assigns a unique sequential number to each row within the result set. The numbers are generated based on the order specified in the ORDER BY clause. ROW_NUMBER() does not consider ties, and each row receives a distinct number.

Example:

Company Stock_Price ROW_NUMBER
Metamorphosis LLC 542.12 1
Orange Inc. 542.12 2
MacroString Corp. 902.93 3
Spookify 1252.97 4

(Using ROW_NUMBER() OVER (ORDER BY Stock_Price))

Rank()

RANK() assigns a unique rank to each row within the result set. Rows with the same values and order criteria receive the same rank, and the subsequent rank changes based on the number of rows with the same rank.

Example:

Company Stock_Price RANK
Metamorphosis LLC 542.12 1
Orange Inc. 542.12 1
MacroString Corp. 902.93 3
Spookify 1252.97 4

(Using RANK() OVER (ORDER BY Stock_Price))

Dense Rank()

Similar to RANK(), DENSE_RANK() assigns a unique rank to each row in the result set. However, it doesn’t skip any ranks for tied values.

Example:

Company Stock_Price ROW_NUMBER
Metamorphosis LLC 542.12 1
Orange Inc. 542.12 1
MacroString Corp. 902.93 2
Spookify 1252.97 3

(Using DENSE_RANK() OVER (ORDER BY Stock_Price))

NTILE()

NTILE() divides the result set into equal-sized groups, or “tiles,” and assigns a group number to each row. The number of groups is determined by the argument provided to NTILE().

Example:

Company Stock_Price NTILE
Metamorphosis LLC 542.12 1
Orange Inc. 542.12 1
MacroString Corp. 902.93 2
Spookify 1252.97 2

(Using NTILE(2) OVER (ORDER BY Stock_Price))

Window-Aggregate Functions

Window-aggregation functions in SQL are a specialized group of analytic functions that combine the capabilities of aggregate and window functions. These functions allow you to perform aggregate calculations while retaining a granular level of detail within the result set, providing more flexibility for data analysis.

SUM() OVER()

The SUM() function combined with the OVER() clause calculates the sum of a specified column over a window of rows. The SUM() function takes a single argument, which is the column to be aggregated. This allows us to compute running totals, cumulative sums, or other aggregations within the specified window.

Example:

Company Stock_Price Cumulative_Sum
Metamorphosis LLC 542.12 542.12
Orange Inc. 542.12 1084.24
MacroString Corp. 902.93 1987.17
Spookify 1252.97 3240.14

(Using SUM(Stock_Price) OVER (ORDER BY Stock_Price))

AVG() OVER()

The AVG() function with the OVER() clause calculates the average of a column over a window of rows. It works similarly to SUM() OVER(), but provides the average value instead. This function is useful for calculating moving averages or obtaining average values within specific partitions or orderings.

Example:

Company Stock_Price Running_Average
Metamorphosis LLC 542.12 542.12
Orange Inc. 642.78 592.45
MacroString Corp. 902.93 695.94
Spookify 1252.97 835.20

(Using AVG(Stock_Price) OVER (ORDER BY Company))

MIN() OVER()

The MIN() function with the OVER() clause returns the minimum value of a column over a defined window. This allows us to find the minimum value over a specific range or partition, which is helpful for finding the earliest date or the minimum value within each group.

Company Stock_Price Running_Min
Metamorphosis LLC 542.12 542.12
Orange Inc. 642.78 542.12
MacroString Corp. 902.93 542.12
Spookify 1252.97 542.12

(Using MIN(Stock_Price) OVER (ORDER BY Company))

MAX() OVER()

Similar to MIN() OVER(), the MAX() function with the OVER() clause returns the maximum value of a column over a specified window.

Company Stock_Price Running_Max
Metamorphosis LLC 542.12 542.12
Orange Inc. 642.78 642.78
MacroString Corp. 902.93 902.93
Spookify 1252.97 1252.97

(Using MAX(Stock_Price) OVER (ORDER BY Company))

COUNT() OVER()

The COUNT() function with **OVER() ** provides the number of rows within a window. It can count rows within specific partitions or orderings, which is helpful if you need to find the number of items sold within a period or any other counts within specific groups.

Company Stock_Price Running_Count
Metamorphosis LLC 542.12 1
Orange Inc. 642.78 2
MacroString Corp. 902.93 3
Spookify 1252.97 4

(Using COUNT(Stock_Price) OVER (ORDER BY Company))

Statistical Window Functions

Statistical window functions are another set of analytic functions that perform statistical calculations over a window of rows from a result set. These functions allow us to derive valuable statistical insights and metrics from our data.

STDDEV() OVER()

The STDDEV() OVER() function calculates the standard deviation of a column over a specified window, providing insight into the data’s spread.

Example:

employee_id employee_name department_name department_id salary Sample Standard Deviation
1 Alice HR 101 60000 2500
2 Bob HR 101 55000 2500
4 David Sales 102 80000 5000
3 Charlie Sales 102 70000 5000
5 Eve IT 103 65000 2500
6 Frank IT 103 70000 2500

(Using STDDEV(salary) OVER (PARTITION BY department_id))

VARIANCE() OVER()

The VARIANCE() function calculates the variance (average squared deviation from the mean) of a column over a defined window.

Example:

employee_id employee_name department_name department_id salary Variance
1 Alice HR 101 60000 6250000
2 Bob HR 101 55000 6250000
4 David Sales 102 80000 25000000
3 Charlie Sales 102 70000 25000000
5 Eve IT 103 65000 6250000
6 Frank IT 103 70000 6250000

(Using VARIANCE(salary) OVER (PARTITION BY department_id))

Offset Functions

Offset window functions allow you to access data from rows that are offset from the current row within the window. These functions are helpful when we need to compare the current row to a preceding or successive row. Examples include comparing today’s stock data to last week’s stock data (LAG), or comparing salaries to the lowest salary (FIRST_VALUE with ORDER BY).

LAG()

The LAG() function retrieves a value from a previous row within the window. It takes two arguments: the column you want to retrieve the value from and the offset (number of rows before the current row). For example, LAG(column_name, offset) returns the value from column_name that is offset rows before the current row.

? Note: Offset is an INT type parameter.*

Example:

Company Stock_Price LAG_Stock_Price
Metamorphosis LLC 542.12 NULL
Orange Inc. 542.12 542.12
MacroString Corp. 902.93 542.12
Spookify 1252.97 902.93

(Using LAG(Stock_Price) OVER (ORDER BY Stock_Price) AS LAG_Stock_Price)

LEAD()

The LEAD() function is similar to LAG(), but it retrieves the value from a subsequent row within the window. It also takes a similar set of arguments with the column you want to retrieve the value from and the offset (number of rows after the current row). For example, LEAD(column_name, offset) returns the value from column_name that is offset rows after the current row.

Example:

Company Stock_Price LEAD_Stock_Price
Metamorphosis LLC 542.12 542.12
Orange Inc. 542.12 902.93
MacroString Corp. 902.93 1252.97
Spookify 1252.97 NULL

(Using LEAD(Stock_Price) OVER (ORDER BY Stock_Price) AS LEAD_Stock_Price)

FIRST_VALUE()

FIRST_VALUE() retrieves the value from the first row in the window. It takes one argument: the column you want to retrieve the value from. For example, FIRST_VALUE(column_name) returns the value from column_name in the first row of the window.

Example:

Company Stock_Price FIRST_Stock_Price
Metamorphosis LLC 542.12 542.12
Orange Inc. 542.12 542.12
MacroString Corp. 902.93 542.12
Spookify 1252.97 542.12

(Using FIRST_VALUE(Stock_Price) OVER (ORDER BY Stock_Price) AS FIRST_Stock_Price)

LAST_VALUE()

LAST_VALUE() is similar to FIRST_VALUE(), but it retrieves the value from the last row in the window. It also takes one argument: the column you want to retrieve the value from. For example, LAST_VALUE(column_name) returns the value from column_name in the last row of the window.

Example:

Company Stock_Price LAST_Stock_Price
Metamorphosis LLC 542.12 1252.97
Orange Inc. 542.12 1252.97
MacroString Corp. 902.93 1252.97
Spookify 1252.97 1252.97

(Using LAST_VALUE(Stock_Price) OVER (ORDER BY Stock_Price) as LAST_Stock_Price))

SQL Analytic Functions Interview Questions

Third Purchase

Want to try this out? Answer it on your own on Interview Query.

Given the transactions table below, write a query that finds the third purchase of every user.

Note: Sort the results by the user_id in ascending order. If a user purchases two products at the same time, the lower id field is used to determine which is the first purchase.

Example:

Input:

transactions table

Columns Type
id INTEGER
user_id INTEGER
created_at DATETIME
product_id INTEGER
quantity INTEGER

Output:

Columns Type
user_id INTEGER
created_at DATETIME
product_id INTEGER
quantity INTEGER

Solution

This problem set is relatively straightforward at first. We can first determine the order of purchases for every user by looking at the created_at column and ordering by user_id and the created_at column.

However, we still need an indicator of which purchase was the third value. Whenever we want to rank our dataset, it’s helpful to then immediately think of a specific window function to use.

In this case, we need to apply the RANK function to the transactions table. The RANK function is a window function that assigns a rank to each row in the partition of the result set.

RANK() OVER (PARTITION BY user_id ORDER BY created_at ASC, id ASC) AS rank_value

In this example, the PARTITION BY clause distributes the rows in the result set into partitions by one or more criteria.

The ORDER BY clause then sorts the rows in each partition by the column we indicated– in this case, created_at.

Finally, the RANK() function operates on the rows of each partition and re-initializes when crossing each partition boundary. The end result is a column with the rank of each purchase partitioned by user_id.

All we have to do is then wrap the table in a subquery and filter out where the new column is equal to 3, which is equivalent to subsetting for the third purchase.

SELECT user_id, created_at, product_id, quantity
FROM (
    SELECT
        user_id
        , created_at
        , product_id
        , quantity
        , RANK() OVER (
            PARTITION BY user_id
            ORDER BY created_at ASC, id ASC
        ) AS rank_value
    FROM transactions
) AS t
WHERE rank_value = 3
ORDER BY user_id ASC

Second-longest Flight

Try this on your own on Interview Query.

Given a table of **flights**, extract the 2nd flight with the longest duration between each pair of cities. Order the flights by the flight id ascending.

Note: For any cities X and Y(source_location=X, destination_location=Y) and (source_location=Y, destination_location=X) are counted as the same pair of cities.

Note: If there are fewer than two flights between two cities, there is no 2nd longest flight.

Example:

Input:

flights table

Column Type
id INTEGER
destination_location VARCHAR
source_location VARCHAR
plane_id INTEGER
flight_start DATETIME
flight_end DATETIME

Output:

Column Type
id INTEGER
destination_location VARCHAR
source_location VARCHAR
flight_start DATETIME
flight_end DATETIME

Solution

The solution for this question is quite long, given that we have a graph-based problem that specifically handles undirected edges. Meanwhile, the SQL table clearly shows that our data is directed. To make up for this, we have to conduct operations that consider symmetrical edges, as the question specifies that (source_location=X, destination_location=Y) and (source_location=Y, destination_location=X) are counted as the same pair of cities.

This increases the question’s complexity. Moreover, it’s noted in the question that there are specified edge cases, wherein if there are fewer than two flights between two cities, there is no second-longest flight. This condition adds another level of difficulty, requiring the solution to handle cases where a city pair does not have at least two flights.

Let’s take a look at our solution below:

WITH cte_1 AS (
SELECT  id, destination_location AS x, source_location AS y, flight_start, flight_end, TIMESTAMPDIFF(SECOND, flight_start, flight_end) AS diff, 1 AS helper FROM flights
UNION ALL
SELECT  id, source_location AS x, destination_location AS y, flight_start, flight_end, TIMESTAMPDIFF(SECOND, flight_start, flight_end) AS diff, 2 AS helper FROM flights
),
cte_2 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY x,y ORDER BY diff DESC) AS ranking FROM cte_1
)
SELECT id, x AS destination_location, y AS source_location, flight_start, flight_end FROM cte_2 WHERE ranking = 2 AND helper= 1
ORDER BY id ASC

To extract the second longest flight between pairs of cities from the given flights table, we can use the following SQL solution:

  1. CTE Creation (cte_1):
    • We start by creating a CTE named cte_1, which serves as a foundation for our subsequent calculations.
    • In the first part of cte_1, we select the flight details, source location, destination location, and duration of each flight using the TIMESTAMPDIFF() function to calculate the duration in seconds.
    • In the second part of cte_1, we union the previous selection with another selection that swaps the source and destination locations, effectively considering flights in both directions.
    • We also include a helper column with the value of 1 for the first part and 2 for the second part. This column will assist in further calculations.
  2. CTE Creation (cte_2):
    • Building upon cte_1, we create another CTE named cte_2 to rank the flights within each pair of cities based on their duration.
    • Using the ROW_NUMBER() window function partitioned by the source and destination locations and ordered by the flight duration in descending order, we assign a ranking to each flight within its respective city pair.
  3. Final Query and Output:
    • In the final query, we select flights from cte_2 where the ranking equals 2 (indicating the second longest flight) and the helper value is 1 (to avoid duplicate entries).
    • The selected columns include the flight ID, destination location, source location, flight start time, and flight end time.
    • The result is ordered by the flight ID in ascending order.

The provided solution efficiently determines the second longest flight between each pair of cities, considering both source and destination locations. By employing CTEs and window functions, the query calculates flight durations, assigns rankings, and filters the desired results.

Learn SQL with Interview Query

The best way to get better in SQL is practice. Interview Query offers a variety of SQL learning resources to help you practice and improve your SQL skills, including: