GROUP BY and HAVING
GROUP BY
In the beginner level course, we saw how aggregate functions can be used to summarize metrics from a table. However, their real utility comes when we combine them with the GROUP BY clause.
Let’s say we have a sales table with the following information:
| employeeID | date | amount |
|---|---|---|
| 1 | 2021-01-01 | 100.00 |
| 2 | 2021-01-01 | 200.00 |
| 1 | 2021-01-02 | 150.00 |
| 3 | 2021-01-03 | 300.00 |
| 2 | 2021-01-03 | 150.00 |
| 4 | 2021-01-04 | 400.00 |
If we wanted to know the total amount sold by each employee with the tools we’ve learned so far, we would need a separate query for each of them.
In this case, this would mean that we would find the total amount sold by each employee separately. For example, to find the amount sold by the first employee, we could use the following query:
SELECT employeeID, SUM(amount) AS total_sold
FROM employees
WHERE employeeID = 1
The WHERE clause filters the sales done by the first employee:
| employeeID | date | amount |
|---|---|---|
| 1 | 2021-01-01 | 100.00 |
| 1 | 2021-01-02 | 150.00 |
And then, the aggregation function finds the total amount sold for this smaller table. The output of the query would be:
| employeeID | total_sold |
|---|---|
| 1 | 250.00 |
If the table is too large, this approach is inconvenient, as it would need to find the total amount sold for each employee separately.
SQL allows us to solve this problem with a single query, using the GROUP BY clause. The GROUP BY clause lets us:
- Divide our table into smaller tables according to a single column.
- Perform an aggregate function over each of them.
- Integrate the results into a single table again.
In this case, the query we would use is:
SELECT employeeID, SUM(amount) AS total_sol
FROM employees
GROUP BY employeeID
Our output would be:
| employeeID | total_amount_sold |
|---|---|
| 1 | 250.00 |
| 2 | 350.00 |
| 3 | 300.00 |
| 4 | 400.00 |
The logic of it works in the following way:
- The
GROUP BYclause divides our original table into four smaller tables according to theiremployeeID.
Table_1:
| employeeID | date | amount |
|---|---|---|
| 1 | 2021-01-01 | 100.00 |
| 1 | 2021-01-02 | 150.00 |
Table_2:
| employeeID | date | amount |
|---|---|---|
| 2 | 2021-01-01 | 200.00 |
| 2 | 2021-01-03 | 150.00 |
Table_3:
| employeeID | date | amount |
|---|---|---|
| 3 | 2021-01-03 | 300.00 |
Table_4:
| employeeID | date | amount |
|---|---|---|
| 4 | 2021-01-04 | 400.00 |
- Then, we select
employeeIDandSUM(amount) AS total_amountfrom each of the tables:
Table 1:
| employeeID | total_amount_sold |
|---|---|
| 1 | 250.00 |
Table 2:
| employeeID | total_sold |
|---|---|
| 2 | 350.00 |
Table 3:
| employeeID | total_sold |
|---|---|
| 3 | 300.00 |
Table 4:
| employeeID | total_sold |
|---|---|
| 4 | 400.00 |
- Finally, the results are integrated into a single table:
| employeeID | total_sold |
|---|---|
| 1 | 250.00 |
| 2 | 350.00 |
| 3 | 300.00 |
| 4 | 400.00 |
In reality, the GROUP BY clause does all those steps at once, and is more efficient than carrying them all out separately.
We can use the GROUP BY clause with any aggregate function we need. We can specify more than one argument in the GROUP BY clause by using a separating comma. The original tables will be divided into smaller groups for which all the columns in the GROUP BY clause have the same values.
For example, GROUP BY location_id, employee_id would divide a table into smaller groups that share the same location_id and employee_id.
HAVING
The HAVING clause is a filter for the groups made by the GROUP BY. It works similarly to the WHERE clause, but it’s applied to groups instead of single rows.
For example, if we wanted to find the total amount sold for employees who performed more than one sale, we could do:
SELECT employeeID,
SUM(amount) AS total_sold
FROM employees
GROUP BY employeeID
HAVING COUNT(*) > 1
Here, the GROUP BY clause would divide our table into groups according to their employeeID and then filter each of the groups according to our condition. In this case, it would discard the smaller tables 3 and 4, as they only have one row.
After filtering the groups, it continues just as the GROUP BY clause does. It performs the aggregate functions in the SELECT on each of the groups and integrates the results into a single table.
So far:
Example query
SELECT first_name, SUM(salary)
FROM employees
JOIN sales
ON employee.id = sales.employeeID
WHERE salary > 7500 AND last_name != "Williams"
GROUP BY employee.id
HAVING COUNT(*) > 1
Execution order
FROM-> Gets input tablesJOIN … ON-> Merges input tablesWHERE-> Filters rows according to conditionGROUP BY-> Separates into smaller tablesHAVING-> Filters smaller tables according to conditionSELECT-> Selects the columns we need (and integrates smaller tables)SUM()and aggregate functions or numerical operations performed on output.
48%
CompletedYou have 29 sections remaining on this learning path.
