SQL for Data Analytics — Aggregation Functions

Madushan Pathirana
7 min readFeb 18, 2023

--

Hi All, This is the fourth article of SQL for Data Analytics article series article. In this article, we will discuss what are aggregation functions are and how to use them. If you like to learn how to SQL basics for data analytics please check out my previous articles 👇 .

If you wish to practice the exercise in this article I recommend you to use Chinook at SQL Online IDE . For more instructions please refer the first article.

Content ;

· What are Aggregation functions?
Aggregation functions in SQL
Sum()
Count()
AVG()
MAX() and MIN()
· Aggregation functions + conditions
· HAVING Clause

What are Aggregation functions?

Assume you need to find each customer's total billing value or total number of customers who bought products in a day from the invoice table.

Summary of Invoices table
-----------------------------------------------
InvoiceId|CustomerId|InvoiceDate |Total|
---------+----------+-------------------+-----+
1| 2|2009-01-01 00:00:00| 1.98|
2| 4|2009-01-02 00:00:00| 3.96|
3| 8|2009-01-03 00:00:00| 5.94|
4| 14|2009-01-06 00:00:00| 8.91|
5| 23|2009-01-11 00:00:00|13.86|
6| 37|2009-01-19 00:00:00| 0.99|
7| 38|2009-02-01 00:00:00| 1.98|
8| 40|2009-02-01 00:00:00| 1.98|
9| 42|2009-02-02 00:00:00| 3.96|
10| 46|2009-02-03 00:00:00| 5.94|
11| 52|2009-02-06 00:00:00| 8.91|
  • In order to find the total billing value (Total column) , you need to consider the sum of the Total column for each customer.
  • By counting the distinct customer IDs in each day we can find the number of customers who bought products in a day.

Aggregate queries provide a more holistic view of the data by further processing the retrieved data. Categorizes the query results according to the contents of a column in the database. Multiple levels of subgroups can be created by specifying multiple columns

They can work on

  • On all the rows in a table
  • A subset of rows in a table selected using a where clause
  • Groups of selected data organized using Group By clause

Syntax;

SELECT  column_1, function(column_2)
FROM table_name
WHERE condition
GROUP BY column_1 ;

note: where conditions are not necessary for aggregating queries. If you are not group by /summarize by any column, it is not required to have the group by clause.

Aggregation functions in SQL

Summary of SQL aggregation function

Lets try out few examples;

Sum()

Find the sum of all invoices ?

SELECT
SUM(TOTAL)
FROM
INVOICES ;

In the above query we select the sum of Total column in invoices table. Since we do not want the sum of each customer or each day , we do not have to use the group by clause.

SUM(TOTAL)       |
-----------------+
2328.600000000004|

Find how much each customer has spent ?

SELECT
CUSTOMERID ,
SUM(TOTAL)
FROM
INVOICES
GROUP BY
CUSTOMERID ;
CustomerId|SUM(TOTAL)        |
----------+------------------+
1| 39.62|
2|37.620000000000005|
3| 39.62|
4| 39.62|
5|40.620000000000005|
6|49.620000000000005|
7| 42.62|
8| 37.62|
9|37.620000000000005|
10|37.620000000000005|
11|37.620000000000005|

In the results sum of the ‘Total’ for each is selected from the Invoices table. ‘Group by’ clause groups the results by non-aggregating columns.

Count()

Count the number of rows in the Invoices table?

SELECT
COUNT(*)
FROM
INVOICE_ITEMS ;
COUNT(*)|
--------+
2240|

In the query we have used Count(*) . In SQL * rows use to select all rows. Similarly count(*) , count all the rows.

Count number of Invoice Ids for each customer in the invoices table ?


SELECT CustomerId ,COUNT(InvoiceId)
from invoices
group by CustomerId ;
CustomerId|COUNT(InvoiceId)|
----------+----------------+
1| 7|
2| 7|
3| 7|
4| 7|
5| 7|
6| 7|
7| 7|
8| 7|

You can use Alias to change the aggregated column name in the results

SELECT CustomerId ,COUNT(InvoiceId) as INVOICE_COUNT
from invoices
group by CustomerId ;
CustomerId|INVOICE_COUNT|  -- << aggrated column name changed to INVOICE_COUNT
----------+-------------+
1| 7|
2| 7|
3| 7|
4| 7|
5| 7|
6| 7|
7| 7|
8| 7|

Count the number of unique customers per country in the invoices table?

SELECT
BILLINGCOUNTRY ,
COUNT(DISTINCT CUSTOMERID) AS NUMBER_OF_CUSTOMERS
FROM
INVOICES
GROUP BY
BILLINGCOUNTRY ;
BillingCountry|number_of_customers|
--------------+-------------------+
Argentina | 1|
Australia | 1|
Austria | 1|
Belgium | 1|
Brazil | 5|
Canada | 8|
Chile | 1|
Czech Republic| 2|

You can use DISTINCT key word with COUNT to , count the number of distinct (unique) items in a column.

Hi freinds, If you like to read more on SQL, Python, and Data Science follow us on Medium. Don’t forget to clap 👏 the article if you like the article and share it with your friends.

AVG()

Calculate the average billing value per customer from the Invoices table ?

SELECT
CUSTOMERID,
AVG(TOTAL) AVG_BILLING_VALUE
FROM
INVOICES
GROUP BY
CUSTOMERID ;
CustomerId|AVG_BILLING_VALUE |
----------+------------------+
1| 5.659999999999999|
2|5.3742857142857146|
3| 5.659999999999999|
4| 5.659999999999999|
5| 5.802857142857143|
6| 7.088571428571429|
7| 6.088571428571428|
8| 5.374285714285714|

You can use multiple non aggregating column with multiple aggergation functions.But you should group by the results with all the non aggrgating columns in the select query.

Select the unique customer count and average billing value (total column) per country ,per city ?

SELECT 
BILLINGCOUNTRY ,
BILLINGCITY ,
COUNT(DISTINCT CUSTOMERID) CUSTOMER_COUNT,
AVG(TOTAL) AVG_BILLING_VALUE
FROM
INVOICES
GROUP BY
BILLINGCOUNTRY ,
BILLINGCITY ;
BillingCountry|BillingCity        |customer_count|avg_billing_value |
--------------+-------------------+--------------+------------------+
Argentina |Buenos Aires | 1|5.3742857142857146|
Australia |Sidney | 1|5.3742857142857146|
Austria |Vienne | 1| 6.088571428571428|
Belgium |Brussels | 1| 5.374285714285714|
Brazil |Brasília | 1|5.3742857142857146|
Brazil |Rio de Janeiro | 1| 5.374285714285714|
Brazil |São José dos Campos| 1| 5.659999999999999|
Brazil |São Paulo | 2| 5.374285714285713|
Canada |Edmonton | 1|5.3742857142857146|
Canada |Halifax | 1|5.3742857142857146|
Canada |Montréal | 1| 5.659999999999999|

MAX() and MIN()

Select the maximum and minimum ‘Total’ value for each country in the invoices table.

SELECT
BILLINGCOUNTRY ,
MAX(TOTAL),
MIN(TOTAL)
FROM
INVOICES I
GROUP BY
BILLINGCOUNTRY ;
BillingCountry|max(Total)|min(Total)|
--------------+----------+----------+
Argentina | 13.86| 0.99|
Australia | 13.86| 0.99|
Austria | 18.86| 0.99|
Belgium | 13.86| 0.99|
Brazil | 13.86| 0.99|
Canada | 13.86| 0.99|
Chile | 17.91| 0.99|
Czech Republic| 25.86| 0.99|

Aggregation functions + conditions

Ok, let's try something little complex. We can combine aggregation functions with where clause. That allows you to filter data based on conditions and aggregate the filtered data.

  1. Select the sum of the billing value (Total column) of each customers from USA ?
SELECT
CUSTOMERID ,
BILLINGCOUNTRY,
SUM(TOTAL)
FROM
INVOICES
WHERE
BILLINGCOUNTRY = 'USA'
GROUP BY
CUSTOMERID ,
BILLINGCOUNTRY
;
CustomerId|BillingCountry|sum(Total)        |
----------+--------------+------------------+
16|USA | 37.62|
17|USA |39.620000000000005|
18|USA |37.620000000000005|
19|USA |38.620000000000005|
20|USA | 39.62|
21|USA | 37.62|
22|USA | 39.62|

Above query select the customers in ‘USA’ then apply the sum(Totla).

2. Tracks table contains the size ( bytes) of a track and the duration of a track. Find the average duration of a track in a genre where size (Bytes) is greater than or equal to 33,500,000 bytes.

SELECT
GENREID ,
AVG(MILLISECONDS)
FROM
TRACKS T
WHERE
BYTES >=33500000
GROUP BY
GENREID ;
GenreId|AVG(MILLISECONDS) |
-------+------------------+
1| 1248796.0|
18| 2625549.076923077|
19|2145041.0215053763|
20|2911783.0384615385|
21| 2614372.222222222|
22| 1585263.705882353|
23| 294294.0|

Above query select the tracks, where size is greater than or equal to 33500000, then calculate the average duration for each all the songs in each genre.

HAVING Clause

What if you want to filter data, after aggregating the data. You can return the sum of each customers invoice value with below query.

SELECT
CUSTOMERID ,
SUM(TOTAL) TOTAL
FROM
INVOICES I
GROUP BY
CUSTOMERID ;

Results

CustomerId|total |
----------+------------------+
1| 39.62|
2|37.620000000000005|
3| 39.62|
4| 39.62|
5|40.620000000000005|
6|49.620000000000005|
7| 42.62|
8| 37.62|
9|37.620000000000005|
10|37.620000000000005|
11|37.620000000000005|
12| 37.62|
13|37.620000000000005|
14|37.620000000000005|
15|38.620000000000005|

What if you need to filter customers with sum(total) (sum of total invoices) more than 40, from the above aggregated result. Then you have to you HAVING clause.

The HAVING Clause enables you to specify conditions that filter which group results appear in the results.The WHERE clause places conditions on the selected columns, whereas the HAVING clause places conditions on groups created by the GROUP BY clause

Sytanx;

SELECT column_name(s) ,function(colum)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition ORDER BY column_name(s);

Lets filter customers with sum of billing value of more than 40.

SELECT
CUSTOMERID ,
SUM(TOTAL) TOTAL
FROM
INVOICES I
GROUP BY
CUSTOMERID
HAVING
SUM(TOTAL)>40;

CustomerId|TOTAL |
----------+------------------+
5|40.620000000000005|
6|49.620000000000005|
7| 42.62|
24| 43.62|
25| 42.62|
26|47.620000000000005|
28|43.620000000000005|
37| 43.62|
43|40.620000000000005|
44|41.620000000000005|
45| 45.62|
46| 45.62|
48| 40.62|
57| 46.62|

We are done with aggregation functions. If you learned something new or cleared your doubts don't forget to clap 👏. Follow me if you like to read more on SQL for data analytics and other data science-related topics please follow me🤗 .

In the next Article, we will discuss how to join two or more tables horizontally 👇.

--

--

Madushan Pathirana
Madushan Pathirana

Written by Madushan Pathirana

"Data Science for Everyone". Hi I am a data enthusiast like you who try to simplify data science for everyone.

No responses yet