SQL for Data Analytics — Aggregation Functions

7 min readFeb 18, 2023

Hi All, This is the fourth article of SQL for Data Analytics article series article. In this article, we will discuss what are aggregation functions are and how to use them. If you like to learn how to SQL basics for data analytics please check out my previous articles 👇 .

SQL for Data Analytics —Installation and select queries

If you are looking for a place to learn SQL for Data Analytics you are at the correct place. In this article series we…

randomdataprojects.medium.com

SQL for Data Analytics — Where clause

You can use SQL not only to extract tables from a database as it is, but you can also filter data based on business…

randomdataprojects.medium.com

If you wish to practice the exercise in this article I recommend you to use Chinook at SQL Online IDE . For more instructions please refer the first article.

Content ;

· What are Aggregation functions?
∘ Aggregation functions in SQL
∘ Sum()
∘ Count()
∘ AVG()
∘ MAX() and MIN()
· Aggregation functions + conditions
· HAVING Clause

What are Aggregation functions?

Assume you need to find each customer's total billing value or total number of customers who bought products in a day from the invoice table.

Summary of Invoices table
-----------------------------------------------
InvoiceId|CustomerId|InvoiceDate        |Total|
---------+----------+-------------------+-----+
        1|         2|2009-01-01 00:00:00| 1.98|
        2|         4|2009-01-02 00:00:00| 3.96|
        3|         8|2009-01-03 00:00:00| 5.94|
        4|        14|2009-01-06 00:00:00| 8.91|
        5|        23|2009-01-11 00:00:00|13.86|
        6|        37|2009-01-19 00:00:00| 0.99|
        7|        38|2009-02-01 00:00:00| 1.98|
        8|        40|2009-02-01 00:00:00| 1.98|
        9|        42|2009-02-02 00:00:00| 3.96|
       10|        46|2009-02-03 00:00:00| 5.94|
       11|        52|2009-02-06 00:00:00| 8.91|

In order to find the total billing value (Total column) , you need to consider the sum of the Total column for each customer.
By counting the distinct customer IDs in each day we can find the number of customers who bought products in a day.

Aggregate queries provide a more holistic view of the data by further processing the retrieved data. Categorizes the query results according to the contents of a column in the database. Multiple levels of subgroups can be created by specifying multiple columns

They can work on

On all the rows in a table
A subset of rows in a table selected using a where clause
Groups of selected data organized using Group By clause

Syntax;

SELECT  column_1, function(column_2)
FROM table_name 
WHERE condition 
GROUP BY  column_1 ;

note: where conditions are not necessary for aggregating queries. If you are not group by /summarize by any column, it is not required to have the group by clause.

Aggregation functions in SQL

Lets try out few examples;

Sum()

Find the sum of all invoices ?

SELECT
 SUM(TOTAL)
FROM
 INVOICES ;

In the above query we select the sum of Total column in invoices table. Since we do not want the sum of each customer or each day , we do not have to use the group by clause.

SUM(TOTAL)       |
-----------------+
2328.600000000004|

Find how much each customer has spent ?

SELECT
 CUSTOMERID ,
 SUM(TOTAL)
FROM
 INVOICES
GROUP BY
 CUSTOMERID ;

CustomerId|SUM(TOTAL)        |
----------+------------------+
         1|             39.62|
         2|37.620000000000005|
         3|             39.62|
         4|             39.62|
         5|40.620000000000005|
         6|49.620000000000005|
         7|             42.62|
         8|             37.62|
         9|37.620000000000005|
        10|37.620000000000005|
        11|37.620000000000005|

In the results sum of the ‘Total’ for each is selected from the Invoices table. ‘Group by’ clause groups the results by non-aggregating columns.

Count()

Count the number of rows in the Invoices table?

SELECT
 COUNT(*)
FROM
 INVOICE_ITEMS ;

COUNT(*)|
--------+
    2240|

In the query we have used Count(*) . In SQL * rows use to select all rows. Similarly count(*) , count all the rows.

Count number of Invoice Ids for each customer in the invoices table ?


SELECT CustomerId ,COUNT(InvoiceId) 
from invoices
group by CustomerId ;

CustomerId|COUNT(InvoiceId)|
----------+----------------+
         1|               7|
         2|               7|
         3|               7|
         4|               7|
         5|               7|
         6|               7|
         7|               7|
         8|               7|

You can use Alias to change the aggregated column name in the results

SELECT CustomerId ,COUNT(InvoiceId) as INVOICE_COUNT
from invoices
group by CustomerId ;

CustomerId|INVOICE_COUNT|  -- << aggrated column name changed to INVOICE_COUNT
----------+-------------+
         1|            7|
         2|            7|
         3|            7|
         4|            7|
         5|            7|
         6|            7|
         7|            7|
         8|            7|

Count the number of unique customers per country in the invoices table?

SELECT
 BILLINGCOUNTRY ,
 COUNT(DISTINCT CUSTOMERID) AS NUMBER_OF_CUSTOMERS
FROM
 INVOICES
GROUP BY
 BILLINGCOUNTRY ;

BillingCountry|number_of_customers|
--------------+-------------------+
Argentina     |                  1|
Australia     |                  1|
Austria       |                  1|
Belgium       |                  1|
Brazil        |                  5|
Canada        |                  8|
Chile         |                  1|
Czech Republic|                  2|

You can use DISTINCT key word with COUNT to , count the number of distinct (unique) items in a column.

Hi freinds, If you like to read more on SQL, Python, and Data Science follow us on Medium. Don’t forget to clap 👏 the article if you like the article and share it with your friends.

AVG()

Calculate the average billing value per customer from the Invoices table ?

SELECT
 CUSTOMERID,
 AVG(TOTAL) AVG_BILLING_VALUE
FROM
 INVOICES
GROUP BY
 CUSTOMERID ;

CustomerId|AVG_BILLING_VALUE |
----------+------------------+
         1| 5.659999999999999|
         2|5.3742857142857146|
         3| 5.659999999999999|
         4| 5.659999999999999|
         5| 5.802857142857143|
         6| 7.088571428571429|
         7| 6.088571428571428|
         8| 5.374285714285714|

You can use multiple non aggregating column with multiple aggergation functions.But you should group by the results with all the non aggrgating columns in the select query.

Select the unique customer count and average billing value (total column) per country ,per city ?

SELECT 
 BILLINGCOUNTRY ,
 BILLINGCITY ,
 COUNT(DISTINCT CUSTOMERID) CUSTOMER_COUNT,
 AVG(TOTAL) AVG_BILLING_VALUE
FROM
 INVOICES
GROUP BY
 BILLINGCOUNTRY ,
 BILLINGCITY ;

BillingCountry|BillingCity        |customer_count|avg_billing_value |
--------------+-------------------+--------------+------------------+
Argentina     |Buenos Aires       |             1|5.3742857142857146|
Australia     |Sidney             |             1|5.3742857142857146|
Austria       |Vienne             |             1| 6.088571428571428|
Belgium       |Brussels           |             1| 5.374285714285714|
Brazil        |Brasília           |             1|5.3742857142857146|
Brazil        |Rio de Janeiro     |             1| 5.374285714285714|
Brazil        |São José dos Campos|             1| 5.659999999999999|
Brazil        |São Paulo          |             2| 5.374285714285713|
Canada        |Edmonton           |             1|5.3742857142857146|
Canada        |Halifax            |             1|5.3742857142857146|
Canada        |Montréal           |             1| 5.659999999999999|

MAX() and MIN()

Select the maximum and minimum ‘Total’ value for each country in the invoices table.

SELECT
 BILLINGCOUNTRY ,
 MAX(TOTAL),
 MIN(TOTAL)
FROM
 INVOICES I
GROUP BY
 BILLINGCOUNTRY ;

BillingCountry|max(Total)|min(Total)|
--------------+----------+----------+
Argentina     |     13.86|      0.99|
Australia     |     13.86|      0.99|
Austria       |     18.86|      0.99|
Belgium       |     13.86|      0.99|
Brazil        |     13.86|      0.99|
Canada        |     13.86|      0.99|
Chile         |     17.91|      0.99|
Czech Republic|     25.86|      0.99|

Aggregation functions + conditions

Ok, let's try something little complex. We can combine aggregation functions with where clause. That allows you to filter data based on conditions and aggregate the filtered data.

Select the sum of the billing value (Total column) of each customers from USA ?

SELECT
 CUSTOMERID ,
 BILLINGCOUNTRY,
 SUM(TOTAL)
FROM
 INVOICES
WHERE
 BILLINGCOUNTRY = 'USA'
GROUP BY
 CUSTOMERID ,
 BILLINGCOUNTRY
;

CustomerId|BillingCountry|sum(Total)        |
----------+--------------+------------------+
        16|USA           |             37.62|
        17|USA           |39.620000000000005|
        18|USA           |37.620000000000005|
        19|USA           |38.620000000000005|
        20|USA           |             39.62|
        21|USA           |             37.62|
        22|USA           |             39.62|

Above query select the customers in ‘USA’ then apply the sum(Totla).

2. Tracks table contains the size ( bytes) of a track and the duration of a track. Find the average duration of a track in a genre where size (Bytes) is greater than or equal to 33,500,000 bytes.

SELECT
 GENREID ,
 AVG(MILLISECONDS)
FROM
 TRACKS T
WHERE
 BYTES >=33500000
GROUP BY
 GENREID ;

GenreId|AVG(MILLISECONDS) |
-------+------------------+
      1|         1248796.0|
     18| 2625549.076923077|
     19|2145041.0215053763|
     20|2911783.0384615385|
     21| 2614372.222222222|
     22| 1585263.705882353|
     23|          294294.0|

Above query select the tracks, where size is greater than or equal to 33500000, then calculate the average duration for each all the songs in each genre.

HAVING Clause

What if you want to filter data, after aggregating the data. You can return the sum of each customers invoice value with below query.

SELECT
 CUSTOMERID ,
 SUM(TOTAL) TOTAL
FROM
 INVOICES I
GROUP BY
 CUSTOMERID ;

Results

CustomerId|total             |
----------+------------------+
         1|             39.62|
         2|37.620000000000005|
         3|             39.62|
         4|             39.62|
         5|40.620000000000005|
         6|49.620000000000005|
         7|             42.62|
         8|             37.62|
         9|37.620000000000005|
        10|37.620000000000005|
        11|37.620000000000005|
        12|             37.62|
        13|37.620000000000005|
        14|37.620000000000005|
        15|38.620000000000005|

What if you need to filter customers with sum(total) (sum of total invoices) more than 40, from the above aggregated result. Then you have to you HAVING clause.

The HAVING Clause enables you to specify conditions that filter which group results appear in the results.The WHERE clause places conditions on the selected columns, whereas the HAVING clause places conditions on groups created by the GROUP BY clause

Sytanx;

SELECT column_name(s) ,function(colum)
FROM table_name 
WHERE condition 
GROUP BY column_name(s) 
HAVING condition ORDER BY column_name(s);

Lets filter customers with sum of billing value of more than 40.

SELECT
 CUSTOMERID ,
 SUM(TOTAL) TOTAL
FROM
 INVOICES I
GROUP BY
 CUSTOMERID
HAVING
 SUM(TOTAL)>40;

CustomerId|TOTAL             |
----------+------------------+
         5|40.620000000000005|
         6|49.620000000000005|
         7|             42.62|
        24|             43.62|
        25|             42.62|
        26|47.620000000000005|
        28|43.620000000000005|
        37|             43.62|
        43|40.620000000000005|
        44|41.620000000000005|
        45|             45.62|
        46|             45.62|
        48|             40.62|
        57|             46.62|

We are done with aggregation functions. If you learned something new or cleared your doubts don't forget to clap 👏. Follow me if you like to read more on SQL for data analytics and other data science-related topics please follow me🤗 .

In the next Article, we will discuss how to join two or more tables horizontally 👇.

SQL for Data Analytics — Joins

In this article, we will discuss how to join multiple tables and how to extract informative insights.

madushanpathriana.medium.com