Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Using MySQL Common Table Expressions and Window Functions [HOL2986]
Jesper Wisborg Krogh
Senior Principal Technical Support Engineer
Prerequisites
Note: This is not required when using the laptops at HOL2986 at Oracle OpenWorld
2018.
It is assumed the following software is already installed on the machine where you try
the examples in this workbook. If you are attending the hands-on labs session Using
MySQL Common Table Expressions and Window Functions [HOL2986] at Oracle
OpenWorld 2018, you do not need to do anything, and you can skip to the next section.
Software List
MySQL Server 8.0.12 or later
Recommended: MySQL Workbench 8.0.12 or later
Data List
The sakila database: http://downloads.mysql.com/docs/sakila-db.zip - for more
information about the sakila database, see also:
https://dev.mysql.com/doc/sakila/en/.
HOL Information
The following information is useful for the hands-on lab session at Oracle OpenWorld
2018:
Windows Username lab
Windows Password HandsOn18!
MySQL Root Username root@localhost
MySQL Root Password S3cr3t
MySQL Regular User Username oracle@localhost
MySQL Regular User Password Oracle2018
MySQL Schemas sakila
Tip: You can connect to MySQL using the pre-configured connection in MySQL
Workbench.
Common Table Expressions (CTEs)
A common table expression (also commonly known as a CTE) is a named subquery. The
common table expression is defined before the main query and can then be used as if it
was a regular table. It is possible to reference a common table expression more than
once (unlike traditional MySQL subqueries) and a query can have several common table
expression. Additionally, there are in some cases a performance gain.
Common table expressions can be divided into two categories: non-recursive and
recursive. Non-recursive common table expressions are the simplest, so let’s start out
looking at those.
Non-Recursive
In the simple non-recursive form, a common table expression simply specify a subquery,
so you can reference it by name as it is a regular table inside the query. A common table
expression is defined using the WITH keyword (specify only WITH once even if you define
multiple common table expressions).
Basic Construct
As a simple example, consider a common table expression selecting the integer 1 as the
column i and making it available in the main part of the query through the name cte:
WITH cte(i) AS (SELECT 1)
SELECT * FROM cte;
Script: cte_basic.sql
The query consists of several parts:
WITH tells that you are defining a common table expression.
cte is the name of the expression and it has one column named i.
AS tells that the subquery used for the cte expression will be defined next
(between the parentheses).
Finally the main part of the query uses the cte common table expression as the
source of its data.
The output of the query is:
+---+
| i |
+---+
| 1 |
+---+
Granted, this example does not really show why you want to use common table
expressions, but it still displays all the parts of a non-recursive common table
expression. You can use this basic construct to create more complex examples.
Multiple Common Table Expressions in One Query
You can have multiple common table expressions in one query. You separate the
expressions using a comma. The expressions are processed in the order listed, so a given
expression can reference the expressions defined earlier. For example:
WITH cte_1(txt) AS (SELECT 'Hello'),
cte_2(txt) AS (SELECT CONCAT_WS(' ', cte_1.txt, 'World')
FROM cte_1)
SELECT txt FROM cte_2;
Script: cte_multiple.sql
Yes, this is a cumbersome way to create the string “Hello World”, but in some real world
queries, it can greatly simplify complex queries and can give a more linear workflow that
makes it easier to understand what the query does.
Recursive Common Table Expressions
While the non-recursive common table expressions are very useful as just shown, their
use is relatively limited. There is however also the possibility to use recursive
expressions. This is done by defining a seed query and use a UNION to define how
further rows are generated. The second part of the UNION refers to the common table
expression itself to read the previous generated row(s).
Sequences
The most basic recursive common table expression is to generate a sequence, for
example:
WITH RECURSIVE digits(i) AS
(SELECT 1
UNION
SELECT i+1
FROM digits
WHERE i < 10
)
SELECT i
FROM digits
ORDER BY i;
Script: cte_recursive_digits.sql
The recursion terminates when no rows are created, i.e. when the WHERE clause does
not match any of the previous generated rows. So, the query generates the digits 1
through 10 (the WHERE clause applies to the previous generated rows):
+----+
| i |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
+----+
Tip: Both the seed and recursive parts can themselves include UNION.
This may not seem very useful at all, but now consider a sales query. Some
days/weeks/months may not have any sales data, so what do you do if you want to
include those days/weeks/months in the report? There is an example later in the lab
where this will be demonstrated; here let’s consider how to generate the dates for
Oracle OpenWorld 2018:
WITH RECURSIVE oow_days(DayNumber, Day) AS
(SELECT 0, '2018-10-21'
UNION
SELECT DayNumber + 1, `Day` + INTERVAL 1 DAY
FROM oow_days
WHERE `Day` < '2018-10-25'
)
SELECT DayNumber, DAYNAME(`Day`) AS DayName, `Day`
FROM oow_days
ORDER BY `Day`;
Script: cte_recursive_days.sql
The result of the query is:
+-----------+-----------+------------+
| DayNumber | DayName | Day |
+-----------+-----------+------------+
| 0 | Sunday | 2018-10-21 |
| 1 | Monday | 2018-10-22 |
| 2 | Tuesday | 2018-10-23 |
| 3 | Wednesday | 2018-10-24 |
| 4 | Thursday | 2018-10-25 |
+-----------+-----------+------------+
Terminating Recursion
What happens if you forget to add a WHERE clause in the recursive part of the common
table expression? That depends. When the common table expression generates a row
that already exists in the result set, the recursion will be default stop automatically. So,
if your common table expression naturally terminates, there is not any problems. An
example is:
WITH RECURSIVE months(MonthNumber, MonthName) AS
(SELECT MONTH('2018-01-01'), MONTHNAME('2018-01-01')
UNION
SELECT MONTH('2018-01-01' + INTERVAL MonthNumber MONTH),
MONTHNAME('2018-01-01' + INTERVAL MonthNumber MONTH)
FROM months
)
SELECT *
FROM Months
ORDER BY MonthNumber;
Script: cte_recursive_months.sql
(Yes, that is a roundabout way to generate the names of the months. It is left as an
exercise to generate the list is a more straight forward way.)
The result of the query is just 12 rows with the month names:
+-------------+-----------+
| MonthNumber | MonthName |
+-------------+-----------+
| 1 | January |
| 2 | February |
| 3 | March |
| 4 | April |
| 5 | May |
| 6 | June |
| 7 | July |
| 8 | August |
| 9 | September |
| 10 | October |
| 11 | November |
| 12 | December |
+-------------+-----------+
However, in general recursive common table expression do not terminate on their own.
Consider the original sequence of digits but without the WHERE clause:
WITH RECURSIVE digits(i) AS
(SELECT 1
UNION
SELECT i+1
FROM digits
)
SELECT i
FROM digits
ORDER BY i;
Script: cte_recursive_digits_infinite.sql
This query fails with an error 3636:
ERROR: 3636: Recursive query aborted after 1001 iterations. Try
increasing @@cte_max_recursion_depth to a larger value.
MySQL will for safety fail with an error after 1000 rows have been generated. This is to
protect you from queries running indefinitely (and requiring indefinite resources to
store the result) in case you forget to add a WHERE clause that ensures the recursion is
terminated and the recursion does not terminate naturally. The limit can be changed
using the cte_max_recursion_depth option.
You have three options, if you need your expression to generate more than 1000 rows:
Increase cte_max_recursion_depth globally for all connections.
Increase cte_max_recursion_depth for the session, i.e. for that one
connection.
Increase cte_max_recursion_depth just for that one query.
Tip: If you need to generate more than 1000 rows in a recursive common table
expression, remember to increase the value of the cte_max_recursion_depth option.
This can be set at the session level, so it only affects the one session, or it can be set as
an optimizer hint to affect only that one query. It is recommended to increase the value
at as limited scope as possible.
As an example, consider a case where you need all days in a three year period. This will
generate 1095 or 1096 (depending on whether a leap year is included) rows. Since it is
expected that more than 1000 rows are generated, you can use an optimizer hint to set
the value of cte_max_recursion_depth to for example 1100:
WITH RECURSIVE days(DayDate) AS
(SELECT /*+ SET_VAR(cte_max_recursion_depth = 1100) */
'2015-01-01'
UNION
SELECT DayDate + INTERVAL 1 DAY
FROM days
WHERE DayDate < '2017-12-31'
)
SELECT *
FROM days
ORDER BY DayDate;
Script: cte_recursive_days_3_years.sql
This far the data type of the data generated by the common table expressions have
been ignored. How does that work?
Common Table Expression and Data Types
The data type of each column in the generated tables is determined by the data type of
the data that is fetched. For recursive common table expressions, it is the seed query
that determines the data type. A simple way to demonstrate this is to use a recursive
expression where the seed query generates a string and then in the recursive part to
concatenate to it:
WITH RECURSIVE x(i, str) AS
(SELECT 1, 'abc'
UNION
SELECT i+1, CONCAT(str, 'def')
FROM x
WHERE i < 2
)
SELECT *
FROM x;
Script: cte_data_type.sql
Executing this query causes an error:
ERROR: 1406: Data too long for column 'abc' at row 1
The seed query generates the string “abc” causing the data type to be CHAR(3). When
the recursive part of the query tries to concatenate “def”, there is no room for the extra
characters, and the query fails (provided strict mode is enabled; otherwise the same
string is generated again terminating the recursion and giving the wrong result).
To avoid that issue, explicitly use CAST() in the seed query to get the required data
type:
WITH RECURSIVE x(i, str) AS
(SELECT 1, CAST('abc' AS CHAR(6))
UNION
SELECT i+1, CONCAT(str, 'def')
FROM x
WHERE i < 2
)
SELECT *
FROM x;
Script: cte_data_type_fixed.sql
Common table expressions can also be used with tables. Let’s look at that.
Querying Tables
The sakila database includes the film table with information about the films that can
be rented. There is also the actor table with actors and a join table, film_actor, so it is
possible to determine which actors are in which movies.
Let’s say you want to find all movies starring the actor Sidney Crowe (the movies and
actors included are not from the real world) and not only list the film id and title, but
also all of the actors included. This sounds like a simple query, but it quickly becomes
complex unless you break it up. A recursive common table expression is a nice way to
break it up into parts:
WITH RECURSIVE films (film_id, title, first_name, last_name) AS
( -- what movies did Sidney Crowe work in
SELECT f.film_id, f.title, a.first_name, a.last_name
FROM sakila.film f
INNER JOIN sakila.film_actor fa USING (film_id)
INNER JOIN sakila.actor a USING (actor_id)
WHERE a.first_name = 'SIDNEY' AND a.last_name='CROWE'
UNION -- What movies did they then work in
SELECT f.film_id, f.title, a.first_name, a.last_name
FROM films f
INNER JOIN sakila.film_actor fa USING (film_id)
INNER JOIN sakila.actor a USING (actor_id)
)
SELECT film_id, title,
GROUP_CONCAT(
CONCAT(first_name, ' ', last_name)
ORDER BY last_name, first_name
SEPARATOR ', '
) AS Actors
FROM films
GROUP BY film_id, title;
Script: cte_recursive_actors.sql
The seed part of the common table expression finds all movies starring Sidney Crowe.
The recursive part finds all of the other actors in each of the movies. This is an example
where the seed query generates more than one row – that is perfectly fine; the
recursive query will apply to all of the rows found by the seed query. There is no need
for a WHERE clause in this case as the second iteration will generate the same rows as the
first iteration and thus cause the recursion to terminate as by default UNION DISTINCT
is used.
With the films and actors available in the films common table expression, it is easy to
aggregate the result. The query returns 34 films of which the two first and the last are
(the vertical output format has been used here due to the limitation of the page width):
*************************** 1. row ***************************
film_id: 12
title: ALASKA PHANTOM
Actors: VAL BOLGER, SIDNEY CROWE, SYLVESTER DERN, ALBERT JOHANSSON,
GENE MCKELLEN, BURT POSEY, JEFF SILVERSTONE
*************************** 2. row ***************************
film_id: 15
title: ALIEN CENTER
Actors: SIDNEY CROWE, BURT DUKAKIS, MENA HOPPER, KENNETH PALTROW,
RENEE TRACY, HUMPHREY WILLIS
…
*************************** 34. row ***************************
film_id: 994
title: WYOMING STORM
Actors: HARRISON BALE, SIDNEY CROWE, WOODY HOFFMAN, PENELOPE MONROE,
BETTE NICHOLSON, JULIA ZELLWEGER
Feel free to play more with common table expressions. There will be more examples
while discussing window functions. For now, let’s take a look at window functions.
Window Functions
Window functions are analytical functions that can be used to obtain derived values
from the data in a query. You can obtain all of this by handling the result in your
application, however by using window functions you can keep the logic of the query and
analysis together. Choose which works best for your separation between database and
application logic.
Window functions are related to aggregate functions. The difference is that where an
aggregate function works on the total result (using the fields in the GROUP BY clause to
know how to group the results of the functions), a window function works on a subset
of rows (called the window frame).
There are a number of functions specific for use as window functions (see also
https://dev.mysql.com/doc/refman/en/window-function-descriptions.html):
CUME_DIST(): Cumulative distribution value
DENSE_RANK(): Rank of current row within its partition, without gaps
FIRST_VALUE(): Value of argument from first row of window frame
LAG(): Value of argument from row lagging current row within partition
LAST_VALUE(): Value of argument from last row of window frame
LEAD(): Value of argument from row leading current row within partition
NTH_VALUE(): Value of argument from N-th row of window frame
NTILE(): Bucket number of current row within its partition
PERCENT_RANK(): Percentage rank value
RANK(): Rank of current row within its partition, with gaps
ROW_NUMBER(): Number of current row within its partition
Additionally, several of the standard aggregate functions can be used (see
https://dev.mysql.com/doc/refman/en/group-by-functions.html for details):
AVG(): Return the average value of the argument
BIT_AND(): Return bitwise AND
BIT_OR(): Return bitwise OR
BIT_XOR(): Return bitwise XOR
COUNT(): Return a count of the number of rows returned
MAX(): Return the maximum value
MIN(): Return the minimum value
STDDEV_POP(), STDDEV(), STD(): Return the population standard deviation
STDDEV_SAMP(): Return the sample standard deviation
SUM(): Return the sum
VAR_POP(), VARIANCE(): Return the population standard variance
VAR_SAMP(): Return the sample variance
In the execution order, window functions are evaluated after HAVING and before ORDER
BY. A window function can be included as a field in ORDER BY but not in GROUP BY.
Syntax
There are two different ways to use window function (full details in
https://dev.mysql.com/doc/refman/en/window-functions.html):
Inline:
SELECT …
function() OVER [partition_clause]
[order_clause]
[frame_clause]
…
Named windows:
SELECT …
function() OVER window_name
…
WINDOW window_name AS ([partition_clause]
[order_clause]
[frame_clause]) Named windows allows you to re-use a window definition which can make it
easier to understand what the query is doing.
There are three parts to the window definition:
partition_clause: What to partition by – this is similar to a GROUP BY clause.
order_clause: What to order by – this is similar to an ORDER BY clause.
frame_clause: What the frame consists of; essentially the number of rows to
include for the window frame before and after the current row. For windows
with an order clause, the default frame is all rows previous in the partition;
without an order clause the default frame is all rows in the partition.
Not all parts have a meaning for all functions. For example, the ROW_NUMBER() function
will always apply to a single row irrespective of the frame clause.
As a trivial example to show the difference between the two syntaxes, consider the
following example where the cities in France and United Kingdom are found. The inline
version of the query is:
SELECT city_id, city, country_id,
ROW_NUMBER() OVER(PARTITION BY country_id) AS RowNumber
FROM sakila.city
WHERE country_id IN (34, 102)
ORDER BY city;
Script: window_country_inline.sql
The version using a named window is:
SELECT city_id, city, country_id,
ROW_NUMBER() OVER w_country AS RowNumber
FROM sakila.city
WHERE country_id IN (34, 102)
WINDOW w_country AS (PARTITION BY country_id)
ORDER BY city;
Script: window_country_named.sql
These two queries do the same thing, so you can choose the syntax that suits you the
best, but remember that a named window can be used multiple times in the same
query. The result of the queries is:
+---------+-----------------+------------+-----------+
| city_id | city | country_id | RowNumber |
+---------+-----------------+------------+-----------+
| 88 | Bradford | 102 | 1 |
| 92 | Brest | 34 | 1 |
| 149 | Dundee | 102 | 2 |
| 297 | Le Mans | 34 | 2 |
| 312 | London | 102 | 3 |
| 494 | Southampton | 102 | 4 |
| 495 | Southend-on-Sea | 102 | 5 |
| 496 | Southport | 102 | 6 |
| 500 | Stockport | 102 | 7 |
| 543 | Toulon | 34 | 3 |
| 544 | Toulouse | 34 | 4 |
| 589 | York | 102 | 8 |
+---------+-----------------+------------+-----------+
Introductory Examples
As an example exploring window functions some more consider the following example.
The recursive common table expression is used to create 10 rows where val1 for the
first five rows (id = 1-5) is “abc” and for the last five rows (id = 6-10) is “def”. The val2
column contains a random integer between 0 and 9.
The window functions work at two scopes: global, i.e. across all rows, and partitioned by
val1. The complete example is:
WITH RECURSIVE rand_data(id, val1, val2) AS
(SELECT 1, 'abc', FLOOR(RAND()*10)
UNION
SELECT id+1, IF(id < 5, 'abc', 'def'), FLOOR(RAND()*10)
FROM rand_data
WHERE id < 10
)
SELECT id, val1, val2,
SUM(val2) OVER() AS Total,
SUM(val2) OVER w_all_id AS RunningTotal,
ROUND(AVG(val2) OVER w_all_running, 2) AS RunningAverage,
RANK() OVER w_all_val2 AS 'Rank',
SUM(val2) OVER w_val1_id AS Val1RunningTotal,
ROUND(AVG(val2) OVER w_val1_running, 2) AS Val1RunningAverage,
RANK() OVER w_val1_val2 AS Val1Rank
FROM rand_data
WINDOW w_all_id AS (ORDER BY id),
w_all_val2 AS (ORDER BY val2 DESC),
w_all_running AS (ORDER BY id
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING),
w_val1_id AS (PARTITION BY val1
ORDER BY id
ROWS UNBOUNDED PRECEDING),
w_val1_val2 AS (PARTITION BY val1
ORDER BY val2 DESC
ROWS UNBOUNDED PRECEDING),
w_val1_running AS (PARTITION BY val1
ORDER BY id
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
ORDER BY id;
Script: window_rand_data.sql
For the w_val1_id and w_val1_val2 windows, the ROW UNBOUNDED PRECEDING is not
needed as it is being set to the default (for windows without an ORDER BY the default is
ROW UNBOUNDED PRECEDING ROW UNBOUNDED FOLLOWING), but it is OK to include it
anyway and shows how to specify an unbounded frame.
Try to execute the query and notice the difference between the “global” scope windows
(no PARTITION BY clause) and the windows partitioning by val1. An example out is
(since the data is random, the output will differ for each execution):
The Total column (SUM(val2) OVER() AS Total) is the same you would have gotten if
you did the sum grouping all rows into a single row. The global frames span all rows
whereas the frames partitioned by val1 reset when the value of val1 changes.
The rest of the lab is dedicated to exploring how common table expressions and window
functions can be used.
Sales Query
Non-CTE
A more realistic query is to check sales data per month and calculate the change
compared to the previous month. Sales per month can be found as:
SELECT DATE_FORMAT(r.rental_date, '%Y-%m-01') AS FirstOfMonth,
SUM(p.amount) as Sales
FROM sakila.payment p
INNER JOIN sakila.rental r USING (rental_id)
GROUP BY FirstOfMonth;
Script: monthly_sales.sql
Which gives:
+--------------+----------+
| FirstOfMonth | Sales |
+--------------+----------+
| 2005-05-01 | 4823.44 |
| 2005-06-01 | 9629.89 |
| 2005-07-01 | 28368.91 |
| 2005-08-01 | 24070.14 |
| 2006-02-01 | 514.18 |
+--------------+----------+
By joining on the query itself, it is possible to include the change in sales from the
previous month. In MySQL 5.7 and earlier this could be done using subqueries in the
FROM part, like:
SELECT YEAR(cur.FirstOfMonth) AS 'Year',
MONTHNAME(cur.FirstOfMonth) AS 'Month',
cur.Sales, (cur.Sales - IFNULL(prev.Sales, 0)) AS Delta
FROM (
SELECT DATE_FORMAT(r.rental_date, '%Y-%m-01') AS FirstOfMonth,
SUM(p.amount) as Sales
FROM sakila.payment p
INNER JOIN sakila.rental r USING (rental_id)
GROUP BY FirstOfMonth
) cur
LEFT OUTER JOIN (
SELECT DATE_FORMAT(r.rental_date, '%Y-%m-01') AS FirstOfMonth,
SUM(p.amount) as Sales
FROM sakila.payment p
INNER JOIN sakila.rental r USING (rental_id)
GROUP BY FirstOfMonth
) prev ON prev.FirstOfMonth = cur.FirstOfMonth - INTERVAL 1
MONTH
ORDER BY cur.FirstOfMonth;
Script: non-cte_monthly_sales.sql
While this gives the correct result, it has some drawbacks:
The logic is duplicated
It is hard to read
The subquery is executed twice meaning execution and storage (of the subquery
result) overhead.
It is error prone – when you change the business logic, you have to remember to
update both subqueries.
A better approach is to use a common table expression.
Using CTE
The subquery in the sales query can be rewritten as a common table expression and
then used twice to iterate over the current and previous months and get the sales for
the current month and the difference to the previous month:
WITH monthly_sales(FirstOfMonth, Sales) AS
(SELECT DATE_FORMAT(r.rental_date, '%Y-%m-01') AS FirstOfMonth,
SUM(p.amount) as Sales
FROM sakila.payment p
INNER JOIN sakila.rental r USING (rental_id)
GROUP BY FirstOfMonth
)
SELECT YEAR(cur.FirstOfMonth) AS 'Year',
MONTHNAME(cur.FirstOfMonth) AS 'Month',
cur.Sales, (cur.Sales - IFNULL(prev.Sales, 0)) AS Delta
FROM monthly_sales cur
LEFT OUTER JOIN monthly_sales prev
ON prev.FirstOfMonth = cur.FirstOfMonth - INTERVAL 1 MONTH
ORDER BY cur.FirstOfMonth;
Script: cte_monthly_sales.sql
Note: This assumes that the sale is 0 in months where there are no sales data.
Without a common table expression, it would have been necessary to write the
subquery twice. Additionally, the common table expression helps separate the various
parts of the query: it becomes clearer how the sales per month is calculated and that
the source data of the current and previous months is the same. So, not only will this
perform better than in-lining the subquery (it’s executed once instead of twice), it is less
error prone, and easier to read.
The result of the query is:
+------+----------+----------+----------+
| Year | Month | Sales | Delta |
+------+----------+----------+----------+
| 2005 | May | 4823.44 | 4823.44 |
| 2005 | June | 9629.89 | 4806.45 |
| 2005 | July | 28368.91 | 18739.02 |
| 2005 | August | 24070.14 | -4298.77 |
| 2006 | February | 514.18 | 514.18 |
+------+----------+----------+----------+
A third possibility is to add the use of a window function.
Using CTE and Window Function
There is a window function that can simplify the query further. The LAG() function
(https://dev.mysql.com/doc/refman/8.0/en/window-function-
descriptions.html#function_lag) takes a value from an earlier row (the default is the
previous row); this is an example of a non-aggregate window function. Using this, it is
only necessary to reference the monthly sales once:
WITH monthly_sales(FirstOfMonth, Sales) AS
(SELECT DATE_FORMAT(r.rental_date, '%Y-%m-01') AS FirstOfMonth,
SUM(p.amount) as Sales
FROM sakila.payment p
INNER JOIN sakila.rental r USING (rental_id)
GROUP BY FirstOfMonth
)
SELECT YEAR(FirstOfMonth) AS 'Year',
MONTHNAME(FirstOfMonth) AS 'Month',
Sales,
(Sales - LAG(Sales, 1, 0) OVER (ORDER BY FirstOfMonth)) AS Delta
FROM monthly_sales
ORDER BY FirstOfMonth;
Script: window_monthly_sales_1.sql
This almost works as expected – before discussing what the problem with the query is,
let’s look at the result:
+------+----------+----------+-----------+
| Year | Month | Sales | Delta |
+------+----------+----------+-----------+
| 2005 | May | 4823.44 | 4823.44 |
| 2005 | June | 9629.89 | 4806.45 |
| 2005 | July | 28368.91 | 18739.02 |
| 2005 | August | 24070.14 | -4298.77 |
| 2006 | February | 514.18 | -23555.96 |
+------+----------+----------+-----------+
For the months in 2005 this returns the same result as for the original query (in
cte_monthly_sales.sql), but what happens for February 2006? The LAG() function
uses the previous row – not the previous month. One solution is to add a recursive
common table expression to ensure we include all months even if there are no sales
data. This can be done as follows:
WITH RECURSIVE months(FirstOfMonth) AS
(SELECT '2005-05-01'
UNION
SELECT FirstOfMonth + INTERVAL 1 MONTH
FROM months
WHERE FirstOfMonth < '2006-02-01'
),
monthly_sales(FirstOfMonth, Sales) AS
(SELECT DATE_FORMAT(r.rental_date, '%Y-%m-01') AS FirstOfMonth,
SUM(p.amount) as Sales
FROM sakila.payment p
INNER JOIN sakila.rental r USING (rental_id)
GROUP BY FirstOfMonth
)
SELECT YEAR(FirstOfMonth) AS 'Year',
MONTHNAME(FirstOfMonth) AS 'Month',
IFNULL(Sales, 0) AS Sales,
(IFNULL(Sales, 0) - LAG(IFNULL(Sales, 0), 1, 0)
OVER (ORDER BY FirstOfMonth)) AS Delta
FROM months
LEFT OUTER JOIN monthly_sales USING (FirstOfMonth)
ORDER BY FirstOfMonth;
Script: window_monthly_sales_2.sql
Now the query will include all months between May 2005 and February 2006 (both
included):
+------+-----------+----------+-----------+
| Year | Month | Sales | Delta |
+------+-----------+----------+-----------+
| 2005 | May | 4823.44 | 4823.44 |
| 2005 | June | 9629.89 | 4806.45 |
| 2005 | July | 28368.91 | 18739.02 |
| 2005 | August | 24070.14 | -4298.77 |
| 2005 | September | 0.00 | -24070.14 |
| 2005 | October | 0.00 | 0.00 |
| 2005 | November | 0.00 | 0.00 |
| 2005 | December | 0.00 | 0.00 |
| 2006 | January | 0.00 | 0.00 |
| 2006 | February | 514.18 | 514.18 |
+------+-----------+----------+-----------+
Which of the monthly sales examples gives the right result depends on the question you
want to answer and your business requirements.
Best Customer per Store
A common query pattern is to find the data associated with the largest or smallest value
in a set. For example: “what is the largest city in each country”, “which sales
representative sold most for each region”, etc.
Consider the sakila database, it could be interesting to know which customer spends
the most money for each store. Let’s first see how to do this without using common
table expressions or window functions.
Non-CTE
There are a few ways to find that without using common table expressions, one is:
SELECT c.store_id, c.customer_id, c.first_name, c.last_name, c.Amount
FROM (SELECT c1.store_id, c1.customer_id,
c1.first_name, c1.last_name,
SUM(p1.amount) AS Amount
FROM sakila.customer c1
INNER JOIN sakila.payment p1 USING (customer_id)
GROUP BY c1.store_id, c1.customer_id
) c
INNER JOIN (
SELECT store_id, MAX(Amount) AS MaxStoreAmount
FROM (SELECT store_id, customer_id, SUM(p2.amount) AS Amount
FROM sakila.customer c2
INNER JOIN sakila.payment p2 USING (customer_id)
GROUP BY store_id, customer_id
) tmp
GROUP BY store_id
) s ON s.store_id = c.store_id AND s.MaxStoreAmount = c.Amount;
Script: best_customer_non-cte.sql
This is not exactly easy to follow. The first subquery (c) finds how much money each
customer has spent and which store the customer is associated with. The second
subquery has two parts – and outer and inner part. The inner part again finds the
amount spent for each customer; the outer then finds the maximum amount for each
store.
The result is:
+----------+-------------+------------+-----------+--------+
| store_id | customer_id | first_name | last_name | Amount |
+----------+-------------+------------+-----------+--------+
| 1 | 148 | ELEANOR | HUNT | 216.54 |
| 2 | 526 | KARL | SEAL | 221.55 |
+----------+-------------+------------+-----------+--------+
Changing the subqueries to common table expressions is the first way to improve the
query.
Using CTE
When the same query is re-written to use a common table expression, there is no need
to duplicate the subquery finding the amount of money spent per customer:
WITH customer_amount (store_id, customer_id,
first_name, last_name, Amount) AS
(SELECT c.store_id, c.customer_id,
c.first_name, c.last_name, SUM(p.amount)
FROM sakila.customer c
INNER JOIN sakila.payment p USING (customer_id)
GROUP BY store_id, customer_id
),
store_amount (store_id, MaxStoreAmount) AS
(SELECT store_id, MAX(Amount)
FROM customer_amount
GROUP BY store_id
)
SELECT ca.store_id, ca.customer_id,
ca.first_name, ca.last_name, ca.Amount
FROM customer_amount ca
INNER JOIN store_amount sa ON sa.store_id = ca.store_id
AND sa.MaxStoreAmount = ca.Amount;
Script: best_customer_cte.sql
However, this can be improved even further.
Using CTE and Window Function
When adding the use of window functions, you can use the RANK() function partitioning
by the store and ordering by the sales amount descending. Doing that, the customer
spending the most money per store can simply be found looking for the rows where
RANK() returns 1:
WITH customer_amount (store_id, customer_id,
first_name, last_name, Amount, `Rank`) AS
(SELECT c.store_id, c.customer_id,
c.first_name, c.last_name, SUM(p.amount),
RANK() OVER(
PARTITION BY c.store_id
ORDER BY SUM(p.amount) DESC
) AS 'Rank'
FROM sakila.customer c
INNER JOIN sakila.payment p USING (customer_id)
GROUP BY store_id, customer_id
)
SELECT store_id, customer_id,
first_name, last_name, Amount
FROM customer_amount
WHERE `Rank` = 1;
Script: best_customer_cte_window.sql
Now there is just one subquery and it is trivial to change the query for example to return
the top three customers per store.
Fun
If you would like to have some fun with common table expressions, then the following
query generates the Mandelbrot set as ASCII-art:
WITH RECURSIVE x(i) AS (
SELECT CAST(0 AS DECIMAL(13, 10))
UNION ALL
SELECT i + 1
FROM x
WHERE i < 101
),
Z(Ix, Iy, Cx, Cy, X, Y, I) AS (
SELECT Ix, Iy, X, Y, X, Y, 0
FROM (SELECT CAST(-2.2 + 0.031 * i AS DECIMAL(13, 10)) AS X,
i AS Ix FROM x) AS xgen
CROSS JOIN (
SELECT CAST(-1.5 + 0.031 * i AS DECIMAL(13, 10)) AS Y,
i AS iY FROM x
) AS ygen
UNION ALL
SELECT Ix, Iy, Cx, Cy,
CAST(X * X - Y * Y + Cx AS DECIMAL(13, 10)) AS X,
CAST(Y * X * 2 + Cy AS DECIMAL(13, 10)), I + 1
FROM Z
WHERE X * X + Y * Y < 16.0
AND I < 27
),
Zt (Ix, Iy, I) AS (
SELECT Ix, Iy, MAX(I) AS I
FROM Z
GROUP BY Iy, Ix
ORDER BY Iy, Ix
)
SELECT GROUP_CONCAT(
SUBSTRING(
' .,,,-----++++%%%%@@@@#### ',
GREATEST(I, 1),
1
) ORDER BY Ix SEPARATOR ''
) AS 'Mandelbrot Set'
FROM Zt
GROUP BY Iy
ORDER BY Iy;
Script: mandelbrot_set.sql – Based on: https://wiki.postgresql.org/wiki/Mandelbrot_set
Not at all useful, but it does show the power of common table expressions. This is best
executed in MySQL Shell or the mysql command-line client.
This concludes the guided part of the lab. You are encouraged to continue to playing
with common table expressions and window functions.
References
If you want to learn more about the MySQL Document Store, the following references
are useful:
WITH Syntax (Common Table Expressions):
https://dev.mysql.com/doc/refman/en/with.html
Window Functions: https://dev.mysql.com/doc/refman/en/window-
functions.html
Aggregate (GROUP BY) Function Descriptions:
https://dev.mysql.com/doc/refman/en/group-by-functions.html
MySQL Workbench: https://dev.mysql.com/doc/workbench/en/
Blogs:
o MySQL 8.0 Labs: [Recursive] Common Table Expressions in MySQL
(CTEs): http://mysqlserverteam.com/mysql-8-0-labs-recursive-common-
table-expressions-in-mysql-ctes/
o MySQL 8.0 Labs: [Recursive] Common Table Expressions in MySQL
(CTEs), Part Two – how to generate series:
http://mysqlserverteam.com/mysql-8-0-labs-recursive-common-table-
expressions-in-mysql-ctes-part-two-how-to-generate-series/
o MySQL 8.0 Labs: [Recursive] Common Table Expressions in MySQL
(CTEs), Part Three – hierarchies: http://mysqlserverteam.com/mysql-8-0-
labs-recursive-common-table-expressions-in-mysql-ctes-part-three-
hierarchies/
o http://mysqlserverteam.com/mysql-8-0-1-recursive-common-table-
expressions-in-mysql-ctes-part-four-depth-first-or-breadth-first-traversal-
transitive-closure-cycle-avoidance/ http://mysqlserverteam.com/mysql-
8-0-1-recursive-common-table-expressions-in-mysql-ctes-part-four-
depth-first-or-breadth-first-traversal-transitive-closure-cycle-avoidance/