Billion Goods in Few Categories - Percona · mysql.innodb table stats mysql.innodb index stats...

Billion Goods in Few CategoriesHow Histograms Save a Life?

Sveta SmirnovaPercona

•Introduction•The Use Case

The Cardinality: Two LevelsExample

•Why the Difference?•Even Worse Use Case

ANALYZE TABLE LimitationsExample

•How Histograms Work?•Left Overs

Table of Contents

The column statistics data dictionary table stores histogram statistics aboutcolumn values, for use by the optimizer in constructing query execution plans

MySQL User Reference Manual

Optimizer Statistics aka Histograms

• MySQL Support engineer• Author of• MySQL Troubleshooting• JSON UDF functions• FILTER clause for MySQL

• Speaker• Percona Live, OOW, Fosdem,

DevConf, HighLoad...

Sveta Smirnova

Introduction

• Hardware•Wise options• Optimized queries• Brain

Everything can Be Resolved!

• This talk is about• How I spent the last three years• Resolving the same issue• For different customers

• Task was to speed up the query

Not Everything /

• This talk is about• How I spent the last three years• Resolving the same issue• For different customers

• Task was to speed up the query

Not Everything /

• Specific data distribution

• Access on different fields• ON goods.shop id = shop.id• WHERE shop.location IN (...)• GROUP BY goods.category, shop.profile• ORDER BY shop.distance, goods.quantity

• Index cannot be used effectively

Not All the Queries Can be Optimized

• Specific data distribution• Access on different fields• ON goods.shop id = shop.id• WHERE shop.location IN (...)• GROUP BY goods.category, shop.profile• ORDER BY shop.distance, goods.quantity

• Data distribution varies• Big difference between number of values

Red 1,000,000Green 2Blue 100,000

• Cardinality is not correct• Index maintenance is expensive• Optimizer does not work as we wish it

Examples in my talk @Percona Live Frankfurt

Latest Support Tickets

• Data distribution varies• Constantly changing

Red 100,000Green 1,000,000Blue 10

• Data distribution varies• Constantly changing

Red 1,000Green 2,000Blue 50,000

• Data distribution varies• Cardinality is not correct• Was not updated in time• Updates too often• Calculated wrongly

• Index maintenance is expensive• Optimizer does not work as we wish it

• Data distribution varies• Cardinality is not correct• Index maintenance is expensive• Hardware resources• Slow updates• Window to run CREATE INDEX

• Optimizer does not work as we wish itExamples in my talk @Percona Live Frankfurt

• Data distribution varies• Cardinality is not correct• Index maintenance is expensive• Optimizer does not work as we wish it

• Topic based on real Support cases• Couple of them are still in progress

• All examples are 100% fake• All examples are simplified• All disasters happened with version 5.7

Disclaimer

• Topic based on real Support cases• All examples are 100% fake• They are created so that• No customer can be identified• Everything generated

Table namesColumn namesData

• Use case itself is fictional

• All examples are simplified• All disasters happened with version 5.7

Disclaimer

• Topic based on real Support cases• All examples are 100% fake• All examples are simplified• Only columns, required to show the issue• Everything extra removed• Real tables usually store much more data

• All disasters happened with version 5.7

Disclaimer

• Topic based on real Support cases• All examples are 100% fake• All examples are simplified• All disasters happened with version 5.7

Disclaimer

The Use Case

• categories• Less than 20 rows

• goods• More than 1M rows• 20 unique cat id values• Many other fields

PriceDate: added, last updated, etc.CharacteristicsStore...

Two Tables

• categories• Less than 20 rows

• goods• More than 1M rows• 20 unique cat id values• Many other fields

PriceDate: added, last updated, etc.CharacteristicsStore...

Two Tables

select *

categories

(categories.id=goods.cat_id)

date_added between ’2018-07-01’ and ’2018-08-01’

cat_id in (16,11)

price >= 1000 and <=10000 [ and ... ]

[ GROUP BY ... [ORDER BY ... [ LIMIT ...]]]

• Select from the small table

• For each cat id select from the large table• Filter result on date added[ and price[...]]• Slow with many items in the category

Option 1: Select from the Small Table First

• Select from the small table• For each cat id select from the large table

• Filter result on date added[ and price[...]]• Slow with many items in the category

• Select from the small table• For each cat id select from the large table• Filter result on date added[ and price[...]]

• Slow with many items in the category

• Select from the small table• For each cat id select from the large table• Filter result on date added[ and price[...]]• Slow with many items in the category

Option 1: Illustration

• Filter rows by date added[ and price[...]]

• Get cat id values• Retrieve rows from the small table• Slow if number of rows, filtered bydate added, is larger than number ofgoods in the selected categories

Option 2: Select From the Large Table First

• Filter rows by date added[ and price[...]]• Get cat id values

• Retrieve rows from the small table• Slow if number of rows, filtered bydate added, is larger than number ofgoods in the selected categories

• Filter rows by date added[ and price[...]]• Get cat id values• Retrieve rows from the small table

• Slow if number of rows, filtered bydate added, is larger than number ofgoods in the selected categories

• Filter rows by date added[ and price[...]]• Get cat id values• Retrieve rows from the small table• Slow if number of rows, filtered bydate added, is larger than number ofgoods in the selected categories

Option 2: Illustration

• CREATE INDEX index everything(cat id, date added[, price[, ...]])• It resolves the issue

• But not in all cases

What if We use Combined Indexes?

• CREATE INDEX index everything(cat id, date added[, price[, ...]])• It resolves the issue• But not in all cases

What if We use Combined Indexes?

• Maintenance cost• Slower INSERT/UPDATE/DELETE• Disk space

• Index not useful for selecting rows• Tables may have wrong cardinality

The Problem

• Index not useful for selecting rowsJOIN categories ON (categories.id=goods.cat_id)

JOIN shops ON (shops.id=goods.shop_id)

[ JOIN ... ]

date_added between ’2018-07-01’ and ’2018-08-01’

cat_id in (16,11) AND price >= 1000 AND price <=10000 [ AND ... ]

GROUP BY product_type

ORDER BY date_updated DESC

LIMIT 50,100

• Tables may have wrong cardinality

The Problem

• Index not useful for selecting rows• Tables may have wrong cardinality

The Problem

The Use CaseThe Cardinality: Two Levels

The Query

Parser

Optimizer

Storage Engine

MySQL Architecture

• Optimizer• Engine• MyRocks• InnoDB• Any

MySQL is Layered Architecture

• Number of unique values in the index• Optimizer uses for the query execution plan

• Example

Cardinality

• Number of unique values in the index• Optimizer uses for the query execution plan• Example• ID: 1,2,3,4,5• Number of rows: 5• Cardinality: 5

Cardinality

• Number of unique values in the index• Optimizer uses for the query execution plan• Example• Gender: m,f,f,f,f,m,m,m,m,m,m,f,f,m,f,m,f• Number of rows: 17• Cardinality: 2

Cardinality

• Stores statistics on disk• mysql.innodb table stats• mysql.innodb index stats

• Returns statistics to Optimizer• In ha innobase::info• handler/ha innodb.cc

•When opens table• Subsequent table accesses• flag = HA STATUS VARIABLE• Statistics from memory• Up to date Primary Key data

InnoDB: Overview

• Stores statistics on disk• Returns statistics to Optimizer

• In ha innobase::info• handler/ha innodb.cc

InnoDB: Overview

• Stores statistics on disk• Returns statistics to Optimizer• In ha innobase::info• handler/ha innodb.cc

InnoDB: Overview

•When opens table• flag = HA STATUS CONST• Reads data from disk• Stores it in memory

• Subsequent table accesses• flag = HA STATUS VARIABLE• Statistics from memory• Up to date Primary Key data

InnoDB: Overview

• Table created with option STATS AUTO RECALC = 0

• Before ANALYZE TABLEmysql> show index from test\G

*************************** 2. row ***************************

Table: test

Non_unique: 1

Key_name: f1

Seq_in_index: 1

Column_name: f1

Collation: A

Cardinality: 64

• After restartmysql> show index from test\G

*************************** 2. row ***************************

Table: test

Non_unique: 1

Key_name: f1

Seq_in_index: 1

Column_name: f1

Collation: A

Cardinality: 2

InnoDB: Flow

• After ANALYZE TABLEmysql> show index from test\G

*************************** 2. row ***************************

Table: test

Non_unique: 1

Key_name: f1

Seq_in_index: 1

Column_name: f1

Collation: A

Cardinality: 2

*************************** 2. row ***************************

Table: test

Non_unique: 1

Key_name: f1

Seq_in_index: 1

Column_name: f1

Collation: A

Cardinality: 2

InnoDB: Flow

• After inserting rowsmysql> show index from test\G

*************************** 2. row ***************************

Table: test

Non_unique: 1

Key_name: f1

Seq_in_index: 1

Column_name: f1

Collation: A

Cardinality: 16

*************************** 2. row ***************************

Table: test

Non_unique: 1

Key_name: f1

Seq_in_index: 1

Column_name: f1

Collation: A

Cardinality: 2

InnoDB: Flow

*************************** 2. row ***************************

Table: test

Non_unique: 1

Key_name: f1

Seq_in_index: 1

Column_name: f1

Collation: A

Cardinality: 2

InnoDB: Flow

• Takes data from the engine

• Class ha statistics• sql/handler.h

• Does not have Cardinality field at all• Uses formula to calculate Cardinality

Optimizer: Overview

• Takes data from the engine• Class ha statistics• sql/handler.h

Optimizer: Overview

• Does not have Cardinality field at all

• Uses formula to calculate Cardinality

Optimizer: Overview

• n rows: number of rows in the table• Naturally up to date• Constantly changing!

• rec per key: number of duplicates per key• Calculated by InnoDB in time of ANALYZE• rec per key = n rows / unique values• Do not change!

• Cardinality = n rows / rec per key

Optimizer: Formula

• Engine stores persistent statisticsInnoDB

Storage TablesStatistics As Calculated

Row Count Only in Memory

• Optimizer calculates Cardinality every timewhen accesses engine statistics•Weak user control

Persistent Statistics Are Not Persistent

Row Count Only in Memory• Optimizer calculates Cardinality every time

when accesses engine statistics

•Weak user control

Row Count Only in Memory• Optimizer calculates Cardinality every time

when accesses engine statistics•Weak user control

The Use CaseExample

• EXPLAIN without histogramsmysql> explain select goods.* from goods

-> join categories on (categories.id=goods.cat_id)

-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)

-> and

-> date_added between ’2000-01-01’ and ’2001-01-01’ -- Large range

-> order by goods.cat_id

-> limit 10\G -- We ask for 10 rows only!

Example

• EXPLAIN without histograms*************************** 1. row ***************************

select_type: SIMPLE

table: categories -- Small table first

partitions: NULL

type: index

possible_keys: PRIMARY

key: PRIMARY

key_len: 4

ref: NULL

rows: 20

filtered: 70.00

Extra: Using where; Using index;

Using temporary; Using filesort

Example

• EXPLAIN without histograms*************************** 2. row ***************************

select_type: SIMPLE

table: goods -- Large table

partitions: NULL

type: ref

possible_keys: cat_id_2

key: cat_id_2

key_len: 5

ref: orig.categories.id

rows: 51827

filtered: 11.11 -- Default value

Extra: Using where

2 rows in set, 1 warning (0.01 sec)

Example

• Execution time without histogramsmysql> flush status;

Query OK, 0 rows affected (0.00 sec)

mysql> select goods.* from goods

-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)

-> and

-> date_added between ’2000-01-01’ and ’2001-01-01’

-> limit 10;

ab9f9bb7bc4f357712ec34f067eda364 -

10 rows in set (56.47 sec)

Example

• Engine statistics without histogramsmysql> show status like ’Handler%’;

+----------------------------+--------+

| Variable_name | Value |

+----------------------------+--------+

| Handler_read_next | 964718 |

| Handler_read_prev | 0 |

| Handler_read_rnd | 10 |

| Handler_read_rnd_next | 951671 |

| Handler_write | 951670 |

+----------------------------+--------+

Example

• Now let add the histogrammysql> analyze table goods update histogram on date_added;

+------------+-----------+----------+------------------------------+

+------------+-----------+----------+------------------------------+

| orig.goods | histogram | status | Histogram statistics created

for column ’date_added’. |

+------------+-----------+----------+------------------------------+

1 row in set (2.01 sec)

Example

• EXPLAIN with the histogrammysql> explain select goods.* from goods

-> join categories

-> on (categories.id=goods.cat_id)

-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)

-> and

-> limit 10\G

Example

• EXPLAIN with the histogram*************************** 1. row ***************************

select_type: SIMPLE

table: goods -- Large table first

partitions: NULL

type: index

possible_keys: cat_id_2

key: cat_id_2

key_len: 5

ref: NULL

rows: 10 -- Same as we asked

filtered: 98.70 -- True numbers

Extra: Using where

Example

• EXPLAIN with the histogram*************************** 2. row ***************************

select_type: SIMPLE

table: categories -- Small table

partitions: NULL

type: eq_ref

possible_keys: PRIMARY

key: PRIMARY

key_len: 4

ref: orig.goods.cat_id

rows: 1

filtered: 100.00

Extra: Using index

Example

• Execution time with the histogrammysql> flush status;

mysql> select goods.* from goods

-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)

-> and

-> limit 10;

eeb005fae0dd3441c5c380e1d87fee84 -

10 rows in set (0.00 sec) -- 56/0 times faster!

Example

• Engine statistics with the histogrammysql> show status like ’Handler%’;

+----------------------------+-------++----------------------------+-------+

| Variable_name | Value || Variable_name | Value |

+----------------------------+-------++----------------------------+-------+

| Handler_commit | 1 || Handler_read_prev | 0 |

| Handler_delete | 0 || Handler_read_rnd | 0 |

| Handler_discover | 0 || Handler_read_rnd_next | 0 |

| Handler_external_lock | 4 || Handler_rollback | 0 |

| Handler_mrr_init | 0 || Handler_savepoint | 0 |

| Handler_prepare | 0 || Handler_savepoint_rollback | 0 |

| Handler_read_first | 1 || Handler_update | 0 |

| Handler_read_key | 3 || Handler_write | 0 |

| Handler_read_last | 0 |+----------------------------+-------+

| Handler_read_next | 9 |18 rows in set (0.00 sec)

Example

Why the Difference?

1 2 3 4 5 6 7 8 9 100

Indexes: Number of Items with Same Value

1 2 3 4 5 6 7 8 9 100

Indexes: Cardinality

1 2 3 4 5 6 7 8 9 100

Histograms: Number of Values in Each Bucket

1 2 3 4 5 6 7 8 9 100

Histograms: Data in the Histogram

Even Worse Use Case

Even Worse Use CaseANALYZE TABLE Limitations

• ANALYZE TABLE often• Use large number of STATS SAMPLE PAGES

Solutions in 5.7-

• Counts number of pages in the table

• Takes STATS SAMPLE PAGES• Counts number of unique values in

secondary index in these pages• Divides number of pages in the table on

number of sample pages and multipliesresult by number of unique values

How ANALYZE TABLE Works with InnoDB?

• Counts number of pages in the table• Takes STATS SAMPLE PAGES

• Counts number of unique values insecondary index in these pages• Divides number of pages in the table on

• Counts number of pages in the table• Takes STATS SAMPLE PAGES• Counts number of unique values in

secondary index in these pages

• Divides number of pages in the table onnumber of sample pages and multipliesresult by number of unique values

• Counts number of pages in the table• Takes STATS SAMPLE PAGES• Counts number of unique values in

secondary index in these pages• Divides number of pages in the table on

• Number of pages in the table: 20,000• STATS SAMPLE PAGES: 20 (default)• Unique values in the secondary index:• In sample pages: 10• In the table: 11

• Cardinality: 20,000 * 10 / 20 = 10,000

Example

• Number of pages in the table: 20,000• STATS SAMPLE PAGES: 20 (default)• Unique values in the secondary index:• In sample pages: 10• In the table: 11

• Cardinality: 20,000 * 10 / 20 = 10,000

Example

• Number of pages in the table: 20,000• STATS SAMPLE PAGES: 5,000• Unique values in the secondary index:• In sample pages: 10• In the table: 11

• Cardinality: 20,000 * 10 / 5,000 = 40

Example 2

• Time consumingmysql> select count(*) from goods;

+----------+

| count(*) |

+----------+

| 80303000 |

+----------+

•With bigger number• 27.13/0.32 = 85 times slower!• Not always a solution

Use Larger STATS SAMPLE PAGES?

• Time consuming•With default STATS SAMPLE PAGES

mysql> analyze table goods;

+------------+---------+----------+----------+

+------------+---------+----------+----------+

•With bigger number• 27.13/0.32 = 85 times slower!• Not always a solution

• Time consuming•With bigger number

mysql> alter table goods STATS_SAMPLE_PAGES=5000;

Records: 0 Duplicates: 0 Warnings: 0

mysql> analyze table goods;

+------------+---------+----------+----------+

+------------+---------+----------+----------+

• 27.13/0.32 = 85 times slower!• Not always a solution

• Time consuming•With bigger number• 27.13/0.32 = 85 times slower!

• Not always a solution

• Time consuming•With bigger number• 27.13/0.32 = 85 times slower!• Not always a solution

Even Worse Use CaseExample

• goods characteristicsCREATE TABLE ‘goods_characteristics‘ (

‘id‘ int(11) NOT NULL AUTO_INCREMENT,

‘good_id‘ varchar(30) DEFAULT NULL,

‘size‘ int(11) DEFAULT NULL,

‘manufacturer‘ varchar(30) DEFAULT NULL,

PRIMARY KEY (‘id‘),

KEY ‘good_id‘ (‘good_id‘,‘size‘,‘manufacturer‘),

KEY ‘size‘ (‘size‘,‘manufacturer‘)

) ENGINE=InnoDB AUTO_INCREMENT=196606

DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Two Similar Tables

• goods shopsCREATE TABLE ‘goods_shops‘ (

‘id‘ int(11) NOT NULL AUTO_INCREMENT,

‘good_id‘ varchar(30) DEFAULT NULL,

‘location‘ varchar(30) DEFAULT NULL,

‘delivery_options‘ varchar(30) DEFAULT NULL,

PRIMARY KEY (‘id‘),

KEY ‘good_id‘ (‘good_id‘,‘location‘,‘delivery_options‘),

KEY ‘location‘ (‘location‘,‘delivery_options‘)

) ENGINE=InnoDB AUTO_INCREMENT=131071

DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Two Similar Tables

• Sizemysql> select count(*) from goods_characteristics;

+----------+

| count(*) |

+----------+

| 131072 |

+----------+

mysql> select count(*) from goods_shops;

+----------+

| count(*) |

+----------+

| 65536 |

+----------+

Two Similar Tables

• Data Distribution: goods characteristicsmysql> select count(*) num_rows, good_id, size

-> from goods_characteristics group by good_id, size;

+----------+---------+------+

| num_rows | good_id | size |

+----------+---------+------+

| 65536 | laptop | 7 | | 8189 | laptop | 13 |

| 8187 | laptop | 8 | | 8191 | laptop | 14 |

| 8190 | laptop | 9 | | 8190 | laptop | 15 |

| 8188 | laptop | 10 | | 10 | laptop | 16 |

| 8192 | laptop | 11 | | 10 | laptop | 17 |

| 8189 | laptop | 12 | +----------+---------+------+

Two Similar Tables

• Data Distribution: goods characteristicsmysql> select count(*) num_rows, good_id, manufacturer

-> from goods_characteristics group by good_id, manufacturer order by num_rows desc;

+----------+---------+--------------+

| num_rows | good_id | manufacturer |

+----------+---------+--------------+

| 8189 | laptop | HP | | 10 | laptop | Casper |

| 8189 | laptop | Lenovo | +----------+---------+--------------+

Two Similar Tables

• Data Distribution: goods shopsmysql> select count(*) num_rows, good_id, location

-> from goods_shops group by good_id, location order by num_rows desc;

+----------+---------+---------------+

| num_rows | good_id | location |

+----------+---------+---------------+

+----------+---------+---------------+

Two Similar Tables

• Data Distribution: goods shopsmysql> select count(*) num_rows, good_id, delivery_options

-> from goods_shops group by good_id, delivery_options order by num_rows desc;

+----------+---------+------------------+

| num_rows | good_id | delivery_options |

+----------+---------+------------------+

| 8192 | laptop | DHL | | 8189 | laptop | Gruzovichkof |

| 8191 | laptop | PTT | | 8188 | laptop | Courier |

+----------+---------+------------------+

Two Similar Tables

Histogram statistics are useful primarily for nonindexed columns. Adding anindex to a column for which histogram statistics are applicable might also helpthe optimizer make row estimates. The tradeoffs are:

An index must be updated when table data is modified.A histogram is created or updated only on demand, so it adds no overhead

when table data is modified. On the other hand, the statistics become progres-sively more out of date when table modifications occur, until the next time theyare updated.

MySQL User Reference Manual

Optimizer Statistics aka Histograms

mysql> alter table goods_characteristics stats_sample_pages=5000;

mysql> alter table goods_shops stats_sample_pages=5000;

mysql> analyze table goods_characteristics, goods_shops;

+----------------------------+---------+----------+----------+

+----------------------------+---------+----------+----------+

Index Statistics is More than Good

• The querymysql> select count(*) from goods_shops join goods_characteristics

-> using (good_id)

-> where size < 12 and

-> manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)

-> and (location in (’Moscow’, ’Kiev’) or

-> delivery_options in (’Premium’, ’Urgent’));

^C^C -- query aborted

ERROR 1317 (70100): Query execution was interrupted

Performance

• Handlersmysql> show status like ’Handler%’;

+----------------------------+-------------+

| Handler_commit | 0 |

| Handler_delete | 0 |

| Handler_discover | 0 |

| Handler_external_lock | 4 |

| Handler_mrr_init | 0 |

| Handler_prepare | 0 |

| Handler_read_first | 1 |

| Handler_read_key | 13043 |

| Handler_read_last | 0 |

| Handler_read_next | 854,767,916 |

Performance

• Table ordermysql> explain select count(*) from goods_shops join goods_characteristics

-> using (good_id) where size < 12 and

+----+-----------------------+-------+---------+--------+----------+---------------+

+----+-----------------------+-------+---------+--------+----------+---------------+

| 1 | goods_characteristics | index | good_id | 131072 | 25.00 | Using... |

| 1 | goods_shops | ref | good_id | 65536 | 36.00 | Using... |

+----+-----------------------+-------+---------+--------+----------+---------------+

Performance

• Table order mattersmysql> explain select count(*) from goods_shops straight_join goods_characteristics

-> using (good_id) where size < 12 and

+----+-----------------------+-------+---------+--------+----------+---------------+

| 1 | goods_shops | index | good_id | 65536 | 36.00 | Using... |

| 1 | goods_characteristics | ref | good_id | 131072 | 25.00 | Using... |

+----+-----------------------+-------+---------+--------+----------+---------------+

Performance

• Table order mattersmysql> select count(*) from goods_shops straight_join goods_characteristics

-> using (good_id)

+----------+

| count(*) |

+----------+

| 816640 |

+----------+

Performance

• Table order mattersmysql> show status like ’Handler_read_next’;

+-------------------+-----------+

Performance

• Not for all datamysql> select count(*) from goods_shops straight_join goods_characteristics

-> using (good_id)

-> where (size > 15 or manufacturer in (’Sony’, ’Casper’))

-> and location in

-> (’New York’, ’San Francisco’, ’Paris’, ’Berlin’, ’Brussels’, ’London’)

-> and delivery_options in

-> (’DHL’,’Normal Post’, ’Tracked’, ’Fedex’, ’No delivery’);

^C^C -- query aborted

ERROR 1317 (70100): Query execution was interrupted

Performance

• Not for all datamysql> show status like ’Handler%’;

+----------------------------+------------+

| Handler_commit | 10 |

| Handler_delete | 0 |

| Handler_discover | 0 |

| Handler_external_lock | 28 |

| Handler_mrr_init | 0 |

| Handler_prepare | 0 |

| Handler_read_first | 1 |

| Handler_read_key | 143 |

| Handler_read_last | 0 |

Performance

mysql> analyze table goods_shops update histogram

-> on location, delivery_options;

+-------------+-----------+----------+--------------------------------+

| goods_shops | histogram | status | Histogram statistics created

for column ’delivery_options’. |

| goods_shops | histogram | status | Histogram statistics created

for column ’location’. |

+-------------+-----------+----------+--------------------------------+

Histograms to The Rescue

mysql> analyze table goods_characteristics update histogram

-> on size, manufacturer ;

+-----------------------+-----------+----------+------------------------------+

| goods_characteristics | histogram | status | Histogram statistics created

for column ’manufacturer’. |

| goods_characteristics | histogram | status | Histogram statistics created

for column ’size’. |

+-----------------------+-----------+----------+------------------------------+

• The querymysql> select count(*) from goods_shops join goods_characteristics

-> using (good_id)

+----------+

| count(*) |

+----------+

| 816640 |

+----------+

• The querymysql> show status like ’Handler_read_next’;

+-------------------+-----------+

• Filtering effectmysql> explain select count(*) from goods_shops join goods_characteristics

-> using (good_id)

+----+-----------------------+-------+---------+--------+----------+----------+

| 1 | goods_shops | index | good_id | 65536 | 0.06 | Using... |

| 1 | goods_characteristics | ref | good_id | 131072 | 15.63 | Using... |

+----+-----------------------+-------+---------+--------+----------+----------+

How Histograms Work?

↓ sql/sql planner.cc

↓ calculate condition filter↓ Item func *::get filtering effect• get histogram selectivity• Seen as a percent of filtered rows inEXPLAIN

Low Level

↓ sql/sql planner.cc↓ calculate condition filter

↓ Item func *::get filtering effect• get histogram selectivity• Seen as a percent of filtered rows inEXPLAIN

Low Level

↓ sql/sql planner.cc↓ calculate condition filter↓ Item func *::get filtering effect

• get histogram selectivity• Seen as a percent of filtered rows inEXPLAIN

Low Level

↓ sql/sql planner.cc↓ calculate condition filter↓ Item func *::get filtering effect• get histogram selectivity

• Seen as a percent of filtered rows inEXPLAIN

Low Level

↓ sql/sql planner.cc↓ calculate condition filter↓ Item func *::get filtering effect• get histogram selectivity• Seen as a percent of filtered rows inEXPLAIN

Low Level

• Example datamysql> create table example(f1 int) engine=innodb;

mysql> insert into example values(1),(1),(1),(2),(3);

mysql> select f1, count(f1) from example group by f1;

+------+-----------+

| f1 | count(f1) |

+------+-----------+

| 1 | 3 |

| 2 | 1 |

| 3 | 1 |

+------+-----------+

•With the histogram

Filtered Rows

•Without a histogrammysql> explain select * from example where f1 > 0\G

*************************** 1. row ***************************

select_type: SIMPLE

table: example

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 5

filtered: 33.33

Extra: Using where

1 row in set, 1 warning (0.00 sec)

Filtered Rows

*************************** 1. row ***************************

select_type: SIMPLE

table: example

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 5

filtered: 33.33

Extra: Using where

Filtered Rows

*************************** 1. row ***************************

select_type: SIMPLE

table: example

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 5

filtered: 33.33

Extra: Using where

Filtered Rows

*************************** 1. row ***************************

select_type: SIMPLE

table: example

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 5

filtered: 33.33

Extra: Using where

Filtered Rows

•With the histogrammysql> analyze table example update histogram on f1 with 3 buckets;

+-----------------+-----------+----------+------------------------------+

| hist_ex.example | histogram | status | Histogram statistics created

for column ’f1’. |

+-----------------+-----------+----------+------------------------------+

Filtered Rows

•With the histogrammysql> select * from information_schema.column_statistics

-> where table_name=’example’\G

*************************** 1. row ***************************

SCHEMA_NAME: hist_ex

TABLE_NAME: example

COLUMN_NAME: f1

HISTOGRAM:

"buckets": [[1, 0.6], [2, 0.8], [3, 1.0]],

"data-type": "int", "null-values": 0.0, "collation-id": 8,

"last-updated": "2018-11-07 09:07:19.791470",

"sampling-rate": 1.0, "histogram-type": "singleton",

"number-of-buckets-specified": 3

Filtered Rows

•With the histogrammysql> explain select * from example where f1 > 0\G

*************************** 1. row ***************************

select_type: SIMPLE

table: example

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 5

filtered: 100.00 -- all rows

Extra: Using where

Filtered Rows

*************************** 1. row ***************************

select_type: SIMPLE

table: example

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 5

filtered: 40.00 -- 2 rows

Extra: Using where

Filtered Rows

*************************** 1. row ***************************

select_type: SIMPLE

table: example

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 5

filtered: 20.00 -- one row

Extra: Using where

Filtered Rows

*************************** 1. row ***************************

select_type: SIMPLE

table: example

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 5

filtered: 20.00 - one row

Extra: Using where

Filtered Rows

1 2 30

Indexes: Cardinality

1 2 30

Histograms

Left Overs

Histograms Indexes

Maintained by Optimizer Storage Engine

Updated On Demand On every DML ∗Storage Light Heavy

Optimizer Uses Real Numbers ∗∗ Cardinality

∗ Unless persistent statistics used∗∗ For up to 1024 buckets

Histograms vs Indexes

• CREATE INDEX• Metadata lock• Can be blocked by any query

• UPDATE HISTOGRAM• Backup lock• Can be locked only by a backup• Can be created any time without fear

Maintenance: Locking

• CREATE INDEX• Metadata lock• Can be blocked by any query

• UPDATE HISTOGRAM• Backup lock• Can be locked only by a backup• Can be created any time without fear

Maintenance: Locking

• CREATE INDEX• Locks writes• Locks reads ∗

PS-2503

Before Percona Server 5.6.38-83.0/5.7.20-18Upstream

• Every DML updates the index

• UPDATE HISTOGRAM• Uses up tohistogram generation max mem size• Persistent after creation• DML do not touch it

Maintenance: Load

• CREATE INDEX• Locks writes• Locks reads ∗• Every DML updates the index

• UPDATE HISTOGRAM• Uses up tohistogram generation max mem size• Persistent after creation• DML do not touch it

Maintenance: Load

• Helps if query plan can be changed• Not a replacement for the index:• GROUP BY• ORDER BY• Query on a single table ∗

Only if filtering effect can change the plan

Histograms

• Data distribution is uniform• Range optimization can be used• Full table scan is fast

When Histogram are Not Helpful?

• Index statistics collected by the engine• Optimizer calculates Cardinality each time

when it accesses statistics• Indexes don’t always improve performance• Histograms can help

� Still new feature• Histograms do not replace other

optimizations!

Conclusion

MySQL User Reference ManualBlog by Erik FrosethBlog by Frederic DescampsTalk by Oystein Grovlen @FosdemTalk by Sergei Petrunia @PerconaLiveWL #8707

More information

www.slideshare.net/SvetaSmirnova

twitter.com/svetsmirnova

github.com/svetasmirnova

Thank you!

Rate My Session!

Percona’s open source database experts aretrue superheroes, improving databaseperformance for customers across the globe.

Discover what it means to have a Perconacareer with the smartest people in thedatabase performance industries, solving themost challenging problems our customerscome across.

We’re Hiring!

Thank You

Billion Goods in Few Categories - Percona · mysql.innodb table stats mysql.innodb index stats...

Documents

Forking and Branching - Percona...© 2018 Percona. 1 Peter Zaitsev, CEO, Percona Forking and Branching Lessons from MySQL Community August 28, 2018 Percona Technical Webinars

Percona Software & ServicesUpdate · Percona Software & ServicesUpdate Q2 2017 Peter Zaitsev,CEO Percona TechnicalWebinars May4, 2017. 2 ... Galera Cluster for MySQL MariaDB Galera

AWS + MySQL - Percona · Agenda Options for Running MySQL in AWS RDS or EC2 IO Performance Cloud Watch and Percona Monitoring Plugins HA Options

Percona XtraDB Cluster Documentation

Percona XtraDB Cluster (PXC) 101 · Percona XtraDB Cluster (PXC) 101 ... Percona XtraDB Cluster (PXC) 101 This presentation is based on a Webinar: Percona XtraDB Cluster (PXC) 101

Percona presentation v2

Writing Prometheus exporters - Percona...1 © 2016 Percona Alexey Palazhchenko Writing Prometheus exporters In theory and practice PMM developer / Prometheus contributor / Gopher Percona

Percona Toolkit 2.2.12

Integrate Percona MySQL...EventTracker: Integrate Percona MySQL 3 Introduction Percona MySQL is an open source software company specializing in . MySQL support, consulting, managed

Peter Zaitsev, CEO, Percona January 14, 2015 Percona ... · Peter Zaitsev, CEO, Percona January 14, 2015 Percona Technical Webinars Technologies for your Data in 2015 . ... OpenStack

Bug Analyst at Percona Lalit Choudhary · 1 © 2018 Percona MySQL 8.0 Architecture and Enhancements Lalit Choudhary Bug Analyst at Percona

Percona XtraDB Cluster Documentation · Percona XtraDB Cluster Documentation, Release 5.6.22-25.8 Percona XtraDB Cluster is High Availability and Scalability solution for MySQL Users

Percona Lucid Db

with Ansible and Vagrant Deploying MySQL HA - … · Deploying MySQL HA with Ansible and Vagrant Daniel Guzman Burgos (Percona) Robert Barabas (Percona) 2015-04-13

Percona Software & ServicesUpdate · Percona Server for MySQL MariaDB Percona XtraDBCluster GaleraCluster for MySQL MariaDBGaleraCluster MongoDB Percona Server for MongoDB Amazon

PostgreSQL HA Database Clusters through Containment · PostgreSQL HA Database Clusters through Containment Le Quan Ha IPG Database Team, ... stats realm Strictly\ Private stats auth

Percona XtraDB Cluster · Percona XtraDB Cluster Percona XtraDB Cluster is based on Percona Server running with the XtraDB storage engine. It uses the Galera library, which is an

Percona and TokuDB: Architecture, Design, and …...Peter Zaitsev, CEO, Percona Jon Tobin Sr System Engineer, Tokutek January 28, 2015 Percona Technical Webinars Percona and TokuDB:

Percona XtraBackup: install, usage, tricks · Percona XtraBackup: install, usage, tricks ... productivity Percona XtraBackup: ... MySQL 5.1 + InnoDB-plugin

Percona - Architecture and Design of MySQL Powered ......About Percona Open Source Software for MySQL Ecosystem •Percona Server •Percona XtraDB Cluster (PXC) •Percona Xtrabackup