36
Slide: 1 Presentation Title Presentation Sub-Title ght 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner The PostgreSQL Query Planner Robert Haas Drexel University CS 500 Database Theory

Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Embed Size (px)

Citation preview

Page 1: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Slide: 1

Presentation TitlePresentation Sub-Title

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner

The PostgreSQLQuery Planner

Robert HaasDrexel University

CS 500 Database Theory

Page 2: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 2

Why Does My Query Need a Plan?

• SQL is a declarative language.

• In other words, a SQL query is not a program.

• No control flow statements (e.g. for, while) and no way to control order of operations.

• SQL describes results, not process.

Page 3: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 3

The Best Plan May Not Be Obvious

CREATE TABLE foo (a integer, txt varchar);

CREATE INDEX foo_a ON foo (a);

...insert some data...

SELECT * FROM foo WHERE a = 1;

What should the planner do?

Page 4: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 4

Data Distribution Affects Plan Choice

SELECT * FROM foo WHERE a = 1

• Plan #1 (10,000 rows, a = 1 .. 10000): Index Scan using foo_a on foo

Index Cond: (a = 1)

• Plan #2 (10,000 rows, 90% have a = 1): Seq Scan on foo

Filter: (a = 1)

Page 5: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 5

Data Distribution Affects Plan Choice

SELECT * FROM foo WHERE a = 1

• Plan #3 (10,000 rows, a = 1 .. 10, 1000 times each): Bitmap Heap Scan on foo

Recheck Cond: (a = 1)

-> Bitmap Index Scan on foo_a

Index Cond: (a = 1)

Page 6: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 6

Join Planning

CREATE TABLE foo (a integer, txt varchar);CREATE TABLE bar (a integer, txt varchar);CREATE INDEX foo_a ON foo (a);CREATE INDEX bar_a ON bar (a);

...insert some data...

SELECT * FROM foo, bar WHERE foo.x = bar.x

What should the planner do? (at least 14 choices!)

Page 7: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 7

Goals of Query Planning

• Make queries run fast.– Minimize disk I/O.– Prefer sequential I/O to random I/O.– Minimize CPU processing.

• Don't use too much memory in the process.

• Deliver correct results.

Page 8: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 8

Query Planner Decisions

• Access strategy for each table.– Sequential Scan, Index Scan, Bitmap Index Scan.

• Join strategy.– Join order.– Join strategy: nested loop, merge join, hash join.– Inner vs. outer.

• Aggregation strategy.– Plain, sorted, hashed.

Page 9: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 9

Table Access Strategies

• Sequential Scan (Seq Scan)– Read every row in the table.

• Index Scan or Bitmap Index Scan– Read only part of the table by using the index to skip uninteresting

parts.– Index scan reads index and table in alternation.– Bitmap index scan reads index first, populating bitmap, and then

reads table in sequential order.

Page 10: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 10

Sequential Scan

• Always works – no need to create indices in advance.

• Doesn't require reading the index, which has both I/O and CPU cost.

• Best way to access very small tables.

• Usually the best way to access all or nearly the rows in a table.

Page 11: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 11

Index Scan

• Potentially huge performance gain when reading only a small fraction of rows in a large table.

• Only table access method that can return rows in sorted order – very useful in combination with LIMIT.

• Random I/O against base table!

Page 12: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 12

Bitmap Index Scan

• Scans all index rows before examining base table, populating a TID bitmap.

• Table I/O is sequential, with skips; results in physical order.

• Can efficiently combine data from multiple indices – TID bitmap can handle boolean AND and OR operations.

• Handles LIMIT poorly.

Page 13: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 13

Join Planning

• Fixing the join order and join strategy is the “hard part” of query planning.

• # of possibilities grows exponentially with number of tables.

• When search space is small, planner does a nearly exhaustive search.

• When search space is too large, planner uses heuristics or GEQO to limit planning time and memory usage.

Page 14: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 14

Join Strategies

• Nested loop.• Merge join.• Hash join.

Each join strategy takes an “outer” relation and an “inner” relation and produces a result relation.

Page 15: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 15

Nested Loop Pseudocode

for (each outer tuple)for (each inner tuple)

if (join condition is met)emit result row;

• Outer or inner loop could be scanning output of some other join, or a base table. Base table scan could be using an index.

• Cost is roughly proportional to product of table sizes – bad if BOTH are large.

Page 16: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 16

Nested Loop Example #1

SELECT * FROM foo, bar WHERE foo.x = bar.x

Nested Loop Join Filter: (foo.x = bar.x) -> Seq Scan on bar -> Materialize -> Seq Scan on foo

This might be very slow!

Page 17: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 17

Nested Loop Example #2

SELECT * FROM foo, bar WHERE foo.x = bar.x

Nested Loop -> Seq Scan on foo -> Index Scan using bar_pkey on bar Index Cond: (bar.x = foo.x)

Nested loop with inner index-scan! Much better... though probably still not the best plan.

Page 18: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 18

Merge Join

• Only handles equality joins – something like a.x = b.x.

• Put both input relations into sorted order (using sort or index scan) and scan through the two in parallel, matching up equal values.

• Normally visits each input tuple only once, but may need to “rescan” portions of the inner input if there are duplicate values in the outer input.– Take OUTER={1 2 2 3} and INNER={2 2 3 4}

Page 19: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 19

Merge Join Example

SELECT * FROM foo, bar WHERE foo.x = bar.x

Merge Join Merge Cond: (foo.x = bar.x) -> Sort Sort Key: foo.x -> Seq Scan on foo -> Materialize -> Sort Sort Key: bar.x -> Seq Scan on bar

Page 20: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 20

Hash Join

• Like merge join, only handles equality joins.

• Hash each row from the inner relation to create a hash table. Then, hash each row from the outer relation and probe the hash table for matches.

• Very fast – but requires enough memory to store inner tuples. Can get around this using multiple “batches”.

• Not guaranteed to retain input ordering.

Page 21: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 21

Hash Join Example

SELECT * FROM foo, bar WHERE foo.x = bar.x

Hash Join

Hash Cond: (foo.x = bar.x)

-> Seq Scan on foo

-> Hash

-> Seq Scan on bar

Page 22: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 22

Join Removal

• Upcoming PostgreSQL 9.0 feature!

• Consider the following query:

SELECT foo.x, foo.y, foo.zFROM foo LEFT JOIN bar ON foo.x = bar.x;

• If there is a unique index on bar (x), then, instead of joining foo and bar, we can just read foo, and ignore bar.

• Common scenario using views or query generators.

Page 23: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 23

Join Removal – Continued

• PostgreSQL 9.0 will only be able to remove LEFT joins.

• Current project for PostgreSQL 9.1: remove INNER joins.

• Consider:SELECT foo.x, foo.y, foo.z FROM foo, bar WHERE foo.x = bar.x;

• Need: (1) foo.x is NOT NULL, (2) foreign key foo (x) references bar (x).

Page 24: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 24

Join Reordering

• SELECT * FROM fooJOIN bar ON foo.x = bar.xJOIN baz ON foo.y = baz.y

• SELECT * FROM fooJOIN baz ON foo.y = baz.yJOIN bar ON foo.x = bar.x

• SELECT * FROM fooJOIN (bar JOIN baz ON true)

ON foo.x = bar.x AND foo.y = baz.y

Page 25: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 25

Not The Same Thing!

• SELECT * FROM(foo JOIN bar ON foo.x = bar.x)LEFT JOIN baz ON foo.y = baz.y

• SELECT * FROM(foo LEFT JOIN baz ON foo.y = baz.y)JOIN bar ON foo.x = bar.x

Page 26: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 26

EXPLAIN Estimates

Hash Join (cost=8.28..404.52 rows=9000 width=118)

Hash Cond: (foo.x = bar.x)

-> Hash Join (cost=3.02..275.52 rows=9000 width=12)

Hash Cond: (foo.y = baz.y)

-> Seq Scan on foo (cost=0.00..145.00 rows=10000 width=8)

-> Hash (cost=1.90..1.90 rows=90 width=4)

-> Seq Scan on baz (cost=0.00..1.90 rows=90 width=4)

-> Hash (cost=4.00..4.00 rows=100 width=106)

-> Seq Scan on bar (cost=0.00..4.00 rows=100 width=106)

Page 27: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 27

EXPLAIN ANALYZE

Hash Join (cost=8.28..404.52 rows=9000 width=118)

(actual time=0.743..51.582 rows=9000 loops=1)

Hash Cond: (foo.x = bar.x)

-> Hash Join (cost=3.02..275.52 rows=9000 width=12)

(actual time=0.368..30.964 rows=9000 loops=1)

Hash Cond: (foo.y = baz.y)

-> Seq Scan on foo (cost=0.00..145.00 rows=10000 width=8) (actual time=0.021..9.908 rows=10000 loops=1)

-> Hash (cost=1.90..1.90 rows=90 width=4)

(actual time=0.280..0.280 rows=90 loops=1)

Buckets: 1024 Batches: 1 Memory Usage: 4kB

-> Seq Scan on baz (cost=0.00..1.90 rows=90 width=4) (actual time=0.010..0.138 rows=90 loops=1)

-> Hash (cost=4.00..4.00 rows=100 width=106)

(actual time=0.354..0.354 rows=100 loops=1)

Buckets: 1024 Batches: 1 Memory Usage: 14kB

-> Seq Scan on bar (cost=0.00..4.00 rows=100 width=106) (actual time=0.007..0.167 rows=100 loops=1)

Total runtime: 59.376 ms

Page 28: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 28

Review of Join Planning

• Join Order

• Join Strategy– Nested loop– Merge join– Hash join– Join removal

• Inner vs. outer

Page 29: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 29

Aggregates and DISTINCT

• Plain aggregate.– e.g. SELECT count(*) FROM foo;

• Sorted aggregate.– Sort the data (or use pre-sorted data); when you see a new value,

aggregate the prior group.

• Hashed aggregate.– Insert each input row into a hash table based on the grouping

columns; at the end, aggregate all the groups.

Page 30: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 30

Statistics

• All of the decisions discussed earlier in this talk are made using statistics.– Seq scan vs. index scan vs. bitmap index scan– Nested loop vs. merge join vs. hash join

• ANALYZE (manual or via autovacuum) gathers this information.

• You must have good statistics or you will get bad plans!

Page 31: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 31

Confusing The Planner

SELECT * FROM foo WHERE a = 1 AND b = 1• If 20% of the rows have a = 1 and 10% of the rows have b =

1, the planner will assume that 20% * 10% = 2% of the rows meet both criteria.

SELECT * FROM foo WHERE (a + 0) = a• Planner doesn't have a clue, so will assume 0.5% of rows

will match.

Page 32: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 32

What Can Go Wrong?

• If the planner underestimates the row count, it may choose an index scan instead of a sequential scan, or a nested loop instead of a hash or merge join.

• If the planner overestimates the row count, it may choose a sequential scan instead of an index scan, or a merge or hash join instead of a nested loop.

• Small values for LIMIT tilt the planner toward fast-start plans and magnify the effect of bad estimates.

Page 33: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 33

Query Planner Parameters

• seq_page_cost (1.0), random_page_cost (4.0) – Reduce these costs to account for caching effects. If database is fully cached, try 0.005.

• default_statistics_target (10 or 100) – Level of detail for statistics gathering. Can also be overridden on a per-column basis.

• enable_hashjoin, enable_sort, etc. - Just for testing.• work_mem – Amount of memory per sort or hash.• from_collapse_limit, join_collapse_limit, geqo_threshold –

Sometimes need to be raised, but be careful!

Page 34: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 34

Things That Are Slow

• DISTINCT.

• PL/pgsql loops.FOR x IN SELECT ... LOOP SELECT ... END LOOP

• Repeated calls to SQL or PL/pgsql functions.SELECT id, some_function(id) FROM table;

Page 35: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 35

Upcoming Features

• Join removal (right now just for LEFT joins).• Machine-readable EXPLAIN output.• Hash statistics.• Better model for Materialize costs.• Improved use of indices to handle MIN(x), MAX(x), and x IS

NOT NULL.

Page 36: Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL

Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL Query Planner Slide: 36

Questions?

Any questions?