37
Why physical storage of your database tables might matter Apoorva Aggarwal

Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Why physical storage of your database tables might matter

Apoorva Aggarwal

Page 2: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

About MeData Platform Engineer at Grofers helping build systems for data informed decisions

@apoorva_1993 apoorvaaggarwal

Page 3: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 4: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 5: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

The War ZoneBird’s eye view

Page 6: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Behind the scenesNarrowing down into problem space

Page 7: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

We looked at the query plan

Page 8: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Default index in postgres is of type BTree which constitutes a BTree or a balanced tree of index entries and index leaf nodes which store physical address of an index entry.

An index lookup requires three steps:

1. Tree traversal - O(log n)2. Following the leaf node chain - O(n) ~

O(200)3. Fetching the table data

Page 9: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

In postgres, location of a row is given by ctid which is a tuple. ctid is of type tid (tuple identifier), called ItemPointer in the C code. Per documentation:

This is the data type of the system column ctid. A tuple ID is a pair (block number, tuple index within block) that identifies the physical location of the row within its table.

Distribution of Data

Page 10: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

CLUSTER command

- Physically rearranges the rows on disk based on a given column.

- Acquiring a READ WRITE lock on table and requiring 2.5x the table size made it tricky to use.

Possible solutions to root cause and exploring feasibility Insert cluster gif

Page 11: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

An attempt to fix the problem right from the source

Page 12: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Understanding how Spark writes data to a table

Page 13: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 14: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 15: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 16: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 17: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 18: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Looking at the code we found that we had partitioned on product_id for a particular transformation operation.

Note how rows containing a particular product ID are in one partition

Page 19: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Testing hypothesis

Page 20: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 21: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Attempting to align the data

Page 22: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

We repartitioned the dataframe

Page 23: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

New data distributionAlas, the table is still not pivoted on customer_id.

What did we do wrong?

Page 24: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 25: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Understanding Repartitioning

Page 26: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Visualizing the dataframe

Page 27: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Bring all rows of a product together?How?

Sort them maybe!?

Page 28: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

After sorting rows within a partition by customer_id

Page 29: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Let’s write this to the database table And see distribution

Page 30: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 31: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 32: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Testing the solutionThe final test still remains. Will the query execution now be faster. Let’s see what the query planner says

Page 33: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 34: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

A plain index scan fetches one tuple-pointer at a time from the index, and immediately visits that tuple in the table. A bitmap scan fetches all the tuple-pointers from the index in one go, sorts them using an in-memory "bitmap" data structure, and then visits the table tuples in physical tuple-location order.

Why a bitmap scan

Page 35: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

Execution time came down from ~100 ms to ~3ms.

Page 36: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer
Page 37: Why physical storage of your database tables might matter · 2020-03-16 · Why physical storage of your database tables might matter Apoorva Aggarwal. About Me Data Platform Engineer

We are hiring!

Email: [email protected]