TOASTing an Elephant : Building a Custom Data Warehouse Using PostgreSQL

Building a Custom Data Warehouse Using PostgreSQL

TOASTing an Elephant

Illustration by Zoe Lubitz

David Kohn

Chief Elephant Toaster and Data Engineer at Moat

[email protected]

mailto:[email protected]

We measure attention online, for both advertisers and publishers.

We don’t track cookies/ip addresses.

Rather we process billions of events per day that allow us to measure how many people saw an ad or interacted with it.

We are a neutral third party and our metrics are used by both advertisers and publishers to measure their performance and agree on a fair price.

Those billions of events are aggregated in our realtime system and end up as millions of rows per day added to our stats databases.

Moat Interface

Moat Interface

Moat Interface

tuple client filter1date filterN metric1 metricN

Partition Keys Filters (~10 text) Metrics (~170 int8)

Production queries have single client.

Production queries sum all of these.Subset(s) are hierarchical.

Basic Row Structure



Production queries have single client.

Production queries sum all of these.Subset(s) are hierarchical.

Basic Row Structure

SELECT filter1, filter2 … SUM(metric1), SUM(metric2) … SUM(metricN) FROM rollup_table_name WHERE client = ‘foo’ AND date >= ‘bar’ AND date <= ‘baz’

GROUP BY filter1, filter2 …

Typical Query

Moat Interface

Client Filters Date Range

Metrics

Moat Interface


Metrics (And there’s a lot more of them you can choose)

Lots of Data

Sum large amounts of data quickly (but only a small fraction of total data, easily partition-able)

Sum all columns of very wide rows

Compress data (for storage and i/o reasons)

Support medium read concurrency (or at least degrade predictably) ie 4-12 requests/second some of which can take minutes to finish

Data is derivative and structured to meet needs of client-facing app high read/aggregation throughput for clients

ETL quickly, some bulk delete/redo operations, once per day

Requirements

Should we choose a row store or a column store?

Old Systems

• 2 masters + 2 replicas each

• Handled last 7 days

• High concurrency

• Highly disk bound

• Heavily partitioned

• Shield for column stores

• ~3 mos/cluster (30 TB license - 8 nodes - $$$)

• Fast, but slowed down under concurrency

• Performance degradation unpredictable

• Projections can lead to slow ETL

• 1 cluster (8 nodes, spinning disk)

• 2012-Present

• No roll up tables, too big

• Incredibly slow for client facing queries (many columns)

• Bulk Insert ETL, delete/update hard

Postgres Vertica Redshift

page

tuple tuple

tuple tuple

header

tuple header attr attr

attr attr

attr

attr

attr attr attrattr attr

table (on disk)

page page page page

page page page page

page page page page

Row Store

A table is a collection of rows, each row split into columns/attrs

Each row must fit into a page.

page

tuple tuple

tuple tuple

header


attr attr

attr

attr


table (on disk)

page page page page

page page page page

page page page page

Row Store• Accesses small subsets of rows

quickly

• Little penalty for many columns

selected

• Great for individual inserts,

updates and deletes

• Often normalize data structure

• OLTP workloads

• High concurrency, less

throughput per user

• Data stored uncompressed,

unless too large for a block

pagecompressed values (possibly with surrogate keys)

table (on disk)

attr Apage page page page

page page page

attr Bpage page page page

page page page page

page page

Column Store

A table is a collection of columns.

Each column split into values position corresponds to row.

Values in columns often compressed.

pagecompressed values (possibly with surrogate keys)

table (on disk)

attr Apage page page page

page page page

attr Bpage page page page

page page page page

page page

Column Store• Scans and aggregates large

numbers of rows quickly

• Best when selecting a subset of

columns

• Great for bulk inserts, harder to

delete or update

• Often denormalized data

structure

• OLAP workloads

• Lower concurrency, much higher

throughput per user

• Data can be compressed

page

tuple tuple

tuple tuple


attr attr

attr

attr


attr

What happens when an attr is too big to fit in a page?

?

header

TOAST tablepage

The Oversized Attribute StorageTechnique

tuple header attr pointer

attr attr

attr

attr


attr

tuple id

compressed attr

segment

LZIP

pagetuple id

compressed attr

segment

Project Marjory

Project Marjory

Moat Interface


Metrics



Original Row



Original Row

Subtype

subtype filter1 filterN metric1 metricN



Original Row

Subtype

subtype filter1 filterN metric1 metricN

tuple array

MegaRow

clientdate

Partition Keys

subtype

Array of Composite Type (~5000 rows/array)

subtype subtype subtype

subtype subtype subtype subtype

segment

INSERT INTO array_table_name SELECT date, client, segment, ARRAY_AGG( (filter1, filter2 … metric1, metric2 … metricN)::subtype) FROM temp_table_for_etl GROUP BY date, client, segment

Typical ETL Query

INSERT INTO array_table_name SELECT date, client, segment, ARRAY_AGG( (filter1, filter2 … metric1, metric2 … metricN)::subtype) FROM temp_table_for_etl GROUP BY date, client, segment

Typical ETL Query

Reporting Query SELECT a.date, a.client,

s.filter1 ... s.filterN SUM(s.metric1)... SUM(s.metricN)

FROM array_table_name a, LATERAL UNNEST(subtype[]) s (filter1, filter2, … metricN) WHERE client = ‘foo’ AND date >= ‘bar’ AND date <= ‘baz’

1 Client, 10 days, ~150,000 rows/day (~1.5m rows total)

MarjoryRedshift

1 Client, 10 days, ~3,000,000 rows/day (~30m rows total)

MarjoryRedshift

1 Client, 4 months, ~150,000 rows/day (~18m rows total)

MarjoryRedshift


MarjoryRedshift


MarjoryRedshift

• Performs quite well on our typical queries (lots of columns, large subset of rows)

• Sort order matters less than in column stores

• Query time scales with number of rows unpacked and aggregated, lightly depends on number of columns

• Utilizes resources efficiently for concurrency (Postgres’ stinginess can serve us well)

• 8-10x compression for our data (with a bit of extra tuning of our composite type)

• All done in PL/PGSQL etc, no C-code required.

• Doesn’t do as well on general SQL queries, have to unpack all of the rows

• Not getting you much compared to a column store if you’re accessing only a few columns (one might be able to design it differently though)

• Doesn’t dynamically scale number of workers for size of query (Postgres’ stinginess doesn’t serve us well for more typical BI cases, but that wasn’t what we optimized for)

• Isn’t going to do as well when scanning very large numbers of rows (ie more typical BI)

• All done in PL/PGSQL etc, no C-code required.

The Good The Not-So-Good

Trade generality for fit to our use case.

I’ll Drink to That!


Rollups

SELECT filter1, filter2, filterN, SUM(metric1), SUM(metricN) GROUP BY GROUPING SETS(filter1, filter2 ... filterN-1, filterN), (filter1 ... filterN-1, filterN), ... (filter1, filter2), (filter1)

INSERT INTO byfilter1 ... INSERT INTO byfilter2 ...

tuple arrayarrayarrayarrayclientdate

Partition Keys

subtype

subtype

subtype

subtype

subtype

subtype

subtype

subtype

subtype subtype

segment

subtype

subtype

subtype

subtype

subtypesubtype

subtype

subtype

byfilter4[ ]byfilter3[ ]byfilter2[ ]byfilter1[ ]

MegaRow

tuple arrayarrayclientdate

Partition Keys

subtype

subtype

subtype

subtype

subtype

subtype

subtype

subtype

segmentsubtype

subtype

subtype


MegaRow

array

subtype

subtypearray

subtype

subtype

subtype

subtype

subtype

subtype

subtype

subtype

tuple arrayarrayclientdate

Partition Keys

subtype

subtype

subtype

subtype

subtype

subtype

subtype

subtype

segmentsubtype

subtype

subtype


MegaRow

NULL NULL

tuple clientdate

Partition Keys

segment

Rollup Arrays

Summary Statistics

total_rows metadata

Summary Statistics

tuple clientdate

Partition Keys

segment

Rollup Arrays

Summary Statistics

total_rows metadata

Summary Statistics

SELECT date, client, SUM(total_rows) as rows_per_day FROM array_table_name GROUP BY date, client

Count Rows/Day by Client

Partition Keys Rollup Arrays

Distinct ListsSummary Stats

clientdate segment

total rows metadata arrayarray

val val

val val

val val val

val val

val

Distinct Filter Values



Targeted Reporting Query

clientdate segment

total rows metadata

SELECT a.date, a.client, s.filter1, s.filter2, … s.metricN FROM array_table_name a, LATERAL UNNEST(subtype[]) s (filter1, filter2, … metricN) WHERE client = ‘foo’ AND date >= ‘bar’ AND date <= ‘baz’ AND s.filter1 = ‘fizz’

arrayarray

val val

val val

val val val

val val

val




Targeted Reporting Query

clientdate segment

total rows metadata

SELECT a.date, a.client, s.filter1, s.filter2, … s.metricN FROM array_table_name a, LATERAL UNNEST(subtype[]) s (filter1, filter2, … metricN) WHERE client = ‘foo’ AND date >= ‘bar’ AND date <= ‘baz’ AND s.filter1 = ‘fizz’

arrayarray

val val

val val

val val val

val val

val


SELECT a.date, a.client, s.filter1, s.filter2, … s.metricN FROM array_table_name a, LATERAL UNNEST(subtype[]) s (filter1, filter2, … metricN) WHERE client = ‘foo’ AND date >= ‘bar’ AND date <= ‘baz’ AND s.filter1 = ‘fizz’ AND a.distinct_filter1 @> ‘[fizz]’::text[]

Stats

• Marjory (All data since 2012) has about the same on disk footprint as Elmo (last 33ish days)

• ~20x compression compared to normal format Postgres (~10x TOAST + ~2x avoided storage of rollups)

• 5 Marjory instances, each with all of the data for all time (on local store spinning disk drives) have basically taken over what we had on our Vertica and Redshift instances (at least 16 instances)

• Overall tradeoff is I/O for CPU, so had to do some tuning to get parallel to planning/running properly

ALTER TABLE array_table_name ALTER client SET STATISTICS 10000; ALTER TABLE array_table_name ALTER byfilter1 SET STATISTICS 0; ALTER TABLE array_table_name ALTER byfilter2 SET STATISTICS 0; ... ALTER TABLE array_table_name ALTER byfilterN SET STATISTICS 0;

Only Do Meaningful Statistics (But Make Them Good)

Useful Tuning Tips



Useful Tuning Tips

Make Data-Type Specific Functions For Unnest With Proper StatsCREATE FUNCTION unnest(byfilter4) RETURNS SET OF array_subtype as $func$ ... $func$ LANGUAGE PLPGSQL ROWS 5000 COST 5000;



Useful Tuning Tips

Make Data-Type Specific Functions For Unnest With Proper StatsCREATE FUNCTION unnest(byfilter4) RETURNS SET OF array_subtype as $func$ ... $func$ LANGUAGE PLPGSQL ROWS 5000 COST 5000;

min_parallel_relation_size parallel_setup_cost parallel_tuple_cost max_worker_processes max_parallel_workers_per_gather cpu_operator_cost?

Futz With Parallelization Parameters Until They Work

Yep. CPU Bound

[email protected]


We’re hiring!

http://grnh.se/os4er71

mailto:[email protected]

http://grnh.se/os4er71

Data & Analytics

TOASTing an Elephant : Building a Custom Data Warehouse Using PostgreSQL