Corlogic Consulting December 2018 Corey Huinker Cloud ... · Storing More History with PostgreSQL Goal: Increase storage of L7 to replace Vertica and/or Redshift Combining 30 small

Cloud Architecture PatternsRunning PostgreSQL at Scale

(when RDS will not do what you need)Corey Huinker

Corlogic ConsultingDecember 2018

First, we need a problem to solve.

This is You

You Get An Idea For a Product

You make a product! ...now you have to sell it.

To advertise the product, you need an ad...

...so you talk to an ad agency.

But placing ads has challenges

Need to find websites with visitors who:

● Would want to buy your product● Are able to buy your product● Would like the style of your advertisement

A Websites' Claims about their Visitors...

...are not always accurate.

Buying ad-space on websites directly is usually not possible. You must use an auction service.

How Modern Ad Tracking Is Done● Each advertisement is wrapped in a JavaScript program● The program starts program starts when the web page loads● The program sends a message every ~5 seconds until the page closes● The program also sends messages when important events happen

○ Is the advertisement in a space that fits the size of the image?○ Is the advertisement in a part of the screen that is visible to the user?○ Did the mouse pass over the ad?○ Did the video begin to play?○ Is the audio muted?○ Did the video finish?

● These messages are collected by a "pixel server" and combined to construct a timeline of the life of the advertisement on that web page

Now your monitored ad reports events.

Focal points of ad monitoringWe want to know the number of times:

● How many times did the ad land on a page? ("Impressions")● How many times the ad landed in a favorable spot on the page?● How many times the ad fit into the space allotted?● How many times the ad was visible for 10 seconds? 20? 30?● Did the viewer interact with the ad in any way?

The Real Purpose of ad monitoringWe want to know:

● How many of our ads were seen by actual humans? ("Engagement")● How many of our ads were seen by NHT - Non-Human Traffic? ("bots")● How do these numbers compare with the claims of the website?● How do these numbers compare with the claims of the auction service?

Ultimately, we want to know how much of our money was wasted, so we can change where we spend money in the future.

Ad Tracking creates a lot of data● Not all impressions report their events

○ Default rate is about 3%○ Full reporting would require 30x the infrastructure for only a small gain in accuracy○ Customers can pay for a higher sampling rate

● Approximate sampling events recorded per day: 50,000,000,000○ Full reporting would be 1.5T events per day.

● Sampling events are chained together to tell the story of that impression.● Impression data is then aggregated by date, ad campaign, browser type● After aggregation, we have about 500M rows per day.● Each row has > 125 measures of viewability metrics

The Original Architecture (2013)

Tagged AdsOn Web Browsers

Viewable Events

(Billions/day)

PixelServers

Stats Cache Aggregators

csv.gz files on S3One per customer per day

Log shipping

DailySummary Files

100s of files(~100 GB total size)

Redshift (2013-2015)

Vertica (2014-2016)

Daily ETLs

MySQL OLTP

(2013-2014)

MySQL DW DBs (2013-2014)

User Query (2013)

Website Request

PartialQueries

Redshift(2013-2015)

Vertica(2014-2016)

Searches

MySQL DBs - Shard By Date

Stats Cache AggregatorsStats-Cache API

MySQL OLTP DB

Partial QueryPartial Query

Partial Query Results combined at the application

level● 3 dialects of SQL● 1 custom API

CSV.gz files not accessed directly by

queries

Capturing Ad Activity Events

● "Pixel" server:Website that only serves up one 1x1 pixel image

● Captures data about visiting browsers in web logs

● Needs to be fast to not delay user experience or risk losing event data

● Data must be read quickly to give customers real time results.

Tagged AdsOn Web Browsers

Viewable Events

(Billions/day)

PixelServers

● Number of Servers: ~500○ Probably too many servers○ Over-provisioned for reliability

● EC2 Type: t3.xlarge or similar○ Low CPU workload○ Low disk I/O workload○ High network bandwidth○ low latency

Real-time Event Accumulation and Aggregation● Stats Cache machines consume

syslogs from Pixel Servers● Log Events from the same

browser are combined to form the ad outcome.

● Outcomes are aggregated by ad campaign, product brand, etc.(10 different aggregations)

● Each Stats Cache is an incomplete shard of today's data

● At end of day, all shard data is combined into one CSV per customer

PixelServers

Stats Cache

S3 - CSVs

Log shipping

DailySummary Files

100s of files(~100 GB total size)

● Number of Servers: ~450● Custom in-memory DB● No disk storage of data● CPU load: nearly 100%● Cannot use swap● EC2 type: r5.2xlarge or

similar

Real-time Stats Reporting (2013)

Aggregationin Application

Code

API Response

API Request

Stats Cache Shards

● Aggregating in application code ○ high memory usage○ high network usage○ high potential for error

Non-Real-time Data Reporting (2013)

Aggregationin Application

Code

API Response

API Request

MySQL Shards(by date)

Redshift(for very large clients)

MySQL OLTP DB

First Steps to Fix MySQL OLTP DB● Converted to PostgreSQL 9.4 - Logical Replication not yet available● Conversion took 2-4 weeks using 2 programmers● Added Triggers on MySQL tables to identify modified rows● Used mysql_fdw to create migration tables on PostgreSQL● Created each new PostgreSQL table as SELECT * FROM foreign table● Scheduled tasks update PostgreSQL by reading new records in trigger tables● Moved read-only workloads to postgres instance● Migrated read-write apps in stages● Only downtime was in final cut-over● Final system: Single 32 core EC2 master with 1-2 physical read replicas

Next To Fix: MySQL Data Warehouse shards● Performed adequately when daily volume was < 1% of current volume● Impossible to add new columns to tables● Easier to create a new shard than to modify an existing one.● New metrics being added every few weeks or days (over 100 metrics)● Dozens of shards, some cover a month of data, others only a few days● Each new shard adds workload to application level aggregator

Understanding User Interest In Data

AgeOfData

Today's Real-time Data

User Interest In Data

Yesterday

2-7 Days Ago

8-30 Days Ago

Older

● 85% of API requests are for data <= 7 days old● This follows Zipf's Law: https://en.wikipedia.org/wiki/Zipf's_law● Conclusions:

○ put newest data on fastest servers○ move older data onto fewer, slower servers

https://en.wikipedia.org/wiki/Zipf's_law

Postgres For The Most Needed Data● Vanilla PostgreSQL instance 9.4 ● i3.8xlarge or similar: 32 cores, 240GB RAM, 5TB disk● Data partitioned by day● Drop any partitions > 10 days old.● All data is copy of data in S3. No need for backups.● Focus on loading the data as quickly as possible (< 2 hours)● Smaller customer's data available earlier.● Adjust application logic to make this data visible earlier.● Codename: L7

CSV files on S3One Per Customer+Date L7 DBDaily ETL

What Didn't Work: Redshift● Intended to compliment MySQL● Performed adequately when daily volume was < 1% of current volume● Needed sub-second response, was getting 30s+ response● Was the only machine that had a copy of data across all time● HDD was slow, tried SSD instances, but had limited space● Eventually grew to a 26 node cluster with 32 cores per node.● Could not distinguish a large query from a small one● Had no insight into how the data was partitioned● Reorganizing data according to AWS suggestions would have resulted in

vacuums taking several days.

What Didn't Work: Vertica● Intended to compliment MySQL● Good response times over larger data volumes ● Needed local disk to perform adequately, which limited disk size● each cluster could only hold a few months of data● 5 node clusters, 32 cores each.● Could only have K-safety of 1, or else load took too long (2 hrs vs 10)● Nodes failed daily, until glibc bug was fixed● Expensive

Storing More History with PostgreSQL● Goal: Increase storage of L7 to replace Vertica and/or Redshift● Combining 30 small EBS drives via RAID-0 to make 1 30-TB drive

○ This method had more IOPS than a single provisioned EBS drive of the same size

● Same hardware as an L7 could now store ~40 days of data● As number of customers increased, 40 days would shrink to 25● Same strategy as L7, just keep the data longer● Codename: Elmo - It stores "mo" (more) data

CSV files on S3One Per Customer+Date

Elmo Clusters

Daily ETL

Typeahead SearchWhat we need:

● "Type-ahead" search queries, like Google search autocomplete● query must finish in < 100ms● queries can be across any time range, so all customer data must be covered● Not all statistics are needed● Only show best 10 matches

Typeahead SearchWhat we did:

● Re-structure data to only store each searchable text string once● Combine All data for a Customer's Day into one row using arrays● PostgreSQL will compress those arrays via TOAST● When compressed, all data can fit in 40TB● Use btree_gin indexes for full text search● All search ETL handled by 1 32 core machine (i3.8xlarge)● All search requests handled by 2 replicas (i3.8xlarge)

CSV files on S3One Per Customer+Date Search DBDaily ETL

Applying TOAST to Regular Data● Combined All of a customer's data for one day into one row with arrays● TOAST Compression shifts workload from scarce IOPS to abundant CPU● Some customer's data too large for a single row● Split the customer's data into several "chunk" rows● Used same hardware as other instances (i3.8xlarge)● Same RAID-0 as used in Elmo instance could now hold all customer data

● ETL too slow to be handled by just once machine (compression takes time)● 5 32-core machines with an ETL-load sharing feature such that each one processes a client/day

then shares it with other nodes● Replaced all Redshift (1 5-node cluster) and Vertica instances (9 5-node clusters)!● Big cost savings● Codename: Marjory (Elmo and Marjory are Muppets from Jim Henson TV shows)

Applying TOAST to Regular Data

CSV files on S3One Per Customer+Date Daily ETL

Marjory DB

Marjory DB

Marjory DBCompressed Data Sharing

APIs to Foreign Data Wrappers● The Stats-Cache API data must be added to any data which is fetched from PostgreSQL● Existing in-memory database written in Python● The re-aggregation of this data was handled in regular code, not SQL● This is slow and error-prone● We created a Foreign Data Wrapper using the multicorn Python API● The FDW takes the SQL query an makes an API call, then puts results in a result set● The API now looks like a set of PostgreSQL tables● Aggregation in SQL much faster● Code much simpler

Complex Foreign Data Wrappers● Codename: Frackles● Store csv.gz in compressed SQLite files on S3● Each query starts a web server● Start one AWS Lambda per customer/day● Each lamba fetches and queries on SQLite file● Report results back to web server● Web server aggregates results and returns as result set● Queries are slow, but data is available sooner● Very Short ETL, but slower than dedicated servers● Very good for queries across long date ranges● AWS now offers Athena, a similar (but costly) service

SQL Query

Frackles FDW

AWS Lambda

SQLite S3 FileAWS

LambdaSQLite S3

FileAWS Lambda

SQLite S3 FileAWS

LambdaSQLite S3

File

Other Tools: PMPP● Poor Man's Parallel Processing● https://github.com/coreyhuinker/pmpp● Written by me● First written in PL/PgSQL, but re-coded in C for performance reasons● Set returning function that takes db names + queries as input.● Allows an application to send multiple queries in parallel to multiple

servers● all the queries have the same shape (columns, types)● User can re-aggregate data returned from the set returning function.● Any machine that talks libpq could be queried (PgSQL, Vertica,

Redshift)● Allows for partial aggregation on DW boxes● Secondary aggregation can occur on local machine

https://github.com/coreyhuinker/pmpp

Other Tools: Decanters● Large queries can exhaust memory on an application machine● A decanter lets wine "breathe"● These machines let the data "breathe"● Abundant CPUs, abundant memory per CPU, minimal disk● Some very small lookup tables replicated for performance reasons● All other local tables are FDWs to OLTP database (postgres_fdw)● Common use: Big PMPP query to Stats-Cache, Elmo, Marjory, Frackles,

each one doing a local aggregation● Final aggregation happens on decanter● Can occasionally experience OOM (rather than on an important

machine)● New decanter can spin up and enter load balancer in 5 minutes● No engineering time to be spent rescuing failed decanters

ETL Process (2017)

Tagged AdsViewable Events

PixelServers

Stats Aggregators

S3 - CSVs

S3 - SQLite

Log shipping

DailySummaries

Elmo Clusters

Marjory Clusters

Search Clusters

Daily ETLs

User Queries (2017)

UserStats Requests

Elmo Clusters

MarjoryClusters

S3 - SQLite

PMPP Requests

OLTP DB

Third Party DW

Search Clusters

SearchesPg FDW

FracklesFDW

Pg F

DW

Decanters

Live Stats Aggregators

Stats-Cache FDW

Why Not RDS?● No ability to install custom extensions (PMPP,

pg_partman, etc)● No place to do local \copy operations● Reduced insight into the server load (this is better

now with RDS Performance Insights)● Reduced ability to tune pg server● No ability to try beta versions● Expense

Why Not Aurora?● Had early adopter access● AWS Devs said that it was not geared for DW

workloads● I/O sometimes good, sometimes bad● Wasn't ready yet● Data volumes necessitate advanced partitioning● Advanced partitioning was not available until

v10● Expense

Why Not Athena?● Athena had no concept of constraint exclusion to

avoid reading irrelevant files● Costs $5/TB of data read● Most queries would cost > $100 each● Running thousands of queries per hour

Questions?

Documents

Corlogic Consulting December 2018 Corey Huinker Cloud ... · Storing More History with PostgreSQL Goal: Increase storage of L7 to replace Vertica and/or Redshift Combining 30 small