Upload
others
View
2
Download
1
Embed Size (px)
Citation preview
Cloud Architecture PatternsRunning PostgreSQL at Scale
(when RDS will not do what you need)Corey Huinker
Corlogic ConsultingDecember 2018
First, we need a problem to solve.
This is You
You Get An Idea For a Product
You make a product! ...now you have to sell it.
To advertise the product, you need an ad...
...so you talk to an ad agency.
But placing ads has challenges
Need to find websites with visitors who:
● Would want to buy your product● Are able to buy your product● Would like the style of your advertisement
A Websites' Claims about their Visitors...
...are not always accurate.
Buying ad-space on websites directly is usually not possible. You must use an auction service.
How Modern Ad Tracking Is Done● Each advertisement is wrapped in a JavaScript program● The program starts program starts when the web page loads● The program sends a message every ~5 seconds until the page closes● The program also sends messages when important events happen
○ Is the advertisement in a space that fits the size of the image?○ Is the advertisement in a part of the screen that is visible to the user?○ Did the mouse pass over the ad?○ Did the video begin to play?○ Is the audio muted?○ Did the video finish?
● These messages are collected by a "pixel server" and combined to construct a timeline of the life of the advertisement on that web page
Now your monitored ad reports events.
Focal points of ad monitoringWe want to know the number of times:
● How many times did the ad land on a page? ("Impressions")● How many times the ad landed in a favorable spot on the page?● How many times the ad fit into the space allotted?● How many times the ad was visible for 10 seconds? 20? 30?● Did the viewer interact with the ad in any way?
The Real Purpose of ad monitoringWe want to know:
● How many of our ads were seen by actual humans? ("Engagement")● How many of our ads were seen by NHT - Non-Human Traffic? ("bots")● How do these numbers compare with the claims of the website?● How do these numbers compare with the claims of the auction service?
Ultimately, we want to know how much of our money was wasted, so we can change where we spend money in the future.
Ad Tracking creates a lot of data● Not all impressions report their events
○ Default rate is about 3%○ Full reporting would require 30x the infrastructure for only a small gain in accuracy○ Customers can pay for a higher sampling rate
● Approximate sampling events recorded per day: 50,000,000,000○ Full reporting would be 1.5T events per day.
● Sampling events are chained together to tell the story of that impression.● Impression data is then aggregated by date, ad campaign, browser type● After aggregation, we have about 500M rows per day.● Each row has > 125 measures of viewability metrics
The Original Architecture (2013)
Tagged AdsOn Web Browsers
Viewable Events
(Billions/day)
PixelServers
Stats Cache Aggregators
csv.gz files on S3One per customer per day
Log shipping
DailySummary Files
100s of files(~100 GB total size)
Redshift (2013-2015)
Vertica (2014-2016)
Daily ETLs
MySQL OLTP
(2013-2014)
MySQL DW DBs (2013-2014)
User Query (2013)
Website Request
PartialQueries
Redshift(2013-2015)
Vertica(2014-2016)
Searches
MySQL DBs - Shard By Date
Stats Cache AggregatorsStats-Cache API
MySQL OLTP DB
Partial QueryPartial Query
Partial Query Results combined at the application
level● 3 dialects of SQL● 1 custom API
CSV.gz files not accessed directly by
queries
Capturing Ad Activity Events
● "Pixel" server:Website that only serves up one 1x1 pixel image
● Captures data about visiting browsers in web logs
● Needs to be fast to not delay user experience or risk losing event data
● Data must be read quickly to give customers real time results.
Tagged AdsOn Web Browsers
Viewable Events
(Billions/day)
PixelServers
● Number of Servers: ~500○ Probably too many servers○ Over-provisioned for reliability
● EC2 Type: t3.xlarge or similar○ Low CPU workload○ Low disk I/O workload○ High network bandwidth○ low latency
Real-time Event Accumulation and Aggregation● Stats Cache machines consume
syslogs from Pixel Servers● Log Events from the same
browser are combined to form the ad outcome.
● Outcomes are aggregated by ad campaign, product brand, etc.(10 different aggregations)
● Each Stats Cache is an incomplete shard of today's data
● At end of day, all shard data is combined into one CSV per customer
PixelServers
Stats Cache
S3 - CSVs
Log shipping
DailySummary Files
100s of files(~100 GB total size)
● Number of Servers: ~450● Custom in-memory DB● No disk storage of data● CPU load: nearly 100%● Cannot use swap● EC2 type: r5.2xlarge or
similar
Real-time Stats Reporting (2013)
Aggregationin Application
Code
API Response
API Request
Stats Cache Shards
● Aggregating in application code ○ high memory usage○ high network usage○ high potential for error
Non-Real-time Data Reporting (2013)
Aggregationin Application
Code
API Response
API Request
MySQL Shards(by date)
Redshift(for very large clients)
MySQL OLTP DB
First Steps to Fix MySQL OLTP DB● Converted to PostgreSQL 9.4 - Logical Replication not yet available● Conversion took 2-4 weeks using 2 programmers● Added Triggers on MySQL tables to identify modified rows● Used mysql_fdw to create migration tables on PostgreSQL● Created each new PostgreSQL table as SELECT * FROM foreign table● Scheduled tasks update PostgreSQL by reading new records in trigger tables● Moved read-only workloads to postgres instance● Migrated read-write apps in stages● Only downtime was in final cut-over● Final system: Single 32 core EC2 master with 1-2 physical read replicas
Next To Fix: MySQL Data Warehouse shards● Performed adequately when daily volume was < 1% of current volume● Impossible to add new columns to tables● Easier to create a new shard than to modify an existing one.● New metrics being added every few weeks or days (over 100 metrics)● Dozens of shards, some cover a month of data, others only a few days● Each new shard adds workload to application level aggregator
Understanding User Interest In Data
AgeOfData
Today's Real-time Data
User Interest In Data
Yesterday
2-7 Days Ago
8-30 Days Ago
Older
● 85% of API requests are for data <= 7 days old● This follows Zipf's Law: https://en.wikipedia.org/wiki/Zipf's_law● Conclusions:
○ put newest data on fastest servers○ move older data onto fewer, slower servers
Postgres For The Most Needed Data● Vanilla PostgreSQL instance 9.4 ● i3.8xlarge or similar: 32 cores, 240GB RAM, 5TB disk● Data partitioned by day● Drop any partitions > 10 days old.● All data is copy of data in S3. No need for backups.● Focus on loading the data as quickly as possible (< 2 hours)● Smaller customer's data available earlier.● Adjust application logic to make this data visible earlier.● Codename: L7
CSV files on S3One Per Customer+Date L7 DBDaily ETL
What Didn't Work: Redshift● Intended to compliment MySQL● Performed adequately when daily volume was < 1% of current volume● Needed sub-second response, was getting 30s+ response● Was the only machine that had a copy of data across all time● HDD was slow, tried SSD instances, but had limited space● Eventually grew to a 26 node cluster with 32 cores per node.● Could not distinguish a large query from a small one● Had no insight into how the data was partitioned● Reorganizing data according to AWS suggestions would have resulted in
vacuums taking several days.
What Didn't Work: Vertica● Intended to compliment MySQL● Good response times over larger data volumes ● Needed local disk to perform adequately, which limited disk size● each cluster could only hold a few months of data● 5 node clusters, 32 cores each.● Could only have K-safety of 1, or else load took too long (2 hrs vs 10)● Nodes failed daily, until glibc bug was fixed● Expensive
Storing More History with PostgreSQL● Goal: Increase storage of L7 to replace Vertica and/or Redshift● Combining 30 small EBS drives via RAID-0 to make 1 30-TB drive
○ This method had more IOPS than a single provisioned EBS drive of the same size
● Same hardware as an L7 could now store ~40 days of data● As number of customers increased, 40 days would shrink to 25● Same strategy as L7, just keep the data longer● Codename: Elmo - It stores "mo" (more) data
CSV files on S3One Per Customer+Date
Elmo Clusters
Daily ETL
Typeahead SearchWhat we need:
● "Type-ahead" search queries, like Google search autocomplete● query must finish in < 100ms● queries can be across any time range, so all customer data must be covered● Not all statistics are needed● Only show best 10 matches
Typeahead SearchWhat we did:
● Re-structure data to only store each searchable text string once● Combine All data for a Customer's Day into one row using arrays● PostgreSQL will compress those arrays via TOAST● When compressed, all data can fit in 40TB● Use btree_gin indexes for full text search● All search ETL handled by 1 32 core machine (i3.8xlarge)● All search requests handled by 2 replicas (i3.8xlarge)
CSV files on S3One Per Customer+Date Search DBDaily ETL
Applying TOAST to Regular Data● Combined All of a customer's data for one day into one row with arrays● TOAST Compression shifts workload from scarce IOPS to abundant CPU● Some customer's data too large for a single row● Split the customer's data into several "chunk" rows● Used same hardware as other instances (i3.8xlarge)● Same RAID-0 as used in Elmo instance could now hold all customer data
● ETL too slow to be handled by just once machine (compression takes time)● 5 32-core machines with an ETL-load sharing feature such that each one processes a client/day
then shares it with other nodes● Replaced all Redshift (1 5-node cluster) and Vertica instances (9 5-node clusters)!● Big cost savings● Codename: Marjory (Elmo and Marjory are Muppets from Jim Henson TV shows)
Applying TOAST to Regular Data
CSV files on S3One Per Customer+Date Daily ETL
Marjory DB
Marjory DB
Marjory DBCompressed Data Sharing
APIs to Foreign Data Wrappers● The Stats-Cache API data must be added to any data which is fetched from PostgreSQL● Existing in-memory database written in Python● The re-aggregation of this data was handled in regular code, not SQL● This is slow and error-prone● We created a Foreign Data Wrapper using the multicorn Python API● The FDW takes the SQL query an makes an API call, then puts results in a result set● The API now looks like a set of PostgreSQL tables● Aggregation in SQL much faster● Code much simpler
Complex Foreign Data Wrappers● Codename: Frackles● Store csv.gz in compressed SQLite files on S3● Each query starts a web server● Start one AWS Lambda per customer/day● Each lamba fetches and queries on SQLite file● Report results back to web server● Web server aggregates results and returns as result set● Queries are slow, but data is available sooner● Very Short ETL, but slower than dedicated servers● Very good for queries across long date ranges● AWS now offers Athena, a similar (but costly) service
SQL Query
Frackles FDW
AWS Lambda
SQLite S3 FileAWS
LambdaSQLite S3
FileAWS Lambda
SQLite S3 FileAWS
LambdaSQLite S3
File
Other Tools: PMPP● Poor Man's Parallel Processing● https://github.com/coreyhuinker/pmpp● Written by me● First written in PL/PgSQL, but re-coded in C for performance reasons● Set returning function that takes db names + queries as input.● Allows an application to send multiple queries in parallel to multiple
servers● all the queries have the same shape (columns, types)● User can re-aggregate data returned from the set returning function.● Any machine that talks libpq could be queried (PgSQL, Vertica,
Redshift)● Allows for partial aggregation on DW boxes● Secondary aggregation can occur on local machine
Other Tools: Decanters● Large queries can exhaust memory on an application machine● A decanter lets wine "breathe"● These machines let the data "breathe"● Abundant CPUs, abundant memory per CPU, minimal disk● Some very small lookup tables replicated for performance reasons● All other local tables are FDWs to OLTP database (postgres_fdw)● Common use: Big PMPP query to Stats-Cache, Elmo, Marjory, Frackles,
each one doing a local aggregation● Final aggregation happens on decanter● Can occasionally experience OOM (rather than on an important
machine)● New decanter can spin up and enter load balancer in 5 minutes● No engineering time to be spent rescuing failed decanters
ETL Process (2017)
Tagged AdsViewable Events
PixelServers
Stats Aggregators
S3 - CSVs
S3 - SQLite
Log shipping
DailySummaries
Elmo Clusters
Marjory Clusters
Search Clusters
Daily ETLs
User Queries (2017)
UserStats Requests
Elmo Clusters
MarjoryClusters
S3 - SQLite
PMPP Requests
OLTP DB
Third Party DW
Search Clusters
SearchesPg FDW
FracklesFDW
Pg F
DW
Decanters
Live Stats Aggregators
Stats-Cache FDW
Why Not RDS?● No ability to install custom extensions (PMPP,
pg_partman, etc)● No place to do local \copy operations● Reduced insight into the server load (this is better
now with RDS Performance Insights)● Reduced ability to tune pg server● No ability to try beta versions● Expense
Why Not Aurora?● Had early adopter access● AWS Devs said that it was not geared for DW
workloads● I/O sometimes good, sometimes bad● Wasn't ready yet● Data volumes necessitate advanced partitioning● Advanced partitioning was not available until
v10● Expense
Why Not Athena?● Athena had no concept of constraint exclusion to
avoid reading irrelevant files● Costs $5/TB of data read● Most queries would cost > $100 each● Running thousands of queries per hour
Questions?