Upload
amazon-web-services
View
365
Download
0
Embed Size (px)
Citation preview
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Mark McBride, Senior Software Engineer, Capital Games, Electronic Arts
Bill Weiner, SVP Operations, 47Lining
11/28/16
How EA Leveraged Amazon Redshift and AWS
Partner 47Lining to Gather Meaningful Player
Insights - GAM301
Speakers
Mark McBrideSenior Software
Engineer
Capital Games,
Electronic Arts
Bill WeinerSVP Operations,
47Lining
& Redshift Whisperer
What to Expect from the Session
• Analytics Architecture
• Challenges
• Effective patterns for ingest, de-dup, aggregate, vacuum
into Redshift
• How to balance rapid ingest and query speeds
• Strategies for data partitioning / orchestration
• Best practices for schema optimization, performant data
summaries incremental updates
• And how we built a Redshift solution to ingest 1 billion
rows of data per day
Life Before Redshift
• External solutions
• “One size fits all” for processing all games
• Serves the needs of central teams, but no focus on the
game team, no dedicated resource to us
• Lack of depth in data
• Client driven
Vision
• Discover how players play our game
• Drive better feature development
• Healthier operations through data
• Rapid iteration and evolution of telemetry gathering
• Decoupled from game server
• Frictionless access to data
• Easily query-able data
• Wall displays
Architecture
Architecture – Persisting to S3
Game Servers Amazon
Kinesis
S3
Worker
S3
Bucket
Put Events
Game Clients
Architecture – Game Client
• iOS/Android clients
• Produces client specific events like screen transitions
• Events are batched up and sent to the game server
every minute
• In between flushes to server, events are persisted to disk
• If the client crashes events will be sent on the next
session
Architecture - Game Server
• EC2 Instances - Tomcat/Java
• Produces the majority of events
• Events are sent asynchronously to Kinesis
• ActiveMQ broker is responsible for the durability of the
message
• Persisted to disk until sent
• Retries with exponential backoff
• Dead Letter Queue
Architecture - Kinesis
• One Kinesis stream with 10 shards partitioned by event UUID
• 24 hour retention
• Provides fault tolerance to game server. Redshift can be
offline and the Game Server isn't impacted.
• Game Server batches many events into one Kinesis record
on every client request
• Records are compressed
Architecture – S3 Kinesis Worker
• Elastic Beanstalk
• Decompress records
• Transform hierarchical JSON structure into flat structure
• Patch missing data. PlayerId
• Clean/truncate data. 0/1 -> true/false
• Filter out unrepairable data. Bad timestamps
• Report operational metrics
• Write to S3 when thresholds are met
Architecture – S3
• S3 files organized by
hour: Yyyy/MM/dd/HH/SequenceStart-SequenceEnd.gz
• Compressed JSON
• Long-term “truth” storage
• Cheap
Architecture - S3 to Redshift
S3Ingest Data
Pipeline
Amazon
Redshift
Amazon Elastic
Beanstalk
DeDupe & Analyze
Vacuum
SQL ETL
Data
Pipeline
Architecture – Ingest Data Pipeline
• Every hour data pipeline job bulk inserts all S3 files into
EventsDups table.
• Copy EventsDups from s3://sw-prod-kinesis-
events/#{format(minusHours(@scheduledStartTime,1),'Y
YYY/MM/dd/HH/')}
• Monitor for failures!
• Consider manifest driven ingest next time!
Table Progression
AggregateIngest Deduped
S3
Worker
Asynchronous
Copy of New
Data
Deduplication of
Incoming DataDeduplication
with
Events Table
and Insertion
Aggregation of
Data
Events Table
Architecture – Deduper
• Why deduplication?
• Redshift doesn't provide constraints.
• Distributed systems are complicated. Allow for retries when in
doubt.
• Data pipeline jobs can fail. Allow one to rerun ingest.
Architecture – Deduper Implementation
• Critical that a proper definition of duplicates is created
• Not based on all columns being the same
• Using the unique set of event identifying columns events
can be deduplicated both in the ingest table and against
the events table
Architecture – Deduper Implementation
• Beanstalk webapp that polls EventsDups table for work
• Deduplication is performed using the following columns to establish
uniqueness:
Description
Raw Event
Timestamp
Timestamp for event
User Id Player Identifier
Session Id Unique to each session
Step Each event gets a unique number generated from a memcached
increment operation.
Event Type Integer unique to each event.
Schema : Events
Sort Key Description
Ingest Time Unix time UTC when event is captured on Kinesis
Stat Date Raw Event Timestamp in yyyy-mm-dd format
Player Id Random generated UUID – Distribution Key
Raw Event
Timestamp
Unix time UTC when event is triggered on server
Event Type Integer unique to each event. 2924 = BattleSummaryEvent
Standard Fields Country, Device, Network, Platform...
Event Value 1-10 For each event type a set of 10 fields can be set.
Architecture – Vacuum
• Why Vacuum?
• Reclaim and reuse space that is freed when you delete and
update rows … we only insert …
• Ensure new data is properly sorted with existing table
• This is important in providing quality statistics to the query
optimizer.
• We Vacuum once a day which balances the time to Vacuum
and ability to provide performant statistics.
Architecture – Analyze
• Why Analyze?
• Any time one adds (modifies, or deletes) a significant number
of rows, you should run the analyze command to maintain the
query optimizers statistics.
• This occurs when the table is vacuumed.
• We analyze on every 4th successful deduper process.
• Analyze is resource intensive. Balance time to analyze
to optimizers ability to generate good plans.
Architecture – ETL – User Retention Daily
• Data Pipeline scheduled once an hour – along with many other aggregate tables
• Upsert into table looking back a week into events table
• Executed after users aggregate table is updated
Sort Key Description
PlayerId Random generated UUID – Distribution Key
Platform Apple/Google
Country US, GB...
Stat Date Row for every day player has played
Days In Game Number of days in game
Revenue Summary revenue data
Architecture – Scaling Growth
1 Billion Events!!!!
The Force
Awakens!!!
World Wide Launch!!!
Technical
Challenges & Solutions
Amazon Redshift system architecture
Leader node• SQL endpoint
• Stores metadata
• Coordinates query execution
Compute nodes• Local, columnar storage
• Executes queries in parallel
• Load, backup, restore via Amazon S3; load from Amazon DynamoDB, Amazon EMR, or SSH
Two hardware platforms• Optimized for data processing
• DS2: HDD; scale from 2 TB to 2 PB
• DC1: SSD; scale from 160 GB to 326 TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
Architecture – Scaling Challenges
650 minutes
to Vacuum
1,550 Minutes
To Deduplicate
Goals of Sorting
• Physically sort data within blocks and throughout a table
• Enable rrscans (block-rejection) to prune blocks by
leveraging zone maps
• Optimal SORTKEY is dependent on:• Query patterns
• Data profile
• Business requirements
COMPOUND
• Most common
• Well-defined filter criteria
• Time-series data
Choosing a SORTKEY
INTERLEAVED
• Edge cases
• Large tables (>billion rows)
• No common filter criteria
• Non time-series data
• Organizing of time-series data
• Optimally newest data at the "end" of a time-series table,
• Primarily as a query predicate (date, identifier, …)
• Optionally, choose a column frequently used for aggregates
Best Practices for Time-Series Data
http://docs.aws.amazon.com/redshift/latest/d
g/vacuum-load-in-sort-key-order.html
It is important to have
sort keys that ensures
that new data is
“located”, per sort key
order, at the end of the
time-series table
Tim
e
Events Out of Time
Incoming event destination post-vacuum
Events Table
Ingest Table
Altered Timestamp
By creating synthetic timestamp sort key the incoming
rows all vacuum to the end of the main events table
Ingest Table
Tim
e
Events Table
Best Practice for Sort Key Selection
http://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html
Compound Sort Key:
A compound key is made up of all of the columns listed
in the sort key definition, in the order they are listed. A
compound sort key is most useful when a query's filter
applies conditions, such as filters and joins, that use a
prefix of the sort keys. The performance benefits of
compound sorting (may) decrease when queries depend
only on secondary sort columns, without referencing the
primary columns.
Optimizing the Effectiveness of Zone Maps
Time Only Query Performance>9 <4
Blo
ck 1
Blo
ck 2
Blo
ck 3
Blo
ck 4
Truncated Synthetic Timestamp>9 <4
Blo
ck 1
Blo
ck 2
Blo
ck 3
Blo
ck 4
Balancing Vacuum and Query Speed
Four ingest batches come in with the
same truncated synthetic timestamp
After vacuum the secondary and tertiary
reorder the order of rows improving
sorting power for these later sort key
Vacuum time grows the number of
overlapping batches increases
Improved grouping of the secondary
and tertiary sort key values improves
query speed where these are used
Pre-Vacuum Post-Vacuum
Architecture – Scaling Challenges
Goals of Distribution
• Distribute data evenly for parallel processing
• Minimize data movement
• Co-located joins
• Localized aggregations
Distribution key All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Full table data on first
slice of every nodeSame key to same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
EvenRound-robin
distribution
Choosing a Distribution Style
Key
• Large FACT tables
• Rapidly changing tables used
in joins
• Localize columns used within
aggregations
All
• Have slowly changing data
• Reasonable size (i.e., few
millions but not 100s of
millions of rows)
• No common distribution key
for frequent joins
• Typical use case: joined
dimension table without a
common distribution key
Even
• Tables not frequently joined or
aggregated
• Large tables without
acceptable candidate keys
Best Practice for Distribution
http://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html
Data redistribution can account for a substantial portion of the cost of a
query plan, and the network traffic it generates can affect other database
operations and slow overall system performance
1. To distribute the workload uniformly among the nodes in the cluster.
Uneven distribution, or data distribution skew, forces some nodes to
do more work than others, which impairs query performance
2. To minimize data movement during query execution. If the rows that
participate in joins or aggregates are already collocated on the
nodes with their joining rows in other tables, the optimizer does not
need to redistribute as much data during query execution
Unauthenticated (Anonymous) Events
Events Table
Slice 0 Slice 1 Slice 2 … Slice N
A small percentage of unauthenticated
events located on a single slice of a
large cluster leads to significant skew
Node Level Skew Slice Level Skew
Split Events Tables
Events Table - Authenticated
Slice 0 Slice 1 Slice 2 … Slice N
By splitting events into two tables querying speed was improved due to
unauthenticated events no longer unbalancing skew
UNION ALL view can be used to query all event data when needed
Events Table - Unauthenticated
Slice 0 Slice 1 Slice 2 … Slice N
Long Deduplication Time
Incoming events needs to be scrubbed
to prevent duplicate events
Duplicates removed from incoming data
Scanning the full events table for
deduplication slows as the events table
grows
Ingest Table
Tim
e
Events Table
Events Table
Time Restricted Deduplication
Incoming events evaluated for ranges
on specific columns
Scan of main events table
limited to range of incoming
events
Ingest Table
Tim
e
Growing Aggregation Time
Per player statistics
Events Table
Aggregate
Incremental AggregationEvents Table
Aggregate
Temp
Benefits
Benefits Detail
- Churn Prediction
- Cheater Detection
- Adaptive AI
- Changing the definition of success
Next steps
Next Steps
• Data retention
• Machine Learning
• Firehose
• Kinesis Analytics
Thank you!
Remember to complete
your evaluations!