Upload
rob-winters
View
754
Download
1
Embed Size (px)
DESCRIPTION
Presentation from Tableau Customer Conference 2013 on building a real time reporting/analytics platform. Topics discussed include definitions of big data and real time, technology choices and rationale, use cases for real time big data, architecture, and pitfalls to avoid.
Citation preview
Billions of Rows, Millions of Insights
Right Now
Developing a Landscape for Real Time Information
Who is Spil Games?
• 180 million monthly and 12 million daily players• More than one billion gameplays monthly• Active in every country of the world (even Vatican City!)
One of the largest casual gaming companies on the planetLocal EVERYWHERE
Titles
We are a platform first, but also a publisher and a developer
The paradigm is shifting
The Data Lake• Highly consistent• Highly connectable• Inflexible• Slow
• Flexible• Fast• Going to get wet
You always need both
Traditionally, we define data based on what we expect
With streaming data, we capture first and define later
Defining BIG Data
The Four Vs
Velocity
Variety
Volume
Veracity
Small Data = BIG Data?
Real Time
ETL, Events, Excel
Drinking from the Firehose
Heuristics
VALUE: The Only V that Matters
Defining the VELOCITY and VARIETY
Traditional ETL“Real Time”
• Once a day• Once a week• Delayed
• Faster than human perception
• <200 milliseconds“In Time”
In Time: Information is available fast enough to influence decisions•While in the shop/on the site (minutes)•While the query runs (seconds)•While the page loads (milliseconds)
The Velocity Continuum
Deriving the VALUE at Spil
Informing Decisions Making Decisions
• Day to day business reporting
• Analytical reporting for self-service analysis
• Business analytics for advising decisions
• Descriptive models to explain our business
• Customer Lifetime Value• Marketing ROI
• Customer content recommendations
• Email campaign targeting
• Site learning and optimization
• System monitoring and alerting
Why Real Time Reporting Matters
Value of Reporting
Real time reporting is a paradigm-shifting component of our cloud based big data strategy!
I need to see everything happening RIGHT NOW System Monitoring
Product Changes
In-Time Customer Support
Real Time Systems Requirements
Requirement Rationale Our Experience
Scalable with fast loads Must handle intraday variable load
Load swings up to 300% during the day
Fast join performance Synthesizing traditional ETL data and real time events on the fly
Denormalization is great but volume expensive; 3NF is BAD
Resilient Real time means as few buffers as possible
Tableau extracts can slow the process too much
Good query optimizer Minor inefficiencies translate to expensive performance hits
The best MySQL engine is still too slow for BIG data aggregation
Concurrent loading and querying
No offline processing for real time data
ETLs running at the same time as queries up to 20% of the time
Solution: C-Store Databases
C-Stores and Fast Dashboards
• C-Stores persist each column independently and allow column compression
• Queries retrieve data only from needed columns
Example: 7 billion rows, 25 columns, 10 bytes/column = 1,6 TB table
Query: Select A, sum( D ) from table where C >= X;
Row Store: 1,6TB of DataColumn Store (30% compression): <195 GB data
The Result: Dashboards can run direct on large tables
Dashboard on 7 billion row table with two joins, <20 seconds to refresh
Our Technical Infrastructure
How much data do we handle?
Through Map/Reduce: 1.2 Billion Events/Day (150 Million Rows/Day
into DWH)
Through ETL: 100-200 Million
Rows/Day into DWH
Map/Reduce: 20 Billion Rows
Vertica: 45 Billion Rows
Long Term Storage:All of 2013 Events
Predictive models: >500 million scores per day
ETLs to Production DBs: >10 Models
Reporting: 150 Dashboards, 80 data
sources
Queries: >2000 per day
Ingestion Persistence Usage
Data Flow for Event Data
{ token:"BAEDIDtxmZoAWAEA", sessionId:1358331540132,
visitorId:515876866411417, pageInSession:3, environment:"stg",
eventList:[{ eventCategory: 'displayAds', eventAction: 'fetch', eventLabel:
'Miniclip,leaderboard,160x60,SE,2.9', eventValue: 1, //the depth in the daizy
chaining pageInSession: 2, timing: 1730 }] }:
JSON event data is generated by client
Visitor Session Page Timing Type Action Source Value123 456 3 1730 DisplayAd Fetch Miniclip 2
Data is structured in Map/Reduce and put into flat files
Data is loaded into Vertica for Reporting + Analysis
Tableau queries directly from fact tables
Why we chose our tech
• Affordable• Highly available and resilient
• Extremely fast development due to SQL• Excellent query performance = lazy
optimization
• Right price• Easy (and fun!) development• Excellent library availability
• Industry standard for Map/Reduce• Cheap storage of “data lake”
• Easy integration with existing tech
• Denormalize like crazy (or cheat with pre-join projections)
• Map/Reduce doesn’t like “real” time (try Storm)
• Network is the first limit you hit
• Let Tableau write the SQL, but optimize the projections
• Tableau’s caching is inflexible, scripting can solve (kind of)
What we’ve learned along the way
Q&A + Demo