15
Billions of Rows, Millions of Insights Right Now Developing a Landscape for Real Time Information

Billions of Rows, Millions of Insights, Right Now

Embed Size (px)

DESCRIPTION

Presentation from Tableau Customer Conference 2013 on building a real time reporting/analytics platform. Topics discussed include definitions of big data and real time, technology choices and rationale, use cases for real time big data, architecture, and pitfalls to avoid.

Citation preview

Page 1: Billions of Rows, Millions of Insights, Right Now

Billions of Rows, Millions of Insights

Right Now

Developing a Landscape for Real Time Information

Page 2: Billions of Rows, Millions of Insights, Right Now

Who is Spil Games?

• 180 million monthly and 12 million daily players• More than one billion gameplays monthly• Active in every country of the world (even Vatican City!)

One of the largest casual gaming companies on the planetLocal EVERYWHERE

Titles

We are a platform first, but also a publisher and a developer

Page 3: Billions of Rows, Millions of Insights, Right Now

The paradigm is shifting

The Data Lake• Highly consistent• Highly connectable• Inflexible• Slow

• Flexible• Fast• Going to get wet

You always need both

Traditionally, we define data based on what we expect

With streaming data, we capture first and define later

Page 4: Billions of Rows, Millions of Insights, Right Now

Defining BIG Data

The Four Vs

Velocity

Variety

Volume

Veracity

Small Data = BIG Data?

Real Time

ETL, Events, Excel

Drinking from the Firehose

Heuristics

VALUE: The Only V that Matters

Page 5: Billions of Rows, Millions of Insights, Right Now

Defining the VELOCITY and VARIETY

Traditional ETL“Real Time”

• Once a day• Once a week• Delayed

• Faster than human perception

• <200 milliseconds“In Time”

In Time: Information is available fast enough to influence decisions•While in the shop/on the site (minutes)•While the query runs (seconds)•While the page loads (milliseconds)

The Velocity Continuum

Page 6: Billions of Rows, Millions of Insights, Right Now

Deriving the VALUE at Spil

Informing Decisions Making Decisions

• Day to day business reporting

• Analytical reporting for self-service analysis

• Business analytics for advising decisions

• Descriptive models to explain our business

• Customer Lifetime Value• Marketing ROI

• Customer content recommendations

• Email campaign targeting

• Site learning and optimization

• System monitoring and alerting

Page 7: Billions of Rows, Millions of Insights, Right Now

Why Real Time Reporting Matters

Value of Reporting

Real time reporting is a paradigm-shifting component of our cloud based big data strategy!

I need to see everything happening RIGHT NOW System Monitoring

Product Changes

In-Time Customer Support

Page 8: Billions of Rows, Millions of Insights, Right Now

Real Time Systems Requirements

Requirement Rationale Our Experience

Scalable with fast loads Must handle intraday variable load

Load swings up to 300% during the day

Fast join performance Synthesizing traditional ETL data and real time events on the fly

Denormalization is great but volume expensive; 3NF is BAD

Resilient Real time means as few buffers as possible

Tableau extracts can slow the process too much

Good query optimizer Minor inefficiencies translate to expensive performance hits

The best MySQL engine is still too slow for BIG data aggregation

Concurrent loading and querying

No offline processing for real time data

ETLs running at the same time as queries up to 20% of the time

Solution: C-Store Databases

Page 9: Billions of Rows, Millions of Insights, Right Now

C-Stores and Fast Dashboards

• C-Stores persist each column independently and allow column compression

• Queries retrieve data only from needed columns

Example: 7 billion rows, 25 columns, 10 bytes/column = 1,6 TB table

Query: Select A, sum( D ) from table where C >= X;

Row Store: 1,6TB of DataColumn Store (30% compression): <195 GB data

The Result: Dashboards can run direct on large tables

Dashboard on 7 billion row table with two joins, <20 seconds to refresh

Page 10: Billions of Rows, Millions of Insights, Right Now

Our Technical Infrastructure

Page 11: Billions of Rows, Millions of Insights, Right Now

How much data do we handle?

Through Map/Reduce: 1.2 Billion Events/Day (150 Million Rows/Day

into DWH)

Through ETL: 100-200 Million

Rows/Day into DWH

Map/Reduce: 20 Billion Rows

Vertica: 45 Billion Rows

Long Term Storage:All of 2013 Events

Predictive models: >500 million scores per day

ETLs to Production DBs: >10 Models

Reporting: 150 Dashboards, 80 data

sources

Queries: >2000 per day

Ingestion Persistence Usage

Page 12: Billions of Rows, Millions of Insights, Right Now

Data Flow for Event Data

{ token:"BAEDIDtxmZoAWAEA", sessionId:1358331540132,

visitorId:515876866411417, pageInSession:3, environment:"stg",

eventList:[{ eventCategory: 'displayAds', eventAction: 'fetch', eventLabel:

'Miniclip,leaderboard,160x60,SE,2.9', eventValue: 1, //the depth in the daizy

chaining pageInSession: 2, timing: 1730 }] }:

JSON event data is generated by client

Visitor Session Page Timing Type Action Source Value123 456 3 1730 DisplayAd Fetch Miniclip 2

Data is structured in Map/Reduce and put into flat files

Data is loaded into Vertica for Reporting + Analysis

Tableau queries directly from fact tables

Page 13: Billions of Rows, Millions of Insights, Right Now

Why we chose our tech

• Affordable• Highly available and resilient

• Extremely fast development due to SQL• Excellent query performance = lazy

optimization

• Right price• Easy (and fun!) development• Excellent library availability

• Industry standard for Map/Reduce• Cheap storage of “data lake”

• Easy integration with existing tech

Page 14: Billions of Rows, Millions of Insights, Right Now

• Denormalize like crazy (or cheat with pre-join projections)

• Map/Reduce doesn’t like “real” time (try Storm)

• Network is the first limit you hit

• Let Tableau write the SQL, but optimize the projections

• Tableau’s caching is inflexible, scripting can solve (kind of)

What we’ve learned along the way

Page 15: Billions of Rows, Millions of Insights, Right Now

Q&A + Demo