A Big Data Primer
Stacia Misner E-mail: [email protected] Twitter: @StaciaMisner Blog: blog.datainspirations.com
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 2
Session Overview
• What’s the Fuss? • What’s in the Big Data Stack? • Where Do I Start?
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 3
What’s the Fuss?
• Some Background… • Classic Data Analysis versus Big Data • Why Now? • Why Bother?
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 4
Some Background…
Google Trends: “Big Data”
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 5
Has Big Data Jumped the Shark?
Volume Velocity
Variety Variability
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 6
Is Big Data the Next Fron;er?
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 7
Classic Data Analysis
Data Warehouse & BI Solutions
ETL
…Uses Just a Subset
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 8
Classic Data Analysis
Data Warehouse & BI Solutions
ETL
…Requires Structure
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 9
Variety Includes Unstructured Data
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 10
Big Data versus Tradi;onal BI
http://blogs.forrester.com/brian_hopkins/11-08-29-big_data_brewer_and_a_couple_of_webinars
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 11
Why Now? The Times… They Are A’Changin’
1970 1 TB $1,000,000 2013 1 TB < $100
Cost of Storage Decreasing
Direct attached storage, not Enterprise SAN!
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 12
The Times… They Are A’Changin’
All Books 15 TB Daily Tweets 15 TB
Data Volumes Increasing
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 13
The Times… They Are A’Changin’
Then…
10 Years Completed in 2003
Processing Power Increasing
3 Billion Base Pairs to Analyze
Now…
1 Week At 1/10th the Cost
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 14
Why Now?
Powerful, Scalable, Cheap, Elasticity
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 15
Why Bother?
• Make more data available faster • Deliver access to more detailed, accurate informa;on to
adjust just-‐in-‐;me • Segment customers at more granular level for
personaliza;on of products and services • Perform more sophis;cated analy;cs • Improve products
Case Study Customer, Product, Promo4on Data -‐>
Personalized Promo4ons
Before Big Data A[er Big Data
8 weeks 1 week and dropping
http://wiki.apache.org/hadoop/PoweredBy
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 16
What’s In the Big Data Stack?
• Key Differences • Hadoop Ecosystem • Hadoop and Analysis Services
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 17
Key Differences
Scale Out As Needed With Commodity Hardware
Impose Schema On Read
Basically Available
Soft-state Eventually consistent
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 18
Hadoop Ecosystem
HDFS
MapReduce
Note: This is only a subset of ecosystem!
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 19
Problem to Solve
• Elas;city o Ability to analyze structured, unstructured data o DW imposes structure for ques;ons we know we want answered
o Need ability to incorporate other types of data on demand • Scale
o Low cost commodity hardware o Distributed workload
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 20
Hadoop & Analysis Services – High Latency
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 21
Hadoop & Analysis Services-‐ Medium Latency
Linked Server HiveODBC driver
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 22
Hadoop & Analysis Services-‐ Medium Latency
Analysis Management Objects (AMO) to push data into SSAS
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 23
Hadoop & Analysis Services-‐Low Latency
Options: • Impala (Cloudera) • Spark and Shark (UC Berkeley) • Stinger (Hortonworks)
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 24
Where Do I Start?
• Big Data Lifecycle • Approaches
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 25
Big Data Lifecycle
Discovery
Data Prepara;on
Model Planning
Model Building
Result Communica;on
Produc;on
Look at internal/external processes – What is a challenge? Where could overwhelming advantage be useful? Formulate hypothesis
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 26
Big Data Business Models
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 27
Big Data Lifecycle
Discovery
Data Prepara;on
Model Planning
Model Building
Result Communica;on
Produc;on
Explore the data in a sandbox Condition the data
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 28
Big Data Lifecycle
Discovery
Data Prepara;on
Model Planning
Model Building
Result Communica;on
Produc;on
Decide on methods and models Examine data for key variables
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 29
Big Data Lifecycle
Discovery
Data Prepara;on
Model Planning
Model Building
Result Communica;on
Produc;on
Create data sets for testing, training, and production Set up hardware environment
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 30
Big Data Lifecycle
Discovery
Data Prepara;on
Model Planning
Model Building
Result Communica;on
Produc;on
Validate (or not) hypothesis Share findings
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 31
Big Data Lifecycle
Discovery
Data Prepara;on
Model Planning
Model Building
Result Communica;on
Produc;on
Pilot project Operationalize
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 32
Approaches – Store and Analyze
• Integrate and consolidate o Becer data quality o Access to history o Higher storage requirements and latency impact
• Choose hardware o Massively Parallel Processing (PDW) o Tabular – data compression o RDBMS – column-‐store o NoSQL – mul;ple variable data sources
• Analyze data at rest
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 33
Approaches – Analyze and Store
• Filter and aggregate data before adding to DW o Reduce ac;on ;me (receipt of raw data to decision point) to acain greater business agility
o Lower storage and administra;ve overhead • Analyze data in mo;on (complex event processing)
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 34
Overwhelmed? Prototype First!
• Define a small project – focus on one product, for example
• Capture data for the subset of focus for limited dura;on (one month)
• Take ac;on on analy;cs and measure resul;ng change
http://www.microsoft.com/bigdata
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 35
Session Review
• What’s the Fuss? • What’s in the Big Data Stack? • Where Do I Start?
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 36
Resources
• Big data has jumped the shark (9/11/2011) o www.dbms2.com/2011/09/11/big-‐data-‐has-‐jumped-‐the-‐shark/
• Big data: The next fron;er for innova;on, compe;;on, and produc;vity (aka The McKinsey report) o hcp://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innova;on/Big_data_The_next_fron;er_for_innova;on
• What a Big Data Model Looks Like o hcp://blogs.hbr.org/cs/2012/12/what_a_big-‐data_business_model.html
Copyright © 2013 by Data Inspira;ons Inc. All rights reserved. 37
Resources
• Architectures for Running SSAS on Data in Hadoop Hive o hcp://thinknook.com/architectures-‐for-‐running-‐sql-‐server-‐analysis-‐service-‐ssas-‐on-‐data-‐in-‐hadoop-‐hive-‐2013-‐02-‐25/