Upload
pradeep-varadan
View
346
Download
3
Embed Size (px)
DESCRIPTION
Basic Primer on what is big data, what drives it , the technologies that support it and industry examples
Citation preview
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 11Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Big Data 101
Pradeep VaradanEnterprise Architecture
Mar 2014
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 2
Agenda
• What is Big Data ?
Hype
Facts
Definition
• Why the upsurge ?
Re-thinking data
Rethinking processes
• Technology
Current constraints
RDBMS vs. Hadoop
Hadoop
No SQL
• Use Cases
Cross Industry examples
Netflix
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 3
data
Big Data
What is Big Data ?
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 4
Social media
Server logs
Web clickstream
Machine/sensor
Geo-location
What is Big Data ?
Hobbyist Desktop Internet Big Data
Kb Gb Pb Zb
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 5
“high-volume, -velocity and -variety information assets
that demand cost-effective, innovative forms of
information processing for enhanced insight and decision
making” - Gartner
What is Big Data ?
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 6
• What is Big Data ?
Hype
Facts
Definition
• Why the upsurge ?
Re-thinking data
Rethinking processes
• Technology
Current constraints
RDBMS vs. Hadoop
Hadoop
No SQL
• Use Cases
Cross Industry examples
Netflix
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 7
TRADITIONAL APPROACH BIG DATA APPROACH
Analyze small subsets of information
Analyze all information
Analyzedinformation
All available information
All available informationanalyzed
Rethinking data #1
Move from samples to populations
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 8
TRADITIONAL APPROACH BIG DATA APPROACH
Start with hypothesis andtest against selected data
Explore all data andidentify correlations
Hypothesis Question
DataAnswer
Data Exploration
CorrelationInsight
Let data do the talking
Rethinking data #2
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 9
TRADITIONAL APPROACH BIG DATA APPROACH
Carefully cleanse information before any analysis
Analyze information as is, cleanse as needed
Small amount of carefully
organized information
Large amount of
messy information
Fail fast or progress iteratively
Rethinking processes #1
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 10
Rethinking processes #2
TRADITIONAL APPROACH BIG DATA APPROACH
Analyze data after it’s been processed and landed in a warehouse
or mart
Analyze data in motion as it’s generated, in real-time
Repository InsightAnalysisData
Data
Insight
Analysis
Provide insight in real time
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 11
• What is Big Data ?
Hype
Facts
Definition
• Why the upsurge ?
Re-thinking data
Rethinking processes
• Technology
Current constraints
Hadoop
RDBMS vs. Hadoop
No SQL
• Use Cases
Cross Industry examples
Netflix
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 12
Constraints of the current environment
Category Existing
Optimization
Ask
Data Type Structured Unstructured
H/W Scalability Vertical Horizontal
Reliability Pricy H/W Free S/W
Interoperability Closed by Vendor Open source
IO Write less, Read more Write more, Read less
Insight Newspaper/daily Near Real time
Data retention Filtered/Limited Unfiltered/Unlimited
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 13
Big Data Technologies
• Hadoop
• NO SQL
• Analytics/Visualization (Out of Scope)
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 14
How did Hadoop come about ?
Year Google
2004 GFS, Map Reduce
2005 Sawzall
2006 Big Table
2010 Dremel/F1
…. ……
2012 Spanner
Year Open Source
2006 HDFS
2008 Pig, Hive
2008 HBase
2013 Impala
… ….
? ?
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 15
DFS Message Path
MapReduce Processing Msg
DN
TT
DN
TT
DN
TTDN
TTDN
TT
DN
TT
DN
TTDN
TT…
… …
Name
Node
Job Tracker
HDFS: Distributed compute and storage
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 16
Map Reduce : visual example
Map Shuffle ReduceDistribute
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 17
Hadoop Reference architecture
hadoop - hdfs, map reduce
sqoop - db to hdfs
flume - log to hdfs
hbase - columnar store - big table - key,value
pig - python, ruby, php
hive - sql query
oozie - worklflow co-ordination , xml based, scheduler/job-orchestration
zookeeper - co-ordinator ; misc admin functions: locking, messaging,
mailboxes, leader election
fuse-dfs - hdfs volumes in linux
avro - data serialization/rpc
mahout - machine learning
dumbo - python library for streaming
vaidya - Performance benchmarking framework
chukwa - cluster monitor
Lucene - text search
scribe - log collection
storm - real time processing
Welcome to the zoo!
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 19
Hadoop companies
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 20
Interfaces to Hadoop
Analytics
Data
Pre
p
CR
M
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 21
Hadoop Vs. Relational Databases
• Write first, think later
• Think first, write next
Hadoop : Schema-on-read
RDBMS: Schema-on-write
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 22
NO SQL – Not Only SQL
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 23
NO SQL Types
• Column family:
Aggregate OLAP oriented, Primary Key is data mapping back to row ids
HBase, Accumulo, Cassandra
– NSA uses Accumulo with cell level security for PRISM
• Document store:
Object Oriented encapsulation ,Encoding (XML, YAML, JSON, and BSON)
MarkLogic, MongoDB, Couchbase
– Metlife uses MongoDB for “The Wall’ /Customer 360 View CRM
• Key-value:
– (key,value) based lookups , Associative array with hash table
– Dynamo, Riak, Voldemort
– LinkedIn used Voldemort behind ‘Who viewed my profile?’
• Graph:
– graph structures with nodes, edges, and properties ; index-free adjacency,
– Neo4J, Allegro, Virtuoso
– TwitLogic semantic web using twitter data
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 24
SQL vs. NO SQL
SQL NO SQL
Relational Distributed/Hierarchical
Tables Key Value pairs, Documents, Graphs,
Column families
Pre-defined schema Dynamic schema
Vertically scalable Horizontally scalable
SQL UnQL(more programming)
Complex queries on small data Simple queries on large data
ACID BASE
Vertically scalable Horizontally scalabale
Defined data model Model inside application
Cumbersome set up – DBA Ease of set up
Simple data Complex data
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 25
Eventually consistent
“CAP Theorem is a set of basic requirements that describe any distributed system
not just storage or database”
“You cannot have a clustered system that supports all of the
following three qualities: consistency, availability, partition-tolerant” -
CAP Theorem by Prof. Eric Brewer (Berkeley)
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 26
Agenda
• What is Big Data ?
Hype
Facts
Definition
• Why the upsurge ?
Re-thinking data
Rethinking processes
• Technology
Current constraints
Hadoop
RDBMS vs. Hadoop
No SQL
• Use Cases
Cross Industry examples
Netflix
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 27
Big Data Use Cases
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 28
“House of Cards” is one of the first major test cases of this Big Data-
driven creative strategy. Detailed knowledge of Netflix subscriber
viewing preferences clinched their decision to license a remake of the
popular and critically well regarded 1990 BBC miniseries. Netflix’s data
indicated that the same subscribers who loved the original BBC
production also gobbled down movies starring Kevin Spacey or directed
by David Fincher. Therefore, concluded Netflix executives, a remake of
the BBC drama with Spacey and Fincher attached was a no-brainer, to
the point that the company committed $100 million for two 13-episode
seasons.
Use Cases
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 29
Where are we headed ?
• H/W
– Couch - cluster of unreliable commodity hardware
– Software defined storage reliability
• S/W
– HDFS will be the new UNIX (distributed FS)
– Open Source software
• Data Ingestion
– Online transactions + Batch file + Streaming torrents
• Technical Architecture
– Shared nothing
– Data centric (Process will move to data)
• Backup and recovery ?
• Scalability
– Horizontal
– Vertical
• Mixed workloads
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 30
References
• McKinsey
• Gartner
• Forrester
• Wikibon
• IBM big data
• Oracle Big Data
• Aster
• MapR
• Cloudera
• Wikipedia Big Data
• Wikipedia NO SQL
• MongoDB
• Use Cases
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 31
Thank you
pradeepvaradan [email protected]