Upload
larsgeorge
View
3.935
Download
0
Embed Size (px)
DESCRIPTION
Keynote during BiDaTA 2013 in Genoa, a special track of the ADBIS 2013 conference. URL: http://dbdmg.polito.it/bidata2013/index.php/keynote-presentation
Citation preview
1
Hadoop is dead, long live Hadoop!
Lars George | EMEA Chief Architect @larsgeorge
A Eulogy and ProclamaAon
What the Press Says…
2
Source: hFp://blogs.the451group.com/informaAon_management/2012/07/09/hadoop-‐is-‐dead-‐long-‐live-‐hadoop/
3
Big Data… WTH? A brief reasoning for Hadoop’s existence.
4
— Bubble Buddy, Head of IT
Big Data – A Misnomer
• Misleading to quick assumpAons • Current challenges are driven by many things, not just the size of data
• ANY company can use the Big Data principles to improve specific business metrics • Increased data retenAon • Access to all the data • Machine learning for paFern detecAon, recommendaAons
• But what has happened to cause this all?
5
Explosive Data Growth
6
10,000
2005 2015 2010
5,000
0
1.8 trillion gigabytes of data was created in 2011…
§ More than 90% is unstructured data § Approx. 500 quadrillion files § QuanAty doubles every 2 years
STRUCTURED DATA UNSTRUCTURED DATA
GIGAB
YTES OF DA
TA CRE
ATED
(IN BILLIONS)
Source: IDC 2011
The ‘Big Data’ Phenomenon
7
Big Data Drivers: § The proliferaAon of data capture
and creaAon technologies
§ Increased “interconnectedness” drives consumpAon (creaAng more data)
§ Inexpensive storage makes it possible to keep more, longer
§ InnovaAve somware and analysis tools turn data into informaAon
Big Data encompasses not only the content itself, but how it’s consumed.
More Devices
More Consumption
More Content
New & Better Information
§ Every gigabyte of stored content can generate a petabyte or more of transient data*
§ The informaAon about you is much greater than the informaAon you create
*Source: IDC 2011
The Current SoluAons
8
10,000
2005 2015 2010
5,000
0
Current Database Solutions are designed for structured data.
§ OpAmized to answer known quesPons quickly § Schemas dictate form/context
§ Difficult to adapt to new data types and new quesAons
§ Expensive at Petabyte scale
STRUCTURED DATA UNSTRUCTURED DATA
GIGAB
YTES OF DA
TA CRE
ATED
(IN BILLIONS)
10%
Data Management Strategies Have Stayed the Same
• Raw data on SAN, NAS
and tape • Data moved from
storage to compute • RelaAonal models with
predesigned schemas
Too Much Data, Too Many Sources
• Can’t ingest fast enough
Too Much Data, Too Many Sources
$
!
$ $
$
• Can’t ingest fast enough
• Costs too much to store
Too Much Data, Too Many Sources
1
2 3 4 5
• Can’t ingest fast enough
• Costs too much to store
• Exists in different places
Too Much Data, Too Many Sources
• Can’t ingest fast enough
• Costs too much to store
• Exists in different places
• Archived data is lost
Can’t Use It The Way You Want To
• Analysis and processing takes too long
Can’t Use It The Way You Want To
1
2 3 4 5
• Analysis and processing takes too long
• Data exists in silos
Can’t Use It The Way You Want To
? ? ? • Analysis and processing takes too long
• Data exists in silos
• Can’t ask new quesAons
Can’t Use It The Way You Want To
• Analysis and processing takes too long
• Data exists in silos
• Can’t ask new quesAons
• Can’t analyze unstructured data
The Big Data Challenge
18
VOLUME VARIETY
VELOCITY DEMANDS A
NEW APPROACH
Big Data Contains Limitless Insights…
BUT
WEB LOGS
SOCIAL MEDIA
TRANSACTIONAL DATA
SMART GRIDS
OPERATIONAL DATA
DIGITAL CONTENT
R&D DATA AD IMPRESSIONS
FILES
Big Data Challenges
19
Cost-‐effecAvely managing the volume, velocity and variety of data
Deriving value across structured and unstructured data
AdapAng to context changes and integraAng new data sources and types
Big Data SoluAon Requirements
20
Cost-effectively manage the volume, variety and velocity of data
Process and analyze large, complex data sets…quickly
Flexibly adapt to context changes and new data types
21
Google’s Approach to Big Data Hadoop’s Pedigree
A Timeline View #1
22
Google File System
• FoundaAon of scalable, fail-‐safe, self-‐healing storage • One central place of truth • Cost-‐effecAve hardware finally available
• 19” Rack servers with decent amount of disk space
• Handling of failures built in • Components or enAre servers • At scale there are always hardware faults
• Simple file system interface • Finally no need for expensive, proprietary systems
23
Storage
MapReduce
• First take on distributed data processing framework • Same concepts as Google File System, i.e.
• Fail-‐safe and scalable • Handles a wide range of data processing problems
• BUT not all of them (more later) • Simple API reading and wriAng Key/Value pairs • Framework handles heavy task of data movement • Core concept is data locality, heavy I/O
• Brings code to data, not the opposite (i.e. no HPC) • Accessible in many programming languages
24
Processing
BigTable
• Adds database like random access to data • EffecAvely a Key/Value store with table semanAcs • Used for small data points
• Usually less than a megabyte per Key/Value • Forfeits advanced concepts for ease of scalability
• No transacAons, no query language • Powers many applicaAons at Google • Uses Google File System as storage layer • Tight integraAon with MapReduce for batch processing
25
Random Access
Dremel, Tenzing, Pregel
• Dremel adds specific file format and query language • Used for highly selecAve queries, data exploraAon • File layout is opAmized for very effecAve scanning • Runs alongside of MapReduce and File System
• Tenzing adds SQL over various data sources • Can query raw files, Dremel files, or BigTable data etc. • Brings “known” paradigm to stored data
• Pregel adds graph processing API
26
Query API
Percolator, Megastore
• AddiAons to BigTable to add “missing” features • Percolator is using BigTable to update search index incrementally, needs transacAons • Distributes updates with mulA-‐phase commits
• Megastore drives Google App Engine to also add transacAons for user API • Uses ranges of rows as en#ty groups • Reduces locking to small subsets • OpAmisAc, roll-‐forward only transacAons • Java layer over BigTable API
27
TransacAons
Spanner, F1
• Future of Google’s distributed storage and processing system
• Spanner is a scalable, mulA-‐version, globally-‐ distributed, and synchronously-‐replicated database • Replicates across datacenters • Uses TrueTime (atomic clocks) for synchronizaAon • Uses Colossus for storage (a GFS successor)
• F1 replaced MySQL for AdWords service • SQL over data stored in Spanner • Colocated with Spanner processes
28
World-‐Wide Data
29
The Hadoop Story A Eulogy
What is Apache Hadoop?
30
Has the Flexibility to Store and Mine Any Type of Data
§ Ask quesAons across structured and
unstructured data that were previously impossible to ask or solve
§ Not bound by a single schema
Excels at Processing Complex Data
§ Scale-‐out architecture divides workloads
across mulAple nodes
§ Flexible file system eliminates ETL boFlenecks
Scales Economically
§ Can be deployed on commodity
hardware
§ Open source plavorm guards against vendor lock
Hadoop Distributed File System (HDFS)
Self-‐Healing, High
Bandwidth Clustered Storage
MapReduce/YARN
Distributed CompuAng Framework
Apache Hadoop is an open source plavorm for data storage and processing that is…
ü Scalable ü Fault tolerant ü Distributed
CORE HADOOP SYSTEM COMPONENTS
Core Hadoop: HDFS
31
Self-healing, high bandwidth
1
2
3
4
5
2
4
5
HDFS
1
2
5
1
3
4
2
3
5
1
3
4
HDFS breaks incoming files into blocks and stores them redundantly across the cluster.
Core Hadoop: MapReduce
32
framework.
1
2
3
4
5
2
4
5
MR
1
2
5
1
3
4
2
3
5
1
3
4
Processes large jobs in parallel across many nodes and combines the results.
Why Hadoop Was Created
33
New opportunities to derive value from all your data.
Exploding Data Volumes & Types
Driving The Need For A Flexible, Scalable SoluPon
It’s difficult to handle data this diverse, at this scale. Traditional platforms can’t keep pace.
WEB LOGS
SOCIAL MEDIA
TRANSACTIONAL DATA
SMART GRIDS
OPERATIONAL DATA
DIGITAL CONTENT
R&D DATA
AD IMPRESSIONS
FILES
• Any Kind • From Any Source • Structured & Unstructured • At Scale
• Deep Analysis • ExhausAve & Detailed • SophisAcated Algorithms • Generate Results Quickly
• Extract More Value • From More Data • More Cost Effectively • With Greater Flexibility
BIG DATA
HARD PROBLEMS
NEW OPPORTUNITIES
The Core Values of Hadoop
34
A platform for
§ Designed to store and process data at petabyte scale
§ Scale-out architecture increases capacity and processing power linearly
§ Perform operations in parallel across the entire cluster
§ Store data in any format – free from rigid schemas
§ Define context at the time you ask the question
§ Process and analyze data using virtually any programming language
§ Build out your cluster on your hardware of choice
§ Open source software guards against vendor lock-in
§ Wide integration ensures investment protection
1 2 3
Hadoop In PracAce
35
36
Cloudera Soaware Stack Turnkey soluAon for Big Data and Advanced AnalyAcs use-‐cases
CDH 100% OPEN SOURCE HADOOP DISTRIBUTION
CLOUDERA MANAGER END-‐TO-‐END SYSTEM MANAGEMENT
CORE PROJECTS PREMIUM PROJECTS CONNECTORS
HDFS MAPREDUCE FLUME HCATALOG
MICROSTRATEGY
NETEZZA
ORACLE
QLIKVIEW
TABLEAU
TERADATA
HIVE HUE MAHOUT OOZIE
PIG SQOOP WHIRR ZOOKEEPER
HBASE
IMPALA
SEARCH (BETA)
DEPLOYMENT MONITORING API SNMP CONFIG ROLLBACKS PHONE HOME
SERVICE MGMT DIAGNOSTICS ROLLING UPGRADES LDAP REPORTING BACKUP/DR
CLOUDERA SUPPORT BEST-‐IN-‐CLASS TECHNICAL SUPPORT, COMMUNICTY ADVOCACY & INDEMNIFICATION
CLOUDERA NAVIGATOR END-‐TO-‐END DATA MANAGEMENT
ACCESS MGMT DATA AUDIT
CORE HADOOP PROJECTS
CLOUDERA MANAGER
CLOUDERA NAVIGATOR HBASE IMPALA
37
Spin some YARN! Reborn again!
Back to the Press again…
38
Source: hFp://gigaom.com/2012/07/07/why-‐the-‐days-‐are-‐numbered-‐for-‐hadoop-‐as-‐we-‐know-‐it/
A Timeline View #2
39
First: What is MapReduce 1?
40
MoAvaAons to Change MR1
41
• Scaling >4000 nodes • Fewer, larger clusters
• No single source of truth, data in “silos” again
• HA of Job Tracker difficult • Large, complex state
• Poor resource uAlizaAon • Slots in MR1 are for either map or reduce
YARN: Yet Another Resource NegoAator
42
Split of ResponsibiliAes
43
Job Tracker
Resource Manager
ApplicaAon Master
split
• One per Cluster • Long-‐lived • App-‐level
• One per app instance • Short-‐lived • Task-‐level scheduling and monitoring
Fine-‐grained Resource Control
• Node Manager is a generalized Task Tracker
• Task Tracker • Fixed number of map and reduce slots
• Node Manager • Containers with variable resource limits
44
Node Manager: Containers
45
YARN + MapReduce 2
46
• YARN “runs” MapReduce as an applicaAon • MR is user space • YARN is kernel
YARN ApplicaAons
• Distributed shell • Open MPI • Master-‐worker • Apache Giraph, Hama • Spark
47
48
Summary What the future may hold
Enterprise Data EvoluAon
RDBMS/EDW
HADOOP-OPTIMIZED INFRASTRUCTUREA
MO
UN
T O
F D
ATA
BUSINESS IMPACT
NEXT-GEN DATA COMPUTING PLATFORM
DATA-DRIVENORGANIZATION
AMOUNT OF DA
TA
• Data collecAon & reporAng
• Process data faster • Store data more cost-‐effecAvely • Simplify infrastructure
• Combine data from across the business • Ask new quesAons immediately • Enable new real-‐Ame applicaAons
1980s 2000s 2010s
CREATE COMPETITIVE ADVANTAGE
IMPROVE OPERATIONAL EFFICIENCY
Playing Catchup
• Improve overall performance • Google’s code is kernel module, C++, as low as possible • Hadoop is Java, for ease of development in open-‐source • Maybe rewrite parts of the stack? • Overall goal: saturate machine specs (I/O, CPU, RAM)
• Add missing features • Everything is based on “hearsay”, aka research papers and presentaAons
• Add what is necessary or for the sake of it?
50
Further Extend or Invent?
• YARN is a good example for what can be done • Look at every component and evaluate • Work with research and universiAes, companies to drive new development
• What else can be done with all that data?
51
52
— Jim Gray, Computer ScienAst
From Framework to Plavorm to Commodity
• Hadoop distribuAons are already a commodity • Move up the stack to reach commercial space
• Simplify data processing • ConAnuuity • WibiData (Kiji) • Cloudera CDK
• Pure Hadoop SoluAons • DataMeer • Plavora
53
Hadoop… live long and prosper!
54
Lars George, EMEA Chief Architect, Cloudera @larsgeorge
Thank you!