Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

1

Hadoop is dead, long live Hadoop!

Lars George | EMEA Chief Architect @larsgeorge

A Eulogy and ProclamaAon

What the Press Says…

2

Source: hFp://blogs.the451group.com/informaAon_management/2012/07/09/hadoop-‐is-‐dead-‐long-‐live-‐hadoop/

3

Big Data… WTH? A brief reasoning for Hadoop’s existence.

4

— Bubble Buddy, Head of IT

Big Data – A Misnomer

•  Misleading to quick assumpAons •  Current challenges are driven by many things, not just the size of data

•  ANY company can use the Big Data principles to improve specific business metrics •  Increased data retenAon •  Access to all the data •  Machine learning for paFern detecAon, recommendaAons

•  But what has happened to cause this all?

5

Explosive Data Growth

6

10,000

2005 2015 2010

5,000

0

1.8 trillion gigabytes of data was created in 2011…

§  More than 90% is unstructured data §  Approx. 500 quadrillion files §  QuanAty doubles every 2 years

STRUCTURED DATA UNSTRUCTURED DATA

GIGAB

YTES OF DA

TA CRE

ATED

(IN BILLIONS)

Source: IDC 2011

The ‘Big Data’ Phenomenon

7

Big Data Drivers: §  The proliferaAon of data capture

and creaAon technologies

§  Increased “interconnectedness” drives consumpAon (creaAng more data)

§  Inexpensive storage makes it possible to keep more, longer

§  InnovaAve somware and analysis tools turn data into informaAon

Big Data encompasses not only the content itself, but how it’s consumed.

More Devices

More Consumption

More Content

New & Better Information

§  Every gigabyte of stored content can generate a petabyte or more of transient data*

§  The informaAon about you is much greater than the informaAon you create

*Source: IDC 2011

The Current SoluAons

8

10,000

2005 2015 2010

5,000

0

Current Database Solutions are designed for structured data.

§  OpAmized to answer known quesPons quickly §  Schemas dictate form/context

§  Difficult to adapt to new data types and new quesAons

§  Expensive at Petabyte scale

STRUCTURED DATA UNSTRUCTURED DATA

GIGAB

YTES OF DA

TA CRE

ATED

(IN BILLIONS)

10%

Data Management Strategies Have Stayed the Same

•  Raw data on SAN, NAS

and tape •  Data moved from

storage to compute •  RelaAonal models with

predesigned schemas

Too Much Data, Too Many Sources

•  Can’t ingest fast enough


$

!

$ $

$


•  Costs too much to store


1

2 3 4 5



•  Exists in different places




•  Exists in different places

•  Archived data is lost

Can’t Use It The Way You Want To

•  Analysis and processing takes too long


1

2 3 4 5


•  Data exists in silos


? ? ? •  Analysis and processing takes too long


•  Can’t ask new quesAons




•  Can’t ask new quesAons

•  Can’t analyze unstructured data

The Big Data Challenge

18

VOLUME VARIETY

VELOCITY DEMANDS A

NEW APPROACH

Big Data Contains Limitless Insights…

BUT

WEB LOGS

SOCIAL MEDIA

TRANSACTIONAL DATA

SMART GRIDS

OPERATIONAL DATA

DIGITAL CONTENT

R&D DATA AD IMPRESSIONS

FILES

Big Data Challenges

19

Cost-‐effecAvely managing the volume, velocity and variety of data

Deriving value across structured and unstructured data

AdapAng to context changes and integraAng new data sources and types

Big Data SoluAon Requirements

20

Cost-effectively manage the volume, variety and velocity of data

Process and analyze large, complex data sets…quickly

Flexibly adapt to context changes and new data types

21

Google’s Approach to Big Data Hadoop’s Pedigree

A Timeline View #1

22

Google File System

•  FoundaAon of scalable, fail-‐safe, self-‐healing storage •  One central place of truth •  Cost-‐effecAve hardware finally available

•  19” Rack servers with decent amount of disk space

•  Handling of failures built in •  Components or enAre servers •  At scale there are always hardware faults

•  Simple file system interface •  Finally no need for expensive, proprietary systems

23

Storage

MapReduce

•  First take on distributed data processing framework •  Same concepts as Google File System, i.e.

•  Fail-‐safe and scalable •  Handles a wide range of data processing problems

•  BUT not all of them (more later) •  Simple API reading and wriAng Key/Value pairs •  Framework handles heavy task of data movement •  Core concept is data locality, heavy I/O

•  Brings code to data, not the opposite (i.e. no HPC) •  Accessible in many programming languages

24

Processing

BigTable

•  Adds database like random access to data •  EffecAvely a Key/Value store with table semanAcs •  Used for small data points

•  Usually less than a megabyte per Key/Value •  Forfeits advanced concepts for ease of scalability

•  No transacAons, no query language •  Powers many applicaAons at Google •  Uses Google File System as storage layer •  Tight integraAon with MapReduce for batch processing

25

Random Access

Dremel, Tenzing, Pregel

•  Dremel adds specific file format and query language •  Used for highly selecAve queries, data exploraAon •  File layout is opAmized for very effecAve scanning •  Runs alongside of MapReduce and File System

•  Tenzing adds SQL over various data sources •  Can query raw files, Dremel files, or BigTable data etc. •  Brings “known” paradigm to stored data

•  Pregel adds graph processing API

26

Query API

Percolator, Megastore

•  AddiAons to BigTable to add “missing” features •  Percolator is using BigTable to update search index incrementally, needs transacAons •  Distributes updates with mulA-‐phase commits

•  Megastore drives Google App Engine to also add transacAons for user API •  Uses ranges of rows as en#ty groups •  Reduces locking to small subsets •  OpAmisAc, roll-‐forward only transacAons •  Java layer over BigTable API

27

TransacAons

Spanner, F1

•  Future of Google’s distributed storage and processing system

•  Spanner is a scalable, mulA-‐version, globally-‐ distributed, and synchronously-‐replicated database •  Replicates across datacenters •  Uses TrueTime (atomic clocks) for synchronizaAon •  Uses Colossus for storage (a GFS successor)

•  F1 replaced MySQL for AdWords service •  SQL over data stored in Spanner •  Colocated with Spanner processes

28

World-‐Wide Data

29

The Hadoop Story A Eulogy

What is Apache Hadoop?

30

Has the Flexibility to Store and Mine Any Type of Data

§  Ask quesAons across structured and

unstructured data that were previously impossible to ask or solve

§  Not bound by a single schema

Excels at Processing Complex Data

§  Scale-‐out architecture divides workloads

across mulAple nodes

§  Flexible file system eliminates ETL boFlenecks

Scales Economically

§  Can be deployed on commodity

hardware

§  Open source plavorm guards against vendor lock

Hadoop Distributed File System (HDFS)

Self-‐Healing, High

Bandwidth Clustered Storage

MapReduce/YARN

Distributed CompuAng Framework

Apache Hadoop is an open source plavorm for data storage and processing that is…

ü  Scalable ü  Fault tolerant ü  Distributed

CORE HADOOP SYSTEM COMPONENTS

Core Hadoop: HDFS

31

Self-healing, high bandwidth

1

2

3

4

5

2

4

5

HDFS

1

2

5

1

3

4

2

3

5

1

3

4

HDFS breaks incoming files into blocks and stores them redundantly across the cluster.

Core Hadoop: MapReduce

32

framework.

1

2

3

4

5

2

4

5

MR

1

2

5

1

3

4

2

3

5

1

3

4

Processes large jobs in parallel across many nodes and combines the results.

Why Hadoop Was Created

33

New opportunities to derive value from all your data.

Exploding Data Volumes & Types

Driving The Need For A Flexible, Scalable SoluPon

It’s difficult to handle data this diverse, at this scale. Traditional platforms can’t keep pace.

WEB LOGS

SOCIAL MEDIA

TRANSACTIONAL DATA

SMART GRIDS

OPERATIONAL DATA

DIGITAL CONTENT

R&D DATA

AD IMPRESSIONS

FILES

•  Any Kind •  From Any Source •  Structured & Unstructured •  At Scale

•  Deep Analysis •  ExhausAve & Detailed •  SophisAcated Algorithms •  Generate Results Quickly

•  Extract More Value •  From More Data •  More Cost Effectively •  With Greater Flexibility

BIG DATA

HARD PROBLEMS

NEW OPPORTUNITIES

The Core Values of Hadoop

34

A platform for

§  Designed to store and process data at petabyte scale

§  Scale-out architecture increases capacity and processing power linearly

§  Perform operations in parallel across the entire cluster

§  Store data in any format – free from rigid schemas

§  Define context at the time you ask the question

§  Process and analyze data using virtually any programming language

§  Build out your cluster on your hardware of choice

§  Open source software guards against vendor lock-in

§  Wide integration ensures investment protection

1 2 3

Hadoop In PracAce

35

36

Cloudera Soaware Stack Turnkey soluAon for Big Data and Advanced AnalyAcs use-‐cases

CDH 100% OPEN SOURCE HADOOP DISTRIBUTION

CLOUDERA MANAGER END-‐TO-‐END SYSTEM MANAGEMENT

CORE PROJECTS PREMIUM PROJECTS CONNECTORS

HDFS MAPREDUCE FLUME HCATALOG

MICROSTRATEGY

NETEZZA

ORACLE

QLIKVIEW

TABLEAU

TERADATA

HIVE HUE MAHOUT OOZIE

PIG SQOOP WHIRR ZOOKEEPER

HBASE

IMPALA

SEARCH (BETA)

DEPLOYMENT MONITORING API SNMP CONFIG ROLLBACKS PHONE HOME

SERVICE MGMT DIAGNOSTICS ROLLING UPGRADES LDAP REPORTING BACKUP/DR

CLOUDERA SUPPORT BEST-‐IN-‐CLASS TECHNICAL SUPPORT, COMMUNICTY ADVOCACY & INDEMNIFICATION

CLOUDERA NAVIGATOR END-‐TO-‐END DATA MANAGEMENT

ACCESS MGMT DATA AUDIT

CORE HADOOP PROJECTS

CLOUDERA MANAGER

CLOUDERA NAVIGATOR HBASE IMPALA

37

Spin some YARN! Reborn again!

Back to the Press again…

38

Source: hFp://gigaom.com/2012/07/07/why-‐the-‐days-‐are-‐numbered-‐for-‐hadoop-‐as-‐we-‐know-‐it/

A Timeline View #2

39

First: What is MapReduce 1?

40

MoAvaAons to Change MR1

41

•  Scaling >4000 nodes •  Fewer, larger clusters

•  No single source of truth, data in “silos” again

•  HA of Job Tracker difficult •  Large, complex state

•  Poor resource uAlizaAon •  Slots in MR1 are for either map or reduce

YARN: Yet Another Resource NegoAator

42

Split of ResponsibiliAes

43

Job Tracker

Resource Manager

ApplicaAon Master

split

•  One per Cluster •  Long-‐lived •  App-‐level

•  One per app instance •  Short-‐lived •  Task-‐level scheduling and monitoring

Fine-‐grained Resource Control

•  Node Manager is a generalized Task Tracker

•  Task Tracker •  Fixed number of map and reduce slots

•  Node Manager •  Containers with variable resource limits

44

Node Manager: Containers

45

YARN + MapReduce 2

46

•  YARN “runs” MapReduce as an applicaAon •  MR is user space •  YARN is kernel

YARN ApplicaAons

•  Distributed shell •  Open MPI •  Master-‐worker •  Apache Giraph, Hama •  Spark

47

48

Summary What the future may hold

Enterprise Data EvoluAon

RDBMS/EDW

HADOOP-OPTIMIZED INFRASTRUCTUREA

MO

UN

T O

F D

ATA

BUSINESS IMPACT

NEXT-GEN DATA COMPUTING PLATFORM

DATA-DRIVENORGANIZATION

AMOUNT OF DA

TA

• Data collecAon & reporAng

• Process data faster • Store data more cost-‐effecAvely • Simplify infrastructure

• Combine data from across the business • Ask new quesAons immediately • Enable new real-‐Ame applicaAons

1980s 2000s 2010s

CREATE COMPETITIVE ADVANTAGE

IMPROVE OPERATIONAL EFFICIENCY

Playing Catchup

•  Improve overall performance •  Google’s code is kernel module, C++, as low as possible •  Hadoop is Java, for ease of development in open-‐source •  Maybe rewrite parts of the stack? •  Overall goal: saturate machine specs (I/O, CPU, RAM)

•  Add missing features •  Everything is based on “hearsay”, aka research papers and presentaAons

•  Add what is necessary or for the sake of it?

50

Further Extend or Invent?

•  YARN is a good example for what can be done •  Look at every component and evaluate •  Work with research and universiAes, companies to drive new development

•  What else can be done with all that data?

51

52

— Jim Gray, Computer ScienAst

From Framework to Plavorm to Commodity

•  Hadoop distribuAons are already a commodity •  Move up the stack to reach commercial space

•  Simplify data processing •  ConAnuuity •  WibiData (Kiji) •  Cloudera CDK

•  Pure Hadoop SoluAons •  DataMeer •  Plavora

53

Hadoop… live long and prosper!

54

Lars George, EMEA Chief Architect, Cloudera @larsgeorge

Thank you!

Technology

Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa