Hadoop for shanghai dev meetup

1 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop (Shanghai Developer

Meetup – Sept 15, 2011)

余家昌 (Andrew Yu)

EMC Greenplum


The Elephant Chase



Yahoo! Hadoop use cases

• Personalized Yahoo! Homepage

• Yahoo! Mail anti-spam

• Search and Ad pipelines

• Ad inventory prediction

• Data analytics

• etc


Enterprise Use Case: “Big ETL”

Challenge: Transform Massive Data

Flows Containing Data Needed for

Complex Analysis

• Examples: – Web Traffic Reduction

– Network Traffic & Performance Analysis

– Location Analytics for People and Goods

– Smart Electric Power Grid

– Genome Analysis

– Clinical Outcome Research & Analysis

• Data Sources: – Web server & app server logs

– CDR / xDRs

– Router & Switching Subsystem Logs

– Sensor networks

Solution: Hadoop/MapReduce as ETL

fabric to load to Analytic Database

• Components:

– Hadoop: Massively-parallel ingest, storage and

analysis

– MapReduce: Runs multiple cascaded custom

analysis / extraction on capture data

– Connectors move structured data to Analytics

DB

• Hadoop’s Roles:

– Capture TBs/day of machine-generated data

– Quality: Run data quality tasks in MapReduce

– Execute MapReduce flows

– Extract/Combine data/metadata

– Move processed data to analytic DB

• Limitations & Cautions:

– Software development, More parts (Cascading/Flow), Maintainability


Enterprise Use Case: Fraud Detection

Challenge: Identify & alert fraudulent

activity patterns

• Examples:

– ESP’s - Email Fraud

– Finance/Banking - Bank Fraud

– Advertising - Click Fraud

– Telecom – Network fraud

• Data Sources:

– Web & app server logs

– IP/Call Records

– Email Traffic

– Customer Transaction Data

– Banking/Credit Data

Solution: Hadoop/MapReduce to filter

& correlate communications

• Components:

– Hadoop: Massively-parallel ingest,

storage and analysis

– Mahout: Machine learning tool for building

fraud algorithms

– MapReduce: Rapid analysis & algorithm

deployment

• Hadoop’s Role(s):

– Massive ingest of historical/real-time data

– Build/Validate model for fraud detection

manually or using Mahout

– Parallel MapReduce jobs for near real-

time fraud detection


– Software development, Partial Solution (not Real-time, not Interactive)

–


Enterprise Use Case: Cluster Analysis

Challenge: Grouping a collection of

data according to common similarities

• Examples:

– Customer segmentation

– Financial cost/risk analysis

– Patient-centric healthcare

– Financial stock classification

– Social network analysis

• Data Sources:

– Health records

– Sales data

– Human genome sequences

– Financial trading data

– Facebook/Twitter/LinkedIn

Solution: Process and Refine in

Hadoop and load into Analytical DB

• Components:

– Hadoop: Flexible data storage as volume

increases and structures vary

– MapReduce: Cascading allows data

processing with minimal adjustments

– Optional: Connectors to move results to

Analytic DB

• Hadoop’s Role(s):

– Flexible: Allow agile implementation of

and unit testing of algorithms

– Large scale analysis in Hadoop creates

more accurate groupings

– Rapid, parallel processing in MapReduce


– Software development, Complex Integration with Sources


Greenplum HD: Community Edition Stack

Hadoop Distributed File System (HDFS)

MapReduce Framework (MapRed)

Pig

Hiv

e

HB

ase

Zook

eepe

r

100% APACHE

Currently supported

Future releases may include support for Oozie and Mahout

http://hadoop.apache.org/

http://yahoo.github.com/oozie


100% APACHE

INTERFACE

Greenplum HD: Enterprise Edition Stack

Hadoop Distributed File System (HDFS)

MapReduce Framework (MapRed)

Pig

Hiv

e

HB

ase

Zook

eepe

r

Future releases may include support for Oozie and Mahout

Currently supported

Enha

nced

Mon

itorin

g

http://hadoop.apache.org/

http://yahoo.github.com/oozie


Greenplum HD: Enterprise Edition Enterprise-Ready Hadoop Platform for Unstructured Data

• 2 – 5x Faster than Apache Hadoop Faster

• High Availability • Mirroring Reliable

• NFS mountable • System Management

Easier to Use


Greenplum Enterprise HD is Faster than Other Distributions

DFSIO (higher is better)

Terasort (lower is better)

10 node cluster, 2x Quad-Core, 24G DRAM, 12 x 1TB SATA Drives @ 7200 rpm, Quad NICs

Ela

pse

d tim

e in

min

ute

s

MB

/se

c

0

50

100

150

200

250

3.5 TB 0

100

200

300

400

500

600

700

800

900

1000

Read Write


Greenplum Enterprise HD Distributed Name Node

• Fully distributed

service running on

all Hadoop nodes

• Automatic and

transparent failover

• Persistent metadata

• Highly scalable in

number of files

Hadoop

Node NN

Hadoop

Node NN

Hadoop

Node NN

Hadoop

Node NN

Hadoop

Node NN

Hadoop

Node NN

Hadoop

Node NN

Hadoop

Node NN

Hadoop

Node NN

Hadoop

Node NN


Greenplum Enterprise HD Job Tracker High Availability

• Assures business continuity

• Designed for mission critical use

– Automatic stateful restart

– Task Tracker reconnects without task loss

– Persistent completed task state

Greenplum Enterprise HD Distribution for Apache Hadoop

Enterprise HD MapReduce

Enterprise HD

Lockless Storage Services

Distributed

Name Node Job Tracker HA


Greenplum Enterprise HD Snapshots

• Intelligent Snapshots – Automatic data deduplication

– Block sharing for space

savings

• Fast and flexible – Zero performance loss when

writing to the original

• Easy to manage – Scheduled or on-demand

– Drag and drop recovery

REDIRECT ON

WRITE

FOR SNAPSHOT

A B C C’ D

Snapshot

1

Snapshot

2

Snapshot

3

Enterprise HD Lockless Storage

Services

Hadoop / HBASE

APPLICATIONS

READ / WRITE

NFS

APPLICATIONS


Greenplum Enterprise HD Mirroring

• Business Continuity – Efficient design

– Differential deltas are

updated

– Data is compressed and

check-summed

• Easy to manage – Scheduled or on-demand

– Consistent point-in-time

WAN Datacenter 2

Production Research

Production WAN

Datacenter 1

Cloud


Greenplum Enterprise HD

Direct Access Using NFS

Greenplum Enterprise HD Distribution for Apache Hadoop

Enterprise HD MapReduce

Enterprise HD

Lockless Storage Services

Distributed

Name Node Job Tracker HA

• Simple application integration

– Leverage NFS for random read/write access

• Direct access for standard Hadoop tools

– Command line utilities

– File browsers

– Desktop utilities


Greenplum Enterprise HD

Simple Management

• Intuitive

• Insightful

• Complete

• One node

or

thousands


Greenplum HD: Software Distributions

Features Community Edition Enterprise Edition

Apache Compatibility 100% Apache Open Source 100% API Compatible

Name Node High Availability Reference Implementation Distributed and High Avaiability

Job Tracker HA Reference Implementation HT High Availability

Name Node Scalability NN Metadata in Memory Distributed Name Node

Premium Support Yes Yes

Performance 2 - 5x than Community Edition

Snapshots No Yes

Mirrors No Yes

NFS Mounts No Yes

System Management No Yes

Available for Ordering May 9th 2011 Q3

Pricing Per Node Pricing Per Node Pricing


Greenplum HD on Data Computing Appliance

• Introducing the world’s first: – High-performance

– Purpose-built

– Data co-processing Hadoop

appliance

• Combining Greenplum Database

and Greenplum Hadoop in one

appliance


GPDB GPHD Interoperability

GPDB External Tables

GPHD

File on HD

GPHD data in/out

in GPDB Query


Greenplum Database External Tables for Hadoop

• Bring GPDB relational expressive

power to HDFS – HDFS data presented as external tables

– HDFS data supporting full SQL syntax

• Have ALL, PART or NONE of your

data in HDFS

• Leverage full parallelism of both

Hadoop and GPDB – GPDB can read from/write to HDFS,

Example:

Select count(*) from

HDFS_data h,

GPDB_data g

where h.key = g.key;

Insert into

HDFS_data select *

from GPDB_data;


Greenplum Enterprise HD HDFS Integration – Parallelized Flow

• Reading: – Each GPDB segment reads a portion of the file

• Segment i of n reads the i/n-th portion

– Access offset from HDFS namenode

– Read data directly from HDFS datanode

• Writing: – Each GPDB segment writes a file

– HDFS balancing distributes the load evenly

across datanodes


Big Data Analytics “Stack”

Greenplum Chorus Enterprise Collaboration Platform for Data

Greenplum Database

World’s Most Scalable MPP Database Platform

Analytic Toolsets (Business Analytics, BI, Statistics, etc.)

Greenplum HD

Enterprise Analytics Platform for Unstructured Data

Greenplum Data Computing Appliances Purpose-built for Big Data Analytics


THANK YOU

Technology

Hadoop for shanghai dev meetup