51
Introduction to the Hadoop ecosystem

Introduction to the Hadoop Ecosystem (codemotion Edition)

Embed Size (px)

Citation preview

Page 1: Introduction to the Hadoop Ecosystem (codemotion Edition)

Introduction to the Hadoop ecosystem

Page 2: Introduction to the Hadoop Ecosystem (codemotion Edition)

About me

Page 3: Introduction to the Hadoop Ecosystem (codemotion Edition)

About us

Page 4: Introduction to the Hadoop Ecosystem (codemotion Edition)

Why Hadoop?

Page 5: Introduction to the Hadoop Ecosystem (codemotion Edition)

Why Hadoop?

Page 6: Introduction to the Hadoop Ecosystem (codemotion Edition)

Why Hadoop?

Page 7: Introduction to the Hadoop Ecosystem (codemotion Edition)

Why Hadoop?

Page 8: Introduction to the Hadoop Ecosystem (codemotion Edition)

Why Hadoop?

Page 9: Introduction to the Hadoop Ecosystem (codemotion Edition)

Why Hadoop?

Page 10: Introduction to the Hadoop Ecosystem (codemotion Edition)

Why Hadoop?

Page 11: Introduction to the Hadoop Ecosystem (codemotion Edition)

How to scale data?

w1 w2 w3

r1 r2 r3

Page 12: Introduction to the Hadoop Ecosystem (codemotion Edition)

But…

Page 13: Introduction to the Hadoop Ecosystem (codemotion Edition)

But…

Page 14: Introduction to the Hadoop Ecosystem (codemotion Edition)

What is Hadoop?

Page 15: Introduction to the Hadoop Ecosystem (codemotion Edition)

What is Hadoop?

Page 16: Introduction to the Hadoop Ecosystem (codemotion Edition)

What is Hadoop?

Page 17: Introduction to the Hadoop Ecosystem (codemotion Edition)

What is Hadoop?

Page 18: Introduction to the Hadoop Ecosystem (codemotion Edition)

The Hadoop App Store

HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra

Chukwa

Intel

Sync

Flume Hana HyperT Impala Mahout Nutch Oozie Scoop

Scribe Tez Vertica Whirr ZooKee Cloudera Horton MapR EMC

IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper

Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat

Page 19: Introduction to the Hadoop Ecosystem (codemotion Edition)

Data Storage

Page 20: Introduction to the Hadoop Ecosystem (codemotion Edition)

Data Storage

Page 21: Introduction to the Hadoop Ecosystem (codemotion Edition)

Hadoop Distributed File System

Page 22: Introduction to the Hadoop Ecosystem (codemotion Edition)

Hadoop Distributed File System

Page 23: Introduction to the Hadoop Ecosystem (codemotion Edition)

HDFS Architecture

Page 24: Introduction to the Hadoop Ecosystem (codemotion Edition)

Data Processing

Page 25: Introduction to the Hadoop Ecosystem (codemotion Edition)

Data Processing

Page 26: Introduction to the Hadoop Ecosystem (codemotion Edition)

MapReduce

Page 27: Introduction to the Hadoop Ecosystem (codemotion Edition)

Typical large-data problem

Page 28: Introduction to the Hadoop Ecosystem (codemotion Edition)

MapReduce Flow

𝐤𝟏 𝐯𝟏 𝐤𝟐 𝐯𝟐 𝐤𝟒 𝐯𝟒 𝐤𝟓 𝐯𝟓 𝐤𝟔 𝐯𝟔 𝐤𝟑 𝐯𝟑

a 𝟏 b 2 c 9 a 3 c 2 b 7 c 8

a 𝟏 b 2 c 3 c 6 a 3 c 2 b 7 c 8

a 1 3 b 𝟐 7 c 2 8 9

a 4 b 9 c 19

Page 29: Introduction to the Hadoop Ecosystem (codemotion Edition)

Jobs & Tasks

Page 30: Introduction to the Hadoop Ecosystem (codemotion Edition)

Combined Hadoop Architecture

Page 31: Introduction to the Hadoop Ecosystem (codemotion Edition)

Word Count Mapper in Java

public class WordCountMapper extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable>

{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> output, Reporter reporter) throws IOException

{

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens())

{

word.set(tokenizer.nextToken());

output.collect(word, one);

}

}

}

Page 32: Introduction to the Hadoop Ecosystem (codemotion Edition)

Word Count Reducer in Java

public class WordCountReducer extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable>

{

public void reduce(Text key, Iterator values, OutputCollector

output, Reporter reporter) throws IOException

{

int sum = 0;

while (values.hasNext())

{

IntWritable value = (IntWritable) values.next();

sum += value.get();

}

output.collect(key, new IntWritable(sum));

}

}

Page 33: Introduction to the Hadoop Ecosystem (codemotion Edition)

Scripting for Hadoop

Page 34: Introduction to the Hadoop Ecosystem (codemotion Edition)

Scripting for Hadoop

Page 35: Introduction to the Hadoop Ecosystem (codemotion Edition)

Apache Pig

••

Page 36: Introduction to the Hadoop Ecosystem (codemotion Edition)

Pig in the Hadoop ecosystem

Hadoop Distributed File System

Distributed Programming Framework

Metadata Management

Scripting

Page 37: Introduction to the Hadoop Ecosystem (codemotion Edition)

Pig Latin

users = LOAD 'users.txt' USING PigStorage(',') AS (name,

age);

pages = LOAD 'pages.txt' USING PigStorage(',') AS (user,

url);

filteredUsers = FILTER users BY age >= 18 and age <=50;

joinResult = JOIN filteredUsers BY name, pages by user;

grouped = GROUP joinResult BY url;

summed = FOREACH grouped GENERATE group,

COUNT(joinResult) as clicks;

sorted = ORDER summed BY clicks desc;

top10 = LIMIT sorted 10;

STORE top10 INTO 'top10sites';

Page 38: Introduction to the Hadoop Ecosystem (codemotion Edition)

Pig Execution Plan

Page 39: Introduction to the Hadoop Ecosystem (codemotion Edition)

Try that with Java…

Page 40: Introduction to the Hadoop Ecosystem (codemotion Edition)

SQL for Hadoop

Page 41: Introduction to the Hadoop Ecosystem (codemotion Edition)

SQL for Hadoop

Page 42: Introduction to the Hadoop Ecosystem (codemotion Edition)

Apache Hive

Page 43: Introduction to the Hadoop Ecosystem (codemotion Edition)

Hive in the Hadoop ecosystem

Hadoop Distributed File System

Distributed Programming Framework

Metadata Management

Scripting Query

Page 44: Introduction to the Hadoop Ecosystem (codemotion Edition)

Hive Architecture

Page 45: Introduction to the Hadoop Ecosystem (codemotion Edition)

Hive Example

CREATE TABLE users(name STRING, age INT);

CREATE TABLE pages(user STRING, url STRING);

LOAD DATA INPATH '/user/sandbox/users.txt' INTO

TABLE 'users';

LOAD DATA INPATH '/user/sandbox/pages.txt' INTO

TABLE 'pages';

SELECT pages.url, count(*) AS clicks FROM users JOIN

pages ON (users.name = pages.user)

WHERE users.age >= 18 AND users.age <= 50

GROUP BY pages.url

SORT BY clicks DESC

LIMIT 10;

Page 46: Introduction to the Hadoop Ecosystem (codemotion Edition)

Bringing it all together…

Page 47: Introduction to the Hadoop Ecosystem (codemotion Edition)

Online Advertising

Page 48: Introduction to the Hadoop Ecosystem (codemotion Edition)

Getting started…

Page 49: Introduction to the Hadoop Ecosystem (codemotion Edition)

Hortonworks Sandbox

Page 50: Introduction to the Hadoop Ecosystem (codemotion Edition)

Hadoop Training

••

••

••

Page 51: Introduction to the Hadoop Ecosystem (codemotion Edition)

The end…or the beginning?