48
Hadoop Ecosystem Ran Silberman, December 2014

Hadoop ecosystem

Embed Size (px)

Citation preview

Page 1: Hadoop ecosystem

Hadoop

Ecosystem

Ran Silberman, December 2014

Page 2: Hadoop ecosystem

What types of ecosystems exist?

● Systems that are based on MapReduce

● Systems that replace MapReduce

● Complementary databases

● Utilities

● See complete list here

Page 3: Hadoop ecosystem

Systems based

on MapReduce

Page 4: Hadoop ecosystem

Hive

● Part of the Apache project

● General SQL-like syntax for querying HDFS or other

large databases

● Each SQL statement is translated to one or more

MapReduce jobs (in some cases none)

● Supports pluggable Mappers, Reducers and SerDe’s

(Serializer/Deserializer)

● Pro: Convenient for analytics people that use SQL

Page 5: Hadoop ecosystem

Hive Architecture

Page 6: Hadoop ecosystem

Hive Usage

Start a hive shell:

$hivecreate hive table:

hive> CREATE TABLE tikal (id BIGINT, name STRING, startdate TIMESTAMP, email STRING)Show all tables:

hive> SHOW TABLES;Add a new column to the table:hive> ALTER TABLE tikal ADD COLUMNS (description STRING);Load HDFS data file into the dable:

hive> LOAD DATA INPATH '/home/hduser/tikal_users' OVERWRITE INTO TABLE tikal;query employees that work more than a year:

hive> SELECT name FROM tikal WHERE (unix_timestamp() - startdate > 365 * 24 * 60 * 60);

Page 7: Hadoop ecosystem

Pig

● Part of the Apache project

● A programing language that is compiled into one or

more MaprRecuce jobs.

● Supports User Defined functions

● Pro: More Convenient to write than pure MapReduce.

Page 8: Hadoop ecosystem

Pig Usage

Start a pig Shell. (grunt is the PigLatin shell prompt)

$ piggrunt>Load a HDFS data file:

grunt> employees = LOAD 'hdfs://hostname:54310/home/hduser/tikal_users'

as (id,name,startdate,email,description);

Dump the data to console:grunt> DUMP employees;Query the data:

grunt> employees_more_than_1_year = FILTER employees BY (float)rating>1.0;grunt> DUMP employees_more_than_1_year;Store query result to new file:

grunt> store employees_more_than_1_year into '/home/hduser/employees_more_than_1_year';

Page 9: Hadoop ecosystem

Cascading

● An infrastructure with API that is compiled to one or

more MapReduce jobs

● Provide graphical view of the MapReduce jobs workflow

● Ways to tweak setting and improve performance of

workflow.

● Pros:

o Hides MapReduce API and joins jobs

o Graphical view and performance tuning

Page 10: Hadoop ecosystem

MapReduce workflow

● MapReduce framework operates exclusively on

Key/Value pairs

● There are three phases in the workflow:

o map

o combine

o reduce

(input) <k1, v1> =>

map => <k2, v2> =>

combine => <k2, v2> =>

reduce => <k3, v3> (output)

Page 11: Hadoop ecosystem

WordCount in MapRecuce Java API

private class WordCount {

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

Page 12: Hadoop ecosystem

WordCount in MapRecuce Java Cont.

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

}

Page 13: Hadoop ecosystem

WordCount in MapRecuce Java Cont.

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}

Page 14: Hadoop ecosystem

MapReduce workflow example.

Let’s consider two text files:

$ bin/hdfs dfs -cat /user/joe/wordcount/input/file01

Hello World Bye World

$ bin/hdfs dfs -cat /user/joe/wordcount/input/file02

Hello Hadoop Goodbye Hadoop

Page 15: Hadoop ecosystem

Mapper code

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);

}}

Page 16: Hadoop ecosystem

Mapper output

For two files there will be two mappers.

For the given sample input the first map emits:

< Hello, 1>

< World, 1>

< Bye, 1>

< World, 1>

The second map emits:

< Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

Page 17: Hadoop ecosystem

Set Combiner

We defined a combiner in the code:

job.setCombinerClass(IntSumReducer.class);

Page 18: Hadoop ecosystem

Combiner output

Output of each map is passed through the local combiner

for local aggregation, after being sorted on the keys.

The output of the first map:

< Bye, 1>

< Hello, 1>

< World, 2>

The output of the second map:

< Goodbye, 1>

< Hadoop, 2>

< Hello, 1>

Page 19: Hadoop ecosystem

Reducer code

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

}

Page 20: Hadoop ecosystem

Reducer output

The reducer sums up the values

The output of the job is:

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

Page 21: Hadoop ecosystem

The Cascading core components

● Tap (Data resource)o Source (Data input)

o Sink (Data output)

● Pipe (data stream)

● Filter (Data operation)

● Flow (assembly of Taps and Pipes)

Page 22: Hadoop ecosystem

WordCount in Cascading

Visualization

source (Document Collection)

sink (Word Count)

pipes (Tokenize, Count)

Page 23: Hadoop ecosystem

WodCount in Cascading Cont.

// define source and sink Taps.

Scheme sourceScheme = new TextLine( new Fields( "line" ) );

Tap source = new Hfs( sourceScheme, inputPath );

Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );

Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

// the 'head' of the pipe assembly

Pipe assembly = new Pipe( "wordcount" );

// For each input Tuple

// parse out each word into a new Tuple with the field name "word"

// regular expressions are optional in Cascading

String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";

Function function = new RegexGenerator( new Fields( "word" ), regex );

assembly = new Each( assembly, new Fields( "line" ), function );

// group the Tuple stream by the "word" value

assembly = new GroupBy( assembly, new Fields( "word" ) );

Page 24: Hadoop ecosystem

WodCount in Cascading

// For every Tuple group

// count the number of occurrences of "word" and store result in

// a field named "count"

Aggregator count = new Count( new Fields( "count" ) );

assembly = new Every( assembly, count );

// initialize app properties, tell Hadoop which jar file to use

Properties properties = new Properties();

FlowConnector.setApplicationJarClass( properties, Main.class );

// plan a new Flow from the assembly using the source and sink Taps

// with the above properties

FlowConnector flowConnector = new FlowConnector( properties );

Flow flow = flowConnector.connect( "word-count", source, sink, assembly );

// execute the flow, block until complete

flow.complete();

Page 25: Hadoop ecosystem

Diagram of Cascading Flow

Page 26: Hadoop ecosystem

Scalding

● Extension to Cascading

● Programing language is Scala instead of Java

● Good for functional programing paradigms in Data

Applications

● Pro: code can be very compact!

Page 27: Hadoop ecosystem

WordCount in Scalding

import com.twitter.scalding._

class WordCountJob(args : Args) extends Job(args) {

TypedPipe.from(TextLine(args("input")))

.flatMap { line => line.split("""\s+""") }

.groupBy { word => word }

.size

.write(TypedTsv(args("output")))

}

Page 28: Hadoop ecosystem

Summingbird

● An open source from Twitter.

● An API that is compiled to Scalding and to Storm

topologies.

● Can be written in Java or Scala

● Pro: When you want to use Lambda Architecture and

you want to write one code that will run on both Hadoop

and Storm.

Page 29: Hadoop ecosystem

WordCount in Summingbird

def wordCount[P <: Platform[P]]

(source: Producer[P, String], store: P#Store[String, Long]) =

source.flatMap { sentence =>

toWords(sentence).map(_ -> 1L)

}.sumByKey(store)

Page 30: Hadoop ecosystem

Systems that

replace MapReduce

Page 31: Hadoop ecosystem

Spark

● Part of the Apache project

● Replaces MapReduce with it own engine that works

much faster without compromising consistency

● Architecture not based on Map-reduce but rather on two

concepts: RDD (Resilient Distributed Dataset) and DAG

(Directed Acyclic Graph)

● Pro’s:

o Works much faster than MapReduce;

o fast growing community.

Page 32: Hadoop ecosystem

Impala

● Open Source from Cloudera

● Used for Interactive queries with SQL syntax

● Replaces MapReduce with its own Impala Server

● Pro: Can get much faster response time for SQL over

HDFS than Hive or Pig.

Page 33: Hadoop ecosystem

Impala benchmark

Note: Impala is over Parquet!

Page 34: Hadoop ecosystem

Impala replaces MapReduce

Page 35: Hadoop ecosystem

Impala architecture

● Impala architecture was inspired by Google Dremel

● MapReduce is great for functional programming, but not

efficient for SQL.

● Impala replaced the MapReduce with Distributed Query

Engine that is optimized for fast queries.

Page 36: Hadoop ecosystem

Dermal architecture

Dremel: Interactive Analysis of Web-Scale Datasets

Page 37: Hadoop ecosystem

Impala architecture

Page 38: Hadoop ecosystem

Presto, Drill, Tez

● Several more alternatives:

o Presto by Facebook

o Apache Drill pushed by MapR

o Apache Tez pushed by Hortonworks

● all are alternatives to Impala and do more or less the

same: provide faster response time for queries over

HDFS.

● Each of the above claim to have very fast results.

● Be careful of benchmarks they publish: to get better

results they use indexed data rather than sequential

files in HDFS (i.e., ORC file, Parquet, HBase)

Page 39: Hadoop ecosystem

Complementary

Databases

Page 40: Hadoop ecosystem

HBase

● Apache project

● NoSQL cluster database that can grow linearly

● Can store billions of rows X millions of columns

● Storage is based on HDFS

● API based on MapReduce

● Pros:

o Strongly consistent read/writes

o Good for high-speed counter aggregations

Page 41: Hadoop ecosystem

Parquet

● Apache (incubator) project. Initiated by Twitter &

Cloudera

● Columnar File Format - write one column at a time

● Integrated with Hadoop ecosystem (MapReduce, Hive)

● Supports Avro, Thrift and ProtBuf

● Pro: keep I/O to a minimum by reading from a disk only

the data required for the query

Page 42: Hadoop ecosystem

Columnar format (Parquet)

Page 43: Hadoop ecosystem

Advantages of Columnar formats

● Better compression as data is more homogenous.

● I/O will be reduced as we can efficiently scan only a

subset of the columns while reading the data.

● When storing data of the same type in each column,

we can use encodings better suited to the modern

processors’ pipeline by making instruction branching

more predictable.

Page 44: Hadoop ecosystem

Utilities

Page 45: Hadoop ecosystem

Flume

● Cloudera product

● Used to collect files from distributed systems and send

them to central repository

● Designed for integration with HDFS but can write to

other FS

● Supports listening to TCP and UDP sockets

● Main Use Case: collect distributed logs to HDFS

Page 46: Hadoop ecosystem

Avro

● An Apache project

● Data Serialization by Schema

● Support rich data structures. Defined in Json-like syntax

● Support Schema evolution

● Integrated with Hadoop I/O API

● Similar to Thrift and ProtocolBuffers

Page 47: Hadoop ecosystem

Oozie

● An Apache project

● Workflow Scheduler for Hadoop jobs

● Very close integration with the Hadoop API

Page 48: Hadoop ecosystem

Mesos

● Apache project

● Cluster manager that abstracts resources

● Integrated with Hadoop to allocate resources

● Scalable to 10,000 nodes

● Supports physical machines, VM’s, Docker

● Multi resource scheduler (memory, CPU, disk, ports)

● Web UI for viewing cluster status