Hadoop ecosystem

Hadoop Ecosystem

Ran Silberman Dec. 2014

What types of ecosystems exist?

● Systems that are based on MapReduce● Systems that replace MapReduce● Complementary databases● Utilities● See complete list here

Systems based on MapReduce

● Part of the Apache project● General SQL-like syntax for querying HDFS or other

large databases● Each SQL statement is translated to one or more

MapReduce jobs (in some cases none)● Supports pluggable Mappers, Reducers and SerDe’s

(Serializer/Deserializer)● Pro: Convenient for analytics people that use SQL

Hive Architecture

Hive UsageStart a hive shell:$hive

create hive table:hive> CREATE TABLE tikal (id BIGINT, name STRING, startdate TIMESTAMP, email STRING)

Show all tables:hive> SHOW TABLES;

Add a new column to the table:hive> ALTER TABLE tikal ADD COLUMNS (description STRING);

Load HDFS data file into the dable:hive> LOAD DATA INPATH '/home/hduser/tikal_users' OVERWRITE INTO TABLE tikal;

query employees that work more than a year:hive> SELECT name FROM tikal WHERE (unix_timestamp() - startdate > 365 * 24 * 60 * 60);

● Part of the Apache project● A programing language that is compiled into one or

more MaprRecuce jobs.● Supports User Defined functions● Pro: More Convenient to write than pure MapReduce.

Pig UsageStart a pig Shell. (grunt is the PigLatin shell prompt)$ pig

grunt>

Load a HDFS data file:grunt> employees = LOAD 'hdfs://hostname:54310/home/hduser/tikal_users'

as (id,name,startdate,email,description);

Dump the data to console:grunt> DUMP employees;

Query the data:grunt> employees_more_than_1_year = FILTER employees BY (float)rating>1.0;

grunt> DUMP employees_more_than_1_year;

Store query result to new file:grunt> store employees_more_than_1_year into '/home/hduser/employees_more_than_1_year';

Cascading

● An infrastructure with API that is compiled to one or more MapReduce jobs

● Provide graphical view of the MapReduce jobs workflow● Ways to tweak setting and improve performance of

workflow.● Pros:

○ Hides MapReduce API and joins jobs○ Graphical view and performance tuning

MapReduce workflow

● MapReduce framework operates exclusively on Key/Value pairs

● There are three phases in the workflow:○ map○ combine○ reduce

(input) <k1, v1> => map => <k2, v2> => combine => <k2, v2> => reduce => <k3, v3> (output)

WordCount in MapRecuce Java APIprivate class WordCount {

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

WordCount in MapRecuce Java Cont.public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

result.set(sum);

context.write(key, result);

WordCount in MapRecuce Java Cont.public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

MapReduce workflow example.

Let’s consider two text files:

$ bin/hdfs dfs -cat /user/joe/wordcount/input/file01

Hello World Bye World

$ bin/hdfs dfs -cat /user/joe/wordcount/input/file02

Hello Hadoop Goodbye Hadoop

Mapper codepublic void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }

Mapper output

For two files there will be two mappers.

For the given sample input the first map emits: < Hello, 1>

< World, 1>

< Bye, 1>

< World, 1>

The second map emits: < Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

Set Combiner

We defined a combiner in the code:

job.setCombinerClass(IntSumReducer.class);

Combiner outputOutput of each map is passed through the local combiner for local aggregation, after being sorted on the keys.The output of the first map: < Bye, 1>

< Hello, 1>

< World, 2>

The output of the second map: < Goodbye, 1>

< Hadoop, 2>

< Hello, 1>

Reducer codepublic void reduce(Text key, Iterable<IntWritable> values,

Context context

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

result.set(sum);

context.write(key, result);

Reducer output

The reducer sums up the valuesThe output of the job is:

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

The Cascading core components

● Tap (Data resource)○ Source (Data input)○ Sink (Data output)

● Pipe (data stream)● Filter (Data operation)● Flow (assembly of Taps and Pipes)

WordCount in Cascading Visualizationsource (Document Collection)sink (Word Count)pipes (Tokenize, Count)

WodCount in Cascading Cont.// define source and sink Taps.Scheme sourceScheme = new TextLine( new Fields( "line" ) );Tap source = new Hfs( sourceScheme, inputPath );

Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

// the 'head' of the pipe assemblyPipe assembly = new Pipe( "wordcount" );

// For each input Tuple// parse out each word into a new Tuple with the field name "word"// regular expressions are optional in CascadingString regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";Function function = new RegexGenerator( new Fields( "word" ), regex );assembly = new Each( assembly, new Fields( "line" ), function );

// group the Tuple stream by the "word" valueassembly = new GroupBy( assembly, new Fields( "word" ) );

WodCount in Cascading// For every Tuple group// count the number of occurrences of "word" and store result in// a field named "count"Aggregator count = new Count( new Fields( "count" ) );assembly = new Every( assembly, count );

// initialize app properties, tell Hadoop which jar file to useProperties properties = new Properties();FlowConnector.setApplicationJarClass( properties, Main.class );

// plan a new Flow from the assembly using the source and sink Taps// with the above propertiesFlowConnector flowConnector = new FlowConnector( properties );Flow flow = flowConnector.connect( "word-count", source, sink, assembly );

// execute the flow, block until completeflow.complete();

Diagram of Cascading Flow

Scalding

● Extension to Cascading● Programing language is Scala instead of Java● Good for functional programing paradigms in Data

Applications● Pro: code can be very compact!

WordCount in Scaldingimport com.twitter.scalding._

class WordCountJob(args : Args) extends Job(args) {

TypedPipe.from(TextLine(args("input")))

.flatMap { line => line.split("""\s+""") }

.groupBy { word => word }

.write(TypedTsv(args("output")))

Summingbird

● An open source from Twitter.● An API that is compiled to Scalding and to Storm

topologies.● Can be written in Java or Scala● Pro: When you want to use Lambda Architecture and

you want to write one code that will run on both Hadoop and Storm.

WordCount in Summingbirddef wordCount[P <: Platform[P]]

(source: Producer[P, String], store: P#Store[String, Long]) =

source.flatMap { sentence =>

toWords(sentence).map(_ -> 1L)

}.sumByKey(store)

Systems that replace MapReduce

● Part of the Apache project● Replaces MapReduce with it own engine that works

much faster without compromising consistency● Architecture not based on Map-reduce but rather on two

concepts: RDD (Resilient Distributed Dataset) and DAG (Directed Acyclic Graph)

● Pro’s: ○ Works much faster than MapReduce; ○ fast growing community.

Impala

● Open Source from Cloudera● Used for Interactive queries with SQL syntax● Replaces MapReduce with its own Impala Server ● Pro: Can get much faster response time for SQL over

HDFS than Hive or Pig.

Impala benchmark

Note: Impala is over Parquet!

Impala replaces MapReduce

Impala architecture

● Impala architecture was inspired by Google Dremel● MapReduce is great for functional programming, but not

efficient for SQL.● Impala replaced the MapReduce with Distributed Query

Engine that is optimized for fast queries.

Dermal architecture

Dremel: Interactive Analysis of Web-Scale Datasets

Impala architecture

Presto, Drill, Tez

● Several more alternatives:○ Presto by Facebook○ Apache Drill pushed by MapR○ Apache Tez pushed by Hortonworks

● all are alternatives to Impala and do more or less the same: provide faster response time for queries over HDFS.

● Each of the above claim to have very fast results.● Be careful of benchmarks they publish: to get better

results they use indexed data rather than sequential files in HDFS (i.e., ORC file, Parquet, HBase)

Complementary Databases

● Apache project● NoSQL cluster database that can grow linearly● Can store billions of rows X millions of columns● Storage is based on HDFS● API based on MapReduce● Pros:

○ Strongly consistent read/writes○ Good for high-speed counter aggregations

Parquet

● Apache (incubator) project. Initiated by Twitter & Cloudera

● Columnar File Format - write one column at a time● Integrated with Hadoop ecosystem (MapReduce, Hive)● Supports Avro, Thrift and ProtBuf● Pro: keep I/O to a minimum by reading from a disk only

the data required for the query

Columnar format (Parquet)

Advantages of Columnar formats

● Better compression as data is more homogenous.

● I/O will be reduced as we can efficiently scan only a

subset of the columns while reading the data.

● When storing data of the same type in each column,

we can use encodings better suited to the modern

processors’ pipeline by making instruction branching

more predictable.

Utilities

● Cloudera product● Used to collect files from distributed systems and send

them to central repository● Designed for integration with HDFS but can write to

other FS● Supports listening to TCP and UDP sockets● Main Use Case: collect distributed logs to HDFS

● An Apache project● Data Serialization by Schema● Support rich data structures. Defined in Json-like syntax● Support Schema evolution● Integrated with Hadoop I/O API● Similar to Thrift and ProtocolBuffers

● An Apache project● Workflow Scheduler for Hadoop jobs● Very close integration with the Hadoop API

● Apache project● Cluster manager that abstracts resources● Integrated with Hadoop to allocate resources● Scalable to 10,000 nodes● Supports physical machines, VM’s, Docker● Multi resource scheduler (memory, CPU, disk, ports)● Web UI for viewing cluster status

Hadoop ecosystem

Data & Analytics

Hadoop Ecosystem Overview - Inspiring Innovationshadam1/491s16/lectures/05-Hadoop_Ecosystem_Overview.pdfHadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Compu>ng Spring

Securing the Hadoop Ecosystem

Streaming Live Data and the Hadoop Ecosystem

Introduction to Hadoop Ecosystem

Introduction to the Hadoop EcoSystem

A BigBench Implementation in the Hadoop Ecosystem

Spark in the Hadoop Ecosystem

Hadoop Ecosystem - CRS4dassia.crs4.it/.../2014/11/Dassia4_HadoopEcosystem.pdf · 2015-05-04 · Hadoop Ecosystem Data Acquisition. Data Acquisition Sorgenti distribuite ed eterogenee

Apache NiFi in the Hadoop Ecosystem

PROFESSIONAL HADOOP® SOLUTIONS - Startseite€¦ · The Hadoop Ecosystem 7 Hadoop Core Components 7 Hadoop Distributions 10 Developing Enterprise Applications with Hadoop 12 Summary

Introduction to the Hadoop Ecosystem (SEACON Edition)

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

The Evolution of the Hadoop Ecosystem

The Hadoop Ecosystem - York UniversityManaging the whole ecosystem Hadoop cluster provisioning – Step by step process for installing hadoop on many hosts – Handles Hadoop cluster

Hadoop And Their Ecosystem

Challenges to Error Diagnosis in Hadoop Ecosystemsthe Hadoop ecosystem for various purposes. Hadoop is an ecosystem that consists of multiple layers of largely independently evolving

Hadoop & Ecosystem - An Overview

2. Hadoop - lsd.ls.fi.upm.eslsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/Hadoop_… · Hadoop Hadoop Software Ecosystem Hadoop MapReduce Hadoop Distributed File System

Hadoop - Architectural road map for Hadoop Ecosystem

The Hadoop Ecosystem & HBase - Meetupfiles.meetup.com/3137102/WHUG 4. Hadoop Ecosystem... · 2012-07-13 · The Hadoop Ecosystem & HBase Kai Voigt, Cloudera Inc. Warsaw Hadoop User