Upload
danganh
View
214
Download
0
Embed Size (px)
Citation preview
2"2$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"Hadoop"Ecosystem"Chapter"2.1"
2"3$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! What$other$projects$exist$around$core$Hadoop?$
! When$to$use$HBase?$
! How$does$Spark$compare$to$MapReduce?$
! What$is$the$differences$between$Hive,$Pig,$and$Impala?$
! How$is$Flume$typically$deployed?$
$
The"Hadoop"Ecosystem"
2"4$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
The$Hadoop$Ecosystem$
!! IntroducLon$
!! Data"Storage:"HBase"
!! Data"IntegraMon:"Flume"and"Sqoop"
!! Data"Processing:"Spark"
!! Data"Analysis:"Hive,"Pig,"and"Impala"
!!Workflow"Engine:"Oozie"
!!Machine"Learning:"Mahout"
2"5$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Hadoop"Distributed"File"System"
MapReduce"
Hive" Pig"Impala"Sqoop"
The"Hadoop"Ecosystem"(1)"
Oozie" …"Flume"HBase"
Hadoop"
Ecosystem"
Hadoop"Core"
Components"
CDH"
2"6$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Hive" Pig"Impala"Sqoop"
$
! Next,$a$discussion$of$the$key$Hadoop$ecosystem$components$
The"Hadoop"Ecosystem"(2)"
Oozie" …"Flume"HBase"
Hadoop"
Ecosystem"
2"7$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Ecosystem$projects$may$be$– Built"on"HDFS"and"MapReduce"
– Built"on"just"HDFS"– Designed"to"integrate"with"or"support"Hadoop"
! Most$are$Apache$projects$or$Apache$Incubator$projects$– Some"others"are"not"managed"by"the"Apache"So]ware"FoundaMon"
– These"are"o]en"hosted"on"GitHub"or"a"similar"repository"
! Following$is$an$introducLon$to$some$of$the$most$significant$projects$
The"Hadoop"Ecosystem"(3)"
2"9$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! HBase$is$the$Hadoop$database$
! A$‘NoSQL’$datastore$
! Can$store$massive$amounts$of$data$– Petabytes+"
! High$write$throughput$– Scales"to"hundreds"of"thousands"of"inserts"per"second"
! Handles$sparse$data$well$– No"wasted"spaces"for"empty"columns"in"a"row"
! Limited$access$model$– OpMmized"for"lookup"of"a"row"by"key"rather"than"full"queries"
– No"transacMons:"single"row"operaMons"only"– Only"one"column"(the"‘row"key’)"is"indexed"
HBase"
2"10$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
HBase"vs"TradiMonal"RDBMSs"
RDBMS$ HBase$Data$layout$ Row/oriented" Column/oriented"
TransacLons$ Yes" Single"row"only"
Query$language$ SQL" get/put/scan"(or"use"Hive"
or"Impala)"
Security$ AuthenMcaMon/AuthorizaMon" Kerberos"
Indexes$ Any"column" Row/key"only"
Max$data$size$ TBs" PB+"
Read/write$throughput$(queries$per$second)$
Thousands" Millions"
2"11$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Use$plain$HDFS$if…$– You"only"append"to"your"dataset""(no"random"write)"
– You"usually"read"the"whole"dataset"(no"random"read)"
! Use$HBase$if…$– You"need"random"write"and/or"read"
– You"do"thousands"of"operaMons"per"second""on"TB+"of"data"
! Use$an$RDBMS$if…$– Your"data"fits"on"one"big"node"– You"need"full"transacMon"support"– You"need"real/Mme"query"capabiliMes"
When"To"Use"HBase"
2"13$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! What$is$Flume?$– A"service"to"move"large"amounts"of"data"in"real"Mme"
– Example:"storing"log"files"in"HDFS"
! Flume$imports$data$into$HDFS$as$it$is$generated$– Instead"of"batch/processing"it"later"– For"example,"log"files"from"a"Web"server"
! Flume$is$– Distributed"– Reliable"and"available"– Horizontally"scalable""– Extensible"
Flume:"Real/Mme"Data"Import"
2"14$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Flume:"High/Level"Overview"
Agent"" Agent" Agent"
Agent" Agent$
Agent(s)$
Agent"
compress$encrypt$
• Pre/process"data"before"storing"• "e.g.,"transform,"scrub,"enrich"
• Store"in"any"format"
• Text,"compressed,"binary,"or"
custom"sink"
• Collect"data"as"it"is"produced"• Files,"syslogs,"stdout"or"custom"source"
"Agent""
• Process"in"place""• e.g.,"encrypt,"compress"
• Write"in"parallel"
• Scalable"throughput"
HDFS"
2"15$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Sqoop$transfers$data$between$RDBMSs$and$HDFS$– Does"this"very"efficiently"via"a"Map/only"MapReduce"job"
– Supports"JDBC,"ODBC,"and"several"specific"databases"– “Sqoop”"="“SQL"to"Hadoop”"
Sqoop:"Exchanging"Data"With"RDBMSs"
HDFS"
Sqoop"
"""
"
RDBMS"
2"16$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Custom$connectors$for$– MySQL"
– Postgres"– Netezza"– Teradata"– Oracle"(partnered"with"Quest"So]ware)"
! Not$open$source,$but$free$to$use$
Sqoop"Custom"Connectors"
2"18$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Apache$Spark$is$a$fast,$general$engine$for$large"scale$$data$processing$on$a$cluster$
! Originally$developed$UC$Berkeley’s$AMPLab$
! Open$source$Apache$project$
! Provides$several$benefits$over$MapReduce$– Faster"– Be>er"suited"for"iteraMve"algorithms"
– Can"hold"intermediate"data"in"RAM,"resulMng"in"much"be>er"
performance"
– Easier"API"– Supports"Python,"Scala,"Java"
– Supports"real/Mme"streaming"data"processing"
Apache"Spark"
2"19$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! MapReduce$– Widely"used,"huge"investment"already"made"
– Supports"and"supported"by"many"complementary"tools"
– Mature,"well/tested"
! Spark$– Flexible"– Elegant""– Fast"– Supports"real/Mme"streaming"data"processing"
! Over$Lme,$Spark$is$expected$to$supplant$MapReduce$as$the$general$processing$framework$used$by$most$organizaLons$
Spark"vs"Hadoop"MapReduce"
2"21$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! The$moLvaLon:$MapReduce$is$powerful$$but$hard$to$master$
! The$soluLon:$Hive$and$Pig$$– Languages"for"querying"and"manipulaMng"data"
– Leverage"exisMng"skillsets"– Data"analysts"who"use"SQL"– Programmers"who"use"scripMng"languages""
– Open"source"Apache"projects"– Hive"iniMally"developed"at"Facebook"– Pig"IniMally"developed"at"Yahoo!"
! Interpreter$runs$on$a$client$machine$– Turns"queries"into"MapReduce"jobs"
– Submits"jobs"to"the"cluster"
Hive"and"Pig:"High"Level"Data"Languages"
2"22$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Hive"
! What$is$Hive?$– HiveQL:"An"SQL/like"interface"to"Hadoop"
$
SELECT * FROM purchases WHERE price > 10000 ORDER BY storeid
2"23$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Pig"
! What$is$Pig?$– Pig$LaLn:"A"dataflow"language"for"transforming"large"data"sets"
purchases = LOAD "/user/dave/purchases" AS (itemID, price, storeID, purchaserID);
bigticket = FILTER purchases BY price > 10000; ...
2"24$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Hive"vs."Pig"
Hive$ Pig$
Language$ HiveQL"(SQL/like)" Pig"LaMn"(dataflow"
language)"
Schema$ Table"definiMons"stored"in"
a"metastore"
Schema"opMonally"defined"
at"runMme"
ProgrammaLc$access$ JDBC,"ODBC" PigServer"(Java"API)"
JDBC:"Java"Database"ConnecMvity"
ODBC:"Open"Database"ConnecMvity"
2"25$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! High"performance$SQL$engine$for$vast$amounts$of$data$– Similar"query"language"to"HiveQL""
– 10"to"50+"Mmes"faster"than"Hive,"Pig,"or"MapReduce"
! Impala$runs$on$Hadoop$clusters$– Data"stored"in"HDFS"– Does"not"use"MapReduce"
! Developed$by$Cloudera$– 100%"open"source,"released"under"the"Apache"so]ware"license"
Impala:"High"Performance"Queries"
2"26$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Use$Impala$when…$– You"need"near"real/Mme"responses"to"ad"hoc"queries"
– You"have"structured"data"with"a"defined"schema"
! Use$Hive$or$Pig$when…$– You"need"support"for"custom"file"types,"complex"data"types,"or"external"
funcMons"
! Use$Pig$when…$– You"have"developers"experienced"with"wriMng"scripts"– Your"data"is"unstructured/mulM/structured"
! Use$Hive$When…$– You"have"analysts"familiar"with"SQL"
– You"are"integraMng"with"BI"or"reporMng"tools"via"ODBC/JDBC"
Which"to"Choose?"
2"30$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Mahout$is$a$Machine$Learning$library$wrigen$in$Java$$
! Used$for$– CollaboraMve"filtering"(recommendaMons)"
– Clustering"(finding"naturally"occurring"“groupings”"in"data)"– ClassificaMon"(determining"whether"new"data"fits"a"category)"
! Why$use$Hadoop$for$Machine$Learning?$– “It’s"not"who"has"the"best"algorithms"that"wins."It’s"who"has"the"most"
data.”"
Mahout"
2"31$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Hadoop$Ecosystem$– Many"projects"built"on,"and"supporMng,"Hadoop"
– Several"will"be"covered"in"detail"later"in"the"course"
Key"Points"
2"32$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The$following$offer$more$informaLon$on$topics$discussed$in$this$chapter
! HBase$was$inspired$by$Google’s$“BigTable”$paper$presented$at$OSDI$in$2006$– http://research.google.com/archive/bigtable-osdi06.pdf
! An$interesLng$link$comparing$NoSQL$databases$– http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html
! AMPLab$– https://amplab.cs.berkeley.edu
! Databricks$– http://databricks.com
Bibliography"
2"33$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The$following$offer$more$informaLon$on$topics$discussed$in$this$chapter
! Spark$– http://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html
! Dremel$is$a$distributed$system$for$interacLve$ad"hoc$queries$that$was$created$by$Google$ – http://research.google.com/pubs/archive/36632.pdf
! Impala,$developed$by$Cloudera,$supports$the$same$inner,$outer,$and$semi"joins$that$Hive$does – http://tiny.cloudera.com/dac15b
Bibliography""(cont’d)"
2"34$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Managing"Your"Hadoop"SoluMon"Chapter"2.2"
2"35$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! What$is$the$typical$architecture$of$a$data$center$with$Hadoop?$
! What$are$typical$hardware$requirements$for$a$Hadoop$cluster?$
Managing"Your"Hadoop"SoluMon"
2"36$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Managing$Your$Hadoop$SoluLon$
!! Hadoop$in$the$Data$Center$
!! Cluster"Hardware"
2"37$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
A"Typical"Data"Center"With"Hadoop"
2"38$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Technology"Strengths"and"Weaknesses"
Technology$ Strengths$ Drawbacks$
Hadoop" •! Scales"to"petabytes"•! Import/export"tools"
•! Flexible"structure"•! Commodity"hardware"
•! Query"speed"•! No"transacMon"support"
RelaMonal"
Databases"
•! Complex"transacMons"
•! 1000s"of"queries/second"•! SQL"
•! Data"must"fit"into"rows"and"columns"
•! Schema"is"costly"to"change"
•! Full"table"scans"are"slow"
Data"warehouses" •! ReporMng"capabiliMes"•! Up"to"100s"of"terabytes"
•! Dimensions"require"pre/materializaMon"
File"servers"(SAN/
NAS)"
•! Serving"individual"files"•! Write"caches"
•! Cost"•! Reading"large"porMons"of"the"data"saturates"the"network"
Backup"systems"
(tape)"
•! Cheap" •! Expensive"to"retrieve"the"data"
SAN:"Storage"Area"Network"
NAS:"Network/A>ached"Storage"
2"39$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Example"Data"Flow"
Web"server"logs"
Orders"
Site"Content/
RecommendaMons"
Hadoop$
ETL"
RecommendaMons"
HDFS" Enterprise"
Data""
Warehouse"
Sqoop"
(nightly)"
Flume""
(real"Mme)"
Sqoop"
(nightly)"
Sqoop"
HBase"
2"40$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Managing$Your$Hadoop$SoluLon$
!! Hadoop"in"the"Data"Center"
!! Cluster$Hardware$
2"41$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Cluster"Hardware"
Name"
Node"
JobTracker"or"
Resource"
Manager"
Slave"
Nodes"
Master""
Nodes"Client"
Master""
Nodes"
2"42$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Processors$– Mid/grade"processors"(e.g.,"2"x"6/core"2.9"GHz)"
! Memory$– 48/96GB"RAM"
! Network$– 1Gb"Ethernet"(mid/range)"
– 10Gb"Ethernet"(high/end)"! Disk$Drives$
– 6"x"2TB"drives"per"machine"(mid/range)"
– 12"x"3TB"drives"per"machine"(high/end)"
– Non/RAID""
Slave"Nodes:"Recommended"ConfiguraMons"(1)"
RAID:"Redundant"Array"of"Inexpensive"Disks"
2"43$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Switch$– Dedicated"switching"infrastructure"required"because"Hadoop"can"saturate"the"network""
– “All"nodes"talking"to"all"nodes”"! Cost$
– Per"slave"node"cost"should"be"around"$4,000"to"$10,000"(2014"esMmate)"
Slave"Nodes:"Recommended"ConfiguraMons"(2)"
2"44$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Four$master$nodes$– NameNode"(acMve"and"standby)"
– JobTracker"(MRv1)"or"ResourceManager"(MRv2)"(acMve"and"standby)"
– Some"installaMons"just"use"a"single"JobTracker"or"Resource"
Manager"
! Each$typically$runs$on$a$separate$machine$
! Master$node$machines$are$high"quality$servers$– Carrier/class"hardware"– Dual"power"supplies,"Ethernet"cards"– RAIDed"hard"drives"– 24GB"RAM"for"clusters"of"20"nodes"or"less"
– 48GB"RAM"for"clusters"of"up"to"300"nodes"
– 96GB"RAM"for"larger"clusters"
Master"Nodes"Are"More"Important"
2"45$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Basing$your$cluster$growth$on$storage$capacity$is$omen$a$good$method$to$use$
! Example:$– Data"grows"by"approximately"3TB"per"week/40TB"per"quarter"
– Hadoop"replicates"3"Mmes"="120TB"
– Extra"space"required"for"temporary"data"while"running"jobs"(~30%)"="
160TB"
– Assuming"machines"with"12"x"3TB"hard"drives"
– 4/5"new"machines"per"quarter"
– Two"years"of"data"="1.3PB"• Requires"approximately"36"machines"
Capacity"Planning"(1)"
2"46$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! New$nodes$are$automaLcally$used$by$Hadoop$
! Many$clusters$start$small$(less$than$10$nodes)$and$scale$up$as$data$and$processing$demands$grow$
! Hadoop$clusters$can$grow$to$thousands$of$nodes$
Capacity"Planning"(2)"
2"47$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The$following$offer$more$informaLon$on$topics$discussed$in$this$chapter
! Slave$and$master$node$configuraLon$recommendaLons:$Hadoop$OperaLons,$first$ediLon$(1e),$by$Eric$Sammer,$p.$46$and$50.$
Bibliography"
2"48$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
IntroducMon"to"MapReduce"Chapter"2.3"
2"49$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Concepts$behind$MapReduce$
! How$does$data$flow$through$MapReduce$stages$
! Typical$uses$of$Mappers$
! Typical$uses$of$Reducers$
IntroducMon"to"MapReduce"
2"50$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
IntroducLon$to$MapReduce$
!!MapReduce$Overview$
!! Example:"WordCount"
!!Mappers"
!! Reducers"
2"51$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! AutomaLc$parallelizaLon$and$distribuLon$
! Fault"tolerance$
! A$clean$abstracLon$for$programmers$– MapReduce"programs"are"usually"wri>en"in"Java"
– Can"be"wri>en"in"any"language"using"Hadoop&Streaming"– All"of"Hadoop"is"wri>en"in"Java"
! MapReduce$abstracts$all$the$‘housekeeping’$away$from$the$developer$– Developer"can"simply"concentrate"on"wriMng"the"Map"and"Reduce"
funcMons"
Review"/"Features"of"MapReduce"
2"52$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Review"–"Key"MapReduce"Stages"
! The$Mapper$– Each"Map"task"(typically)"operates"on"a"single"HDFS"
block"
– Map"tasks"(usually)"run"on"the"node"where"the"
block"is"stored"
! Shuffle$and$Sort$– Sorts"and"consolidates"intermediate"data"from"all"
mappers"
– Happens"a]er"all"Map"tasks"are"complete"and"
before"Reduce"tasks"start"
! The$Reducer$– Operates"on"shuffled/sorted"intermediate"data"
(Map"task"output)"
– Produces"final"output"
Map"
Reduce"
Shuffle""
and"Sort"
2"53$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"MapReduce"Flow"
Input"Split"1"
Input"
File(s)"Input"Split"2" Input"Split"3"
Input"Format"
2"54$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"MapReduce"Flow"
Record"Reader"
Input"Split"1"
Mapper"
Input"
File(s)"
Record"Reader"
Input"Split"2"
Mapper"
Record"Reader"
Input"Split"3"
Mapper"
Input"Format"
2"55$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"MapReduce"Flow"
Record"Reader"
ParMMoner"
Input"Split"1"
Shuffle"and"Sort"
Mapper"
Input"
File(s)"
Record"Reader"
ParMMoner"
Input"Split"2"
Mapper"
Record"Reader"
ParMMoner"
Input"Split"3"
Mapper"
Input"Format"
2"56$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"MapReduce"Flow"
Record"Reader"
ParMMoner"
Input"Split"1"
Shuffle"and"Sort"
Mapper"
Reducer"
Input"
File(s)"
Reducer"
Record"Reader"
ParMMoner"
Input"Split"2"
Mapper"
Record"Reader"
ParMMoner"
Input"Split"3"
Mapper"
Input"Format"
2"57$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"MapReduce"Flow"
Record"Reader"
ParMMoner"
Input"Split"1"
Shuffle"and"Sort"
Mapper"
Reducer"
Output"Format"
Output"File"
Input"
File(s)"
Reducer"
Output"Format"
Output"File"
Record"Reader"
ParMMoner"
Input"Split"2"
Mapper"
Record"Reader"
ParMMoner"
Input"Split"3"
Mapper"
Input"Format"
2"58$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"MapReduce"Flow"
Record"Reader"
ParMMoner"
Input"Split"1"
Shuffle"and"Sort"
Mapper"
Reducer"
Output"Format"
Output"File"
Input"
File(s)"
Reducer"
Output"Format"
Output"File"
Record"Reader"
ParMMoner"
Input"Split"2"
Mapper"
Record"Reader"
ParMMoner"
Input"Split"3"
Mapper"
Input"Format"
2"59$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
IntroducLon$to$MapReduce$
!!MapReduce"Overview"
!! Review:$WordCount$Example$
!!Mappers"
!! Reducers"
2"60$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Example:"Word"Count"
the cat sat on the mat the aardvark sat on the sofa "
Input"Data"
Result"
aardvark 1
cat 1
mat 1
on 2
sat 2
sofa 1
the 4
Map" Reduce"
2"61$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Example:"The"WordCount"Mapper"(1)"
the cat sat on the mat the aardvark sat on the sofa …
Input"Data"(HDFS"file)"
0 the cat sat on the mat
23 the aardvark sat on the sofa
52 …
… …
Record"Reader"
Mapper"
2"62$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Mapper"the 1
cat 1
sat 1
on 1
the 1
mat 1
Example:"The"WordCount"Mapper"(2)"
the cat sat on the mat the aardvark sat on the sofa …
map()"
map()"
the 1
aardvark 1
sat 1
on 1
the 1
sofa 1
Input"Data"(HDFS"file)"
0 the cat sat on the mat
23 the aardvark sat on the sofa
52 …
… …
Record"Reader"
2"63$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Example:"WordCount"Shuffle"and"Sort"
drove 1
elephant 1
mahout 1
the 1,1
the 1
cat 1
sat 1
on 1
the 1
mat 1
the 1
aardvark 1
sat 1
on 1
the 1
sofa 1
the 1
mahout 1
drove 1
the 1
elephant 1
Mapper"
Mapper"
aardvark 1
cat 1
mat 1
on 1,1
sat 1,1
sofa 1
the 1,1,1,1
Node"1"
Node"2"
aardvark 1
cat 1
mat 1
elephant 1
mahout 1
sat 1,1
drove 1
on 1,1
sofa 1
the 1,1,1,1,1,1
2"64$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Example:"SumReducer"(1)"
"
Reducer"0"
"
Reducer"1"
"
Reducer"2"
"
part-r-00002
"
part-r-00001
"
part-r-00000
aardvark 1
cat 1
mat 1
drove 1
on 2
sofa 1
the 6
elephant 1
mahout 1
sat 2
Final"Output"
aardvark 1
cat 1
mat 1
elephant 1
mahout 1
sat 1,1
drove 1
on 1,1
sofa 1
the 1,1,1,1,1,1
2"65$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
drove 1
on 1,1
sofa 1
the 1,1,1,1,1,1
"
Example:"SumReducer"(2)"
Reducer"2"
reduce()"
reduce()"
reduce()"
reduce()"
drove 1
on 2
sofa 1
the 6
Final"Output"
part-r-00002 HDFS"File"
2"66$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
IntroducLon$to$MapReduce$
!!MapReduce"Overview"
!! Example:"WordCount"
!!Mappers$
!! Reducers"
2"67$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! The$Mapper$$– Input:"key/value"pair"– Output:"A"list"of"zero"or"more"key"value"pairs"
MapReduce:"The"Mapper"(1)"
map(in_key, in_value) " (inter_key, inter_value) list
intermediate key 1 value 1
intermediate key 2 value 2
intermediate key 3 value 3
… …
input key
input value map"
2"68$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! The$Mapper$may$use$or$completely$ignore$the$input$key$– For"example,"a"standard"pa>ern"is"to"read"one"line"of"a"file"at"a"Mme"
– The"key"is"the"byte"offset"into"the"file"at"which"the"line"starts"– The"value"is"the"contents"of"the"line"itself"– Typically"the"key"is"considered"irrelevant"
! If$the$Mapper$writes$anything$out,$the$output$must$be$in$the$form$of$$key/value$pairs$
MapReduce:"The"Mapper"(2)"
the 1
aardvark 1
sat 1
on 1
the 1
sofa 1
23 the aardvark sat on the sofa map"
2"69$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Turn$input$into$upper$case$(pseudo"code):$
Example"Mapper:"Upper"Case"Mapper"
let map(k, v) = emit(k.toUpper(), v.toUpper())
mahout an elephant driver
map()"bugaboo an object of fear
or alarm
bumbershoot umbrella
MAHOUT AN ELEPHANT DRIVER
BUGABOO AN OBJECT OF FEAR OR ALARM
BUMBERSHOOT UMBRELLA
map()"
map()"
2"70$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Output$each$input$character$separately$(pseudo"code):$
Example"Mapper:"‘Explode’"Mapper"
let map(k, v) = foreach char c in v: emit (k, c)
map()"
pi 3.14
pi 3
pi .
pi 1
pi 4
145 kale
145 k
145 a
145 l
145 e
map()"
2"71$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Only$output$key/value$pairs$where$the$input$value$is$a$prime$number$(pseudo"code):$
Example"Mapper:"‘Filter’"Mapper"
let map(k, v) = if (isPrime(v)) then emit(k, v)
map()"
pi 3.14
foo 13
map()"
48 7 48 7 map()"
5 12
foo 13
map()"
2"72$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! The$key$output$by$the$Mapper$does$not$need$to$be$idenLcal$to$the$input$key$
! Example:$output$the$word$length$as$the$key$(pseudo"code):$
Example"Mapper:"Changing"Keyspaces"
let map(k, v) = emit(v.length(), v)
002 aim map()"
001 hadoop map()"
003 ridiculous map()"
3 aim
6 hadoop
10 ridiculous
2"73$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Emit$the$key,value$pair$(pseudo"code):$
Example"Mapper:"IdenMty"Mapper"
let map(k, v) = emit(k,v)
mahout an elephant driver
map()"bugaboo an object of fear
or alarm
bumbershoot umbrella
map()"
map()"
mahout an elephant driver
bugaboo an object of fear or alarm
bumbershoot umbrella
2"74$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
IntroducLon$to$MapReduce$
!!MapReduce"Overview"
!! Example:"WordCount"
!!Mappers"
!! Reducers$
2"75$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Amer$the$Map$phase$is$over,$all$intermediate$values$for$a$given$intermediate$key$are$grouped$together$
! Each$key$and$value$list$is$passed$to$a$Reducer$– All"values"for"a"parMcular"intermediate"key"go"to"the"same"Reducer"
– The"intermediate"keys/value"lists"are"passed"in"sorted"key"order"
Shuffle"and"Sort"
html 891
gif 1231
jpg 3992
html 788
gif 3997
html 344
gif 1231
3997
jpg 3992
html
344
891
788
gif 1231
3997
jpg 3992
html
344
891
788
Reducer"
Reducer"
2"76$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! The$Reducer$outputs$zero$or$more$final$key/value$pairs$– In"pracMce,"usually"emits"a"single"key/value"pair"for"each"input"key"
– These"are"wri>en"to"HDFS"
The"Reducer"
gif 1231
3997
html
344
891
788
reduce()"
html 1498
reduce(inter_key, [v1, v2, …]) " (result_key, result_value)
gif 2614
reduce()"
2"77$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Add$up$all$the$values$associated$with$each$intermediate$key$(pseudo"code):$
Example"Reducer:"Sum"Reducer"
let reduce(k, vals) = sum = 0 foreach int i in vals: sum += i emit(k, sum)
SKU0021
34
8
19
the
1
1
1
1
SKU0021 61
the 4 reduce()"
reduce()"
2"78$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Find$the$mean$of$all$the$values$associated$with$each$intermediate$key$(pseudo"code):$
Example"Reducer:"Average"Reducer"
let reduce(k, vals) = sum = 0; counter = 0; foreach int i in vals: sum += i; counter += 1; emit(k, sum/counter)
SKU0021
34
8
19
the
1
1
1
1
SKU0021 20.33
the 1 reduce()"
reduce()"
2"79$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! The$IdenLty$Reducer$is$very$common$(pseudo"code):$
Example"Reducer:"IdenMty"Reducer"
let reduce(k, vals) = foreach v in vals: emit(k, v)
28
2
2
7
bow
a knot with two loops and two loose ends
a weapon for shooting arrows
a bending of the head or body in respect
bow a knot with two loops and two loose ends
bow a weapon for shooting arrows
bow a bending of the head or body in respect
28 2
28 2
28 7
reduce()"
reduce()"
2"80$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! A$MapReduce$program$has$two$major$developer"created$components:$a$Mapper$and$a$Reducer$
! Mappers$map$input$data$to$intermediate$key/value$pairs$– O]en"parse,"filter,"or"transform"the"data"
! Reducers$process$Mapper$output$into$final$key/value$pairs$– O]en"aggregate"data"using"staMsMcal"funcMons"
Key"Points"
2"81$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Hadoop"Clusters"Chapter"2.4"
2"82$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Components$of$a$Hadoop$cluster$
! How$do$Hadoop$jobs$and$tasks$run$on$a$cluster$
! How$does$a$job’s$data$flow$in$a$Hadoop$cluster$
Hadoop"Clusters"
2"83$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Hadoop$Clusters$
!! Hadoop$Cluster$Overview$
!! Hadoop"Jobs"and"Tasks"
2"84$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Amer$tesLng$a$Hadoop$soluLon$on$a$small$sample$dataset,$the$soluLon$can$be$run$on$a$Hadoop$cluster$to$process$all$data$
! Clusters$are$typically$installed$and$maintained$by$a$system$administrator$
! Developers$benefit$by$understanding$how$the$components$of$a$Hadoop$cluster$work$together$
Installing"A"Hadoop"Cluster"(1)"
2"85$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Difficult$– Download,"install,"and"integrate"individual"Hadoop"components"
directly"from"Apache"
! Easier:$CDH$– Cloudera’s"DistribuMon"for"Apache"Hadoop"– Vanilla"Hadoop"plus"many"patches,"backports,"bug"fixes"
– Includes"many"other"components"from"the"Hadoop"ecosystem"
"
! Easiest:$Cloudera$Manager$– Wizard/based"UI"to"install,"configure"and"manage"a"Hadoop"
cluster"
– Included"with"Cloudera"Standard"(free)"or"Cloudera"Enterprise"
Installing"A"Hadoop"Cluster"(2)"
2"86$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
MapReduce"
Master"
Nodes"
Hadoop"Cluster"Terminology"
Slave"Node"
HDFS""
Master"
Nodes"
! A$Hadoop$cluster$is$a$group$of$computers$working$together$– Usually"runs"HDFS"and"MapReduce""
! A$node$is$an$individual$computer$in$the$cluster$– Master"nodes"manage"distribuMon"of"work"and"data"to"slave"nodes"
! A$daemon$is$a$program$running$on$a$node$– Each"performs"different"funcMons"in"the"cluster"
Slave"Node"
Slave"Node"
Slave"Node"
…""
2"87$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! HDFS$daemons$– NameNode"–"holds"the"metadata"for"HDFS""
– Typically"two"on"a"producMon"cluster:"one"acMve,"one"standby"– DataNode"–"holds"the"actual"HDFS"data"
– One"per"slave"node"
Hadoop"Daemons:"HDFS"
DataNode"
Name"Node""(AcMve"+"
Standby)"
DataNode"
DataNode"
DataNode"
…""
2"88$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! MapReduce$v1$(“MRv1”$or$“Classic$MapReduce”)$– Uses"a"JobTracker/TaskTracker"architecture"– One"JobTracker"per"cluster"–"limits"cluster"size"to"about"4000"nodes"
– Slots"on"slave"nodes"designated"for"Map"or"Reduce"tasks"
! MapReduce$v2$(“MRv2”)$– Built"on"top"of"YARN"(Yet"Another"Resource"NegoMator)"– Uses"ResourceManager/NodeManager"architecture"
– Increases"scalability"of"cluster"– Node"resources"can"be"used"for"any"type"of"task"
– Improves"cluster"uMlizaMon"
– Support"for"non/MR"jobs"
MapReduce"v1"and"v2"(1)"
2"89$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! MRv1$daemons$– JobTracker"–"one"per"cluster"
– Manages"MapReduce"jobs,"distributes"individual"tasks"to"
TaskTrackers"
– TaskTracker"–"one"per"slave"node"– Starts"and"monitors"individual"Map"and"Reduce"tasks"
Hadoop"Daemons:"MapReduce"v1"
JobTracker"(AcMve"+"
Standby)"
TaskTracker"
TaskTracker"
TaskTracker"
…""
2"90$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
• Manage"data"storage"
• Hold"metadata"
Basic"Cluster"ConfiguraMon:"HDFS"
DataNode" DataNode"DataNode"DataNode"Slave"
Nodes"
HDFS""
Master"
Nodes"
Name"
Node"(AcMve"
+"Standby)"
2"91$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
DataNode" DataNode"DataNode"DataNode"Slave"
Nodes"
• Manages"MR"jobs"
• Distributes"tasks"to"slave"nodes"
Basic"Cluster"ConfiguraMon:"HDFS"+"MapReduce"v1"
HDFS""
Master"
Nodes"
MapReduce"
Master""
Nodes"
Client"
• Manage"data"storage"
• Hold"metadata"
TaskTracker" TaskTracker" TaskTracker" TaskTracker"
Name"
Node"(AcMve"
+"Standby)"
Job""
Tracker"(AcMve"+"
Standby)"
Task" Task" Task"
MR"Job"
2"92$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! MRv2$daemons$– ResourceManager"–"one"per"cluster"
– Starts"ApplicaMonMasters,"allocates"resources"on"slave"nodes"
– ApplicaMonMaster"–"one"per"job"
– Requests"resources,"manages"individual"Map"and"Reduce"tasks"
– NodeManager"–"one"per"slave"node"
– Manages"resources"on"individual"slave"nodes"
– JobHistory"–"one"per"cluster"– Archives"jobs’"metrics"and"metadata"
Hadoop"Daemons:"MapReduce"v2"
Resource"
Manager"(AcMve"+"Standby)"
NodeManager"
NodeManager"ApplicaMonMaster"
NodeManager"
…""
Job"History"
Server"
2"93$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
• Manage"data"storage"
• Hold"metadata"
Basic"Cluster"ConfiguraMon:"HDFS"
DataNode" DataNode"DataNode"DataNode"Slave"
Nodes"
HDFS""
Master"
Nodes"
Name"
Node"(AcMve"
+"Standby)"
Note:"This"slide"is"the"same"as"the"earlier"HDFS"slide;"there"is"no"change"in"the"HDFS"
design"for"YARN/MapReduce"v2"because"HDFS"is"the"storage"side"of"Hadoop."YARN/
MapReduce"v2"implement"the"compute"side"of"Hadoop."
2"94$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
DataNode" DataNode"DataNode"DataNode"Slave"
Nodes"
• Allocates"resources"• Launches"ApplicaMon"Masters"
Basic"Cluster"ConfiguraMon:"HDFS"+"MapReduce"v2"
Resource"
Manager"(AcMve"+"
Standby)"
HDFS""
Master"
Nodes"
MapReduce"
Master""
Nodes"
Client"
• Manage"data"storage"
• Hold"metadata"
Name"
Node"(AcMve"
+"Standby)"
NodeManager" NodeManager" NodeManager" NodeManager"
MR"Job"
Task" Task" Task"ApplicaLon$Master$
MR"Job"
2"95$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Hadoop$Clusters$
!! Hadoop"Cluster"Overview"
!! Hadoop$Jobs$and$Tasks$
2"96$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! A$job$is$a$‘full$program’$– A"complete"execuMon"of"Mappers"and"Reducers"over"a"dataset"
! A$task$is$the$execuLon$of$a$single$Mapper$or$Reducer$over$a$slice$of$data$
! A$task0a1empt$is$a$parLcular$instance$of$an$agempt$to$execute$a$task$– There"will"be"at"least"as"many"task"a>empts"as"there"are"tasks"
– If"a"task"a>empt"fails,"another"will"be"started"by"the"JobTracker"or"
ApplicaMonMaster"
– Specula3ve&execu3on"(covered"later)"can"also"result"in"more"task"
a>empts"than"completed"tasks&
Review"/"MapReduce"Terminology"
2"97$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Submi|ng"A"Job"
MR""
Master"
Node"
Client"
XML".jar"
Job"
2"98$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Job""
Tracker(s)"
TaskTracker"
A"MapReduce"v1"Cluster"
TaskTracker"
TaskTracker"
TaskTracker"
DataNode"
DataNode"
DataNode"
DataNode"
Name"
Node(s)"
Slave"Nodes"
2"99$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Job""
Tracker(s)"
TaskTracker"
Running"a"Job"on"a"MapReduce"v1"Cluster"(1)"
TaskTracker"
TaskTracker"
TaskTracker"
DataNode"
DataNode"
DataNode"
DataNode"
Block1"
Block2" Name"
Node(s)"
HDFS: mydata
Slave"Nodes" $ hadoop fs –put mydata
2"100$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Job""
Tracker(s)"
TaskTracker"
Running"a"Job"on"a"MapReduce"v1"Cluster"(2)"
Client"
TaskTracker"
TaskTracker"
TaskTracker"
DataNode"
DataNode"
DataNode"
DataNode"
Block1"
Block2"
Map"Task"1"
Map"Task"2"
Reduce"Task"1"
Reduce"Task"2"
Name"
Node(s)"
HDFS: mydata
Slave"Nodes"
2"101$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Resource"
Manager(s)"
NodeManager"
A"MapReduce"v2"Cluster"
NodeManager"
NodeManager"
NodeManager"
DataNode"
DataNode"
DataNode"
DataNode"
Name"
Node(s)"
Job"History"
Server"
Slave"Nodes"
2"102$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Resource"
Manager(s)"
NodeManager"
Running"a"Job"on"a"MapReduce"v2"Cluster"(1)"
Client"
NodeManager"
NodeManager"
NodeManager"
DataNode"
DataNode"
DataNode"
DataNode"
Block1"
Block2" Name"
Node(s)"
HDFS: mydata
MapReduce$ApplicaLon$Master$
Job"History"
Server"
Slave"Nodes"
2"103$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Resource"
Manager(s)"
NodeManager"
Running"a"Job"on"a"MapReduce"v2"Cluster"(2)"
Client"
NodeManager"
NodeManager"
NodeManager"
DataNode"
DataNode"
DataNode"
DataNode"
Block1"
Block2"
Map"Task"1"
Map"Task"2"
Reduce"Task"1"
Reduce"Task"2"
Name"
Node(s)"
HDFS: mydata
MapReduce$ApplicaLon$Master$
Job"History"
Server"
Slave"Nodes"
2"104$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Job"Data:"Mapper"Data"Locality"
When"possible,"Map"
tasks"run"on"a"node"
where"the"block"of"
data"to"be"processed"
is"stored"locally"
Otherwise,"the"Map"
task"will"transfer"
the"data"across"the"
network"and"then"
process"that"data"
Block1"
Block2"
Map"Task"1"
Map"Task"2"
HDFS"
2"105$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Job"Data:"Intermediate"Data"
Block1"
Block2"
Map"Task"1"
Map"Task"2"
Intermediate"
Data"
Intermediate"
Data"
HDFS"
Map"task"
intermediate"data"is"
stored"on"the"local"
disk"(not"HDFS)"
2"106$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Job"Data:"Shuffle"and"Sort"
Block1"
Block2"
Map"Task"1"
Map"Task"2"
Reduce"Task"1"
Reduce"Task"2"
Intermediate"
Data"
Intermediate"
Data"
HDFS"
Intermediate"data"is"
transferred"across"the"
network"to"the"
Reducers"
There"is"no"concept"of"
data"locality"for"
Reducers"
Reducers"write"their"
output"to"HDFS"
2"107$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! It$appears$that$the$shuffle$and$sort$phase$is$a$bogleneck$– The"reduce"method"in"the"Reducers"cannot"start"unMl"all"Mappers"
have"finished"
! In$pracLce,$Hadoop$will$start$to$transfer$data$from$Mappers$to$Reducers$as$soon$as$the$Mappers$finish$work$– This"avoids"a"huge"amount"of"data"transfer"starMng"as"soon"as"the"last"
Mapper"finishes"
– The"reduce"method"sMll"does"not"start"unMl"all"intermediate"data"has"
been"transferred"and"sorted"
Is"Shuffle"and"Sort"a"Bo>leneck?"
2"108$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! It$is$possible$for$one$Map$task$to$run$more$slowly$than$the$others$– Perhaps"due"to"faulty"hardware,"or"just"a"very"slow"machine"
! It$would$appear$that$this$would$create$a$bogleneck$– The"reduce"method"in"the"Reducer"cannot"start"unMl"every"Mapper"
has"finished"
! Hadoop$uses$specula3ve0execu3on$to$miLgate$against$this$– If"a"Mapper"appears"to"be"running"significantly"more"slowly"than"the"
others,"a"new"instance"of"the"Mapper"will"be"started"on"another"
machine,"operaMng"on"the"same"data"
– A"new"task&a7empt"for"the"same"task"
– The"results"of"the"first"Mapper"to"finish"will"be"used"
– Hadoop"will"kill"off"the"Mapper"which"is"sMll"running"
Is"a"Slow"Mapper"a"Bo>leneck?"
2"109$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
! Write$the$Mapper$and$Reducer$classes$
! Write$a$Driver$class$that$configures$the$job$and$submits$it$to$the$cluster$– Driver"classes"are"covered"in"the"“WriMng"MapReduce”"chapter"
! Compile$the$Mapper,$Reducer,$and$Driver$classes$
$
! Create$a$jar$file$with$the$Mapper,$Reducer,$and$Driver$classes$
! Run$the$hadoop jar$command$to$submit$the$job$to$the$Hadoop$cluster$
CreaMng"and"Running"a"MapReduce"Job"
$ javac -classpath `hadoop classpath` MyMapper.java MyReducer.java MyDriver.java
$ jar cf MyMR.jar MyMapper.class MyReducer.class MyDriver.class
$ hadoop jar MyMR.jar MyDriver in_file out_dir
2"110$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The$following$offer$more$informaLon$on$topics$discussed$in$this$chapter
! Full$Cloduera$documentaLon$available$at$– http://cloudera.com/
! Reference$Hadoop:$The$DefiniLve$Guide,$third$ediLon$(TDG$3e)$for$details$on$job$submission$"$Chapter$6,$“How$MapReduce$Works”,$starLng$on$page$189.$
Bibliography"