Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
MapReduceDesignPa0erns
CMSC491Hadoop-BasedDistributedCompu>ng
Spring2016AdamShook
Agenda
• Summariza>onPa0erns• FilteringPa0erns• DataOrganiza>onPa0erns• JoinsPa0erns• Metapa0erns• I/OPa0erns• BloomFilters
SUMMARIZATIONPATTERNSNumericalSummariza>ons,InvertedIndex,Coun>ngwithCounters
Overview
• Top-downsummariza>onoflargedatasets• MoststraighSorwardpa0erns• Calculateaggregatesoveren>redatasetorgroups
• Buildindexes
NumericalSummariza>ons
• Grouprecordstogetherbyafieldorsetoffieldsandcalculateanumericalaggregatepergroup
• Buildhistogramsorcalculatesta>s>csfromnumericalvalues
KnownUses
• WordCount• RecordCount• Min/Max/Count• Average/Median/StandardDevia>on
Structure
Performance
• Performwell,especiallywhencombinerisused
• Needtobeconcernedaboutdataskewwithfromthekey
Example
• Discoverthefirst>meaStackOverflowuserposted,thelast>meauserposted,andthenumberofpostsinbetween
• UserID,MinDate,MaxDate,Count
public class MinMaxCountTuple implements Writable { private Date min = new Date(); private Date max = new Date(); private long count = 0; private final static SimpleDateFormat frmt = new SimpleDateFormat( "yyyy-MM-dd'T'HH:mm:ss.SSS");
public Date getMin() { return min; } public void setMin(Date min) { this.min = min; } public Date getMax() { return max; } public void setMax(Date max) { this.max = max; } public long getCount() { return count; } public void setCount(long count) { this.count = count; } public void readFields(DataInput in) { min = new Date(in.readLong()); max = new Date(in.readLong()); count = in.readLong(); } public void write(DataOutput out) { out.writeLong(min.getTime()); out.writeLong(max.getTime()); out.writeLong(count); }
public String toString() { return frmt.format(min) + "\t" + frmt.format(max) + "\t" + count; }
}
public static class MinMaxCountMapper extends Mapper<Object, Text, Text, MinMaxCountTuple> {
private Text outUserId = new Text(); private MinMaxCountTuple outTuple = new MinMaxCountTuple();
private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
public void map(Object key, Text value, Context context) { Map<String, String> parsed = xmlToMap(value.toString()); String strDate = parsed.get("CreationDate"); String userId = parsed.get("UserId"); Date creationDate = frmt.parse(strDate); outTuple.setMin(creationDate); outTuple.setMax(creationDate) outTuple.setCount(1); outUserId.set(userId); context.write(outUserId, outTuple); }
}
public static class MinMaxCountReducer extends Reducer<Text, MinMaxCountTuple, Text, MinMaxCountTuple> {
private MinMaxCountTuple result = new MinMaxCountTuple();
public void reduce(Text key, Iterable<MinMaxCountTuple> values, Context context) { result.setMin(null); result.setMax(null); result.setCount(0); int sum=0; for (MinMaxCountTuple val : values) { if (result.getMin() == null || val.getMin().compareTo(result.getMin()) < 0) { result.setMin(val.getMin()); } if (result.getMax() == null || val.getMax().compareTo(result.getMax()) > 0) { result.setMax(val.getMax()); } sum += val.getCount(); } result.setCount(sum); context.write(key, result); }
}
public static void main(String[] args) {
Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: MinMaxCountDriver <in> <out>"); System.exit(2); } Job job = new Job(conf, "Comment Date Min Max Count"); job.setJarByClass(MinMaxCountDriver.class);
job.setMapperClass(MinMaxCountMapper.class); job.setCombinerClass(MinMaxCountReducer.class); job.setReducerClass(MinMaxCountReducer.class);
job.setOutputKeyClass(Text.class); job.setOutputValueClass(MinMaxCountTuple.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
InvertedIndex
• Generateanindexfromadatasettoenablefastsearchesordataenrichment
• Buildinganindextakes>me,butcangreatlyreducetheamountof>metosearchforsomething
• Outputcanbeingestedintokey/valuestore
Structure
Coun>ngwithCounters
• UseMapReduceframework’scounteru>litytocalculateglobalsumen>relyonthemapside,producingnooutput
• Smallnumberofcountersonly!!
KnownUses
• Countnumberofrecords• Countasmallnumberofuniquefieldinstances
• Sumfieldsofdatatogether
Structure
FILTERINGPATTERNSFiltering,BloomFiltering,TopTen,Dis>nct
Filtering
• Discardrecordsthatarenotofinterest
• Createsubsetsofyourbigdatasetsthatyouwanttofurtheranalyze
KnownUses
• Closerviewofthedata• Trackingathreadofevents• Distributedgrep• Datacleansing• Simplerandomsampling
Structure
BloomFiltering
• Keeprecordsthatareamemberofalargepredefinedsetofvalues
• Inherentpossibilityoffalseposi>ves
KnownUses
• Removingmostofthenon-watchedvalues• Pre-filteringadatasetpriortoexpensivemembershiptest
Structure
TopTen
• Retrievearela>velysmallnumberoftopKrecordsbasedonarankingscheme
• Findtheoutliersormostinteres>ngrecords
KnownUses
• Outlieranalysis• Selec>nginteres>ngdata• Catchydashboards
Structure
Dis>nct
• Removeduplicateentriesofyourdata,eitherfullrecordsorasubsetoffields
• ThatfourthVnobodytalksaboutthatmuch
KnownUses
• Deduplicatedata• Getdis>nctvalues• Protectfrominnerjoinexplosion
Structure
DATAORGANIZATIONPATTERNS
StructuredtoHierarchical,Par>>oning,Binning,TotalOrderSor>ng,Shuffling
StructuredtoHierarchical
• Transformedrow-baseddatatoahierarchicalformat
• ReformabngRDBMSdatatoamoreconducivestructure
KnownUses
• Pre-joiningdata• PreparedataforHBaseorMongoDB
Structure
Par>>oning
• Par>>onrecordsintosmallerdatasets
• Enablesfasterfuturequery>mesduetopar>>onpruning
KnownUses
• Par>>onpruningbycon>nuousvalue• Par>>onpruningbycategory• Sharding
Structure
Binning
• Filerecordsintooneormorecategories– Similartopar>>oning,buttheimplementa>onisdifferent
• CanbeusedtosolvesimilarproblemstoPar>>oning
KnownUses
• Pruningforfollow-onanaly>cs• Categorizingdata
Structure
TotalOrderSor>ng
• Sortyourdatasetinparallel
• Difficulttoapply“divideandconquer”techniqueofMapReduce
KnownUses
• Sor>ng
Structure
Structure
Shuffling
• Setofrecordsthatyouwanttocompletelyrandomize
• Ins>llsomeanonymityorcreatesomerepeatablerandomsampling
KnownUses
• Anonymizetheorderofthedataset• Repeatablerandomsamplingagershuffled
Structure
JOINPATTERNS
JoinRefresher,Reduce-SideJoinw/andw/oBloomFilter,ReplicatedJoin,CompositeJoin,CartesianProduct
JoinRefresher
• Ajoinisanopera>onthatcombinesrecordsfromtwoormoredatasetsbasedonafieldorsetoffields,knownasaforeignkey
• Let’sgooverthedifferenttypesofjoinsbeforetalkingabouthowtodoitinMapReduce
ATaleofTwoTables
InnerJoin
LegOuterJoin
RightOuterJoin
FullOuterJoin
An>join
CartesianProduct
Howtoimplement?
• Reduce-SideJoinw/andw/oBloomFilter• ReplicatedJoin• CompositeJoin
• CartesianProductstandsalone
ReduceSideJoin
• Twoormoredatasetsarejoinedinthereducephase
• Coversalljointypeswehavediscussed– Excep>on:Mr.Cartesian
• Alldataissentoverthenetwork– Ifapplicable,filterusingBloomfilter
Structure
Performance
• Needtobeconcernedaboutdataskew• 2PBjoinedon2PBmeans4PBofnetworktraffic
ReplicatedJoin
• InnerandLegOuterJoins• Removesneedtoshuffleanydatatothereducephase
• Veryuseful,butrequiresonelargedatasetandtheremainingdatasetstobeabletofitintomemoryofeachmaptask
Structure
Performance
• Fastesttypeofjoin• Map-only
• LimitedbasedonhowmuchdatayoucansafelystoreinsideJVM
• Needtobeconcernedaboutgrowingdatasets
• Couldop>onallyuseaBloomfilter
CompositeJoin
• Leveragesbuilt-inHadoopu>li>estojointhedata
• Requiresthedatatobealreadyorganizedandpreparedinaspecificway
• Reallyonlyusefulifyouhaveonelargedatasetthatyouareusingalot
DataStructure
Structure
Performance
• Goodperformance,joinopera>onisdoneonthemapside
• Requiresthedatatohavethesamenumberofpar>>ons,par>>onedinthesameway,andeachpar>>onmustbesorted
CartesianProduct
• Pairupandcompareeverysinglerecordwitheveryotherrecordinadataset
• Allowsrela>onshipsbetweenmanydifferentdatasetstobeuncoveredatafine-grainlevel
KnownUses
• Documentorimagecomparisons• Mathstufforsomething
Structure
Performance
• Massivedataexplosion!• Canusemanymapslotsforalong>me
• Effec>velycreatesadatasetsizeO(n2)– Needtomakesureyourclustercanfitwhatyouaredoing
METAPATTERNSJobChaining,ChainFolding,JobMerging
JobChaining
• Onejobisogennotenough• Needacombina>onofpa0ernsdiscussedtodoyourworkflow
• Sequen>alvsParallel
Methodologies
• IntheDriver• InaBashrunscript• WiththeJobControlu>lity
ChainFolding
• Eachrecordcanbesubmi0edtomul>plemappers,thenareducer,thenamapper
• Reducesamountofdatamovementinthepipeline
Structure
Structure
Methodologies
• Justdoit• ChainMapper/ChainReducer
JobMerging
• Mergeunrelatedjobstogetherintothesamepipeline
Structure
Methodologies
• Tagmapoutputrecords• UseMul>pleOutputs
I/OPATTERNS
Genera>ngData,ExternalSourceOutput,ExternalSourceInput,Par>>onPruning
CustomizingI/O
• Unstructuredandsemi-structureddataogencallsforacustominputformattobedeveloped
Genera>ngData
• Generatelotsofdatainparallelfromnothing
• Randomorrepresenta>vebigdatasetsforyoutotestyouranaly>cswith
KnownUses
• Benchmarkingyournewcluster• Makingmoredatatorepresentasampleyouweregiven
Structure
ExternalSourceOutput
• YouwanttowriteMapReduceoutputtosomenon-na>veloca>on
• DirectloadingintoasysteminsteadofusingHDFSasastagingarea
KnownUses
• Writedirectlyouttosomenon-HDFSsolu>on– Key/ValueStore– RDBMS– In-MemoryStore
• Manyofthesearealreadywri0en
Structure
ExternalSourceInput
• Youwanttoloaddatainparallelfromsomeothersource
• HookothersystemsintotheMapReduceframework
KnownUses
• SkipthestagingareaandloaddirectlyintoMapReduce
• Key/Valuestore• RDBMS• In-Memorystore
Structure
Par>>onPruning
• Abstractawayhowthedataisstoredtoloadwhatdataisneededbasedonthequery
KnownUses
• Discardunneededfilesbasedonthequery• Abstractdatastoragefromquery,allowingforpowerfulmiddlewaretobebuilt
Structure
References
• “MapReduceDesignPa0erns”–O’Reilly2012
• www.github.com/adamjshook/mapreducepa0erns
• h0p://en.wikipedia.org/wiki/Bloom_filter