MapReduce Design Paerns - Inspiring Innovation

Preview:

Citation preview

MapReduceDesignPa0erns

CMSC491Hadoop-BasedDistributedCompu>ng

Spring2016AdamShook

Agenda

•  Summariza>onPa0erns•  FilteringPa0erns•  DataOrganiza>onPa0erns•  JoinsPa0erns•  Metapa0erns•  I/OPa0erns•  BloomFilters

SUMMARIZATIONPATTERNSNumericalSummariza>ons,InvertedIndex,Coun>ngwithCounters

Overview

•  Top-downsummariza>onoflargedatasets•  MoststraighSorwardpa0erns•  Calculateaggregatesoveren>redatasetorgroups

•  Buildindexes

NumericalSummariza>ons

•  Grouprecordstogetherbyafieldorsetoffieldsandcalculateanumericalaggregatepergroup

•  Buildhistogramsorcalculatesta>s>csfromnumericalvalues

KnownUses

•  WordCount•  RecordCount•  Min/Max/Count•  Average/Median/StandardDevia>on

Structure

Performance

•  Performwell,especiallywhencombinerisused

•  Needtobeconcernedaboutdataskewwithfromthekey

Example

•  Discoverthefirst>meaStackOverflowuserposted,thelast>meauserposted,andthenumberofpostsinbetween

•  UserID,MinDate,MaxDate,Count

public class MinMaxCountTuple implements Writable { private Date min = new Date(); private Date max = new Date(); private long count = 0; private final static SimpleDateFormat frmt = new SimpleDateFormat( "yyyy-MM-dd'T'HH:mm:ss.SSS");

public Date getMin() { return min; } public void setMin(Date min) { this.min = min; } public Date getMax() { return max; } public void setMax(Date max) { this.max = max; } public long getCount() { return count; } public void setCount(long count) { this.count = count; } public void readFields(DataInput in) { min = new Date(in.readLong()); max = new Date(in.readLong()); count = in.readLong(); } public void write(DataOutput out) { out.writeLong(min.getTime()); out.writeLong(max.getTime()); out.writeLong(count); }

public String toString() { return frmt.format(min) + "\t" + frmt.format(max) + "\t" + count; }

}

public static class MinMaxCountMapper extends Mapper<Object, Text, Text, MinMaxCountTuple> {

private Text outUserId = new Text(); private MinMaxCountTuple outTuple = new MinMaxCountTuple();

private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");

public void map(Object key, Text value, Context context) { Map<String, String> parsed = xmlToMap(value.toString()); String strDate = parsed.get("CreationDate"); String userId = parsed.get("UserId"); Date creationDate = frmt.parse(strDate); outTuple.setMin(creationDate); outTuple.setMax(creationDate) outTuple.setCount(1); outUserId.set(userId); context.write(outUserId, outTuple); }

}

public static class MinMaxCountReducer extends Reducer<Text, MinMaxCountTuple, Text, MinMaxCountTuple> {

private MinMaxCountTuple result = new MinMaxCountTuple();

public void reduce(Text key, Iterable<MinMaxCountTuple> values, Context context) { result.setMin(null); result.setMax(null); result.setCount(0); int sum=0; for (MinMaxCountTuple val : values) { if (result.getMin() == null || val.getMin().compareTo(result.getMin()) < 0) { result.setMin(val.getMin()); } if (result.getMax() == null || val.getMax().compareTo(result.getMax()) > 0) { result.setMax(val.getMax()); } sum += val.getCount(); } result.setCount(sum); context.write(key, result); }

}

public static void main(String[] args) {

Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: MinMaxCountDriver <in> <out>"); System.exit(2); } Job job = new Job(conf, "Comment Date Min Max Count"); job.setJarByClass(MinMaxCountDriver.class);

job.setMapperClass(MinMaxCountMapper.class); job.setCombinerClass(MinMaxCountReducer.class); job.setReducerClass(MinMaxCountReducer.class);

job.setOutputKeyClass(Text.class); job.setOutputValueClass(MinMaxCountTuple.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

InvertedIndex

•  Generateanindexfromadatasettoenablefastsearchesordataenrichment

•  Buildinganindextakes>me,butcangreatlyreducetheamountof>metosearchforsomething

•  Outputcanbeingestedintokey/valuestore

Structure

Coun>ngwithCounters

•  UseMapReduceframework’scounteru>litytocalculateglobalsumen>relyonthemapside,producingnooutput

•  Smallnumberofcountersonly!!

KnownUses

•  Countnumberofrecords•  Countasmallnumberofuniquefieldinstances

•  Sumfieldsofdatatogether

Structure

FILTERINGPATTERNSFiltering,BloomFiltering,TopTen,Dis>nct

Filtering

•  Discardrecordsthatarenotofinterest

•  Createsubsetsofyourbigdatasetsthatyouwanttofurtheranalyze

KnownUses

•  Closerviewofthedata•  Trackingathreadofevents•  Distributedgrep•  Datacleansing•  Simplerandomsampling

Structure

BloomFiltering

•  Keeprecordsthatareamemberofalargepredefinedsetofvalues

•  Inherentpossibilityoffalseposi>ves

KnownUses

•  Removingmostofthenon-watchedvalues•  Pre-filteringadatasetpriortoexpensivemembershiptest

Structure

TopTen

•  Retrievearela>velysmallnumberoftopKrecordsbasedonarankingscheme

•  Findtheoutliersormostinteres>ngrecords

KnownUses

•  Outlieranalysis•  Selec>nginteres>ngdata•  Catchydashboards

Structure

Dis>nct

•  Removeduplicateentriesofyourdata,eitherfullrecordsorasubsetoffields

•  ThatfourthVnobodytalksaboutthatmuch

KnownUses

•  Deduplicatedata•  Getdis>nctvalues•  Protectfrominnerjoinexplosion

Structure

DATAORGANIZATIONPATTERNS

StructuredtoHierarchical,Par>>oning,Binning,TotalOrderSor>ng,Shuffling

StructuredtoHierarchical

•  Transformedrow-baseddatatoahierarchicalformat

•  ReformabngRDBMSdatatoamoreconducivestructure

KnownUses

•  Pre-joiningdata•  PreparedataforHBaseorMongoDB

Structure

Par>>oning

•  Par>>onrecordsintosmallerdatasets

•  Enablesfasterfuturequery>mesduetopar>>onpruning

KnownUses

•  Par>>onpruningbycon>nuousvalue•  Par>>onpruningbycategory•  Sharding

Structure

Binning

•  Filerecordsintooneormorecategories– Similartopar>>oning,buttheimplementa>onisdifferent

•  CanbeusedtosolvesimilarproblemstoPar>>oning

KnownUses

•  Pruningforfollow-onanaly>cs•  Categorizingdata

Structure

TotalOrderSor>ng

•  Sortyourdatasetinparallel

•  Difficulttoapply“divideandconquer”techniqueofMapReduce

KnownUses

•  Sor>ng

Structure

Structure

Shuffling

•  Setofrecordsthatyouwanttocompletelyrandomize

•  Ins>llsomeanonymityorcreatesomerepeatablerandomsampling

KnownUses

•  Anonymizetheorderofthedataset•  Repeatablerandomsamplingagershuffled

Structure

JOINPATTERNS

JoinRefresher,Reduce-SideJoinw/andw/oBloomFilter,ReplicatedJoin,CompositeJoin,CartesianProduct

JoinRefresher

•  Ajoinisanopera>onthatcombinesrecordsfromtwoormoredatasetsbasedonafieldorsetoffields,knownasaforeignkey

•  Let’sgooverthedifferenttypesofjoinsbeforetalkingabouthowtodoitinMapReduce

ATaleofTwoTables

InnerJoin

LegOuterJoin

RightOuterJoin

FullOuterJoin

An>join

CartesianProduct

Howtoimplement?

•  Reduce-SideJoinw/andw/oBloomFilter•  ReplicatedJoin•  CompositeJoin

•  CartesianProductstandsalone

ReduceSideJoin

•  Twoormoredatasetsarejoinedinthereducephase

•  Coversalljointypeswehavediscussed– Excep>on:Mr.Cartesian

•  Alldataissentoverthenetwork–  Ifapplicable,filterusingBloomfilter

Structure

Performance

•  Needtobeconcernedaboutdataskew•  2PBjoinedon2PBmeans4PBofnetworktraffic

ReplicatedJoin

•  InnerandLegOuterJoins•  Removesneedtoshuffleanydatatothereducephase

•  Veryuseful,butrequiresonelargedatasetandtheremainingdatasetstobeabletofitintomemoryofeachmaptask

Structure

Performance

•  Fastesttypeofjoin•  Map-only

•  LimitedbasedonhowmuchdatayoucansafelystoreinsideJVM

•  Needtobeconcernedaboutgrowingdatasets

•  Couldop>onallyuseaBloomfilter

CompositeJoin

•  Leveragesbuilt-inHadoopu>li>estojointhedata

•  Requiresthedatatobealreadyorganizedandpreparedinaspecificway

•  Reallyonlyusefulifyouhaveonelargedatasetthatyouareusingalot

DataStructure

Structure

Performance

•  Goodperformance,joinopera>onisdoneonthemapside

•  Requiresthedatatohavethesamenumberofpar>>ons,par>>onedinthesameway,andeachpar>>onmustbesorted

CartesianProduct

•  Pairupandcompareeverysinglerecordwitheveryotherrecordinadataset

•  Allowsrela>onshipsbetweenmanydifferentdatasetstobeuncoveredatafine-grainlevel

KnownUses

•  Documentorimagecomparisons•  Mathstufforsomething

Structure

Performance

•  Massivedataexplosion!•  Canusemanymapslotsforalong>me

•  Effec>velycreatesadatasetsizeO(n2)– Needtomakesureyourclustercanfitwhatyouaredoing

METAPATTERNSJobChaining,ChainFolding,JobMerging

JobChaining

•  Onejobisogennotenough•  Needacombina>onofpa0ernsdiscussedtodoyourworkflow

•  Sequen>alvsParallel

Methodologies

•  IntheDriver•  InaBashrunscript•  WiththeJobControlu>lity

ChainFolding

•  Eachrecordcanbesubmi0edtomul>plemappers,thenareducer,thenamapper

•  Reducesamountofdatamovementinthepipeline

Structure

Structure

Methodologies

•  Justdoit•  ChainMapper/ChainReducer

JobMerging

•  Mergeunrelatedjobstogetherintothesamepipeline

Structure

Methodologies

•  Tagmapoutputrecords•  UseMul>pleOutputs

I/OPATTERNS

Genera>ngData,ExternalSourceOutput,ExternalSourceInput,Par>>onPruning

CustomizingI/O

•  Unstructuredandsemi-structureddataogencallsforacustominputformattobedeveloped

Genera>ngData

•  Generatelotsofdatainparallelfromnothing

•  Randomorrepresenta>vebigdatasetsforyoutotestyouranaly>cswith

KnownUses

•  Benchmarkingyournewcluster•  Makingmoredatatorepresentasampleyouweregiven

Structure

ExternalSourceOutput

•  YouwanttowriteMapReduceoutputtosomenon-na>veloca>on

•  DirectloadingintoasysteminsteadofusingHDFSasastagingarea

KnownUses

•  Writedirectlyouttosomenon-HDFSsolu>on– Key/ValueStore– RDBMS–  In-MemoryStore

•  Manyofthesearealreadywri0en

Structure

ExternalSourceInput

•  Youwanttoloaddatainparallelfromsomeothersource

•  HookothersystemsintotheMapReduceframework

KnownUses

•  SkipthestagingareaandloaddirectlyintoMapReduce

•  Key/Valuestore•  RDBMS•  In-Memorystore

Structure

Par>>onPruning

•  Abstractawayhowthedataisstoredtoloadwhatdataisneededbasedonthequery

KnownUses

•  Discardunneededfilesbasedonthequery•  Abstractdatastoragefromquery,allowingforpowerfulmiddlewaretobebuilt

Structure

References

•  “MapReduceDesignPa0erns”–O’Reilly2012

•  www.github.com/adamjshook/mapreducepa0erns

•  h0p://en.wikipedia.org/wiki/Bloom_filter

Recommended