98
MapReduce Design Pa0erns CMSC 491 Hadoop-Based Distributed Compu>ng Spring 2016 Adam Shook

MapReduce Design Paerns - Inspiring Innovation

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MapReduce Design Paerns - Inspiring Innovation

MapReduceDesignPa0erns

CMSC491Hadoop-BasedDistributedCompu>ng

Spring2016AdamShook

Page 2: MapReduce Design Paerns - Inspiring Innovation

Agenda

•  Summariza>onPa0erns•  FilteringPa0erns•  DataOrganiza>onPa0erns•  JoinsPa0erns•  Metapa0erns•  I/OPa0erns•  BloomFilters

Page 3: MapReduce Design Paerns - Inspiring Innovation

SUMMARIZATIONPATTERNSNumericalSummariza>ons,InvertedIndex,Coun>ngwithCounters

Page 4: MapReduce Design Paerns - Inspiring Innovation

Overview

•  Top-downsummariza>onoflargedatasets•  MoststraighSorwardpa0erns•  Calculateaggregatesoveren>redatasetorgroups

•  Buildindexes

Page 5: MapReduce Design Paerns - Inspiring Innovation

NumericalSummariza>ons

•  Grouprecordstogetherbyafieldorsetoffieldsandcalculateanumericalaggregatepergroup

•  Buildhistogramsorcalculatesta>s>csfromnumericalvalues

Page 6: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  WordCount•  RecordCount•  Min/Max/Count•  Average/Median/StandardDevia>on

Page 7: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 8: MapReduce Design Paerns - Inspiring Innovation

Performance

•  Performwell,especiallywhencombinerisused

•  Needtobeconcernedaboutdataskewwithfromthekey

Page 9: MapReduce Design Paerns - Inspiring Innovation

Example

•  Discoverthefirst>meaStackOverflowuserposted,thelast>meauserposted,andthenumberofpostsinbetween

•  UserID,MinDate,MaxDate,Count

Page 10: MapReduce Design Paerns - Inspiring Innovation

public class MinMaxCountTuple implements Writable { private Date min = new Date(); private Date max = new Date(); private long count = 0; private final static SimpleDateFormat frmt = new SimpleDateFormat( "yyyy-MM-dd'T'HH:mm:ss.SSS");

public Date getMin() { return min; } public void setMin(Date min) { this.min = min; } public Date getMax() { return max; } public void setMax(Date max) { this.max = max; } public long getCount() { return count; } public void setCount(long count) { this.count = count; } public void readFields(DataInput in) { min = new Date(in.readLong()); max = new Date(in.readLong()); count = in.readLong(); } public void write(DataOutput out) { out.writeLong(min.getTime()); out.writeLong(max.getTime()); out.writeLong(count); }

public String toString() { return frmt.format(min) + "\t" + frmt.format(max) + "\t" + count; }

}

Page 11: MapReduce Design Paerns - Inspiring Innovation

public static class MinMaxCountMapper extends Mapper<Object, Text, Text, MinMaxCountTuple> {

private Text outUserId = new Text(); private MinMaxCountTuple outTuple = new MinMaxCountTuple();

private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");

public void map(Object key, Text value, Context context) { Map<String, String> parsed = xmlToMap(value.toString()); String strDate = parsed.get("CreationDate"); String userId = parsed.get("UserId"); Date creationDate = frmt.parse(strDate); outTuple.setMin(creationDate); outTuple.setMax(creationDate) outTuple.setCount(1); outUserId.set(userId); context.write(outUserId, outTuple); }

}

Page 12: MapReduce Design Paerns - Inspiring Innovation

public static class MinMaxCountReducer extends Reducer<Text, MinMaxCountTuple, Text, MinMaxCountTuple> {

private MinMaxCountTuple result = new MinMaxCountTuple();

public void reduce(Text key, Iterable<MinMaxCountTuple> values, Context context) { result.setMin(null); result.setMax(null); result.setCount(0); int sum=0; for (MinMaxCountTuple val : values) { if (result.getMin() == null || val.getMin().compareTo(result.getMin()) < 0) { result.setMin(val.getMin()); } if (result.getMax() == null || val.getMax().compareTo(result.getMax()) > 0) { result.setMax(val.getMax()); } sum += val.getCount(); } result.setCount(sum); context.write(key, result); }

}

Page 13: MapReduce Design Paerns - Inspiring Innovation
Page 14: MapReduce Design Paerns - Inspiring Innovation

public static void main(String[] args) {

Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: MinMaxCountDriver <in> <out>"); System.exit(2); } Job job = new Job(conf, "Comment Date Min Max Count"); job.setJarByClass(MinMaxCountDriver.class);

job.setMapperClass(MinMaxCountMapper.class); job.setCombinerClass(MinMaxCountReducer.class); job.setReducerClass(MinMaxCountReducer.class);

job.setOutputKeyClass(Text.class); job.setOutputValueClass(MinMaxCountTuple.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

Page 15: MapReduce Design Paerns - Inspiring Innovation

InvertedIndex

•  Generateanindexfromadatasettoenablefastsearchesordataenrichment

•  Buildinganindextakes>me,butcangreatlyreducetheamountof>metosearchforsomething

•  Outputcanbeingestedintokey/valuestore

Page 16: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 17: MapReduce Design Paerns - Inspiring Innovation

Coun>ngwithCounters

•  UseMapReduceframework’scounteru>litytocalculateglobalsumen>relyonthemapside,producingnooutput

•  Smallnumberofcountersonly!!

Page 18: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Countnumberofrecords•  Countasmallnumberofuniquefieldinstances

•  Sumfieldsofdatatogether

Page 19: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 20: MapReduce Design Paerns - Inspiring Innovation

FILTERINGPATTERNSFiltering,BloomFiltering,TopTen,Dis>nct

Page 21: MapReduce Design Paerns - Inspiring Innovation

Filtering

•  Discardrecordsthatarenotofinterest

•  Createsubsetsofyourbigdatasetsthatyouwanttofurtheranalyze

Page 22: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Closerviewofthedata•  Trackingathreadofevents•  Distributedgrep•  Datacleansing•  Simplerandomsampling

Page 23: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 24: MapReduce Design Paerns - Inspiring Innovation

BloomFiltering

•  Keeprecordsthatareamemberofalargepredefinedsetofvalues

•  Inherentpossibilityoffalseposi>ves

Page 25: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Removingmostofthenon-watchedvalues•  Pre-filteringadatasetpriortoexpensivemembershiptest

Page 26: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 27: MapReduce Design Paerns - Inspiring Innovation

TopTen

•  Retrievearela>velysmallnumberoftopKrecordsbasedonarankingscheme

•  Findtheoutliersormostinteres>ngrecords

Page 28: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Outlieranalysis•  Selec>nginteres>ngdata•  Catchydashboards

Page 29: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 30: MapReduce Design Paerns - Inspiring Innovation

Dis>nct

•  Removeduplicateentriesofyourdata,eitherfullrecordsorasubsetoffields

•  ThatfourthVnobodytalksaboutthatmuch

Page 31: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Deduplicatedata•  Getdis>nctvalues•  Protectfrominnerjoinexplosion

Page 32: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 33: MapReduce Design Paerns - Inspiring Innovation

DATAORGANIZATIONPATTERNS

StructuredtoHierarchical,Par>>oning,Binning,TotalOrderSor>ng,Shuffling

Page 34: MapReduce Design Paerns - Inspiring Innovation

StructuredtoHierarchical

•  Transformedrow-baseddatatoahierarchicalformat

•  ReformabngRDBMSdatatoamoreconducivestructure

Page 35: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Pre-joiningdata•  PreparedataforHBaseorMongoDB

Page 36: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 37: MapReduce Design Paerns - Inspiring Innovation

Par>>oning

•  Par>>onrecordsintosmallerdatasets

•  Enablesfasterfuturequery>mesduetopar>>onpruning

Page 38: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Par>>onpruningbycon>nuousvalue•  Par>>onpruningbycategory•  Sharding

Page 39: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 40: MapReduce Design Paerns - Inspiring Innovation

Binning

•  Filerecordsintooneormorecategories– Similartopar>>oning,buttheimplementa>onisdifferent

•  CanbeusedtosolvesimilarproblemstoPar>>oning

Page 41: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Pruningforfollow-onanaly>cs•  Categorizingdata

Page 42: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 43: MapReduce Design Paerns - Inspiring Innovation

TotalOrderSor>ng

•  Sortyourdatasetinparallel

•  Difficulttoapply“divideandconquer”techniqueofMapReduce

Page 44: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Sor>ng

Page 45: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 46: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 47: MapReduce Design Paerns - Inspiring Innovation

Shuffling

•  Setofrecordsthatyouwanttocompletelyrandomize

•  Ins>llsomeanonymityorcreatesomerepeatablerandomsampling

Page 48: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Anonymizetheorderofthedataset•  Repeatablerandomsamplingagershuffled

Page 49: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 50: MapReduce Design Paerns - Inspiring Innovation

JOINPATTERNS

JoinRefresher,Reduce-SideJoinw/andw/oBloomFilter,ReplicatedJoin,CompositeJoin,CartesianProduct

Page 51: MapReduce Design Paerns - Inspiring Innovation

JoinRefresher

•  Ajoinisanopera>onthatcombinesrecordsfromtwoormoredatasetsbasedonafieldorsetoffields,knownasaforeignkey

•  Let’sgooverthedifferenttypesofjoinsbeforetalkingabouthowtodoitinMapReduce

Page 52: MapReduce Design Paerns - Inspiring Innovation

ATaleofTwoTables

Page 53: MapReduce Design Paerns - Inspiring Innovation

InnerJoin

Page 54: MapReduce Design Paerns - Inspiring Innovation

LegOuterJoin

Page 55: MapReduce Design Paerns - Inspiring Innovation

RightOuterJoin

Page 56: MapReduce Design Paerns - Inspiring Innovation

FullOuterJoin

Page 57: MapReduce Design Paerns - Inspiring Innovation

An>join

Page 58: MapReduce Design Paerns - Inspiring Innovation

CartesianProduct

Page 59: MapReduce Design Paerns - Inspiring Innovation

Howtoimplement?

•  Reduce-SideJoinw/andw/oBloomFilter•  ReplicatedJoin•  CompositeJoin

•  CartesianProductstandsalone

Page 60: MapReduce Design Paerns - Inspiring Innovation

ReduceSideJoin

•  Twoormoredatasetsarejoinedinthereducephase

•  Coversalljointypeswehavediscussed– Excep>on:Mr.Cartesian

•  Alldataissentoverthenetwork–  Ifapplicable,filterusingBloomfilter

Page 61: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 62: MapReduce Design Paerns - Inspiring Innovation

Performance

•  Needtobeconcernedaboutdataskew•  2PBjoinedon2PBmeans4PBofnetworktraffic

Page 63: MapReduce Design Paerns - Inspiring Innovation

ReplicatedJoin

•  InnerandLegOuterJoins•  Removesneedtoshuffleanydatatothereducephase

•  Veryuseful,butrequiresonelargedatasetandtheremainingdatasetstobeabletofitintomemoryofeachmaptask

Page 64: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 65: MapReduce Design Paerns - Inspiring Innovation

Performance

•  Fastesttypeofjoin•  Map-only

•  LimitedbasedonhowmuchdatayoucansafelystoreinsideJVM

•  Needtobeconcernedaboutgrowingdatasets

•  Couldop>onallyuseaBloomfilter

Page 66: MapReduce Design Paerns - Inspiring Innovation

CompositeJoin

•  Leveragesbuilt-inHadoopu>li>estojointhedata

•  Requiresthedatatobealreadyorganizedandpreparedinaspecificway

•  Reallyonlyusefulifyouhaveonelargedatasetthatyouareusingalot

Page 67: MapReduce Design Paerns - Inspiring Innovation

DataStructure

Page 68: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 69: MapReduce Design Paerns - Inspiring Innovation

Performance

•  Goodperformance,joinopera>onisdoneonthemapside

•  Requiresthedatatohavethesamenumberofpar>>ons,par>>onedinthesameway,andeachpar>>onmustbesorted

Page 70: MapReduce Design Paerns - Inspiring Innovation

CartesianProduct

•  Pairupandcompareeverysinglerecordwitheveryotherrecordinadataset

•  Allowsrela>onshipsbetweenmanydifferentdatasetstobeuncoveredatafine-grainlevel

Page 71: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Documentorimagecomparisons•  Mathstufforsomething

Page 72: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 73: MapReduce Design Paerns - Inspiring Innovation

Performance

•  Massivedataexplosion!•  Canusemanymapslotsforalong>me

•  Effec>velycreatesadatasetsizeO(n2)– Needtomakesureyourclustercanfitwhatyouaredoing

Page 74: MapReduce Design Paerns - Inspiring Innovation

METAPATTERNSJobChaining,ChainFolding,JobMerging

Page 75: MapReduce Design Paerns - Inspiring Innovation

JobChaining

•  Onejobisogennotenough•  Needacombina>onofpa0ernsdiscussedtodoyourworkflow

•  Sequen>alvsParallel

Page 76: MapReduce Design Paerns - Inspiring Innovation

Methodologies

•  IntheDriver•  InaBashrunscript•  WiththeJobControlu>lity

Page 77: MapReduce Design Paerns - Inspiring Innovation

ChainFolding

•  Eachrecordcanbesubmi0edtomul>plemappers,thenareducer,thenamapper

•  Reducesamountofdatamovementinthepipeline

Page 78: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 79: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 80: MapReduce Design Paerns - Inspiring Innovation

Methodologies

•  Justdoit•  ChainMapper/ChainReducer

Page 81: MapReduce Design Paerns - Inspiring Innovation

JobMerging

•  Mergeunrelatedjobstogetherintothesamepipeline

Page 82: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 83: MapReduce Design Paerns - Inspiring Innovation

Methodologies

•  Tagmapoutputrecords•  UseMul>pleOutputs

Page 84: MapReduce Design Paerns - Inspiring Innovation

I/OPATTERNS

Genera>ngData,ExternalSourceOutput,ExternalSourceInput,Par>>onPruning

Page 85: MapReduce Design Paerns - Inspiring Innovation

CustomizingI/O

•  Unstructuredandsemi-structureddataogencallsforacustominputformattobedeveloped

Page 86: MapReduce Design Paerns - Inspiring Innovation

Genera>ngData

•  Generatelotsofdatainparallelfromnothing

•  Randomorrepresenta>vebigdatasetsforyoutotestyouranaly>cswith

Page 87: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Benchmarkingyournewcluster•  Makingmoredatatorepresentasampleyouweregiven

Page 88: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 89: MapReduce Design Paerns - Inspiring Innovation

ExternalSourceOutput

•  YouwanttowriteMapReduceoutputtosomenon-na>veloca>on

•  DirectloadingintoasysteminsteadofusingHDFSasastagingarea

Page 90: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Writedirectlyouttosomenon-HDFSsolu>on– Key/ValueStore– RDBMS–  In-MemoryStore

•  Manyofthesearealreadywri0en

Page 91: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 92: MapReduce Design Paerns - Inspiring Innovation

ExternalSourceInput

•  Youwanttoloaddatainparallelfromsomeothersource

•  HookothersystemsintotheMapReduceframework

Page 93: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  SkipthestagingareaandloaddirectlyintoMapReduce

•  Key/Valuestore•  RDBMS•  In-Memorystore

Page 94: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 95: MapReduce Design Paerns - Inspiring Innovation

Par>>onPruning

•  Abstractawayhowthedataisstoredtoloadwhatdataisneededbasedonthequery

Page 96: MapReduce Design Paerns - Inspiring Innovation

KnownUses

•  Discardunneededfilesbasedonthequery•  Abstractdatastoragefromquery,allowingforpowerfulmiddlewaretobebuilt

Page 97: MapReduce Design Paerns - Inspiring Innovation

Structure

Page 98: MapReduce Design Paerns - Inspiring Innovation

References

•  “MapReduceDesignPa0erns”–O’Reilly2012

•  www.github.com/adamjshook/mapreducepa0erns

•  h0p://en.wikipedia.org/wiki/Bloom_filter