25
BI Over Petabytes: Meet Apache Mahout Industrial Strength Machine Learning April 2009 h@p://lucene.apache.org/mahout/ 4/22/09 1 jeff@windwardsoluJons.com

BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

BIOverPetabytes:MeetApacheMahout

IndustrialStrengthMachineLearningApril2009

h@p://lucene.apache.org/mahout/

4/22/09 [email protected]

Page 2: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

BIandML

•  BusinessIntelligence– OLAP– AnalyJcs– Datamining– Performanceanalysis

– Textmining– PredicJveanalysis

•  MachineLearning– ClassificaJon – Clustering– Regression– CollaboraJvefiltering

– EvoluJonaryalgorithms

4/22/09 [email protected]

Page 3: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

WhatisMachineLearning?

•  “MachinelearningisthesubfieldofarJficialintelligencethatisconcernedwiththedesignanddevelopmentofalgorithmsthatallowcomputerstoimprovetheirperformanceoverJme…”(h@p://en.wikipedia.org/wiki/Machine_learning)

•  TypesofMLalgorithms–  Supervised:Usinglabeledtrainingdata,createafuncJonthatpredictsoutputforunseeninputs

– Unsupervised:UsingunlabeleddatacreateafuncJonthatcanpredictoutput

–  Semi‐supervised:Useslabeledandunlabeleddata

4/22/09 [email protected]

Page 4: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

OneCommonMLExample

Google.com

4/22/09 [email protected]

TextClustering

Page 5: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

AnotherCommonExample

Amazon.com

4/22/09 [email protected]

CollaboraJveFiltering

Page 6: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

WhereMLisUsedToday

•  Internetsearchclustering•  Knowledgemanagementsystems•  Socialnetworkmapping•  TaxonomytransformaJons•  MarkeJnganalyJcs•  RecommendaJonsystems•  Loganalysis&eventfiltering•  SPAMfiltering,frauddetecJon

4/22/09 [email protected]

Page 7: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

CurrentSituaJon

•  VastamountsofdataarenowavailableviatheInternet

•  PlahormsnowexisttoruncomputaJonsoverlargedatasets(MapReduce,Hadoop,Dryad)

•  SophisJcatedanalyJcsareneededtoturndataintoinformaJonpeoplecanuse

•  AcJveMachineLearningresearchcommunityandresearch/proprietaryimplementaJonsofMLalgorithms

•  TheworldneedsscalableimplementaJonsofMLunderopenlicense‐ASF

4/22/09 [email protected]

Page 8: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

HistoryofMahout

•  Summer2007– DevelopersneededscalableML– Mailinglistformed

•  Communityformed– Apachecontributors– Academia&industry–  LotsofiniJalinterest

•  MahoutprojectformedunderApacheLucene–  January25,2008– Mahout0.1releaseApril,2009

4/22/09 [email protected]

Page 9: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

WhoWeAre(sofar)

GrantIngersoll KarlWemn

IsabelDrostTedDunningJeffEastman

DawidWeiss

OJsGospodneJc

ErikHatcher

SeanOwen

OzgurYilmazel

4/22/09 [email protected]

Page 10: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

Release0.1CodeBase•  Matrix&Vectorlibrary

–  Memoryresidentsparse&denseimplementaJons•  ClassificaJon

–  NaïveBayes,ComplementaryNaïveBayes•  Clustering

–  Canopy–  K‐Means,fuzzyK‐Means–  MeanShiq–  DirichletProcess

•  CollaboraJveFiltering–  Taste

•  EvoluJonaryAlgorithms–  Watchmaker

•  UJliJes–  DistanceMeasures–  Parameters

Highlyscalable,parallelimplementa3onsontheApache

Hadooppla7orm

4/22/09 [email protected]

Page 11: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

Examples:Clustering

•  Canopy–  Singlepass(fastapproximaJon)assignseverypointtoasinglecluster–  Inputs:DistanceMeasure,T1,T2canopyvalues

•  MeanShiq–  IteraJveprocessconvergesonmodesofdensitydistribuJon–  Inputs:DistanceMeasure,T1,T2values,convergencecriteria

•  K‐Means–  IteraJveprocessconvergesonasingle,‘best’assignmentofpointstoclusters–  Inputs:DistanceMeasure,iniJalclusters,convergencecriteria

•  FuzzyK‐Means–  LikeK‐MeansbutusesprobabilitydensityfuncJontoweightallpointsagainstallclusters

•  DirichletProcess–  Bayesian:incorporatespriordomainknowledgeasamixtureofmodels–  IteraJveprocessconvergesonmulJple,‘mostlikely’answers–  Inputs:

•  Numberofmodels,numberofiteraJonstoperform•  Model(parameters,observaJons,probabilitydensityfuncJon)•  ModelDistribu3on(prior,posteriorsampling)

4/22/09 [email protected]

Page 12: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

SampleData

4/22/09 [email protected]

Page 13: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

CanopyClusters

4/22/09 [email protected]

Page 14: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

MeanShiqClusters

4/22/09 [email protected]

Page 15: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

K‐MeansClusters

4/22/09 [email protected]

Page 16: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

FuzzyK‐MeansClusters

4/22/09 [email protected]

Page 17: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

DirichletProcessClusters

4/22/09 [email protected]

Page 18: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

SampleData(Again)

4/22/09 [email protected]

Page 19: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

ApacheHadoop

•  Usesclustersof(5‐10,000)generalpurposeLinuxboxes•  HDFSsupportsredundantfilestorageandstreamingaccessin

thefaceofpredictablehardwarefailures•  Map/ReduceAPIsimplifiesprogrammingofalgorithmsthat

operateovervastdatasets•  HbaseoffersGoogleBigTablestyleofschema‐less,temporal

database•  PIGoffershigherlevellanguageformanipulaJngverylarge

datasetsthatreducestheneedforM/Rprogramming•  ZookeeperisahighlyavailableandreliablecoordinaJon

systemusedtosynchronizestatebetweenapplicaJons•  Hiveisadatawarehouseinfrastructurethatprovidesdata

summarizaJon,adhocqueryingandanalysisofdatasets

h@p://hadoop.apache.org

4/22/09 [email protected]

Page 20: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

TheHadoopIceberg

StorageReplicaJon

ProcessScheduling

FailureHandling

Map/ReduceCode

DataMovement

DiskManagement NetworkManagement

(h@p://hadoop.apache.org)

Monitoring

4/22/09 [email protected]

Page 21: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

ReferenceDirichletImplementaJonprivatevoiditerate(intitera-on,DirichletState<Observa-on>state){

//createnewposteriormodelsModel<ObservaJon>[]newModels=modelFactory.sampleFromPosterior(state.getModels());

//iterateoverthesamples,assigningeachtoamodelfor(Observa-onx:sampleData){//computenormalizedvectorofprobabiliJesthatxisdescribedbyeachmodelVectorpi=normalizedProbabiliJes(state,x);//thenpickoneclusterbysamplingaMulJnomialdistribuJonbaseduponthem//see:h@p://en.wikipedia.org/wiki/MulJnomial_distribuJonintk=UncommonDistribu-ons.rMul%nom(pi);//asktheselectedmodeltoobservethedatumnewModels[k].observe(x);}

//updatethestatefromthenewmodelsstate.update(newModels);}

4/22/09 [email protected]

Page 22: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

DirichletMapperonHadoop

publicvoidmap(WritableComparable<?>key,Textvalue,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcepJon{//readthenextsamplepointVectorsample=DenseVector.decodeFormat(value.toString());//computeavectorofprobabiliJesthatsampleisdescribedbyeachmodelVectorpi=normalizedProbabili3es(state,sample);//thenpickonemodelbysamplingaMulJnomialdistribuJonbaseduponthem//see:h@p://en.wikipedia.org/wiki/MulJnomial_distribuJonintk=UncommonDistribuJons.rMul3nom(pi);//outputvaluewithkeyofselectedmodeloutput.collect(newText(String.valueOf(k)),value);}

4/22/09 [email protected]

Page 23: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

Map/ReduceJobsUseLocalData

4/22/09 [email protected]

Page 24: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

DirichletReduceronHadooppublicvoidreduce(Textkey,Iterator<Text>values,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcep-on{//loadthemodelforthissetofvaluesIntegerk=newInteger(key.toString());Model<Vector>model=newModels[k];while(values.hasNext()){Vectorv=DenseVector.decodeFormat(values.next().toString());//asktheselectedmodeltoobservethedatummodel.observe(v);}//compute&setnewmodelparametersbasedupontheobservaJonsmodel.computeParameters();state.clusters.get(k).setModel(model);//outputtheclusterstateforthenextiteraJonoutput.collect(key,newText(cluster.asFormatString()));}

4/22/09 [email protected]

Page 25: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of

Conclusion•  Thisisjustthebeginning•  Highdemandforscalablemachinelearning

•  Contributorsareneededwhohave–  Interest,enthusiasm&programmingability– Testdrivendevelopmentskills– Comfortwiththescarymath(orbravery)

–  Interestand/orproficiencywithHadoop– Somelargedatasetsyouwanttoanalyze

h@p://lucene.apache.org/mahout/

4/22/09 [email protected]