25
Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 h=p://lucene.apache.org/mahout/ 5/11/09 1 jeff@windwardsoluHons.com

Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

MachineLearningoverPetabytesApacheMahout

JeffEastman

May2009

h=p://lucene.apache.org/mahout/

5/11/09 [email protected]

Page 2: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

WhatisMachineLearning?

•  “MachinelearningisthesubfieldofarHficialintelligencethatisconcernedwiththedesignanddevelopmentofalgorithmsthatallowcomputerstoimprovetheirperformanceoverHme…”(h=p://en.wikipedia.org/wiki/Machine_learning)

•  TypesofMLalgorithms–  Supervised:Usinglabeledtrainingdata,createafuncHonthatpredictsoutputforunseeninputs

– Unsupervised:UsingunlabeleddatacreateafuncHonthatcanpredictoutput

–  Semi‐supervised:Useslabeledandunlabeleddata

5/11/09 [email protected]

Page 3: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

SomeMLAlgorithms

–  ClassificaHon•  ClassifyingobserveddataintomulHplecategories

–  Clustering•  GroupingsimilarobservaHonsintoclustersthatare“similar”

–  Regression•  DevelopingfuncHonalmodelsthatdescribeobserveddata

–  CollaboraHvefiltering•  Filteringforpa=ernsininformaHoninvolvingmulHpleagents

– DimensionreducHon•  ReducingmulHdimensionaldatasetstofewerdimensions

–  EvoluHonaryalgorithms•  Survivalofthefi=estamongevolvingmodelpopulaHons

5/11/09 [email protected] 3

Page 4: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

OneCommonMLExample

Google.com

5/11/09 [email protected]

TextClustering

Page 5: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

AnotherCommonExample

Amazon.com

5/11/09 [email protected]

CollaboraHveFiltering

Page 6: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

WhereMLisUsedToday

•  Internetsearch•  Socialnetworkmapping•  Businessintelligence•  BioinformaHcs•  Sensordataanalysis•  RecommendaHonsystems•  Loganalysis&eventfiltering•  SPAMfiltering,frauddetecHon

5/11/09 [email protected]

Page 7: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

CurrentSituaHon

•  Vastamountsofdataarenowreadilyavailable•  PlahormsnowexisttoruncomputaHonsoverlargedatasets(MapReduce,Hadoop,Dryad,Kdb)

•  SophisHcatedanalyHcsareneededtoturndataintoinformaHonpeoplecanuse

•  AcHveMachineLearningresearchcommunityandmanyresearch/proprietaryimplementaHonsofMLalgorithms

•  TheworldneedsscalableimplementaHonsofMLunderopenlicense‐ASF

5/11/09 [email protected]

Page 8: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

HistoryofMahout

•  Summer2007– DevelopersneededscalableML– Mailinglistformed

•  Communityformed– Apachecontributors– Academia&industry–  LotsofiniHalinterest

•  MahoutprojectformedunderApacheLucene–  January25,2008– Mahout0.1releaseApril,2009

5/11/09 [email protected]

Page 9: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

Release0.1CodeBase•  Matrix&Vectorlibrary

–  Memoryresidentsparse&denseimplementaHons•  ClassificaHon

–  NaïveBayes–  ComplementaryNaïveBayes

•  Clustering–  Canopy–  K‐Means,fuzzyK‐Means–  MeanShio–  DirichletProcess

•  CollaboraHveFiltering–  Taste

•  EvoluHonaryAlgorithms–  Watchmaker

•  UHliHes–  DistanceMeasures–  Parameters

Highlyscalable,parallelimplementa3onsontheApache

Hadooppla7orm

5/11/09 [email protected]

Page 10: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

Examples:Clustering

•  Canopy–  Singlepass(fastapproximaHon)assignseverypointtoasinglecluster–  Inputs:DistanceMeasure,T1,T2canopyvalues

•  MeanShio–  IteraHveprocessconvergesonmodesofdensitydistribuHon–  Inputs:DistanceMeasure,T1,T2values,convergencecriteria

•  K‐Means–  IteraHveprocessconvergesonasingle,‘best’assignmentofpointstoclusters–  Inputs:DistanceMeasure,iniHalclusters,convergencecriteria

•  FuzzyK‐Means–  LikeK‐MeansbutusesprobabilitydensityfuncHontoweightallpointsagainstallclusters

•  DirichletProcess–  Bayesian:incorporatespriordomainknowledgeasamixtureofmodels–  IteraHveprocessconvergesonmulHple,‘mostlikely’answers–  Inputs:

•  Numberofmodels,numberofiteraHonstoperform•  Model(parameters,observaHons,probabilitydensityfuncHon)•  ModelDistribu3on(prior,posteriorsampling)

5/11/09 [email protected]

Page 11: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

SampleData

5/11/09 [email protected]

Page 12: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

CanopyClusters

5/11/09 [email protected]

Page 13: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

MeanShioClusters

5/11/09 [email protected]

Page 14: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

K‐MeansClusters

5/11/09 [email protected]

Page 15: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

FuzzyK‐MeansClusters

5/11/09 [email protected]

Page 16: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

DirichletProcessClusters

5/11/09 [email protected]

Page 17: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

SampleData(Again)

5/11/09 [email protected]

Page 18: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

Um,LargeDatasetClustering?

•  StaHsHcalsamplingyieldsaccurateclusters

•  Highly‐dimensionalmeasuresdon’tcompute

5/11/09 [email protected] 18

•  Youactuallywanttoclusterallthedata•  Youaremoreinterestedintheoutliers•  YourdataisalreadyinHadoop

Page 19: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

ApacheHadoop

•  Usesclustersof(1‐10,000)generalpurposeLinuxboxes•  HDFSsupportsredundantfilestorageandstreamingaccessin

thefaceofpredictablehardwarefailures•  Map/ReduceAPIsimplifiesprogrammingofalgorithmsthat

operateovervastdatasets•  HbaseoffersGoogleBigTablestyleofschema‐less,temporal

database•  PIGoffershigherlevellanguageformanipulaHngverylarge

datasetsthatreducestheneedforM/Rprogramming•  ZookeeperisahighlyavailableandreliablecoordinaHon

systemusedtosynchronizestatebetweenapplicaHons•  Hiveisadatawarehouseinfrastructurethatprovidesdata

summarizaHon,adhocqueryingandanalysisofdatasets

h=p://hadoop.apache.org

5/11/09 [email protected]

Page 20: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

TheHadoopIceberg

StorageReplicaHon

ProcessScheduling

FailureHandling

Map/ReduceCode

DataMovement

DiskManagement NetworkManagement

(h=p://hadoop.apache.org)

Monitoring

5/11/09 [email protected]

Page 21: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

ReferenceDirichletImplementaHonprivatevoiditerate(intitera-on,DirichletState<Observa-on>state){

//createnewposteriormodelsModel<ObservaHon>[]newModels=modelFactory.sampleFromPosterior(state.getModels());

//iterateoverthesamples,assigningeachtoamodelfor(Observa-onx:sampleData){//computenormalizedvectorofprobabiliHesthatxisdescribedbyeachmodelVectorpi=normalizedProbabiliHes(state,x);//thenpickoneclusterbysamplingaMulHnomialdistribuHonbaseduponthem//see:h=p://en.wikipedia.org/wiki/MulHnomial_distribuHonintk=UncommonDistribu-ons.rMul%nom(pi);//asktheselectedmodeltoobservethedatumnewModels[k].observe(x);}

//updatethestatefromthenewmodelsstate.update(newModels);}

5/11/09 [email protected]

Page 22: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

DirichletMapperonHadoop

publicvoidmap(WritableComparable<?>key,Textvalue,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcepHon{//readthenextsamplepointVectorsample=DenseVector.decodeFormat(value.toString());//computeavectorofprobabiliHesthatsampleisdescribedbyeachmodelVectorpi=normalizedProbabili3es(state,sample);//thenpickonemodelbysamplingaMulHnomialdistribuHonbaseduponthem//see:h=p://en.wikipedia.org/wiki/MulHnomial_distribuHonintk=UncommonDistribuHons.rMul3nom(pi);//outputvaluewithkeyofselectedmodeloutput.collect(newText(String.valueOf(k)),value);}

5/11/09 [email protected]

Page 23: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

Map/ReduceJobsUseLocalData

5/11/09 [email protected]

Page 24: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

DirichletReduceronHadooppublicvoidreduce(Textkey,Iterator<Text>values,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcep-on{//loadthemodelforthissetofvaluesIntegerk=newInteger(key.toString());Model<Vector>model=newModels[k];while(values.hasNext()){Vectorv=DenseVector.decodeFormat(values.next().toString());//asktheselectedmodeltoobservethedatummodel.observe(v);}//compute&setnewmodelparametersbasedupontheobservaHonsmodel.computeParameters();state.clusters.get(k).setModel(model);//outputtheclusterstateforthenextiteraHonoutput.collect(key,newText(cluster.asFormatString()));}

5/11/09 [email protected]

Page 25: Machine Learning over Petabytesgotocon.com/dl/jaoo-brisbane-2009/slides/JeffEastman... · 2009-05-12 · Machine Learning over Petabytes Apache Mahout Jeff Eastman May 2009 hp:

Conclusion•  Thisisjustthebeginning•  Highdemandforscalablemachinelearning

•  Contributorsareneededwhohave–  Interest,enthusiasm&programmingability– Testdrivendevelopmentskills– Comfortwiththescarymath(orbravery)

–  Interestand/orproficiencywithHadoop– Somelargedatasetsyouwanttoanalyze

h=p://lucene.apache.org/mahout/

5/11/09 [email protected]