Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
MachineLearningoverPetabytesApacheMahout
JeffEastman
May2009
h=p://lucene.apache.org/mahout/
5/11/09 [email protected]
WhatisMachineLearning?
• “MachinelearningisthesubfieldofarHficialintelligencethatisconcernedwiththedesignanddevelopmentofalgorithmsthatallowcomputerstoimprovetheirperformanceoverHme…”(h=p://en.wikipedia.org/wiki/Machine_learning)
• TypesofMLalgorithms– Supervised:Usinglabeledtrainingdata,createafuncHonthatpredictsoutputforunseeninputs
– Unsupervised:UsingunlabeleddatacreateafuncHonthatcanpredictoutput
– Semi‐supervised:Useslabeledandunlabeleddata
5/11/09 [email protected]
SomeMLAlgorithms
– ClassificaHon• ClassifyingobserveddataintomulHplecategories
– Clustering• GroupingsimilarobservaHonsintoclustersthatare“similar”
– Regression• DevelopingfuncHonalmodelsthatdescribeobserveddata
– CollaboraHvefiltering• Filteringforpa=ernsininformaHoninvolvingmulHpleagents
– DimensionreducHon• ReducingmulHdimensionaldatasetstofewerdimensions
– EvoluHonaryalgorithms• Survivalofthefi=estamongevolvingmodelpopulaHons
5/11/09 [email protected] 3
WhereMLisUsedToday
• Internetsearch• Socialnetworkmapping• Businessintelligence• BioinformaHcs• Sensordataanalysis• RecommendaHonsystems• Loganalysis&eventfiltering• SPAMfiltering,frauddetecHon
5/11/09 [email protected]
CurrentSituaHon
• Vastamountsofdataarenowreadilyavailable• PlahormsnowexisttoruncomputaHonsoverlargedatasets(MapReduce,Hadoop,Dryad,Kdb)
• SophisHcatedanalyHcsareneededtoturndataintoinformaHonpeoplecanuse
• AcHveMachineLearningresearchcommunityandmanyresearch/proprietaryimplementaHonsofMLalgorithms
• TheworldneedsscalableimplementaHonsofMLunderopenlicense‐ASF
5/11/09 [email protected]
HistoryofMahout
• Summer2007– DevelopersneededscalableML– Mailinglistformed
• Communityformed– Apachecontributors– Academia&industry– LotsofiniHalinterest
• MahoutprojectformedunderApacheLucene– January25,2008– Mahout0.1releaseApril,2009
5/11/09 [email protected]
Release0.1CodeBase• Matrix&Vectorlibrary
– Memoryresidentsparse&denseimplementaHons• ClassificaHon
– NaïveBayes– ComplementaryNaïveBayes
• Clustering– Canopy– K‐Means,fuzzyK‐Means– MeanShio– DirichletProcess
• CollaboraHveFiltering– Taste
• EvoluHonaryAlgorithms– Watchmaker
• UHliHes– DistanceMeasures– Parameters
Highlyscalable,parallelimplementa3onsontheApache
Hadooppla7orm
5/11/09 [email protected]
Examples:Clustering
• Canopy– Singlepass(fastapproximaHon)assignseverypointtoasinglecluster– Inputs:DistanceMeasure,T1,T2canopyvalues
• MeanShio– IteraHveprocessconvergesonmodesofdensitydistribuHon– Inputs:DistanceMeasure,T1,T2values,convergencecriteria
• K‐Means– IteraHveprocessconvergesonasingle,‘best’assignmentofpointstoclusters– Inputs:DistanceMeasure,iniHalclusters,convergencecriteria
• FuzzyK‐Means– LikeK‐MeansbutusesprobabilitydensityfuncHontoweightallpointsagainstallclusters
• DirichletProcess– Bayesian:incorporatespriordomainknowledgeasamixtureofmodels– IteraHveprocessconvergesonmulHple,‘mostlikely’answers– Inputs:
• Numberofmodels,numberofiteraHonstoperform• Model(parameters,observaHons,probabilitydensityfuncHon)• ModelDistribu3on(prior,posteriorsampling)
5/11/09 [email protected]
SampleData
5/11/09 [email protected]
CanopyClusters
5/11/09 [email protected]
MeanShioClusters
5/11/09 [email protected]
K‐MeansClusters
5/11/09 [email protected]
FuzzyK‐MeansClusters
5/11/09 [email protected]
DirichletProcessClusters
5/11/09 [email protected]
SampleData(Again)
5/11/09 [email protected]
Um,LargeDatasetClustering?
• StaHsHcalsamplingyieldsaccurateclusters
• Highly‐dimensionalmeasuresdon’tcompute
5/11/09 [email protected] 18
• Youactuallywanttoclusterallthedata• Youaremoreinterestedintheoutliers• YourdataisalreadyinHadoop
ApacheHadoop
• Usesclustersof(1‐10,000)generalpurposeLinuxboxes• HDFSsupportsredundantfilestorageandstreamingaccessin
thefaceofpredictablehardwarefailures• Map/ReduceAPIsimplifiesprogrammingofalgorithmsthat
operateovervastdatasets• HbaseoffersGoogleBigTablestyleofschema‐less,temporal
database• PIGoffershigherlevellanguageformanipulaHngverylarge
datasetsthatreducestheneedforM/Rprogramming• ZookeeperisahighlyavailableandreliablecoordinaHon
systemusedtosynchronizestatebetweenapplicaHons• Hiveisadatawarehouseinfrastructurethatprovidesdata
summarizaHon,adhocqueryingandanalysisofdatasets
h=p://hadoop.apache.org
5/11/09 [email protected]
TheHadoopIceberg
StorageReplicaHon
ProcessScheduling
FailureHandling
Map/ReduceCode
DataMovement
DiskManagement NetworkManagement
(h=p://hadoop.apache.org)
Monitoring
5/11/09 [email protected]
ReferenceDirichletImplementaHonprivatevoiditerate(intitera-on,DirichletState<Observa-on>state){
//createnewposteriormodelsModel<ObservaHon>[]newModels=modelFactory.sampleFromPosterior(state.getModels());
//iterateoverthesamples,assigningeachtoamodelfor(Observa-onx:sampleData){//computenormalizedvectorofprobabiliHesthatxisdescribedbyeachmodelVectorpi=normalizedProbabiliHes(state,x);//thenpickoneclusterbysamplingaMulHnomialdistribuHonbaseduponthem//see:h=p://en.wikipedia.org/wiki/MulHnomial_distribuHonintk=UncommonDistribu-ons.rMul%nom(pi);//asktheselectedmodeltoobservethedatumnewModels[k].observe(x);}
//updatethestatefromthenewmodelsstate.update(newModels);}
5/11/09 [email protected]
DirichletMapperonHadoop
publicvoidmap(WritableComparable<?>key,Textvalue,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcepHon{//readthenextsamplepointVectorsample=DenseVector.decodeFormat(value.toString());//computeavectorofprobabiliHesthatsampleisdescribedbyeachmodelVectorpi=normalizedProbabili3es(state,sample);//thenpickonemodelbysamplingaMulHnomialdistribuHonbaseduponthem//see:h=p://en.wikipedia.org/wiki/MulHnomial_distribuHonintk=UncommonDistribuHons.rMul3nom(pi);//outputvaluewithkeyofselectedmodeloutput.collect(newText(String.valueOf(k)),value);}
5/11/09 [email protected]
Map/ReduceJobsUseLocalData
5/11/09 [email protected]
DirichletReduceronHadooppublicvoidreduce(Textkey,Iterator<Text>values,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcep-on{//loadthemodelforthissetofvaluesIntegerk=newInteger(key.toString());Model<Vector>model=newModels[k];while(values.hasNext()){Vectorv=DenseVector.decodeFormat(values.next().toString());//asktheselectedmodeltoobservethedatummodel.observe(v);}//compute&setnewmodelparametersbasedupontheobservaHonsmodel.computeParameters();state.clusters.get(k).setModel(model);//outputtheclusterstateforthenextiteraHonoutput.collect(key,newText(cluster.asFormatString()));}
5/11/09 [email protected]
Conclusion• Thisisjustthebeginning• Highdemandforscalablemachinelearning
• Contributorsareneededwhohave– Interest,enthusiasm&programmingability– Testdrivendevelopmentskills– Comfortwiththescarymath(orbravery)
– Interestand/orproficiencywithHadoop– Somelargedatasetsyouwanttoanalyze
h=p://lucene.apache.org/mahout/
5/11/09 [email protected]