Upload
danghuong
View
224
Download
0
Embed Size (px)
Citation preview
Agenda
WewilldiscusshowAutomatedMachineLearning(AutoML)andPentaho,together,canhelpcustomerssavetimeintheprocessofcreatingamodelanddeployingthismodelintoproduction.
• BusinessCaseforAutomatedMachineLearning(AutoML)andPentaho;
• HighleveloverviewaboutAutomatedMachineLearning(AutoML);
• Demonstrations(Pentaho+AutoML).
BusinessCaseforAutoMLandPentaho
• Findingthecorrectmachinelearningalgorithmisnotaneasytask.
• YouneedtofindabalancebetweenthetimeyouwouldneedtospendandthetimeyoucanactuallyspendontheMLproblem.
• Tocreateagoodmodelyouwillneedtoknowverywelltheproblem,thevariables(instances),preparethedata,featureengineeringandtestdifferentalgorithms.
• SomedatascientistswillalsosaytoaddalittlebitofMAGICJ.
• Adding,ofcourse,inmostcases,alotofcomputerpower.
WhatisAutomatedMachineLearning(AutoML)?
“Machinelearningisverysuccessful,butitssuccessescruciallyrelyonhumanmachinelearningexperts,whoselectappropriateMLarchitectures(deeplearningarchitecturesormoretraditionalMLworkflows)andtheirhyperparameters.Asthecomplexityofthesetasksisoftenbeyondnon-experts,therapidgrowthofmachinelearningapplicationshascreatedademandforoff-the-shelfmachinelearningmethodsthatcanbeusedeasilyandwithoutexpertknowledge.WecalltheresultingresearchareathattargetsprogressiveautomationofmachinelearningAutoML.”https://sites.google.com/site/automl2016/
WhyAutomatedMachineLearning(AutoML)?
• Thedemandformachinelearningexpertshasoutpacedthesupply.Toaddressthisgap,therehavebeenbigstridesinthedevelopmentofuser-friendlymachinelearningsoftwarethatcanbeusedbynon-expertsandexperts,alike.
• AutoMLsoftwarecanbeusedforautomatingalargepartofthemachinelearningworkflow,whichincludesautomatictrainingandtuningofmanymodelswithinauser-specifiedtime-limit.
WhatisNOTAutomatedMachineLearning(AutoML)?
• AutoML isnotautomateddatascience;
• AutoML willnotreplaceDataScientist;– Allthemethodsofautomatedmachinelearningaredevelopedtosupportdatascientists,nottoreplacethem.– AutoML istofreedatascientistsfromtheburdenofrepetitiveandtime-consumingtasks(e.g.,machinelearningpipelinedesignandhyperparameteroptimization)sotheycanbetterspendtheirtimeontasksthataremuchmoredifficulttoautomate.
AutoMLTools
• AutoWeka(OpenSource)– http://www.cs.ubc.ca/labs/beta/Projects/autoweka/
• H2o.aiAutoML(OpenSource)– https://www.h2o.ai/
• TPOT(OpenSource)– https://github.com/rhiever/tpot
• AutoSklearn(OpenSource)– https://github.com/automl/auto-sklearn– http://automl.github.io/auto-sklearn/stable/
• machineJS (OpenSource)– https://github.com/ClimbsRocks/machineJS
CRISP-DM
http://www.pentaho.com/blog/4-steps-machine-learning-pentaho
BusinessUnderstanding
DataUnderstanding
DataPreparation
Modeling
Evaluation
Deployment
Data
UseCase:AutoML+Pentaho
• OurusershaveawelldefinedMLproblemandtheinitialversionofthedataset(trainandtest).
• Unfortunately,theyhaven’tcreatedaMLmodelyet.
• Also,theyhavenoideahowtocreateit.• AndtheywantustohelpthemtocreateitassoonaspossibleusingonlyOpenSourcetools.
TheJourney
• Ifyouembarkinthisjourney,youcanstickinthisproblemforever…
…oryoucanfindquickwaystodoitinaspecifiedtime.
• CustomerscanthenspendenoughtimelatertoimprovetheircurrentModel.
• Thenextstepswillbe:– Hireadatascientistorateamofdatascientists;– Hireadomainexpertinthatproblem.
OurGoal
• Inthisspecificscenario,ourgoalwillbetohelpthemtostarttheprocessofcreatingadummymodelusingAutoML.
CreateYourFirstMLModel
1. Definetheproblem;
2. Analyzeandpreparethedata;
3. Selectalgorithms(startsimple);
4. Runandevaluatethealgorithms;
5. Improvetheresultswithfocusedexperiments;
6. Finalizeresultswithfinetuning.
SampleDataset
• Moredataisbetter,butmoredatameansmorecomplexity.
• Moredatameansmoretimethatyouwillhavetospendinyourproblem.
• Whynotcreateasampledataset?!– Create1to20datasetstotestyourproblemandcreateyourmodels;
DemoAutoML+Pentaho
• ThispresentationaimstodemotheprocessofhowAutoML opensourcetoolsandPentaho,together,canhelpcustomerssavetimeintheprocessofcreatingamodelanddeployingthismodelintoproduction.
ThePowerofPDI
• PDI(PentahoDataIntegration)willhelpdatascientistanddataengineerswithdataonboarding,datapreparation,datablending,modelorchestration(modelandpredict),savingandvisualizingthedata.
DataOnboarding,DataPreparationandDataBlending
• BelowwecanseeaDataPreparationProcessusingPDI(PentahoDataIntegration);• MLdatasetoutput:ARFFFile(WekaFile),CSV(Python,RandApacheSparkMLlib)andHadoopOutputtosavethetxtfiletotheDataLake;
DemoAgenda
Whatwewillcoverinthedemo:
• DataPreparationwithPDI;• ModelcreationusingAutoML Tool;
• ModelDeploymentwithPDI;
Summary
Whatwecoveredtoday:
• BusinessCaseforAutomatedMachineLearning(AutoML)andPentaho;
• HighleveloverviewaboutAutomatedMachineLearning(AutoML);
• Demonstrations(Pentaho+AutoML).
NextSteps
Wanttolearnmore?
• TalktomeduringPentahoWorld2017orsendmeane-mailcaio.moreno@HitachiVantara.com;
• Meet-the-Experts:– https://www.pentahoworld.com/meet-the-experts
TopPredictionAlgorithms
• AccordingtoDataiku,thetoppredictionalgorithmsaretheonesexplainedintheimageontherightside.
• Thisimagealsoexplains(resumes)theadvantagesanddisadvantagesofeachalgorithm.
Source:https://blog.dataiku.com/machine-learning-explained-algorithms-are-your-friend
Algorithms
REXERanalyticsdatasciencesurvey*givesusagoodideaaboutwhichalgorithmshavebeenusedovertheyears.
*SpecialthankstoMarkHall(Pentaho)forsharingthisdocumentwithme.Documentavailableat:http://www.rexeranalytics.com/data-science-survey.html
CoreAlgorithms
Source: http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Highlights_Apr-2016.pdf
Tools
• Thehugeamountoftoolsincreasesthecomplexity.
Source: http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Highlights_Apr-2016.pdf
AutoWeka
• AutoWeka– providesautomaticselectionofmodelsandhyperparametersfor WEKA.– http://www.cs.ubc.ca/labs/beta/Projects/autoweka/
• OpendatasetsforAutoWeka– http://www.cs.ubc.ca/labs/beta/Projects/autoweka/datasets/
AutoSklearn
• AutoWekainspiredtheauthorsofAutoSklearn;
• AutoSklearn– auto-sklearnisanautomatedmachinelearningtoolkitandadrop-inreplacementforascikit-learnestimator.– https://github.com/automl/auto-sklearn– http://automl.github.io/auto-sklearn/stable/
TypesofMLProblemswith(AutoML)
• ThetypesofMachineLearningproblemsthatwecansolveusingAutoWekaandAutoSklearn areClassification,RegressionandClustering:– ClassificationandRegressionarealreadysupportedinAuto-sklearn&Auto-WEKA.– Forclustering,youcanuseaslongasyouhaveanobjectivefunctiontooptimize.
AutomatedbyTPOT
• TPOTwillautomatethemosttediouspartofmachinelearningbyintelligentlyexploringthousandsofpossiblepipelinestofindthebestoneforyourdata.
https://github.com/rhiever/tpot
InstallingAutoWeka
• ToinstallAutoWeka,gotoWekaPackageManager>SearchforAuto-WEKAandclickthe“Install”button.
InstallingTPOT
• CommandtoinstallTPOT– $pipinstalltpot
• Learnmore:– http://rhiever.github.io/tpot/installing/
InstallingAutoSklearnonUbuntu
• Usethedocumentationbelowtohelpyou:– http://automl.github.io/auto-sklearn/stable/
• Runthiscommandonubuntuterminal:– $condainstallgccswig– $curlhttps://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt|xargs-n1-L1pipinstall– $sudoapt-getinstallbuild-essentialswig– $pipinstall–Uauto-sklearn
ErrorAutoSklearnonUbuntu
• ErrorreportedonJune,14th 2017.Solutionsentonthesameday.
• ChecktheGitHublinkbelowtofindthesolution:https://github.com/automl/auto-sklearn/issues/308
InstallingH20.ai
• ToinstallH20.aiAutoMLvisitthewebsites:– https://blog.h2o.ai/2017/06/automatic-machine-learning/– https://www.h2o.ai/
UsingAutoWeka
• timeLimit=Youcandefinethetimeinminutesthat youwantAutoWekatousetorunandfindthebestoption.– Moretime=betterresults
TestingAutoSklearn
• OpenSpyderandtestthecodebelow:
Sourcecode:http://automl.github.io/auto-sklearn/stable/
TestingH2o.aiAutoML
TotestH2oAutoMLisnecessarytoinstalltheversion3.11.0.3888orsuperior.http://h2o-release.s3.amazonaws.com/h2o/rel-vapnik/1/index.html
https://github.com/caiomsouza/machine-learning-orchestration/blob/master/AutoML/src/r/h2o-automl/H20_AutoML_Example.R
aml<- h2o.automl(x=x,y=y,training_frame=train,leaderboard_frame=test,max_runtime_secs=30)
#ViewtheAutoMLLeaderboardlb<- aml@leaderboardlb
DemoAutoML(AutoWeka)+Pentaho
• UsingAutoWekafromtheWekaUserInterfacewecreatedafirst“dummy”modelin15minutes.
• AutoWekawilloutputthebestmodelcreatedinthetimespecified,thismodelcanthenbeusedtopredictnewvalues.
AutoWekaoutput