13
1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems. The advantages of Gaussian processes are: The prediction interpolates the observations (at least for regular kernels). The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide based on those if one should refit (online fitting, adaptive fitting) the prediction in some region of interest. Versatile: different kernels can be specified. Common kernels are provided, but it is also possible to specify custom kernels. The disadvantages of Gaussian processes include: They are not sparse, i.e., they use the whole samples/features information to perform the prediction. They lose efficiency in high dimensional spaces – namely when the number of features exceeds a few dozens. 1.7.1. Gaussian Process Regression (GPR) The GaussianProcessRegressor implements Gaussian processes (GP) for regression purposes. For this, the prior of the GP needs to be specified. The prior mean is assumed to be constant and zero (for normalize_y=False ) or the training data’s mean (for normalize_y=True ). The prior’s covariance is specified by passing a kernel object. The hyperparameters of the kernel are optimized during fitting of GaussianProcessRegressor by maximizing the log-marginal-likelihood (LML) based on the passed optimizer . As the LML may have multiple local optima, the optimizer can be started repeatedly by specifying n_restarts_optimizer . The first run is always conducted starting from the initial hyperparameter values of the kernel; subsequent runs are conducted from hyperparameter values that have been chosen randomly from the range of allowed values. If the initial hyperparameters should be kept fixed, None can be passed as optimizer. The noise level in the targets can be specified by passing it via the parameter alpha , either globally as a scalar or per datapoint. Note that a moderate noise level can also be helpful for dealing with numeric issues during fitting as it is effectively implemented as Tikhonov regularization, i.e., by adding it to the diagonal of the kernel matrix. An alternative to specifying the noise level explicitly is to include a WhiteKernel component into the kernel, which can estimate the global noise level from the data (see example below). The implementation is based on Algorithm 2.1 of [RW2006] . In addition to the API of standard scikit-learn estimators, GaussianProcessRegressor: allows prediction without prior fitting (based on the GP prior) provides an additional method sample_y(X) , which evaluates samples drawn from the GPR (prior or posterior) at given inputs exposes a method log_marginal_likelihood(theta) , which can be used externally for other ways of selecting hyperparameters, e.g., via Markov chain Monte Carlo. 1.7.2. GPR examples 1.7.2.1. GPR with noise-level estimation This example illustrates that GPR with a sum-kernel including a WhiteKernel can estimate the noise level of data. An illustration of the log-marginal-likelihood (LML) landscape shows that there exist two local maxima of LML.

1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

  • Upload
    others

  • View
    33

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

1.7. Gaussian ProcessesGaussianProcesses(GP)areagenericsupervisedlearningmethoddesignedtosolveregressionandprobabilisticclassificationproblems.

TheadvantagesofGaussianprocessesare:

Thepredictioninterpolatestheobservations(atleastforregularkernels).Thepredictionisprobabilistic(Gaussian)sothatonecancomputeempiricalconfidenceintervalsanddecidebasedonthoseifoneshouldrefit(onlinefitting,adaptivefitting)thepredictioninsomeregionofinterest.Versatile:differentkernelscanbespecified.Commonkernelsareprovided,butitisalsopossibletospecifycustomkernels.

ThedisadvantagesofGaussianprocessesinclude:

Theyarenotsparse,i.e.,theyusethewholesamples/featuresinformationtoperformtheprediction.Theyloseefficiencyinhighdimensionalspaces–namelywhenthenumberoffeaturesexceedsafewdozens.

1.7.1. Gaussian Process Regression (GPR)

TheGaussianProcessRegressorimplementsGaussianprocesses(GP)forregressionpurposes.Forthis,theprioroftheGPneedstobespecified.Thepriormeanisassumedtobeconstantandzero(fornormalize_y=False )orthetrainingdata’smean(fornormalize_y=True ).Theprior’scovarianceisspecifiedbypassingakernelobject.ThehyperparametersofthekernelareoptimizedduringfittingofGaussianProcessRegressorbymaximizingthelog-marginal-likelihood(LML)basedonthepassedoptimizer .AstheLMLmayhavemultiplelocaloptima,theoptimizercanbestartedrepeatedlybyspecifyingn_restarts_optimizer .Thefirstrunisalwaysconductedstartingfromtheinitialhyperparametervaluesofthekernel;subsequentrunsareconductedfromhyperparametervaluesthathavebeenchosenrandomlyfromtherangeofallowedvalues.Iftheinitialhyperparametersshouldbekeptfixed,None canbepassedasoptimizer.

Thenoiselevelinthetargetscanbespecifiedbypassingitviatheparameteralpha ,eithergloballyasascalarorperdatapoint.NotethatamoderatenoiselevelcanalsobehelpfulfordealingwithnumericissuesduringfittingasitiseffectivelyimplementedasTikhonovregularization,i.e.,byaddingittothediagonalofthekernelmatrix.AnalternativetospecifyingthenoiselevelexplicitlyistoincludeaWhiteKernelcomponentintothekernel,whichcanestimatetheglobalnoiselevelfromthedata(seeexamplebelow).

TheimplementationisbasedonAlgorithm2.1of[RW2006].InadditiontotheAPIofstandardscikit-learnestimators,GaussianProcessRegressor:

allowspredictionwithoutpriorfitting(basedontheGPprior)providesanadditionalmethodsample_y(X) ,whichevaluatessamplesdrawnfromtheGPR(priororposterior)atgiveninputsexposesamethodlog_marginal_likelihood(theta) ,whichcanbeusedexternallyforotherwaysofselectinghyperparameters,e.g.,viaMarkovchainMonteCarlo.

1.7.2. GPR examples

1.7.2.1. GPR with noise-level estimation

ThisexampleillustratesthatGPRwithasum-kernelincludingaWhiteKernelcanestimatethenoiselevelofdata.Anillustrationofthelog-marginal-likelihood(LML)landscapeshowsthatthereexisttwolocalmaximaofLML.

Page 2: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

Thefirstcorrespondstoamodelwithahighnoiselevelandalargelengthscale,whichexplainsallvariationsinthedatabynoise.

Thesecondonehasasmallernoiselevelandshorterlengthscale,whichexplainsmostofthevariationbythenoise-freefunctionalrelationship.Thesecondmodelhasahigherlikelihood;however,dependingontheinitialvalueforthehyperparameters,thegradient-basedoptimizationmightalsoconvergetothehigh-noisesolution.Itisthusimportanttorepeattheoptimizationseveraltimesfordifferentinitializations.

Page 3: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

1.7.2.2. Comparison of GPR and Kernel Ridge Regression

Bothkernelridgeregression(KRR)andGPRlearnatargetfunctionbyemployinginternallythe“kerneltrick”.KRRlearnsalinearfunctioninthespaceinducedbytherespectivekernelwhichcorrespondstoanon-linearfunctionintheoriginalspace.Thelinearfunctioninthekernelspaceischosenbasedonthemean-squarederrorlosswithridgeregularization.GPRusesthekerneltodefinethecovarianceofapriordistributionoverthetargetfunctionsandusestheobservedtrainingdatatodefinealikelihoodfunction.BasedonBayestheorem,a(Gaussian)posteriordistributionovertargetfunctionsisdefined,whosemeanisusedforprediction.

AmajordifferenceisthatGPRcanchoosethekernel’shyperparametersbasedongradient-ascentonthemarginallikelihoodfunctionwhileKRRneedstoperformagridsearchonacross-validatedlossfunction(mean-squarederrorloss).AfurtherdifferenceisthatGPRlearnsagenerative,probabilisticmodelofthetargetfunctionandcanthusprovidemeaningfulconfidenceintervalsandposteriorsamplesalongwiththepredictionswhileKRRonlyprovidespredictions.

Thefollowingfigureillustratesbothmethodsonanartificialdataset,whichconsistsofasinusoidaltargetfunctionandstrongnoise.ThefigurecomparesthelearnedmodelofKRRandGPRbasedonaExpSineSquaredkernel,whichissuitedforlearningperiodicfunctions.Thekernel’shyperparameterscontrolthesmoothness(length_scale)andperiodicityofthekernel(periodicity).Moreover,thenoiselevelofthedataislearnedexplicitlybyGPRbyanadditionalWhiteKernelcomponentinthekernelandbytheregularizationparameteralphaofKRR.

Page 4: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

Thefigureshowsthatbothmethodslearnreasonablemodelsofthetargetfunction.GPRcorrectlyidentifiestheperiodicityofthefunctiontoberoughly (6.28),whileKRRchoosesthedoubledperiodicity .Besidesthat,GPRprovidesreasonableconfidenceboundsonthepredictionwhicharenotavailableforKRR.Amajordifferencebetweenthetwomethodsisthetimerequiredforfittingandpredicting:whilefittingKRRisfastinprinciple,thegrid-searchforhyperparameteroptimizationscalesexponentiallywiththenumberofhyperparameters(“curseofdimensionality”).Thegradient-basedoptimizationoftheparametersinGPRdoesnotsufferfromthisexponentialscalingandisthusconsiderablefasteronthisexamplewith3-dimensionalhyperparameterspace.Thetimeforpredictingissimilar;however,generatingthevarianceofthepredictivedistributionofGPRtakesconsiderablelongerthanjustpredictingthemean.

1.7.2.3. GPR on Mauna Loa CO2 data

ThisexampleisbasedonSection5.4.3of[RW2006].Itillustratesanexampleofcomplexkernelengineeringandhyperparameteroptimizationusinggradientascentonthelog-marginal-likelihood.ThedataconsistsofthemonthlyaverageatmosphericCO2concentrations(inpartspermillionbyvolume(ppmv))collectedattheMaunaLoaObservatoryinHawaii,between1958and1997.TheobjectiveistomodeltheCO2concentrationasafunctionofthetimet.

Thekerneliscomposedofseveraltermsthatareresponsibleforexplainingdifferentpropertiesofthesignal:

alongterm,smoothrisingtrendistobeexplainedbyanRBFkernel.TheRBFkernelwithalargelength-scaleenforcesthiscomponenttobesmooth;itisnotenforcedthatthetrendisrisingwhichleavesthischoicetotheGP.Thespecificlength-scaleandtheamplitudearefreehyperparameters.aseasonalcomponent,whichistobeexplainedbytheperiodicExpSineSquaredkernelwithafixedperiodicityof1year.Thelength-scaleofthisperiodiccomponent,controllingitssmoothness,isafreeparameter.Inordertoallowdecayingawayfromexactperiodicity,theproductwithanRBFkernelistaken.Thelength-scaleofthisRBFcomponentcontrolsthedecaytimeandisafurtherfreeparameter.smaller,mediumtermirregularitiesaretobeexplainedbyaRationalQuadratickernelcomponent,whoselength-scaleandalphaparameter,whichdeterminesthediffusenessofthelength-scales,aretobedetermined.Accordingto[RW2006],theseirregularitiescanbetterbeexplainedbyaRationalQuadraticthananRBFkernelcomponent,probablybecauseitcanaccommodateseverallength-scales.a“noise”term,consistingofanRBFkernelcontribution,whichshallexplainthecorrelatednoisecomponentssuchaslocalweatherphenomena,andaWhiteKernelcontributionforthewhitenoise.TherelativeamplitudesandtheRBF’slengthscalearefurtherfreeparameters.

Maximizingthelog-marginal-likelihoodaftersubtractingthetarget’smeanyieldsthefollowingkernelwithanLMLof-83.214:

Thus,mostofthetargetsignal(34.4ppm)isexplainedbyalong-termrisingtrend(length-scale41.8years).Theperiodiccomponenthasanamplitudeof3.27ppm,adecaytimeof180yearsandalength-scaleof1.44.Thelongdecaytimeindicatesthatwehavealocallyveryclosetoperiodicseasonalcomponent.Thecorrelatednoisehasanamplitudeof0.197ppmwithalengthscaleof0.138yearsandawhite-noisecontributionof0.197ppm.Thus,theoverallnoiselevelisverysmall,indicatingthatthedatacanbeverywellexplainedbythemodel.Thefigureshowsalsothatthemodelmakesveryconfidentpredictionsuntilaround2015

34.4**2*RBF(length_scale=41.8)+3.27**2*RBF(length_scale=180)*ExpSineSquared(length_scale=1.44,periodicity=1)+0.446**2*RationalQuadratic(alpha=17.7,length_scale=0.957)+0.197**2*RBF(length_scale=0.138)+WhiteKernel(noise_level=0.0336)

Page 5: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

1.7.3. Gaussian Process Classification (GPC)

TheGaussianProcessClassifierimplementsGaussianprocesses(GP)forclassificationpurposes,morespecificallyforprobabilisticclassification,wheretestpredictionstaketheformofclassprobabilities.GaussianProcessClassifierplacesaGPprioronalatentfunction ,whichisthensquashedthroughalinkfunctiontoobtaintheprobabilisticclassification.Thelatentfunction isaso-callednuisancefunction,whosevaluesarenotobservedandarenotrelevantbythemselves.Itspurposeistoallowaconvenientformulationofthemodel,and isremoved(integratedout)duringprediction.GaussianProcessClassifierimplementsthelogisticlinkfunction,forwhichtheintegralcannotbecomputedanalyticallybutiseasilyapproximatedinthebinarycase.

Incontrasttotheregressionsetting,theposteriorofthelatentfunction isnotGaussianevenforaGPpriorsinceaGaussianlikelihoodisinappropriatefordiscreteclasslabels.Rather,anon-Gaussianlikelihoodcorrespondingtothelogisticlinkfunction(logit)isused.GaussianProcessClassifierapproximatesthenon-GaussianposteriorwithaGaussianbasedontheLaplaceapproximation.MoredetailscanbefoundinChapter3of[RW2006].

TheGPpriormeanisassumedtobezero.Theprior’scovarianceisspecifiedbypassingakernelobject.ThehyperparametersofthekernelareoptimizedduringfittingofGaussianProcessRegressorbymaximizingthelog-marginal-likelihood(LML)basedonthepassedoptimizer .AstheLMLmayhavemultiplelocaloptima,theoptimizercanbestartedrepeatedlybyspecifyingn_restarts_optimizer .Thefirstrunisalwaysconductedstartingfromtheinitialhyperparametervaluesofthekernel;subsequentrunsareconductedfromhyperparametervaluesthathavebeenchosenrandomlyfromtherangeofallowedvalues.Iftheinitialhyperparametersshouldbekeptfixed,None canbepassedasoptimizer.

GaussianProcessClassifiersupportsmulti-classclassificationbyperformingeitherone-versus-restorone-versus-onebasedtrainingandprediction.Inone-versus-rest,onebinaryGaussianprocessclassifierisfittedforeachclass,whichistrainedtoseparatethisclassfromtherest.In“one_vs_one”,onebinaryGaussianprocessclassifierisfittedforeachpairofclasses,whichistrainedtoseparatethesetwoclasses.Thepredictionsofthesebinarypredictorsarecombinedintomulti-classpredictions.Seethesectiononmulti-classclassificationformoredetails.

InthecaseofGaussianprocessclassification,“one_vs_one”mightbecomputationallycheapersinceithastosolvemanyproblemsinvolvingonlyasubsetofthewholetrainingsetratherthanfewerproblemsonthewholedataset.SinceGaussianprocessclassificationscalescubicallywiththesizeofthedataset,thismightbeconsiderablyfaster.However,notethat“one_vs_one”doesnotsupportpredictingprobabilityestimatesbutonlyplainpredictions.Moreover,notethatGaussianProcessClassifierdoesnot(yet)implementatruemulti-classLaplaceapproximationinternally,butasdiscussedaboveisbasedonsolvingseveralbinaryclassificationtasksinternally,whicharecombinedusingone-versus-restorone-versus-one.

1.7.4. GPC examples

1.7.4.1. Probabilistic predictions with GPC

Page 6: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

ThisexampleillustratesthepredictedprobabilityofGPCforanRBFkernelwithdifferentchoicesofthehyperparameters.ThefirstfigureshowsthepredictedprobabilityofGPCwitharbitrarilychosenhyperparametersandwiththehyperparameterscorrespondingtothemaximumlog-marginal-likelihood(LML).

WhilethehyperparameterschosenbyoptimizingLMLhaveaconsiderablelargerLML,theyperformslightlyworseaccordingtothelog-lossontestdata.Thefigureshowsthatthisisbecausetheyexhibitasteepchangeoftheclassprobabilitiesattheclassboundaries(whichisgood)buthavepredictedprobabilitiescloseto0.5farawayfromtheclassboundaries(whichisbad)ThisundesirableeffectiscausedbytheLaplaceapproximationusedinternallybyGPC.

Thesecondfigureshowsthelog-marginal-likelihoodfordifferentchoicesofthekernel’shyperparameters,highlightingthetwochoicesofthehyperparametersusedinthefirstfigurebyblackdots.

1.7.4.2. Illustration of GPC on the XOR dataset

ThisexampleillustratesGPConXORdata.Comparedareastationary,isotropickernel(RBF)andanon-stationarykernel(DotProduct).Onthisparticulardataset,theDotProductkernelobtainsconsiderablybetterresultsbecausetheclass-boundariesarelinearandcoincidewiththecoordinateaxes.Inpractice,however,stationarykernelssuchasRBFoftenobtainbetterresults.

Page 7: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

1.7.4.3. Gaussian process classification (GPC) on iris dataset

ThisexampleillustratesthepredictedprobabilityofGPCforanisotropicandanisotropicRBFkernelonatwo-dimensionalversionfortheiris-dataset.ThisillustratestheapplicabilityofGPCtonon-binaryclassification.TheanisotropicRBFkernelobtainsslightlyhigherlog-marginal-likelihoodbyassigningdifferentlength-scalestothetwofeaturedimensions.

1.7.5. Kernels for Gaussian Processes

Kernels(alsocalled“covariancefunctions”inthecontextofGPs)areacrucialingredientofGPswhichdeterminetheshapeofpriorandposterioroftheGP.Theyencodetheassumptionsonthefunctionbeinglearnedbydefiningthe“similarity”oftwodatapointscombinedwiththeassumptionthatsimilardatapointsshouldhavesimilartargetvalues.Twocategoriesofkernelscanbedistinguished:stationarykernelsdependonlyonthedistanceoftwodatapointsandnotontheirabsolutevalues andarethusinvarianttotranslationsintheinputspace,whilenon-stationarykernelsdependalsoonthespecificvaluesofthedatapoints.Stationarykernelscanfurtherbesubdividedintoisotropicandanisotropickernels,whereisotropickernelsarealsoinvarianttorotationsintheinputspace.Formoredetails,werefertoChapter4of[RW2006].

1.7.5.1. Gaussian Process Kernel API

Page 8: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

ThemainusageofaKernelistocomputetheGP’scovariancebetweendatapoints.Forthis,themethod__call__ ofthekernelcanbecalled.Thismethodcaneitherbeusedtocomputethe“auto-covariance”ofallpairsofdatapointsina2darrayX,orthe“cross-covariance”ofallcombinationsofdatapointsofa2darrayXwithdatapointsina2darrayY.Thefollowingidentityholdstrueforallkernelsk(exceptfortheWhiteKernel):k(X)==K(X,Y=X)

Ifonlythediagonaloftheauto-covarianceisbeingused,themethoddiag() ofakernelcanbecalled,whichismorecomputationallyefficientthantheequivalentcallto__call__ :np.diag(k(X,X))==k.diag(X)

Kernelsareparameterizedbyavector ofhyperparameters.Thesehyperparameterscanforinstancecontrollength-scalesorperiodicityofakernel(seebelow).Allkernelssupportcomputinganalyticgradientsofthekernel’sauto-covariancewithrespectto viasettingeval_gradient=True inthe__call__ method.ThisgradientisusedbytheGaussianprocess(bothregressorandclassifier)incomputingthegradientofthelog-marginal-likelihood,whichinturnisusedtodeterminethevalueof ,whichmaximizesthelog-marginal-likelihood,viagradientascent.Foreachhyperparameter,theinitialvalueandtheboundsneedtobespecifiedwhencreatinganinstanceofthekernel.Thecurrentvalueof canbegetandsetviathepropertytheta ofthekernelobject.Moreover,theboundsofthehyperparameterscanbeaccessedbythepropertybounds ofthekernel.Notethatbothproperties(thetaandbounds)returnlog-transformedvaluesoftheinternallyusedvaluessincethosearetypicallymoreamenabletogradient-basedoptimization.ThespecificationofeachhyperparameterisstoredintheformofaninstanceofHyperparameterintherespectivekernel.Notethatakernelusingahyperparameterwithname“x”musthavetheattributesself.xandself.x_bounds.

TheabstractbaseclassforallkernelsisKernel.KernelimplementsasimilarinterfaceasEstimator ,providingthemethodsget_params() ,set_params() ,andclone() .Thisallowssettingkernelvaluesalsoviameta-estimatorssuchasPipeline orGridSearch .Notethatduetothenestedstructureofkernels(byapplyingkerneloperators,seebelow),thenamesofkernelparametersmightbecomerelativelycomplicated.Ingeneral,forabinarykerneloperator,parametersoftheleftoperandareprefixedwithk1__ andparametersoftherightoperandwithk2__ .Anadditionalconveniencemethodisclone_with_theta(theta) ,whichreturnsaclonedversionofthekernelbutwiththehyperparameterssettotheta .Anillustrativeexample:

AllGaussianprocesskernelsareinteroperablewithsklearn.metrics.pairwiseandviceversa:instancesofsubclassesofKernelcanbepassedasmetric topairwise_kernels fromsklearn.metrics.pairwise.Moreover,kernelfunctionsfrompairwisecanbeusedasGPkernelsbyusingthewrapperclassPairwiseKernel.Theonlycaveatisthatthegradientofthehyperparametersisnotanalyticbutnumericandallthosekernelssupportonlyisotropicdistances.Theparametergamma isconsideredtobeahyperparameterandmaybeoptimized.Theotherkernelparametersaresetdirectlyatinitializationandarekeptfixed.

1.7.5.2. Basic kernels

TheConstantKernelkernelcanbeusedaspartofaProductkernelwhereitscalesthemagnitudeoftheotherfactor(kernel)oraspartofaSumkernel,whereitmodifiesthemeanoftheGaussianprocess.Itdependsonaparameter .Itisdefinedas:

Themainuse-caseoftheWhiteKernelkernelisaspartofasum-kernelwhereitexplainsthenoise-componentofthesignal.Tuningitsparameter correspondstoestimatingthenoise-level.Itisdefinedas:

>>>fromsklearn.gaussian_process.kernelsimportConstantKernel,RBF>>>kernel=ConstantKernel(constant_value=1.0,constant_value_bounds=(0.0,10.0))*RBF(length_scale=0.5,length_scale_bounds=(0.0,10.0))+RBF(length_scale=2.0,length_scale_bounds=(0.0,10.0))>>>forhyperparameterinkernel.hyperparameters:print(hyperparameter)Hyperparameter(name='k1__k1__constant_value',value_type='numeric',bounds=array([[0.,10.]]),n_elements=1,fixed=False)Hyperparameter(name='k1__k2__length_scale',value_type='numeric',bounds=array([[0.,10.]]),n_elements=1,fixed=False)Hyperparameter(name='k2__length_scale',value_type='numeric',bounds=array([[0.,10.]]),n_elements=1,fixed=False)>>>params=kernel.get_params()>>>forkeyinsorted(params):print("%s:%s"%(key,params[key]))k1:1**2*RBF(length_scale=0.5)k1__k1:1**2k1__k1__constant_value:1.0k1__k1__constant_value_bounds:(0.0,10.0)k1__k2:RBF(length_scale=0.5)k1__k2__length_scale:0.5k1__k2__length_scale_bounds:(0.0,10.0)k2:RBF(length_scale=2)k2__length_scale:2.0k2__length_scale_bounds:(0.0,10.0)>>>print(kernel.theta)#Note:log-transformed[0.-0.693147180.69314718]>>>print(kernel.bounds)#Note:log-transformed[[-inf2.30258509][-inf2.30258509][-inf2.30258509]]

>>>

Page 9: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

1.7.5.3. Kernel operators

Kerneloperatorstakeoneortwobasekernelsandcombinethemintoanewkernel.TheSumkerneltakestwokernels and andcombinesthemvia .TheProductkerneltakestwokernels and andcombinesthemvia

.TheExponentiationkerneltakesonebasekernelandascalarparameter andcombinesthemvia .

1.7.5.4. Radial-basis function (RBF) kernel

TheRBFkernelisastationarykernel.Itisalsoknownasthe“squaredexponential”kernel.Itisparameterizedbyalength-scaleparameter ,whichcaneitherbeascalar(isotropicvariantofthekernel)oravectorwiththesamenumberofdimensionsastheinputs (anisotropicvariantofthekernel).Thekernelisgivenby:

Thiskernelisinfinitelydifferentiable,whichimpliesthatGPswiththiskernelascovariancefunctionhavemeansquarederivativesofallorders,andarethusverysmooth.ThepriorandposteriorofaGPresultingfromanRBFkernelareshowninthefollowingfigure:

1.7.5.5. Matérn kernel

TheMaternkernelisastationarykernelandageneralizationoftheRBFkernel.Ithasanadditionalparameter whichcontrolsthesmoothnessoftheresultingfunction.Itisparameterizedbyalength-scaleparameter ,whichcaneitherbeascalar(isotropicvariantofthekernel)oravectorwiththesamenumberofdimensionsastheinputs (anisotropicvariantofthekernel).Thekernelisgivenby:

Page 10: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

As ,theMatérnkernelconvergestotheRBFkernel.When ,theMatérnkernelbecomesidenticaltotheabsoluteexponentialkernel,i.e.,

Inparticular, :

and :

arepopularchoicesforlearningfunctionsthatarenotinfinitelydifferentiable(asassumedbytheRBFkernel)butatleastonce()ortwicedifferentiable( ).

Theflexibilityofcontrollingthesmoothnessofthelearnedfunctionvia allowsadaptingtothepropertiesofthetrueunderlyingfunctionalrelation.ThepriorandposteriorofaGPresultingfromaMatérnkernelareshowninthefollowingfigure:

Page 11: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

See[RW2006],pp84forfurtherdetailsregardingthedifferentvariantsoftheMatérnkernel.

1.7.5.6. Rational quadratic kernel

TheRationalQuadratickernelcanbeseenasascalemixture(aninfinitesum)ofRBFkernelswithdifferentcharacteristiclength-scales.Itisparameterizedbyalength-scaleparameter andascalemixtureparameter Onlytheisotropicvariantwhere isascalarissupportedatthemoment.Thekernelisgivenby:

ThepriorandposteriorofaGPresultingfromaRationalQuadratickernelareshowninthefollowingfigure:

1.7.5.7. Exp-Sine-Squared kernel

TheExpSineSquaredkernelallowsmodelingperiodicfunctions.Itisparameterizedbyalength-scaleparameter andaperiodicityparameter .Onlytheisotropicvariantwhere isascalarissupportedatthemoment.Thekernelisgivenby:

ThepriorandposteriorofaGPresultingfromanExpSineSquaredkernelareshowninthefollowingfigure:

Page 12: 1.7. Gaussian Processes · 1.7. Gaussian Processes Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems

1.7.5.8. Dot-Product kernel

TheDotProductkernelisnon-stationaryandcanbeobtainedfromlinearregressionbyputting priorsonthecoefficientsofandapriorof onthebias.TheDotProductkernelisinvarianttoarotationofthecoordinatesaboutthe

origin,butnottranslations.Itisparameterizedbyaparameter .For ,thekerneliscalledthehomogeneouslinearkernel,otherwiseitisinhomogeneous.Thekernelisgivenby

TheDotProductkerneliscommonlycombinedwithexponentiation.Anexamplewithexponent2isshowninthefollowingfigure: