Deep Learning and Stascs: Connecons · – High visibility successes in vision, speech, text, game-playing – Funding agencies, companies, students, public are in awe of deep learning

DeepLearningandSta.s.cs:Connec.ons

PadhraicSmythChancellor’sProfessorDepartmentsofComputerScienceandSta.s.csUniversityofCalifornia,[email protected]

PadhraicSmyth:MonashUniversity,July2019:2

AIResearch10to20yearsago

LogicandAutomatedReasoning

KnowledgeRepresenta.on

MachineLearningNatural

LanguageProcessing

SpeechRecogni.on

ComputerVision

GamePlaying

SearchAlgorithms

Robo.cs


AIResearchin2019

LogicandAutomatedReasoning

KnowledgeRepresenta.on

NaturalLanguageProcessing

SpeechRecogni.on

ComputerVision

GamePlaying

SearchAlgorithms

Robo.cs

DeepMachineLearning


(Russakovskyetal,2015)ImageNetLargeScaleVisualRecogniBonChallenge


FigurefromKevinMurphy,Google,2016



Deepneuralnetworks



DeepNetworksforDetecBngSkinCancer

FromEstevaetal,Nature,2017


Microsoft/IBM Benchmarks for Speech Recognition

Source: https://www.economist.com/node/21710907/sites/all/modules/custom/ec_essay


Microsoft/IBM Benchmarks for Speech Recognition

Source: https://www.economist.com/node/21710907/sites/all/modules/custom/ec_essay


2016 IEEE Conference on Acoustics, Speech, and Signal Processing


FromKodish-Wachsetal,AMIASymposium,2018


PedestrianDetecBon:AlgorithmsandHumans

Algorithms

HumanAnnotators

FromZhangetal,CVPR2016


APerspecBveonDeepLearning•  Deeplearning(DL)research:

–  Highvisibilitysuccessesinvision,speech,text,game-playing–  Fundingagencies,companies,students,publicareinaweofdeeplearning–  Companiesaredrivingalotoftheinterest–  Highlyempirical–li[leguidancefromtheory–  Fewlinks(todate)tostaBsBcsorstaBsBcalthinking

•  Academicresearchcanplayakeyrole–  Computerscience,staBsBcs,mathemaBcs,etc–  Provideguidance:wheredoesDLworkwell?Andnotsowell?

•  ObjecBveempiricalanalyses•  Developmentofprinciplesandtheory

–  Providebalancetothe“hype”


OutlineofToday’sTalk

•  Keyideasindeeplearning

•  LinkstostaBsBcalthinking

•  LimitaBonsofcurrentdeeplearning

•  OpportuniBesfornewideasanddirecBons


PredicBveModeling

f=blackbox predicBonoftargetyinputsx

parametersθ

Goalistolearnamodelfromtrainingdatatopredictyvalues

Machinelearning:emphasisonpredic.onsofySta.s.cs:emphasisonmodelsandparameters


Training Data: D = {xi, yi}, i = 1, . . . , N

Model: yi ⇡ f(xi;✓✓✓)

d-dimensionalinputvector

targetvalue

funcBonalformofthemodel

p-dimensionalparametervector(unknown)

Loss: �

�yi, f(xi;✓✓✓)

�

Model’sPredicBon

IdealTarget


TheThreeComponentsofPredicBveModeling

1.Predic.onModelf:WhatfuncBonalformshouldwechooseforf?

2.LossFunc.onHowdowecomparef’spredicBonstotruey?

3.Op.miza.onGivenfandalossfuncBon,howcanwelearnf’sparameters




2.LossFunc.onHowdowecomparef’spredicBonstotruey?



ExamplesofPredicBonModels

-100 -50 0 50 1000

0.2

0.4

0.6

0.8

1

LinearRegression

Logis.cRegression

f(x;✓✓✓) = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd

=dX

j=0

✓jxj = ✓

✓

✓

Tx

f(x;✓✓✓) = P (y = 1|x;✓✓✓)

=1

1 + e�z, z = ✓✓✓Tx

z

1

1 + e�z


LogisBcRegressionasaSimpleNeuralNetwork

x1

x2

x3

+1

Each“edge”inthenetworkhasanassociatedweightorparameter,θj

f(x;✓✓✓) = P (y = 1|x;✓✓✓)

=1

1 + e�z, z = ✓✓✓Tx


ANeuralNetworkwithOneHiddenLayer(from1990’s)

Herethemodellearns3differentlogisBcfuncBons,eachonea“hiddenunit”andthencombinestheoutputsofthe3tomakeapredicBonMorecomplexthanlogisBcfuncBon,manymoreparameters

x1

x2

x3

+1HiddenLayer

Output

Inputs

f(x;✓✓✓)


DeepLearning:ModelswithMoreHiddenLayers

Usethisideatorecursivelybuild“deepmodels”withmulBplehiddenlayers

x1

x2

x3

+1

Veryflexible,highlynon-linearfuncBonsCanhavedifferenttypesofnon-lineariBes,skiplayers,etc

HiddenLayer1

HiddenLayer2

Output

Inputs

f(x;✓✓✓)


Figurefromh[p://parse.ele.tue.nl/

ExampleofaNetworkforDigitClassificaBon

MathemaBcallythenetworkisjustadifferenBablefuncBon…butaverycomplicatedone

EachoutputisanesBmateofaclassprobabilityP(c=k|x),implementedviaamulBnomiallogisBcfuncBon

Inputpixels,nofeatureextracBon


DeepNetworkarchitectureforGoogLeNetimagerecogni.onnetwork27layers,millionsofparameters

PixelInputs

Output


ABriefHistoryofNeuralNetworks…•  ThePerceptronEra:1950sand60s

–  GreatopBmismwithperceptrons(linearmodels)....–  ...unBlMinsky,1969:perceptronshadlimitedrepresentaBonpower–  Hardproblemsrequirehiddenlayers....buttherewasnotrainingalgorithm

•  TheBackpropaga.onEra:Late1980stomid-90’s–  InvenBonofbackpropagaBon–trainingofmodelswithhiddenlayers–  Wildenthusiasm(intheUSatleast)....conferences,funding,etc–  Mid1990’s:enthusiasmdiesout:trainingdeepNNsishard

•  TheDeepLearningEra:2010-present–  3rdwaveofneuralnetworkenthusiasm–  Whathappenedsincemid90’s?

•  NowpracBcaltotraindeepnetworks• MuchlargerdatasetsandgreatercomputaBonalpower•  FastopBmizaBontechniques+othergoodengineeringtricks


FigureadaptedfromEfronandHasBe,Computer-AidedStaBsBcalInference,2016


FeatureextracBonFigureadaptedfromEfronandHasBe,Computer-AidedStaBsBcalInference,2016


FigureadaptedfromEfronandHasBe,Computer-AidedStaBsBcalInference,2016

Rec.fiedLinearUnit(ReLu)Connec.onswithlinearsplines(e.g.,EckleandSchmidt-Heiber,2018)


FeatureextracBon LogisBcModelFigureadaptedfromEfronandHasBe,Computer-AidedStaBsBcalInference,2016


MachineLearningbeforeDeepModels

FigurefromMarc’Aurelio-Ranzato


DeepConvoluBonalNetwork

FigurefromPeemanetal,2012

Keypoint:end-to-enddifferen.abilityallowsfeatures(convolu.onalfilters)tobelearned,removesneedforhand-cradedfeatureextrac.on(densewordembeddingsplaythesamerolefortext)


ConvoluBonalFiltersforImageData

FigurefromMarc’Aurelio-Ranzato

Keyidea:LearnsuchfiltersinadiscriminaBvefashion


ExamplesofLearnedSpaBalFiltersinPixelSpace


GeneralizedLinearModels

(e.g.,logisBc)


RecursiveGLMs

Define a latent feature via a GLM:

[Mohamed, 2015; Tran et al, 2018]


Define GLM on latent feature:

RecursiveGLMs

Define a latent feature via a GLM:

[Mohamed, 2015; Tran et al, 2018]


BuildingNeuralNetsfromRecursiveGLMs





HIDDENUNIT/NEURON

HIDDENLAYER

ACTIVATIONFUNCTION

WEIGHTMATRIX


DeepNeuralNetworks


DeepNeuralNetworks

……..


DeepNeuralNetworks

……..


DeepNeuralNetworks

…....ineffectdoingregressionwithlearnedfeaturesorbasisfuncBons

…..

Feature Extractor Statistical Model

NEW REPRESENTATION OF FEATURES COMPUTED BY NN LAYERS


DeepNeuralNetworkRepresentaBons•  Itmaybeusefultoviewdeepnetworksastrainablefeatureextractorswith

staBsBcalmodelsas“back-ends”’•  ThisviewencouragesthemixingofdeterminisBcDNNrepresentaBons

(“embeddings”)withconvenBonalstaBsBcalmodels

•  Examples–  Deepsurvivalmodels–  Neuralpointprocessmodels–  RecurrentnetworkmodelsforBme-series


RecurrentNetworksandState-SpaceModels


RecurrentNetworksandState-SpaceModels

RNNstructureissimilartostate-spacemodelsinstaBsBcse.g.,Kalmanfilters,hiddenMarkovmodels,andsoon

RNN:nodistribuBonalassumpBonsonstatevariables->moreflexibilityState-spaceapproach:be[ercharacterizaBonofuncertainty




2.Lossfunc.onHowdowecomparef’spredicBonstoy?



Loss: �

�yi, f(xi;✓✓✓)

�

Model’sPredicBon

IdealTarget


Loss: �

�yi, f(xi;✓✓✓)

�

Example: Squared Error � =

�yi � f(xi;✓✓✓)

�2

Example: Log Loss � = log

1

P (yi|xi;✓✓✓)


Loss: �

�yi, f(xi;✓✓✓)

�

Example: Squared Error � =

�yi � f(xi;✓✓✓)

�2

Example: Log Loss � = log

1

P (yi|xi;✓✓✓)


Empirical Loss:

L(✓✓✓) =

NX

i=1

�

�yi, f(xi;✓✓✓)

�

funcBonalformofthemodel

p-dimensionalparametervector(unknown)

sumovertrainingdatapoints

FocusisongetngpointesBmatesofθ,byminimizaBonofriskSimplemodels:lossisconvex,opBmizaBoncanbestraighuorwardDeepnetworkmodels:thelossisnon-convex,difficulttoopBmize

EmpiricalRisk




2.Lossfunc.onHowdowecomparef’spredicBonstoy?



GradientDescent

Scalarlearningrate:Howfarwemove

Vectorgradient:Direc.onwemoveUpdatedp-dimensional

parametervector

Currentparametervector

Simplegradientmethodsarethe“workhorse”ofmachinelearningNewton(2ndorder)methodsarerarelyused

……….requiresinversionofp x p Hessianmatrix,O(p3)

✓(k+1) = ✓(k) � � rL(✓)


rL(✓) =NX

i=1

rLi(✓)

rL(✓) ⇡ N

m

mX

j=1

rLj(✓)

FullGradient:

Stochas.cGradient:

ApproximaBonofthefullgradient

Randomsampleofmdatapoints(“mini-batch”)

IntuiBon:form << N,wecanmakemanyfastnoisyupdatesCanleadtosublinearconvergenceforlargeN


StochasBcGradientin2dParameterSpace

Gradientsteps

Stochas.cgradientsteps

Empiricallyworksverywellonlargedatasets:sometheoreBcalsupportAnapplicaBonofRobbins-Monro(1951)stochasBcapproximaBonmethodUsefulforstaBsBcalmodelfitngingeneral(notjustfordeeplearning)

e.g.,Wangetal,2015;Chenetal,2016


CONNECTIONSTOSTATISTICS



Model+LossFunc.on+Op.miza.onMethod

Thefunc.onalformoff

Howwemeasurethequalityofthemodel’spredic.ons

Thealgorithmthatfindstheparametersthatminimizeempiricalrisk

Deeplearningwaspresentedasanop.miza.onproblemWhereissta.s.cslurking?


Empirical Loss with Regularization:

L(✓✓✓) =

NX

i=1

�

�yi, f(xi;✓✓✓)

�+ �R(✓✓✓)

EmpiricalRiskMinimizaBon

FindtheparametersthatminimizeempiricalriskontrainingdataThisdirectlycorrespondstomaximizinglikelihood:

Squarederrorloss->GaussianlikelihoodforregressionLogloss->binomial/mulBnomiallikelihoodforclassificaBon

ImplicaBon:implicitcondiBonalindependenceassumpBonoverdata


Empirical Loss with Regularization:

L(✓✓✓) =

NX

i=1

�

�yi, f(xi;✓✓✓)

�+ �R(✓✓✓)

StrengthofregularizaBon

RegularizaBononparameters

EmpiricalRiskMinimizaBonwithRegularizaBon

TheregularizaBontermcanbeinterpretedas(minus)alogprior RL2(✓✓✓) =

X✓2j

RL1(✓✓✓) =X

|✓j |

InaddiBon,DLtechniquessuchasdropoutcanbeinterpretedasaformofprior–generalizestoabroad“dropoutfamily”(seeBhadraetal,ArXiv2019andNalisnicketal,ICML2019)

Gaussianprior

Laplacianprior


Lookslikeadeterminis.cproblem?






Minimizedbysetngf(x;θ)toE[y|x],ateveryx



Minimizedbysetngf(x;θ)toE[y|x],ateveryx


Conclusion:op.miza.onproblemisreallyasta.s.cales.ma.onproblem


TheBias-VarianceTradeoff

Expectedfutureerror=ModelBias2+ModelVariance+IntrinsicUncertaintyApproximaBon

errorEsBmaBon

errorLowerbound

Note:thedecomposi.onaboveisop.mis.c:assumesfuturedataisfromsamedistribu.onastrainingdata



FromNeal,Mi[al,etal,ArXiv,2019

UnexpectedBias-VarianceTrendswithDNNs


ClassProbabiliBesForbothMSEandlog-losstheopBmalpredicBonatanyxisE[y|x]

ForK-aryclassificaBon,yisaK-dimensionalindicatorvector

i.e.,theopBmalpredictorforclasskistheprobabilityofthatclassSodeepnetworkswillproduceesBmatesofclassprobabiliBes…intheory,givenenoughdataandassumingnolocalminima(Note;thisisapropertyofthelossfuncBon,notdeepnetworks)

E[yk|x] = 1 P (yk = 1|x) + 0 P (yk = 0|x)= P (yk = 1|x)


ExampleofTestBedData:CIFARImageClassificaBon•  Anexampleofawidelyused

datasetindeeplearningresearch–  Upto100classes–  50,000imagesfortraining–  10,000imagesfortest

•  Studiesongeneraliza.on,op.miza.on,etc,odenusethisdataset


DeepNetworksareozenMiscalibrated(CIFARdata)

PredictedasTigerwithP(y|x)=0.99

PredictedasTelevisionwithP(y|x)=0.99


DeepNetworksareozenMiscalibrated(CIFARdata)

NetworkofDepth5 NetworkofDepth110

FigurefromGuoetal,ICML2017

PredictedasTigerwithP(y|x)=0.99

PredictedasTelevisionwithP(y|x)=0.99



…….

ExpectedlosswithrespecttoP(x)…forthetrainingdata


2 4 6 8 10 12 14

X values

4

6

8

10

12

14

16

Y va

lues

True E[y|x] functionObserved data

2 4 6 8 10 12 14

X values

4

6

8

10

12

14

16

Pred

icte

d an

d tr

ue Y

val

ues Model 95% confidence intervals

Model prediction for E[y|x]True E[y|x] function

WhatwillhappenwhenweextrapolatebeyondP(x)?


2 4 6 8 10 12 14

X values

4

6

8

10

12

14

16

Y va

lues

True E[y|x] functionObserved data

2 4 6 8 10 12 14

X values

4

6

8

10

12

14

16

Pred

icte

d an

d tr

ue Y

val

ues Model 95% confidence intervals

Model prediction for E[y|x]True E[y|x] function

WhatwillhappenwhenweextrapolatebeyondP(x)?


FromTatemetal.,Nature2004.(seealsoresponsele[ersath[p://faculty.washington.edu/kenrice/naturele[er.pdf)

Generalizingfrom100mOlympicWinningTimes



Deepneuralnetworks

Howwelldothesemodelsextrapolatetonewtypesofimages?



FromRechtetal,ICML2019

AccuracyofImageNetClassifiersonNewImageNetData



ADeepNeuralNetworkforImageRecogniBonFromNguyen,Yosinski,Clune,CVPR2015


ADeepNeuralNetworkforImageRecogniBon

ImagesusedforTraining NewImages

FromNguyen,Yosinski,Clune,CVPR2015






0 10 20 30 40 50 60 70 80 900

2000

4000

6000

8000

10000

12000

14000

AGE

MO

NTHL

Y IN

COM

E

DecisionBoundary

Poorextrapola.onfortestpointslikethis….


Non-RobustnessinDeepImageClassificaBon

FigurefromEngstrometal,ICML2019


ExternalversusInternalValidaBonFromZechetal.,PLOSMedicine,2018

AUCsontestdatafromhospitalnotusedinmodeltraining(“external)

AUCsontestdatafromhospitalsusedinmodeltraining(“internal”)


ExternalversusInternalValidaBonFromZechetal.,PLOSMedicine,2018

AUCsontestdatafromhospitalnotusedinmodeltraining(“external)

AUCsontestdatafromhospitalsusedinmodeltraining(“internal”)


BayesianAssessmentofBlackBoxModels

•  Scenario–  Black-boxpredicBon(e.g.,neuralnetwork)hasbeentrained,parametersarefixed,wecanonlyquerythemodel

–  Wewishtoevaluateitsperformance(accuracy,calibraBon,precision,etc)onlineinanewenvironment

(NewworkinSmyth/SteyversgroupatUCIrvine)

ResultswithdeepnetworksonCIFARimageclassifica.on


BayesianAssessmentofBlackBoxModels

•  Scenario–  Black-boxpredicBon(e.g.,neuralnetwork)hasbeentrained,parametersarefixed,wecanonlyquerythemodel

–  Wewishtoevaluateitsperformance(accuracy,calibraBon,precision,etc)onlineinanewenvironment

(NewworkinSmyth/SteyversgroupatUCIrvine)

N=100queries N=500queries N=10,000queries

ResultswithdeepnetworksonCIFARimageclassifica.on


BayesianAssessmentofAccuracyandCalibraBon(OngoingworkinSmyth/SteyversgroupatUCIrvine)

ResultsonCIFARimageclassificaBondataset


BayesianAssessmentviaRankingandAcBveLearning(OngoingworkinSmyth/SteyversgroupatUCIrvine)

palm treewardrobe

motorcyclesunflowerkeyboard

Most Accurate

0.0 0.2 0.4 0.6

lizardseal

ottershrew

boy

Least Accurate

Bayesianrankingbypredictedclass

ClasswithLeastAccuratePredicBons

ClasswithLeastCalibratedPredicBons

Bayesianac.velearning


THEOVERFITTINGQUESTION


FromPoggioetal,2018:TheoryofdeeplearningIII:thenon-overfitngpuzzle

LackofOverfitngofDeepNetworksonCIFAR-10

Moreparametersthandata


LackofOverfitng:DifferentNetworks,DifferentData

FromNeyshaburetal,2018;TowardsunderstandingtheroleofoverparametrizaBoningeneralizaBonofneuralnetworks

CIFAR-10Data SVHNData

MNISTData


OverfitngintheDLLiterature•  Standardbias-variancetheoryseemsnottoapply

–  DLmodelscaninterpolatethedata(zerotrainingerror)butsBllgeneralizewellontestdata

•  Trainingerror(orloss)tendstoodenbemuchlowerthantesterror–  ThisistradiBonallyanindicatorofoverfitng….butnothere

•  Variousemergingconjecturesandtheories–  e.g.minimum-norminterpolatorsgeneralizewellinoverparametrizedregime(seeBelkinetal(2018,2019),HasBeetal(2019)

…..butverymuchsBllanopenproblem


The“DoubleDescent”TheoryBelkinetal.,Reconcilingmodernmachinelearningandthebias-variancetradeoff,2018


“DoubleDescent”onMNISTData

RecentworkfromstaBsBcsthatconfirmsthesetheories:HasBeetal,ArXiv,2019

Belkinetal.,Reconcilingmodernmachinelearningandthebias-variancetradeoff,2018


CONCLUDINGCOMMENTS


CauBonaryNotesaboutDeepLearning

•  Verylargeamountsoflabeleddataneeded(forclassificaBonproblems)

•  ExtrapolaBonproperBesareunpredictable

•  ModelbuildingandopBmizaBoncanbecomplex(significanthumaneffort)

•  InterpretabilityandexplanaBonaredifficult•  Relianceonempirical“folkwisdom”ratherthanprinciplesandtheory


QuesBonsworthaskingforAIApplicaBons

1.  Ismachinelearninganappropriateapproach?

2.  Ifso,isdeeplearningthebestapproach?

3.  HowdowebuildmodelsthatgeneralizewelltonewsituaBons?


Scullleyetal,NIPS2015Conference


ConcludingComments

•  Deeplearninghasachievedimpressiveresultsinpamernrecogni.on–  ParBcularlyusefulwithhigh-dimensionalsignals(images,speech,text)

•  Manyfounda.onalideasaregroundedinsta.s.cs(includingothertopicswedidnotdiscuss:fairness,adversarial/robustlearning,reinforcementlearning,…)


ConcludingComments

•  Deeplearninghasachievedimpressiveresultsinpamernrecogni.on–  ParBcularlyusefulwithhigh-dimensionalsignals(images,speech,text)

•  Manyfounda.onalideasaregroundedinsta.s.cs(includingothertopicswedidnotdiscuss:fairness,adversarial/robustlearning,reinforcementlearning,…)

•  However,deeplearninghasblindspots

–  e.g.,reportedempiricalaccuraciesmaybeopBmisBc

•  Asdeepmachinelearningisappliedmorebroadlyweneed–  Robustprinciplesandtheorytoguidemodel-building–  ObjecBvediagnosisandevaluaBonmethodsforpracBBoners


THANKYOUFORLISTENINGQUESTIONS?


AddiBonalReading•  Efron,Bradley,andTrevorHasBe.ComputerAgeSta;s;calInference.Cambridge

UniversityPress,2016.(Chapter18:NeuralNetworksandDeepLearning).

•  Goodfellow,Ian,YoshuaBengio,AaronCourville.DeepLearning.Cambridge:MITPress,2016

•  Jordan,MichaelI.,andTomM.Mitchell.Machinelearning:Trends,perspecBves,and

prospects.Science349.6245(2015):255-260.

•  Taddy,Ma[.TheTechnologicalElementsofAr;ficialIntelligence.No.w24301.NaBonalBureauofEconomicResearch,2018.

•  Brynjolfsson,Erik,andTomMitchell.Whatcanmachinelearningdo?WorkforceimplicaBons.Science358.6370(2017):1530-1534.

•  Breiman,L.(2001).StaBsBcalmodeling:Thetwocultures.Sta;s;calScience,16(3),199-231.

Documents

Deep Learning and Stascs: Connecons · – High visibility successes in vision, speech, text, game-playing – Funding agencies, companies, students, public are in awe of deep learning