Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
LinearRegression&GradientDescent
RobotImageCredit:Viktoriya Sukhanova ©123RF.com
TheseslideswereassembledbyByronBoots,withgratefulacknowledgementtoEricEatonandthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.
RegressionGiven:– Datawhere
– Correspondinglabelswhere
2
0
1
2
3
4
5
6
7
8
9
1970 1980 1990 2000 2010 2020
Septem
berA
rcticSeaIceExtent
(1,000,000sq
km)
Year
DatafromG.Witt.JournalofStatisticsEducation,Volume21,Number1(2013)
LinearRegressionQuadraticRegression
X =n
x
(1), . . . ,x(n)o
x
(i) 2 Rd
y =n
y(1), . . . , y(n)o
y(i) 2 R
LinearRegression• Hypothesis:
• Fitmodelbyminimizingsumofsquarederrors
3
x
y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX
j=0
✓jxj
Assumex0 =1
y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX
j=0
✓jxj
Figures are courtesy ofGregShakhnarovich
LeastSquaresLinearRegression
4
• CostFunction
• Fitbysolving
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2
min✓
J(✓)
IntuitionBehindCostFunction
5SlidebyAndrewNg
IntuitionBehindCostFunction
6
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
IntuitionBehindCostFunction
7
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
IntuitionBehindCostFunction
8
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
IntuitionBehindCostFunction
9
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce
10
✓
✓ J(✓)
q1q0
J(q0,q1)
FigurebyAndrewNg
BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce
11
✓
✓
J(✓)
q1q0
J(q0,q1)
✓
FigurebyAndrewNg
BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce
12
✓
✓
J(✓)
q1q0
J(q0,q1)
✓
FigurebyAndrewNg
Sincetheleastsquaresobjectivefunctionisconvex(concave),wedon’tneedtoworryaboutlocalminimainlinearregression
GradientDescent• Initialize• Repeatuntilconvergence
13
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
learningrate(small)e.g.,α=0.05
J(✓)
✓
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
↵
GradientDescent• Initialize• Repeatuntilconvergence
14
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y
(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!x
(i)j
=1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j
GradientDescentforLinearRegression
• Initialize• Repeatuntilconvergence
15
✓
simultaneousupdateforj =0...d
✓j ✓j � ↵
1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j
• Toachievesimultaneousupdate• AtthestartofeachGDiteration,compute• Usethisstoredvalueintheupdatesteploop
h✓
⇣x
(i)⌘
kvk2 =
sX
i
v2i =q
v21 + v22 + . . .+ v2|v|L2 norm:
k✓new
� ✓old
k2 < ✏• Assumeconvergencewhen
GradientDescent
16
(forfixed,thisisafunctionofx) (functionoftheparameters)
h(x)=-900– 0.1x
SlidebyAndrewNg
GradientDescent
17
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
18
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
19
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
20
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
21
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
22
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
23
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
24
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Choosingα
25
αtoosmall
slowconvergence
αtoolarge
Increasingvaluefor J(✓)
• Mayovershoottheminimum• Mayfailtoconverge• Mayevendiverge
Toseeifgradientdescentisworking,printouteachiteration• Thevalueshoulddecreaseateachiteration• Ifitdoesn’t,adjustα
J(✓)
ExtendingLinearRegressiontoMoreComplexModels
• TheinputsX forlinearregressioncanbe:– Originalquantitativeinputs– Transformationofquantitativeinputs
• e.g.log,exp,squareroot,square,etc.
– Polynomialtransformation• example:y =b0 +b1×x +b2×x2 +b3×x3
– Basisexpansions– Dummycodingofcategoricalinputs– Interactionsbetweenvariables
• example:x3 =x1 × x2
Thisallowsuseoflinear regressiontechniquestofitnon-linear datasets.
LinearBasisFunctionModels
• Generally,
• Typically,sothatactsasabias• Inthesimplestcase,weuselinearbasisfunctions:
h✓(x) =dX
j=0
✓j�j(x)
�0(x) = 1 ✓0
�j(x) = xj
basisfunction
BasedonslidebyChristopherBishop(PRML)
LinearBasisFunctionModels
– Theseareglobal;asmallchangeinx affectsallbasisfunctions
• Polynomialbasisfunctions:
• Gaussianbasisfunctions:
– Thesearelocal;asmallchangeinx onlyaffectnearbybasisfunctions.μj ands controllocationandscale(width).
BasedonslidebyChristopherBishop(PRML)
LinearBasisFunctionModels• Sigmoidal basisfunctions:
where
– Thesearealsolocal;asmallchangeinx onlyaffectsnearbybasisfunctions.μjands controllocationandscale(slope).
BasedonslidebyChristopherBishop(PRML)
ExampleofFittingaPolynomialCurvewithaLinearModel
y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px
p =pX
j=0
✓jxj
QualityofFit
Overfitting:• Thelearnedhypothesismayfitthetrainingsetverywell( )
• ...butfailstogeneralizetonewexamples
31
Price
Size
Price
Size
Price
Size
Underfitting(highbias)
Overfitting(highvariance)
Correctfit
J(✓) ⇡ 0
BasedonexamplebyAndrewNg
Regularization• Amethodforautomaticallycontrollingthecomplexityofthelearnedhypothesis
• Idea:penalizeforlargevaluesof– Canincorporateintothecostfunction– Workswellwhenwehavealotoffeatures,eachthatcontributesabittopredictingthelabel
• Canalsoaddressoverfitting byeliminatingfeatures(eithermanuallyorviamodelselection)
32
✓j
Regularization• Linearregressionobjectivefunction
– istheregularizationparameter()– Noregularizationon!
33
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+ �
dX
j=1
✓2j
modelfittodata regularization
✓0
� � � 0
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
UnderstandingRegularization
• Notethat
– Thisisthemagnitudeofthefeaturecoefficientvector!
• Wecanalsothinkofthisas:
• L2 regularizationpullscoefficientstoward0
34
dX
j=1
✓2j = k✓1:dk22
dX
j=1
(✓j � 0)2 = k✓1:d � ~0k22
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
UnderstandingRegularization
• Whathappensifwesettobehuge(e.g.,1010)?
35
�Price
Size0 0 0 0
BasedonexamplebyAndrewNg
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
RegularizedLinearRegression
36
• CostFunction
• Fitbysolving
• Gradientupdate:
min✓
J(✓)
✓j ✓j � ↵
1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j
✓0 ✓0 � ↵1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘
regularization
@
@✓jJ(✓)
@
@✓0J(✓)
✓j ✓j � ↵
1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j � �✓j
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
RegularizedLinearRegression
37
✓0 ✓0 � ↵1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘
• Wecanrewritethegradientstepas:
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
✓j ✓j � ↵
1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j � �✓j
✓j ✓j (1� ↵�)� ↵
1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j