41
DS 4400 Alina Oprea Associate Professor, CCIS Northeastern University September 25 2018 Machine Learning and Data Mining I

Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

DS 4400

Alina Oprea

Associate Professor, CCIS

Northeastern University

September 25 2018

Machine Learning and Data Mining I

Page 2: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Review

• Solution for simple and multiple linear regression can be computed in closed form– Matrix inversion is computationally intense

• Gradient descent is an efficient algorithm for optimization and training LR– The most widely used algorithm in ML!

– We derived GD update rule for simple and multiple LR

– There are many practical issues with GD

• Regularization is general method to reduce model complexity and avoid overfitting– Add penalty to loss function

– Ridge and Lasso regression2

Page 3: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Simple LR

ෝ𝒚 = 𝜽𝟎 + 𝜽𝟏𝒙

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

Residual

𝜽𝟏 = 𝚫𝐲/𝚫𝐱

Slope

𝜃0 Intercept

MSE=1

𝑛σ𝑖=1𝑛 ℎ𝜃 𝑥(𝑖) − 𝑦(𝑖)

2

3

𝑥(𝑖), 𝑦(𝑖)

Page 4: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Simple LR – Closed Form

ҧ𝑥 =σ𝑖=1𝑛 𝑥(𝑖)

𝑛

ത𝑦 =σ𝑖=1𝑛 𝑦(𝑖)

𝑛

4

• Dataset 𝑥(𝑖)∈ 𝑅, 𝑦(𝑖) ∈ 𝑅, ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

• 𝐽 𝜃 =1

𝑛σ𝑖=1𝑛 𝜃0 + 𝜃1𝑥

(𝑖) − 𝑦(𝑖)2

loss

𝜕𝐽 𝜃

𝜕𝜃0= 2

𝑛σ𝑖=1𝑛 𝜃0 + 𝜃1𝑥

(𝑖) − 𝑦(𝑖) = 0

𝜕𝐽(𝜃)

𝜕𝜃1= 2

𝑛σ𝑖=1𝑛 𝑥(𝑖) 𝜃0 + 𝜃1𝑥

(𝑖) − 𝑦(𝑖) = 0

• Closed-from solution of min loss

– 𝜃0 = ത𝑦 − 𝜃1 ҧ𝑥

– 𝜃1 =σ (𝑥 𝑖 − ҧ𝑥)(𝑦(𝑖) −ത𝑦)

σ 𝑥(𝑖)− ҧ𝑥2

Page 5: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Multiple LR – Closed Form

• Dataset: 𝑥(𝑖)

∈ 𝑅𝑑 , 𝑦𝑖∈ 𝑅

• Hypothesis ℎ𝜃 𝑥 = 𝜃𝑇𝑥

• MSE =1

𝑛σ𝑖=1𝑛 𝜃𝑇𝑥(𝑖) − 𝑦(𝑖)

2loss / cost

5

Closed-from solution

Page 6: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

GD for Simple Linear Regression

• 𝐽 𝜃 =1

𝑛σ𝑖=1𝑛 𝜃0 + 𝜃1𝑥

(𝑖) − 𝑦(𝑖)2

•𝜕𝐽(𝜃)

𝜕𝜃0=

2

𝑛σ𝑖=1𝑛 (𝜃0 + 𝜃1𝑥

(𝑖) − 𝑦(𝑖))

•𝜕𝐽(𝜃)

𝜕𝜃1=

2

𝑛σ𝑖=1𝑛 (𝜃0 + 𝜃1𝑥

𝑖 − 𝑦(𝑖)) 𝑥(𝑖)

6

Update rules for Gradient Descent

Page 7: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

GD for Multiple Linear Regression

1

𝑛

1

𝑛

2

𝑛

2

𝑛

7

Update rules for Gradient Descent

Page 8: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Gradient Descent vs Closed Form

Gradient Descent

Closed form

8

Page 9: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Outline

• Classification

– Linear classifiers

– Online perceptron and batch perceptron

• Instance learners

– kNN

• Evaluation of classification algorithms

– Metrics (accuracy, precision, recall)

• Cross validation

9

Page 10: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Supervised learning

10

{𝑥(𝑖), 𝑦(𝑖)}, for 𝑖 = 1,… , 𝑛

መ𝑓 𝑥(𝑖) ≈ 𝑦(𝑖)

መ𝑓

Page 11: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

11

Page 12: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Binary or discrete

12

𝑥 1 , … , 𝑥 𝑛 and 𝑦 1 , … , 𝑦 𝑛 , 𝑥(𝑖) ∈ 𝑅𝑑 , 𝑦(𝑖) ∈ {−1, 1}

𝑓 𝑥(𝑖) = 𝑦(𝑖)

Page 13: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Example 1

Content-related features• Use of certain words• Word frequencies• Language• Sentence

Structural features• Sender IP address• IP blacklist• DNS information• Email server• URL links (non-matching)

Classifying spam email

Binary classification: SPAM or HAM13

Page 14: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Example 2

Handwritten Digit Recognition

Multi-class classification14

Page 15: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Example 3Image classification

Multi-class classification 15

Page 16: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Supervised Learning Process

DataPre-

processingFeature

extractionLearning

model

Training

Labeled(Typically)

ClassificationRegression

Testing

New data

Unlabeled

Learning model

Predictions

MaliciousBenign

NormalizationStandardization

FeatureSelection

Risk score

Classification Regression

16

Page 17: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

History of Perceptrons

17

They are the basic building blocks for Deep Neural Networks

Page 18: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Linear classifiers

18

d+1

space into 2

Page 19: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Linear classifiers

19

All the points x on the hyperplane satisfy: 𝜃𝑇𝑥 = 0

Page 20: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Example: Spam

20

θ𝑥

𝑖=0

𝑑

𝑥𝑖𝜃𝑖

σ𝑖 𝑥𝑖𝜃𝑖 > 0

Page 21: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

The Perceptron

21

1

2

Page 22: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

The Perceptron

22

1

2

Page 23: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Geometric interpretation

23

𝜽𝑡𝜽𝑡+1

Page 24: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Online Perceptron

24

T𝑦(𝑖)𝜃𝑇𝑥(𝑖)

Page 25: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Batch Perceptron

Guaranteed to find separating hyperplane if data is linearly separable

25

𝑦(𝑖)𝜃𝑇𝑥(𝑖)

Page 26: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Perceptron Limitations• Is dependent on starting point

• It could take many steps for convergence

• Perceptron can overfit

– Move the decision boundary for every example

26

Which of this is optimal?

Page 27: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Improving the Perceptron

27

Page 28: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

• For linearly separable data, can prove bounds on perceptron error (depends on how well separated the data is)

28

Page 29: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

• Properties– (𝜃0, 𝜃1, … , 𝜃𝑑) = model parameters– Perceptron is a special case with 𝑓 = 𝑠𝑖𝑔𝑛– Linear regression can be used as classifier 𝑓(𝑥) = 𝑥

– If ℎ 𝑥 > 0.5, output 1; otherwise output -1• Pros

– Very compact model (size d)– Perceptron is fast

• Cons– Does not work for data that is not linearly separable

ℎ𝜃 𝑥 = 𝑓(𝜃𝑇𝑥)

ℎ 𝑥 = 0

ℎ 𝑥 < 0 ℎ 𝑥 > 0

29

Page 30: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Outline

• Classification

– Linear classifiers

– Online perceptron and batch perceptron

• Instance learners

– kNN

• Evaluation of classification algorithms

– Metrics (accuracy, precision, recall)

• Cross validation

30

Page 31: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

31

Page 32: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

K-Nearest-Neighbours for multi-class classification

Vote among multiple classes

32

Page 33: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Vector norms: A norm of a vector ||x|| is informally a measure of the “length” of the vector.

– Common norms: L1, L2 (Euclidean)

Norm can be used as distance between vectors 𝑥 and 𝑦

• 𝑥 − 𝑦𝑝

Vector distances

33

Page 34: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Distance norms

34

Page 35: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

kNN

35

• Algorithm (to classify point 𝑥)– Find 𝑘 nearest points to 𝑥 (according to distance metric)– Perform majority voting to predict class of 𝑥

• Properties– Does not learn any model in training!– Instance learner (needs all data at testing time)

Page 36: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

36How to choose k (hyper-parameter)?

Overfitting!

Page 37: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

37How to choose k (hyper-parameter)?

Page 38: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

38How to choose k (hyper-parameter)?

Page 39: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Cross-validation

Training data

Train set 1

Validation set 1

Train set 10

Error

Validation set 10 Error

Avg Error

K=1Train set 1

Validation set 1

Train set 10

Error

Validation set 10

Error

Avg Error

K=7

K=1K=3K=7

Error

… …

39

Page 40: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Review

40

• Classification is a supervised learning problem

– Prediction is binary or multi-class

• Examples

– Linear classifiers (perceptron)

– Instance learners (kNN)

• ML methodology includes cross-validation for parameter selection and estimation of model error

– K-fold CV or LOOCV

Page 41: Machine Learning and Data Mining I - Khoury College of ...Machine Learning and Data Mining I. Review ... Update rules for Gradient Descent. GD for Multiple Linear Regression 1

Acknowledgements

• Slides made using resources from:

– Andrew Ng

– Eric Eaton

– David Sontag

• Thanks!