13
CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann LECTURE 4: REGRESSION

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

CSE217 INTRODUCTION TO DATA SCIENCE

Spring 2019Marion Neumann

LECTURE 4: REGRESSION

Page 2: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

RECAP: DATA SCIENCE

2

…solving problems with data…

collect & understand

data

clean & format

data

dataproblem

use datato createsolution

scientific or business problem

…which step is most exciting?

Machine Learning

Page 3: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

RECAP: ML

• data: anything you can measure or record

• model: specifica9on of a (mathema9cal) rela+onship between different variables

• evalua*on: how well does the model work?

3

…creating and using models that learn from data…

Page 4: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

RECAP: ML WORKFLOW• Training phase, test phase, and evaluation phase

à turn to your neighbor• by taking turns, explain what happens in the

• training phase• test phase• evaluation phase

• carefully define what kinds of data are used in each phase

4

data

outputprogram

data

output

ground truth performance

measure

Page 5: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

PROPERTY SALES DATAGoal: predict how much my house is worth

• features (input variables)size (in sq. ft): o numeric o categorical o binaryneighborhood: o numeric o categorical o binary# bed rooms: o numeric o categorical o binary# bath rooms: o numeric o categorical o binarypool o numeric o categorical o binaryage (in years): o numeric o categorical o binaryrenovated o numeric o categorical o binary

• house price = target variableo numeric o categorical o binary

5

How can this data

help?

Page 6: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

PREDICTING HOUSE PRICES

• target (house price) is a real number

6

How much is my house worth?

Look at Zillow!

Page 7: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

LINEAR REGRESSION MODEL

7

Page 8: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

TRAINING: MINIMIZE ERROR

8

PDSHp391

Linear Regression

math & statistics

Page 9: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

PREDICTION: USE MODEL

9

PDSHp391

Linear Regression

Page 10: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

HOW ABOUT MORE COMPLEX MODELS?

10

PDSHp393

Linear Regression

Error on training set:linear model >> quadratic >> 6-order polynomial

ß error is zero!

Is the model with zero (training)

error the best?

Page 11: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

EVALUATION FOR REGRESSION

• Training Error vs. Test Error

• Error measures: • RMSE: root mean squared error• MAE: mean absolute error

11

RMSE %&, &() = +,-

.(%0. − 0.)3

MAE %&, &() = +,-

.| %0. − 0.|

%& = 6(7())predictions for test data

Page 12: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

MACHINE LEARNING WORKFLOW

• Training Phase, Test Phase, Evaluation Phase

12

Page 13: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

SUMMARY & READING• Learning from Data requires a lot of math!

• Regression models are used to predict real valued targets.

• We need a test set to evaluate how well our model generalizes.

13

• DSFS• Ch11: ML (p142-144) • Ch14: Simple Linear Regression (p173-176)

• PDSH Ch5: ML – Linear Regression (p390-394)• LINEAR REGRESSION BY HAND

https://www.wired.com/2011/01/linear-regression-by-hand/

SciKitLearn

understandthe model use the

model in practice