10
12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson [email protected] Department of Computer and Information Science, IUPUI

12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson [email protected] Department of Computer and Information Science,

Embed Size (px)

Citation preview

Page 1: 12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

12a. Regression Analysis, Part 1

CSCI N207 Data Analysis Using Spreadsheet

Lingma [email protected]

Department of Computer and Information Science, IUPUI

Page 2: 12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

StudentReading Aptitude

Reading Hours

1 20 52 5 13 5 24 35 75 30 86 35 87 10 38 5 29 15 510 40 9

Multivariate Analysis - Correlation

0 5 10 15 20 25 30 35 40 450123456789

10

Reading Aptitude and Read-ing Hours

Aptitude

Hours

Scatter chart with a trend line:

Page 3: 12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Multivariate Analysis - Correlation

0 5 10 15 20 25 30 35 40 450123456789

10

Reading Aptitude and Read-ing Hours

Aptitude

Hours

Scatter chart with a trend line:

• With a trend line, are we able to roughly estimate the reading aptitude if a person reads 6 hours a week? If so, what is the estimation?

StudentReading Aptitude

Reading Hours

1 20 52 5 13 5 24 35 75 30 86 35 87 10 38 5 29 15 510 40 911 25 612 33 7.813 46 10

Page 4: 12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Regression and Prediction• Regression refers to a mathematical method

for determining the best equation to reproduce a data set.

• Linear regression is a regression method that applies a straight line (linear model) for analysis.

• How do we generate a formula that represents a line with which we can use to predict a data without having to use a chart?

• We use regression analysis to …– … predict new X and Y values – … aid our understanding of data behavior

Page 5: 12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Reviewing the Linear Equation• The equation for a line is:

bmXY Dependent

VariableIndependent Variable

Slope

y-intercept

Page 6: 12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Slope and y-intercept

0

2

4

6

8

10

12

0 5 10 15 20 25

Y

X

Y = 0.4X + 2Y = 0.8X + 4

Y = 0x + 5

Page 7: 12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

m and b• m, the Slope is a ratio, defined as:

• ∆: change of

run

riseor as

X

Y

Page 8: 12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Example – Determining Slope

Data Points

Value

X1 1

Y1 2.4

X2 20

Y2 10

4.019

6.7120

4.21012

12

m

m

XX

YYm

X

Ym

Page 9: 12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Example of Determining Y-Intercept

2810

204.01022

22

22

b

b

mXYb

bmXY

bmXY

• X1=1, Y1=2.4, X2=20, Y2=10, m=0.4Example 1: Example 2:

24.04.2

14.04.211

11

11

b

b

mXYb

bmXY

bmXY

Equation: Y = 0.4X + 2

Page 10: 12a. Regression Analysis, Part 1 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Practice

• Find the equation for the line below.p1(5,1), p2(10,3)

4 5 6 7 8 9 10 110

0.5

1

1.5

2

2.5

3

3.5

f(x) = 0.4 x − 1

Reading Aptitude and Reading Hours