51
CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Embed Size (px)

Citation preview

Page 1: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

CDS 130 - 003Fall, 2010

Computing for Scientists

Data Analysis(Nov. 11, 2010 – Nov. 23, 2010)

Jie Zhang Copyright ©

Page 2: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Where are We?

1. Introduction

2. Computer Fundamentals

3. Measurements

4. Scientific Simulation – Project 1

5. Visualization

6. Data Analysis

7. Ethics

8. Communication– Project 2 with 10-minute in-class presentation

Page 3: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Data AnalysisAnalysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making.

--http://en.wikipedia.org/wiki/Data_analysis

Page 4: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Example: my research on sunspots

•Coordinates•Areas•Fluxes(positive, negative, unsigned)

Number of fragments ------Analogous to number of sunspots.

Page 5: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Butterfly DiagramAutomated MDI AR from 1996 to 2008

Example: my research on sunspots

Page 6: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Your ProjectAnalyze global temperature data.What can you do with the data?

http://www.globalwarmingart.com/images/a/aa/Annual_Average_Temperature_Map.jpg

Page 7: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Content

1. Statistical Measures

2. Histogram method

3. Regression method

Page 8: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Matlab Overview of Data Analysis Tools

Watch video:

http://www.mathworks.com/demos/matlab/analyzing-data-overview-matlab-video-demonstration.html

Page 9: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Content

1. Statistical Measures

2. Histogram method

3. Regression method

Page 10: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Statistical Measures

Page 11: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Statistical Measures

• Minimum

• Maximum

• Mean

• Variance

• Standard Deviation

For a given distribution, no matter 1-D or 2-D data, one can find:

Page 12: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Minimum & Maximum

Why return 3 different numbers, not the minimum value of 1?

Answer: internal “min” function returns the minimum value of each column

“a(:)” converts the 2-D 3X3 array of “a” into 1-D data

Example: a=[1,2,3;4,5,6;7,8,9]

>a=[1,2,3;4,5,6;7,8,9]a = 1 2 3 4 5 6 7 8 9

>min(a)ans = 1 2 3

>min(a(:))ans = 1

>a(:) %what is this?

Page 13: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Minimum & MaximumExample: a=[1,2,3;4,5,6;7,8,9]

>a=[1,2,3;4,5,6;7,8,9]a = 1 2 3 4 5 6 7 8 9

>max(a)ans = 7 8 9

>max(a(:))ans = 9

“max” is a Matlab internal function

Page 14: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

MeanThe mean is the average value of a distribution

N

N

ii

N

ii

xxxx

N

x

.....

mean

211

1

Page 15: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

meanExample: a=[1,2,3;4,5,6;7,8,9]

>a=[1,2,3;4,5,6;7,8,9]

%find the sum of values of the data array>sum=0>for i=[1:9]>sum = sum + a(i);>end

>mean=sum/9mean=5

>mean(a(:))ans = 5

“mean” is a Matlab internal function

Page 16: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Variance and Standard DeviationBoth are measures of how spread out a distribution is. In other words, they are measures of variability

Variance is the average squared deviation of each number from its mean

1

)(variance

2

2

N

xN

ii

Standard deviation is the square root of variance

1

)(varDeviation Standard

2

N

xiance

N

ii

Page 17: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Example: a=[1,2,3;4,5,6;7,8,9]

>a=[1,2,3;4,5,6;7,8,9]

%find the sum of values of the data array>variance=0>for i=[1:9]>variance = variance + power((a(i)-mean),2)>end

>variance=variance/8 %variance Variance = 7.5 >std=sqrt(variance) % standard deviationans = 2.7386

Variance and Standard Deviation

Calculate variance and standard variation by ourselves

Page 18: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Example: a=[1,2,3;4,5,6;7,8,9]

>a=[1,2,3;4,5,6;7,8,9]

>var(a(:))ans = 7.5000

>std(a(:))ans = 2.7386

Variance and Standard Deviation

Using internal matlab functions:

“var”: for variance

“std”: for standard variation

Page 19: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Statistical Measures

Group Exercise:

What are the statistical measures of the following 1-D data array:

a=[1.2, 0.5, 0.9, 2.4, 1.8, 3.1, 2.2, 1.0]

Page 20: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

“temperature.dat”

% Hourly temperature predicted in Fairfax VA (22030) % on Nov. 10, 2010 (Wednesday) for 24 hours % from 1 AM to midnight1 452 443 43 4 425 416 407 408 439 4610 4911 5312 5513 56 14 5715 5716 5617 5318 4919 4620 4421 4322 4223 4124 41

Input Data in Matlab

Page 21: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Watch Online Video:

http://www.mathworks.com/demos/matlab/importing-data-from-files-matlab-video-tutorial.html

Input Data in Matlab

Page 22: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

“dlmread”: internal Matalb File Input Function• Read ASCII-delimited file of numeric data into matrix

•Syntax M = dlmread(filename, delimiter, R, C)

•delimiter: e.g., comma ‘,’, space ‘‘, colon “:” •R, C: specify the row and column where the upper left corner of the data lies in the file

>data=dlmread(‘temperature.dat’,’ ‘,3,0)>plot(data(:,1),’*’)>plot(data(:,2),’*’)

Input Data in Matlab

Page 23: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Statistical Measures

Group Exercise:

1.Download the temperature data from: “http://solar.gmu.edu/teaching/2010_CDS130/notes/data_analysis/temperature.dat”

2. Store the data into your working directory

3. Read the data into Matlab/Octave using “dlmread.m”

4. Calculate the statistical measures of the hourly temperature data

Page 24: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Histogram

Page 25: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Histogram

• Bin: the data range over which data points are counted, also called “groups”

• Frequency: number of data points on each bin• Histogram is kind of bar plot, since each data bin is

represented by one bar in the graph

Histogram is a summary graph showing the frequency distribution of data in various data range

Page 26: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Histogram: ExampleHistogram of scores of Prof. Zhang’s 2007 ASTRONOMY 111 – 003 Class.What do you learn?

Bin size: 10

Page 27: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Histogram: Example

ASTR111-003, Astronomy, 2007, Instructor: Prof. Jie ZhangID Grade10000001 1210000002 8110000003 8010000004 6010000005 7410000006 19……………………….………………………..……………………….

10000128 6810000129 5010000130 6410000131 8310000132 86

Raw data: “ASTR111_003_2007_grad.txt” in text file

Page 28: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Histogram: Example

%column 1: student ID; column 2: student gradea=dlmread('ASTR111_003_2007_grad.txt','',2,0);

%obtain the grade from the second columngrad=a(:,2);

%count the frequencyfreq=[0,0,0,0,0,0,0,0,0,0]; %initialize the frequency distributionfor i=[1:132] %bin 1: grade from 0 to 10

if ((grad(i) >= 0) && (grad(i) < 10)), freq(1)=freq(1)+1; endif ((grad(i) >= 10) && (grad(i) < 20)), freq(2)=freq(2)+1; endif ((grad(i) >= 20) && (grad(i) < 30)), freq(3)=freq(3)+1; endif ((grad(i) >= 30) && (grad(i) < 40)), freq(4)=freq(4)+1; endif ((grad(i) >= 40) && (grad(i) < 50)), freq(5)=freq(5)+1; endif ((grad(i) >= 50) && (grad(i) < 60)), freq(6)=freq(6)+1; endif ((grad(i) >= 60) && (grad(i) < 70)), freq(7)=freq(7)+1; endif ((grad(i) >= 70) && (grad(i) < 80)), freq(8)=freq(8)+1; endif ((grad(i) >= 80) && (grad(i) < 90)), freq(9)=freq(9)+1; end

if ((grad(i) >= 90) && (grad(i) <= 100)), freq(10)=freq(10)+1; endendfor

•Obtain the Data

•Count the frequency in data bin

Read-in the raw data and count the frequency

Page 29: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Histogram: ExampleFrequency Table

Bin Frequency

0-10 0

10-20 4

20-30 3

30-40 4

40-50 5

50-60 14

60-70 31

70-80 33

80-90 33

90-100 5

Page 30: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

“hist”“hist” is a Matlab internal function

%grad>hist(grad) % default bin: divide data into 10 bins

>hist(grad, 20) %specify the number of bins, e.g., 20 bins

>bin=[0:10:100] %specify the bin size and data range of counting >hist(grad, bin) %bin here is 1-D data array %Indicates the center of the bin

>bin=[5:10:95] %center at [5, 15,25….]>hist(grad, bin] %start point of bars [0, 10, 20….] %end point of bars [10, 20, 30….]

>a=hist(grad) % “a”: return the values of frequency distribution

Page 31: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Histogram: example

Group activity: show histogram of random numbers(1)Generate 10000 random numbers from 0 to 1(2)Plot histogram with 10 bins(3)How many numbers between [0.4,0.5]?(4)Plot histogram with 50 bins(5)Run “hist(rand(10,1),10)” multiple times(6)Run “hist(rand(10000,1),10)” multiple times(7)Does the histogram make sense regarding the distribution of the random number?

Page 32: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Histogram: ExampleHistogram of Fairfax Temperature data.

How do we interpret the data?

>temp=dlmread(‘temperature.dat’,’’,3,1) %obtain the temperature data

>hist(temp)

Page 33: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Matlab “if” statementSyntax: if expression, statements, end

• “expression”: consisting of variables joined by relational operator, and return a logic value

• Logic value: 1 – true ; 0 – False• If “true”, the statement is executed• If “false”, the statement is not executed• Relational operator: “<“, “>”, “==“, “<=“

• Logic operator: “&&” – and; “||” - or

Page 34: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Matlab “if” statement> 3>2ans = 1

> 3<2ans = 0

> 3==3ans = 1

>3 == 2ans = 0

>score = 74

>if (score >= 70), freq=1, end

>if (score >=70 && score <80), freq=1, end

Page 35: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Regression

Page 36: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Regression

Regression, or correlation, refer to the data analysis method to find the relationship that might exists between two quantities

For example: one measures the height and weight for a group of people.

What are the data obtained? X, Y

What kind of useful analysis can do you with the data?

What can you do with the methods you’ve learned already?

What else can you do?

Page 37: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Example

http://www.upa.pdx.edu/IOA/newsom/pa551/lectur19.htm

Page 38: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

The Objectives• The input of data is pair of

quantities• x=[x1,x2,x3,x4,x5…]• y=[y1,y2,y3,y4,y5....]

• To quantify the relation of the two quantities

• Objective one is to find the regression line that best fits the data -> the equation of the line

• Objective two is to determine how well the line fits the data correlation coefficient R

Page 39: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Exp: Income and EducationFifteen people are surveyed. The following table shows the data

Participants Income ($) Education (year)

1 125000 19

2 100000 20

3 90000 16

4 75000 16

5 100000 18

6 29000 12

7 35000 14

8 24000 12

9 50000 16

10 60000 17

11 30000 13

12 60000 15

13 56000 14

14 35000 12

15 90000 18

Page 40: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Exp: Income and EducationLoad the data

>pwd % path of my current file folderaans = C:\Octave\3.2.4_gcc-4.4.0\bin

>dir % directory content>cd(‘~/matlab’) %change my working directory % “~” points to my default home directory

>data = dlmread(‘education_income.dat’,’’,3,0) % load data from computer hard drive to memory

>data %make sure that you have the right data>size(data) %find the dimension and size of dataans = 15 3

>parse the data to x array and y array>x=data(:,3) % education, all rows and 3rd column>y=data(:,2) %income, all rows and 2dn column

Page 41: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Exp: Income and Educationmake appropriate scattering plot with labels

>plot(x,y,’*’) %okay but not good. Some points on the axis

>axis([10,22,10000,150000]) %change the plotting range

>xlabel(‘Education’) %okay. But font size is small>xlabel(‘Education’, ’Frontsize’, 20) %change to a larger font size

>ylabel(‘Income’, ‘Frontsize’,20)

>title(‘Income Education Analysis’

>pring –dpng ‘income_no_fit.png’

Page 42: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Exp: Income and Education

Page 43: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Mathematic PrincipleLeast square principle is used in the fitting

We fit the scattered data points into a linear equation

rintercepto-y :b

slope :a

baxy

Minimize the following “sum square” parameter to obtain the fitting coefficients “a” and “b”

datainput thefrom:)(_

)()(_

))(_)(_(1

2

iobsy

biaxifity

iobsyifitySSN

i

Page 44: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Matlab: “polyfit”“polyfit.m”: internal Matlab function to make polynomial curve fitting to a pair of data point>p=polyfit(x,y,1)

fitting cubic 3,:n

fitting quadratic 2,:n

fittinglinear 1, :n

fitting of degree :

...)( 2210

n

xpxpxppxp nn

Page 45: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Linear Regression>p = polyfit(x,y,1) %x: x data %y: y data % degree of fitting, n=1 for linear % p:fitting coefficient in descending power

p= 1.0889e+004 -1.0449e+005

%p(1)=10889.: the slope, the “b” we are looking for%p(2)=-104490: the y-interceptor, the “a” we are looking for

It means that the linear fitting function is:

1.04490-10089xy baxy

Page 46: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Linear Regression: plot

>p(1) % verify the data>p(2) % verify the data

>x_fit=[10:24] %define the x values of the fitted line>y_fit=p(1)*x_fit+p(2) %obtained the fitted y-value

>plot(x_fit,y_fit) %plot the line

%add the data points>hold on>plot(x,y.’*’)

Page 47: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Correlation CoefficientCorrelation coefficient, the R value, characterize how well the data can be fitted by a linear functionR = 0, no correlation at allR = 1.0, perfect correlation

))((

)(

)(2

2

yixixy

yiyy

xixx

yyxx

xy

yxss

yss

xss

ssss

ssR

Page 48: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Correlation Coefficient“corrcoef.m”: the internal function of correlation coefficient

>R=corrcoef(x,y)R = 0.90891

Page 49: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Group ExerciseMake the linear regression analysis on the temperature-latitude data: linear equation and R, and plot the line along with the data points

Latitude (degree) Temperature (Celcius)

0 32

10 26

20 20

30 12

40 5

50 -1

60 -9

70 -14

80 -20

90 -25

Page 50: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

Group Exercise2-D data array. Generate a random 2-D array of size 180 X 90

Find the average value along row 20

Find the average value along column 40

Find the value at row 50, column 30

Page 51: CDS 130 - 003 Fall, 2010 Computing for Scientists Data Analysis (Nov. 11, 2010 – Nov. 23, 2010) Jie Zhang Copyright ©

The End