View
212
Download
0
Tags:
Embed Size (px)
Citation preview
1
Descriptive statistics
A means of organizing, summarizing observations An overview of the general features of a data set For asking questions!!
2
T1BS
Descriptive Statistics
Data PresentationTypes of numerical dataTablesGraphs
Numerical Summary MeasuresMeasures of central tendencyMeasures of dispersionGrouped data
3
T1BS
Ex1: sexual dysfunction 1
4
T1BS
Ex1: sexual dysfunction 2
X1 Y1
5
T1BS
Ex1: sexual dysfunction 3
X1 Y1
6
T1BS
Ex1: sexual dysfunction 4
X2 Y1
7
T1BS
Ex1: sexual dysfunction 5
Y1 Y2
表頭 : 長粗線表頭 : 長細線
組間 : 長細線
組內 : 短細線
表尾 : 長細線
表尾 : 長細線
8
T1BS
Measurement
All science is measurementHelmholts, German physiologist
Good measurementGreat science breakthrough is always about mea
surement improvementTo do everything well is to measure it well
DefinitionDefinition of variablesDefinition of Denominator and Numerator
9
T1BS
Types of data measurements
Nominal scale : ( =, ≠ ) Race, gender, etc Also called categorical data, qualitative scale
Ordinal scale : ( =, ≠, >, < ) Your satisfaction score about YMU
Numerical scale : Interval scale (=, ≠, >, <, +, )
Temperature: zero is NOT NOTHING in quantity in measuring
Ratio scale ( =, ≠, >, <, +, , , )Weight: zero is nothing in quantity in measuring
10
T1BS
Data structure
ID Y1 Y2 X1 X2 … …
1 110 1 1 20
2 140 1 1 25
3 124 2 0 30
4 110 0 0 21
5 100 2 1 19
6 95 0 1 23
7 90 1 0 21
11
T1BS
Tables
Frequency DistributionCategorical dataDiscrete or continuous data
Data Break down: Distinct, non-overlapping intervalsNumber of Intervals: improve summary or lose information?
A general rule for table presentation: self-explanatoryLabeled clearly: table, columns, measurement specified
12
T1BS
Example: cholesterol levels Ages 25-34 Ages 55-64
Cholesterol
md/dL Number
of men
Relative
Frequency (%)
Number
of men
Relative
Frequency (%)
80-119 13 1.2 5 0.4
120-159 150 14.1 48 3.9
160-199 442 41.4 265 21.6
200-239 299 28.0 458 37.3
240-279 115 10.8 281 22.9
280-319 34 3.2 128 10.4
320-359 9 0.8 35 2.9
360-399 5 0.5 7 0.6
total 1067 100.0 1227 100.0
Table 2.7Absolute and relative freq. of serum cholesterol for 2294 US males, 1976-1980
13
T1BS
Examples: cholesterol levels
Table 2.8Relative and cumulative freq. of serum cholesterol for 2294 US males, 1976-1980
Ages 25-34 Ages 55-64
Cholesterol
md/dL
Relative
Frequency
(%)
Cumulative
Relative
Frequency
(%)
Relative
Frequency
(%)
Cumulative
Relative
Frequency
(%)
80-119 1.2 1.2 0.4 0.4
120-159 14.1 15.3 3.9 4.3
160-199 41.4 56.7 21.6 25.9
200-239 28.0 84.7 37.3 63.2
240-279 10.8 95.5 22.9 86.1
280-319 3.2 98.7 10.4 96.6
320-359 0.8 99.5 2.9 99.4
360-399 0.5 100.0 0.6 100.0
total 100.0 100.0
14
T1BS
Graphs: Bar Charts
For nominal or ordinal data Composition
Horizontal axis: CategoriesVertical bar: freq./relative freq. of each categoryBar Separated: not to imply continuity
15
T1BS
Ex2: Head-turning asymmetry
16
T1BS
Types of Numerical Data: Nominal Data: ( 數值大小不重要 )
Sex: (dichotomous, binary data) male=1; female=0
Blood type: O=1; A=2; B=3; AB=4
Ordinal Data: ( 順序重要,數值大小仍不重要 ) Level of severity:
fatal=1; sever=2; moderate=3; minor=4 Oncology Group’s Classification of Pt. Performance status
Status Definition
0 Fully active
1 Restricted in physically strenuous activity
2 Ambulatory and capable of all-self care
3 Capable of only limited self-care
4 Completed disabled
17
T1BS
台灣 HIV/AIDS 歷年報告個案趨勢圖 , 1984-2009
50% 以上新通報個案為發病者:熱線常務理事喀飛於 2009/11/30 為此現象與 CDC 官員起爭執
台灣地區本國籍感染人類免疫缺乏病毒者趨勢圖1984年至2009年12月(依診斷日分析)
9 15 11 12 28 43 36 91 135136172227277347400478530653
768860
1521
2922
19321746
1648
0 0 1 1 4 9 6 16 23 35 64 98 160136153181183166181234265
584774
1061851930
3381
0
400
800
1200
1600
2000
2400
2800
3200
3600
4000
1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
西元
人數
感染者 發病者
18
T1BS
舉例: kaposi’s sarcoma in AIDS pts
first 2560 AIDS patients reported to CDC, USA
資料結構: Sarcoma 的有無0: 無 ; 1: 有
分析方法:Excel 的樞紐分析NCSS 的 frequency table: discrete variable
19
T1BS
Bar chart: Cigarette consumption
Cigarette consumption per person 18 Y/O or older, USA, 1900-1990 Excel: 直條圖 (YEAR 的變項名稱要去掉 ) NCSS: percentile plot (grouping variable: YEAR)
number
0
1000
2000
3000
4000
5000
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990
number
20
T1BS
Graphs: Histograms
For discrete or continuous data Composition
Horizontal axis: Categories Vertical bar: freq./relative freq. of each interval Interval limit: midpoint
119.5: 80-119, 120-159 mg/dL
Bar Area: frequency representation, NOT bar height Same shape: absolute & relative frequency histogram Unequal interval widths: height must vary along with area
for remaining in proper proportion
21
T1BS
舉例: Cholesterol levels
分析方法: Excel 折線圖 NCSS: frequency table
Discrete variable: cholesterol, (080-119, 120-159…..)Frequency variable: No_ages_25_34
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
80-119
120-159
160-199
200-239
240-279
280-319
320-359
360-399
Relative freq1
Relative freq2
22
T1BS
Frequency Polygons
Similar to histogram Placing points at the center of each interval Connecting those points by straight lines
Frequency representation: same as histogram polygon and histogram indistinguishable
Number of observations increase, widths of interval decrease
Relative & Cumulative frequency polygon Percentile for describing the shape of a distribution
Symmetric distribution Skewed distribution
23
T1BS
One-Way Scatter Plots
For a discrete or continuous data Single horizontal axis
Displaying the relative position of each data point Information and interpretation
No information is lostHard to read if many data points lie close
together
391.8 (Alaska) 1214.9(DC)
One-way scatter plot: Crude death rates for 50 states and DC, USA, 1992
24
T1BS
Box Plots
A single axisSimilar to one-way scatter plot,
displaying summary of dataPercentiles, and quartiles (25th , 75th percentiles)
The lower and upper sides of the box: 25th & 75th Whisker lines
Adjacent values: the most extreme values set not more than 1.5 times the height of the box beyond either quartile
OutliersData points outside the whisker lines, represented by circles
25
T1BS
Box Plots, cont.
Crude death rates for 50 states and DC, USA, 1992
200.0
600.0
1000.0
1400.0
Rate
Box Plot
Variables
Am
ount
Alaska (391)
DC (1214)
25th percentile (772)
75th percentile (933)
50th percentile (872)
Lower whisker (772-242=530)
1.5*(933.3-772.0)=242
upper whisker (933+242=1175)
26
T1BS
Two-Way Scatter Plots For 2 different continuous dataExample: FEV1 and FVC (forced vital capacity)
FV
C (
L)
1 2 3 4 5 6 FEV1 (L)
1
2
3
4
5
Two-way scatter plot: FVC vs FEV1 for 19 asthmatics
27
T1BS
Line graphs
Line graph: reported rates of malaria by year, USA, 1940-1989
1940 50 60 70 80 90 Year
.01
.1
1
10
100
Rep
ort
ed r
ates
per
100
,000
Relapse of Korean veterans
Returning of Vietnam veterans
Immigration
28
T1BS
Numerical Summary Measures
Measures of Central TendencyMean
母體 :
樣本 :
Median: 50th percentile of a set of measurement, n=奇數 :
偶數 :
For ordinal and discrete/continuous data
i
iii
xn
x
dxxxfPxxN1
)(1
2/)12
(2
;2
1
thn
thn
thn
29
T1BS
Measure of central tendency, cont.
Mode
Aunimodal
Bbimodal
CR. skewed
DL. skewed
E2 distributions with Identical means, mediansand modes
30
T1BS
Measure of DispersionRange: (Max-Min) Interquartile Range
75th – 25th : the middle 50% of the observations
Variance母體 :
樣本 :
Coefficient of Variation: SD/meanWhy use C.V.?
n
xxExE
2
2222)(
)()(
2)1(
2
2
2222
1
)(
11
)(
nii
n
xx
nn
xxS
31
T1BS
Summarizing the distribution of values Empirical Rule
**Z1-α/2 = 1.96, α=0.05
68.27%
95.45%
2 2
32
T1BS
Grouped Data: 公式不重要Grouped Mean
公式:
fi: the frequency; mi: midpoint, ith interval
Grouped Variance
公式:
i
ii
f
fmx
1][
)(1
2
2
i
ki
iii
f
fxmS
33
Probability機率學觀念複習中央極限定理
34
T1BS
Outlines
Probability: Law of Probability: Independent & Exclusive Bayes’ theorem
Probability Density FunctionNormal density functionCharacteristics of N DistributionStandard normal distribution
35
T1BS
Outlines (cont.)
Population Parameters & sample StatisticsParameters and Statistics
Sampling methods Sampling distribution Central Limit Theorem
Standard Error & Standard Deviation
36
T1BS
Probability: key terms
Definition: An estimate of the likelihood of an event’s occurring Expressed as a fraction, a proportion, or a percent
Outcome: (simple event) A single possible result of a random experiment
Event: An event is any collection of outcomes
Sample Space (S): the set of all possible outcomes
37
T1BS
Law of Probability:
The Law of Large Numbers
1. .
2. .
Law of Probability
1. P(S)=1
2. 0≤P(Ei)≤1
3. ∑P(Ei)=1
4. P(non_A)=P(AC)=1P(A)
nasApn
f),(
n
FrequencyAP
n lim)(
5. For any event A & B P(A&B)=P(A)P(B|A)
P(AorB)=P(A)+P(B)-P(A&B)
P(B)=P(B|A)P(A)+ P(B|AC)P(AC)
38
T1BS
Independent & Exclusive
If event A and B are independent, thenP(A∩B)=P(A)P(B)The product of the marginal probability will equal
to the joint probability. If A&B are Mutually exclusive event, then
P(A∩B)=0P(A or B)=P(A)+P(B)
39
T1BS
Bayes’ theorem
Law of total probability
Conditional law of total probability
Bayes’ Theorem
)|()()( jj BAPBPAPA
B1 B2
B3 B4
B5
)(
)(
)|()(
)|()()|(
AP
ABP
BAPBP
BAPBPABP i
ii
iii
)|()|()|( CBAPCBPCAP jj
40
Probability Density Function& Normal Distribution
41
T1BS
Probability Density Function
Area under probability distribution For continuous random variables:
a b
f(x)
xallforxf
xdxf
xdxfbxafb
a
,0)(
1)()(
)()()(
42
T1BS
Normal density function
Point of inflection
),(~ 2Nx
x
Zx
μ
)2
1exp(
2
1)( 2Zxf
σ σ
43
T1BS
Characteristics of N Distribution
Area under curve=1 Symmetric about the mean mean=median=mode Points of inflection:
μσ
44
T1BS
),(~ 2Nx
x
Zx
Standard normal distribution Z Transformation and Z value:
(Observed - Expected) in terms of UNITS of SD
45
T1BS
Population Parameters
& sample Statistics
46
T1BS
Parameters and Statistics
Parameter symbol
Statistical symbol
Mean μ
Standard Deviation σ SD
Variance σ2 S2
Correlation ρ r
Proportion π p
X
Tab. 4-5, p72
47
T1BS
Sampling methods
Simple random sampling Stratified random sampling Cluster sampling Systemic sampling Multi-stage sampling Conventional sampling
48
T1BS
Sampling distribution
Sampling distribution Occurring in REPEATED SAMPLING Distribution of values of over all possible
samples Using sample Statistics to inference population
Parameters
Quiz: If we want to select 6 students from the class
with total 46 students, how many possible samples would we have?
X
49
T1BS
Central Limit Theorem
The beauty of CLT: Easy to calculate V
The ugliness of CLT: Hard to explain p
Standard Error:
SD of the means
Review
)/,(~ 2 nNx
),(~ 2Nx
For large n,
Xn
50
T1BS
History of Central Limit Theorem
DeMoivre(1733) 由 Bernoulli 分布提出 CLT 的最初版本
Laplace(1749~1827) 觀察到測量誤差有常態分佈的傾向,將此定理推廣到任意 p 值 CLT 當初叫「誤差頻率法則」 (Law of frequency of error)
Galton(Natural Inheritance, 1889) 如果希臘人知道該法則的話,它或許已被人格化且奉祀為神了。它在最狂亂的困惑
中,全然地謙遜,寧靜地統治一切。暴眾越多,狂亂越大,它的支配越完美。它是非理性的至高無上法則。
The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason.
Liapounoff 1901 提出第一個較完整的 CLT 證明 Lindeberg JW, Levy P 於 1920 年代提出 CLT 的完備證明
Proof for a random sample from an arbitrary distribution 註 : 1905 年愛因斯坦提出狹義相對論
51
T1BS
Standard Error & SD
SE: The standard error of the mean is the SD of the means in
a sampling distribution. It tells us how much variability can be expected among
means in future samples
SD: The standard deviation is based on measurement of
individuals. It tells us how much variability can be expected among
individuals.
52
T1BS
Home Work
NCSS 軟體操作 : 課本習題 3-2
請做出課本表 3-23(p58) ,加上一個標準差的變項欄位下週課堂上當場隨機點名作 !!檔案名稱: Gebel 研究問題:
Gebel (1997) 研究了 580 個病人的心跳變化,以評估其可能的心血管自主神經系統失常。
深呼吸會使心跳變化增大,但隨年齡增加變化幅度會降低 請解釋中央極限定理
下週課堂上當場隨機點名問 !