Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
17/08/59
1
Basic Concepts of Medical Statistics
Section for Clinical EpidemiologyFaculty of Medicine Ramathibodi Hospital
Mahidol University
Outline
• Types of data• Descriptive statistics• Inferential statistics & Hypothesis testing• Statistical analysis for categorical data• Statistical analysis for continuous data• Statistical software
17/08/59
2
Types of Data
Sample of Case Report Form from a clinical study
Perspect Clin Res. 2014 Oct‐Dec; 5(4): 159–166.
17/08/59
3
Sample of Data obtained from clinical study
Types of Data
Types of Data
Categorical (Qualitative)
Nominal Ordinal
Numerical (Quantitative)
Discrete Continuous
17/08/59
4
Categorical data
• Categories1. Nominal
• Without ‘order’• Blood group – A / B / AB / O• Sex – Male / Female• Cancer status – Have cancer / Do not have cancer
• Special case of 2 choices: binary/dichotomous data
2. Ordinal• With ‘order’ – we can see which is bigger/smaller
• Cancer staging – I / II / III / IV• Degree of disease – Mild / Moderate / Severe
Numerical data
1. Discrete• Stepping, usually without fractions
• Number of positive lymph nodes after surgery• Number of heart beats per minute
2. Continuous• Age (years)• Cholesterol level (mg/dL)
17/08/59
5
Importance of types of data
• Types of data are important in several aspects of choosing the right statistical analysis
• Summarizing• Estimation• Hypothesis testing
Post test for types of data
Data Types of data
Weight (kg)
History of smoking (Y/N)
Serum potassium level
Length of stay in hospital (day)
Patient status (dead/alive)
17/08/59
6
Post test for types of data
Data Types of data
Weight (kg)
History of smoking (Y/N)
Serum potassium level
Length of stay in hospital (day)
Patient status (dead/alive)
Numerical
Categorical (dichotomous)
Numerical
Numerical
Categorical (dichotomous)
Descriptive Statistics
17/08/59
7
Descriptive statistics
• “Summaries of data”• Methods for describing a set of data by using graphs or summary measures
• Provide an ‘overview’ of general features of a set of data
Descriptive statistics
Summarizing Data
Categorical Data
‐ Frequency (n)
‐ Percentage (%)
Numerical Data
‐Measures of central tendency
‐Measures of dispersion
17/08/59
8
Summarizing Categorical Data
• Ex1. Summarizing of sex among 70 patients
Sex Frequency Percentage
Male 56 80%
Female 14 20%
Total 70 100%
Summarizing Categorical Data
• Ex2. Summarizing of staging of cancers
Stages of cancers Frequency Percentage
I 120 15%
II 320 40%
III 160 20%
IV 200 25%
Total 800 100.0%
17/08/59
9
Summarizing Numerical Data
Summarizing Numerical
Data
Measures of Central Tendency
Mean Median
Measures of Dispersion
Standard Deviation
Range
17/08/59
10
Measures of Central Tendency
• Describe where the data are.
1. Mean2. Median
Measures of Dispersion
• Describe the spreading of data
1. Standard Deviation (SD)
2. Range, Interquartile Range (IQR)
17/08/59
11
Distribution of Data
• Normal• Central tendency: use mean• Dispersion: use standard deviation
• Non-normal distribution• Central tendency: use median• Dispersion: use range
Checking for normal distribution
There are many ways of checking for distribution of the data, such as:
• Histogram visualization• Comparison of mean and median• Comparison of mean and standard deviation
17/08/59
12
Histogram
Comparison between mean and median
• Mean 62.4• Median 30.5
• Distribution of CD4 data is skewed to the right (mean > median)
17/08/59
13
Compare mean and standard deviation
Mean SD Mean-2SD, Mean+2SD
Height (cm) 164.7 7.8 149.1, 180.3
CD4 count 62.4 74.4 -86.4, 211.2
Summarizing Numerical Data
Mean (SD)
Age (year) 49.6 (14.3)
Weight (kg) 95.6 (21.7)
Height (cm) 171.5 (9.2)
BMI 32.5 (7.1)
CD4 62.4 (74.4)
17/08/59
14
Inferential Statistics
17/08/59
15
Inferential Statistics
• Process of drawing conclusions about a population on the basis of the information contained in a sample taken from that population.
Population Parameters
Sample Statistics
Usefulness of Inferential Statistics
• Estimation of Population Parameter from Sample Statistics• Test hypothesis of Population from Sample
17/08/59
16
31
Inferential Statistics
Parameter Estimation
Point estimate
Range estimate
Hypothesis Testing
Single pop.
Two pop.
More than two pop.
Parameter estimation
• Estimation of population parameter based from sample statistics• For example:
• Estimation for continuous data• Estimation of the mean age of all patients who have breast cancer in Thailand.
• Estimation for categorical data• Estimation of the prevalence of patients with diabetes mellitus (DM) in the Thai
population.
17/08/59
17
Example: Estimation
• Researcher would like to estimate the prevalence of hypertension in the Thai population.
• A random sample of 350 was taken from the general population. Forty-two out of 350 people in the sample had hypertension.
• Therefore, the point estimate of prevalence = 42/350 = 0.12 = 12%
Confidence interval
• The main purpose of confidence intervals is to indicate the precision of the point estimates.
17/08/59
18
Confidence interval
• Prevalence of HT in 350 patients is 12% with 95% CI of 9% to 15%. • This means that the true prevalence of HT of population lies somewhere in range 47 to 62 years.
Confidence Interval
Sample Size Mean age 95% CI
30 54 15 .. 80
500 55 50 .. 58
17/08/59
19
37
Parameter estimation
Categorical data
Population proportion
Sample proportion
Continuous data
Population mean
Sample mean
Example: Prevalence of Disease
38
• Researcher would like to estimate the prevalence of hypertension in the Thai population.
• A random sample of 350 was taken from the general population. Forty-two out of 350 people in the sample had hypertension.
• Point estimate from sample proportion = 42/350 = 12%
17/08/59
20
Solution:
a) The point estimate of the proportion is 42/350=0.12 or 12%
b) The Z value for a 95% CI is 1.96.
c) The SE of sample proportion is
SE = /
SE = . . / .
. . . to . . .
That is from 0.09 to 0.15 or 9% to 15%.
95% CI for the proportion is
Point estimate
Conclusion:
• Therefore, we can be 95% confident from this sample that the true prevalence of hypertension in Thailand lies somewhere in the range from 9% to 15%.
40
17/08/59
21
41
Parameter estimation
Categorical data
Population proportion
Sample proportion
Continuous data
Population mean
Sample mean
42
-Researcher would like to estimate the mean age of patients with DM in Bangkok.-A random sample of 750 patients with DM was taken and the mean age of the patients in the sample was estimated.-The mean age of the patients in the sample was 54.42 with a standard deviation of 12.93.
Example:
17/08/59
22
43
Solution:
a) The point estimate of the mean age is 54.42. b) For a 95% CI, the value of t with 749 df is 1.647.
c) The SE of sample mean is
SE = / = . / = 0.47
Therefore, a 95% CI for the mean age is
54.42-(1.647x0.47) to 54.42+(1.647x0.47),
that is from 53.65 to 55.19.
44
Therefore, we can be 95% confident from this sample that the true mean age of patients with DM in Bangkok lies somewhere in the range 53.65 to 55.19 years.
Conclusion:
17/08/59
23
45
Recommendation
If the primary objective is to estimate the population characteristics, the point estimates should be reported with their confidence intervals to indicate their precision.
46
Inferential Statistics
Parameter Estimation
Point estimate
Range estimate
Hypothesis Testing
Single pop.
Two pop.
More than two pop.
17/08/59
24
47
According to a survey study in Thailand in 2000, the proportion of people with chronic kidney disease (CKD) is 0.091 or 9.1%. Researchers want to test if the proportion of people with CKD in this year is the same as the proportion of people with CKD in 2000 which is 0.091.
Hypothesis for single population
48
It is known that calcium supplement is associated with increased bone mineral density (BMD) in the adult population. Researchers wanted to test if the means of BMD between postmenopausal women who received and who did not receive calcium supplement are different.
Hypothesis for two populations
17/08/59
25
49
Hypothesis testing
The purpose of hypothesis testing is to make a decision about a population by examining a sample from that population.
50
Hypothesis testing
Types of hypotheses Types of error Steps for hypothesis testing
Figure 9 Flow chart for hypothesis testing
17/08/59
26
51
Types of hypotheses
- Null hypothesis (H0) :It is a statement that a population parameter is assumed to be true or there is no difference between groups.
- Alternative hypothesis (Ha):It is a statement that is opposed to the null hypothesis.
52
Null hypothesisHo: μBMD(calcium+) = μBMD(calcium-)
Alternative hypothesisHa: μBMD(calcium+) ≠ μBMD(calcium-)
For example:
17/08/59
27
53
Hypothesis testing
Types of hypotheses Types of error Steps for hypothesis testing
Figure 10 Flow chart for hypothesis testing
54
Types of Errors
There are two possible types of errors that may occur in hypothesis testing:
- Type I error- Type II error
17/08/59
28
55
Type I error () is the probability of deciding that the means of BMD between women who receive and who did not receive calcium supplement are different, when in fact in the population, the means of BMD between these two groups are not different.
Null hypothesisHo: μBMD(calcium+) = μBMD(calcium-)
Alternative hypothesisHa: μBMD(calcium+) ≠ μBMD(calcium-)
56
Type II error (β) is the probability of deciding that the means of BMD between women who receive and who did not receive calcium supplement are not different, when in fact in the population, the means of BMD between these two groups are different.
Null hypothesisHo: μBMD(calcium+) = μBMD(calcium-)
Alternative hypothesisHa: μBMD(calcium+) ≠ μBMD(calcium-)
17/08/59
29
57
Ho: μBMD(calcium+) = μBMD(calcium-)
Ha: μBMD(calcium+) ≠ μBMD(calcium-)
Statistical Decision In PopulationBased on Sample Ho True Ho False-----------------------------------------------------------Reject H0 1-
(Type I error) (Power of test)
Do not reject H0 1- (Confidence) (Type II error)
-----------------------------------------------------------
58
-----------------------------------------------------------Actual condition of population
Test result Without disease With disease-----------------------------------------------------------Positive 1-
(False positive) (True positive/sensitivity)
Negative 1- (True negative/ (False negative)
specificity)-----------------------------------------------------------
Diagnostic test
17/08/59
30
59
Hypothesis testing
Types of hypotheses Types of error Steps for hypothesis testing
Figure 11 Flow chart for hypothesis testing
Steps for Hypothesis testing
Step 1 Generate null and alternative hypotheses
Null hypothesisHo: μBMD(calcium+) = μBMD(calcium-)
Alternative hypothesisHa: μBMD(calcium+) ≠ μBMD(calcium-)
60
17/08/59
31
Steps for Hypothesis testing
Step 2 Determine the significance level
The commonly used value of the significance level is 0.05. Usually this value does not exceed 0.10.
61
Steps for Hypothesis testing
Step 3 Select appropriate statistics
Depends on : • Objective of analysis• Type of data• Number of groups of samples• Dependent or independent groups of samples
62
17/08/59
32
Steps for Hypothesis testing
Step 4 Calculate the test statistic
The general form of the test statistic can be expressed as:
63
Steps for Hypothesis testing
Step 5 Convert the test statistic to p value
The p value is the probability of obtaining our observed data (or more extreme data) when the null hypothesis is true.
64
17/08/59
33
Steps for Hypothesis testing
Step 6 Draw a conclusion
65
If the p value is less than or equal to the significance level, we reject the null hypothesis.
If the p value is greater than the significance level, we do not reject the null hypothesis.
66
For example:
Null hypothesisHo: μBMD(calcium+) = μBMD(calcium-)
Alternative hypothesisHa: μBMD(calcium+) ≠ μBMD(calcium-)
Significance level P value Decision
0.05 0.30
0.05 0.01
Fail to reject
Reject
17/08/59
34
67
Hypothesis testing for two pop.
Paired samples
Independent samples
68
Paired Samples
Paired samples arise in a number of ways:When the same subjects are observed two times, usually at different times or in different circumstances.For example, we wanted to compare the SBP before and after use of oral contraceptive (OC) in pre-menopausal women.
17/08/59
35
69
Independent SamplesIndependent samples arise when each set of observations is made on different groups of subjects. Examples for two independent samples:
- We want to compare the systolic blood pressure between men and women.
- We want to compare the cholesterol levels between the groups of patients with and without chronic kidney disease (CKD).
70
Statistical analysis for numerical data
Normal distribution Non-normal distribution
Parametric methods Nonparametric methods
17/08/59
36
71
Distribution Parameter Statistical test
Normal
Non-normal
Statistical test for hypothesis testing of paired samples
Mean
Median
Paired t-test
Wilcoxon matched signed-rank test
72
Researchers wanted to test if the mean weights of HIV patients before and after 12 weeks of receiving an antiretroviral therapy regimen are different.
Class example I:
From a study of Monosuthi and et al.
17/08/59
37
Selection an appropriate statistical test
73
Question Answer
No. of samples 2 samples
Characteristic of samples Paired samples
Distribution of data Normal distribution
Parameter of interest Mean
Statistical test Paired t-test
74
Draw a conclusion
Conclusion, the mean weights of HIV patients before and after receiving antiretroviral therapy are not equal.
The p value is less than 0.001.
H0: μbefore=μafter
HA: μbefore≠μafter
So we reject the null hypothesis.
17/08/59
38
75
Distribution Parameter Statistical test
Normal
Non-normal
Statistical test for hypothesis testing of two independent groups
Mean
Median
Student t-test
Mann-Whiney test
76
Figure 5 Flow chart of hypothesis testing for independent means by using student t-test
Student t-test
Equal variances
t-test with equal variances
Unequal variances
t-test with unequal variances
17/08/59
39
77
Equal variances
Unequal variances
78
A preliminary test for difference of variances should be done, before hypothesis testing for two independent population means, is carried out. The hypothesis testing for difference of variances between two population is based upon the F test or variance ratio test.
Recommendation
17/08/59
40
79
Class example II:
Researchers wanted to test if the means/medians of weights of HIV patients who received NVP, and HIV patients who received EFV, are different.
From a study of Monosuthi and et al.
80
Question Answer
No. of samples 2 samples
Characteristic of samples Independent samples
Distribution of data Normal distribution
Parameter of interest Mean
Statistical test Student t-test
Selection an appropriate statistical test
17/08/59
41
Two types of t-test
• Equal variance• Non-equal variance
• To check for variance between groups, we usually use variance ratio test
82
Draw a conclusion
Conclusion, the variances of weights of patients between the NVP and the EFV groups are equal.
The p value for this example is 0.095.
Therefore, we cannot reject the null hypothesis.
H0: σ2NVP=σ2
EFV
HA: σ2NVP≠σ2
EFV
17/08/59
42
83
Select an appropriate statistical test
Question Answer
No. of samples 2 samples
Characteristic of samples Independent samples
Distribution of data Normal distribution
Statistical test Student t-test
Variances between 2 groups Equal
Statistical test Student t-test with equal variance
84
Draw a conclusion
Conclusion, the mean weight of the NVP group is equal to the mean weight of the EFV group.
The p value in this example is 0.42.
So, we cannot reject the null hypothesis.
H0: μNVP=μEFV
HA: μNVP≠μEFV
17/08/59
43
85
Distribution Parameter Condition Statistical test
Normal
Non-normal
Summary of hypothesis testing for two groups
Mean
Mean
Median
Median
Paired
Paired
Independent
Independent
Paired t-test
Student t-test
Wilcoxon matched signed-rank test
Mann-Whitney test
86
Summary of selection the appropriate statistical test
Types of data: categorical or numerical data
For numerical data: Distribution of data: normal or non-normal
Number groups of samples: one, two, …
For two or more groups:Characteristics: dependent or independent
17/08/59
44
87
Hypothesis testing
Continuous data
Single pop.
Two pop.
Categorical data
Single pop.
Two pop.
Figure 1 Flow chart for hypothesis testing
88
Figure 2 Flow chart of statistical analysis for categorical data in two populations
Statistical analysis for categorical data in two pop.
Hypothesis testing
Independent samples
Paired samples
Estimation of strength of association
Independent samples
Paired samples
17/08/59
45
89
Statistical analysis
For independent samples, the Chi-square test is conducted to examine the association between two categorical variables.H0: Two categorical variables are independent.
orH0: The proportions of the interested event between two independent groups are not different.
90
Independent Samples- A case-control study was conducted to look at the effects of traditional medicine used on the risk of hip fracture. - Researchers wanted to assess the association between traditional medicine used and the risk of osteoporotic hip fracture.
17/08/59
46
Pearson's chi-squared test (χ2)
• ∑
• Oi = Observed frequency• Ei = Expected (theoretical) frequency• n = Number of cells in the table
92
Table 1. The observed frequencies of the 2x2 table for assessing the association between traditional medicine used and the risk of hip fracture.
Traditional medicine users
Hip fracture Total
Yes No
Yes 20 8 28
No 208 216 424
Total 228 224 452
17/08/59
47
Traditional medicine users
Hip fracture Total
Yes No
Yes 28
No 424
Total 228 224 452
228452
28452
• Expected Probability of being in
this cell =
• Expected Frequency of this cell =
452
94
Steps for Hypothesis Testing
1. Generate the null hypothesis as follows:
H0: Traditional medicine used is not associated with the risk of hip fracture.
or H0: Proportions of traditional medicine used between two groups of patients, with and without hip fracture are not different.
17/08/59
48
95
Steps for Hypothesis Testing
2. Generate the alternative hypothesis as follows:
Ha: Traditional medicine used is associated with the risk of hip fracture.
or Ha: Proportions of traditional medicine used between two groups of patients, with and without hip fracture are different.
96
Steps for Hypothesis Testing
3. Set the level of significance
4. Select an appropriate statistical test
We follow the standard convention and set the level of significance to 0.05.
We are testing a hypothesis about the association between two categorical variables, and the samples are independent, so a statistical test for this case is the Chi-square test.
17/08/59
49
97
. tabulate tredmed hip, chi2 column exact expected
+--------------------+| Key ||--------------------|| frequency || expected frequency || column percentage |+--------------------+
traditiona | hip fracturel medicine | no yes | Total-----------+----------------------+---------- yes | 8 20 | 28
| 13.9 14.1 | 28.0 | 3.57 8.77 | 6.19
-----------+----------------------+---------- no | 216 208 | 424
| 210.1 213.9 | 424.0 | 96.43 91.23 | 93.81
-----------+----------------------+----------Total | 224 228 | 452
| 224.0 228.0 | 452.0 | 100.00 100.00 | 100.00
Pearson chi2(1) = 5.2588 Pr = 0.022
Result of the chi-square test:
98
Steps for Hypothesis Testing5. Draw a conclusion
The p value is equal to 0.02, thus, we reject the null hypothesis and conclude that traditional medicine used is significantly associated with the risk of hip fracture. In other words, the proportions of traditional medicine used between two groups of patients, with and without hip fracture are different.
17/08/59
50
99
Test for independence – small samples
• The Chi-square test is not an appropriate method of test for independence if the expected value is less than 5 for more than 20% of the total cells.• The Fisher’s exact test is an alternative method of test for independence when the requirements of the Chi-square test are not met.
100
The expected frequency of ever used OC and have cervical cancer (n11) can be calculated as:
OC used Cervical cancer Total
Yes No
Ever used 68 150 218
Never used 254 875 1129
Total 322 1025 1347
17/08/59
51
101
The expected frequency of ever used OC and did not have cervical cancer (n12) can be calculated as:
OC used Cervical cancer Total
Yes No
Ever used 68 150 218
Never used 254 875 1129
Total 322 1025 1347
102
HRT Hip fracture Total
Yes No
Yes 1 2 3
No 213 214 427
Total 214 216 430
Table 2. The observed frequencies of the 2x2 table for assessing the association between HRT and the risk of hip fracture.
17/08/59
52
103
HRT Hip fracture Total
Yes No
Yes 1.5 1.5 3
No 212.5 214.5 427
Total 214 216 430
Table 3. The expected frequencies of the 2x2 table for assessing the association between HRT and the risk of hip fracture.
104
%Expected freq. less than 5 Statistical test
Greater than 20% Fisher’s exact test
Less than or equal to 20% Chi-square test
Statistical tests for hypothesis testing about the association between two categorical variables
17/08/59
53
105
Figure 3 Flow chart of statistical analysis for categorical data in two populations
Statistical analysis for categorical data in two pop.
Hypothesis testing
Independent samples
Paired samples
Estimation of strength of association
Independent samples
Paired samples
106
Estimation of the strength of association
Risk ratio (RR) Odds ratio (OR)
Figure 4 Types of the strength of association
17/08/59
54
107
Risk ratio (RR)
- It is a ratio of the proportions of disease in exposed versus non-exposed groups.- It is often used in a cohort study and randomized controlled trial.
108
From the 2x2 contingency table, the risk ratio can be calculated as:
Exp Non-Exp Total
Case a b a+b
Control c d c+d
Total a+c b+d a+b+c+d
17/08/59
55
109
VL<400 copies Treatment Total
EFV NVP
Yes 7 1 8
No 17 29 46
Total 24 30 54
The relative risk can be calculated as:
110
The 95%CI can be calculated as:
95%CI=1.15, 66.32
VL<400 copies Treatment Total
EFV NVP
Yes 7 1 8
No 17 29 46
Total 24 30 54
17/08/59
56
111
The RR is equal to 8.75 with 95% CI 1.15 to 66.32. It means that patients who received EFV had 8.75 times the virologicalsuccess as compared with patients who received NVP with a 95% confidence that it would lie between 1.15 to 66.32 times.
Draw a conclusion
112
Odds ratio (OR)
It is often used in a case-control study and cross-sectionalstudy.
odds ratio is a ratio of the odds of having disease in exposed group versus the odds of having disease in non-exposed groups.
17/08/59
57
113
Exp Non-Exp Total
Case a b a+b
Control c d c+d
Total a+c b+d a+b+c+d
Therefore, the odds ratio can be calculated as:
114
Table 4. 2x2 contingency table for estimation of the odds ratio of the hip fracture and traditional medicine used
Hip fracture Traditional medicine Total
Yes No
Yes 20 208 228
No 8 216 224
Total 28 424 452
17/08/59
58
115
The odds ratio can be calculated as:
The 95% CI can be calculated as:
116
The OR is equal to 2.60 with 95% CI 1.06 to 6.96. It means that hip fracture is 2.60 times as likely to occur among patients who used traditional medicine than patients who did not use traditional medicine with a 95% confidence that it would lie between 1.06 to 6.96 times.
Draw a conclusion
17/08/59
59
Statistical Software
Name Website Price Features Ease of use
Note
SPSS http://www.ibm.com/software/analytics/spss/
$$$$$ ++++ ++++ Need to purchase separate modules for complicated analyses (such as Survival Analysis)Available from MU (http://softwaredownload.mahidol/)
Stata http://www.stata.com/
$$$$ ++++ +++ Ramathibodi access (CEB server)
R http://www.r‐project.org/
(Free) +++ + R‐commander is nice add on
SAS http://www.sas.com/ $$$$$ ++++ 0 Need programming skill
Excel ‐ $$ + ++++ Not really a statistical software, but can do many statistical calculations
Stata
17/08/59
60
SPSS
17/08/59
61
Statistical Model/Regression, Logistic Regression
• Statistical Model =