12 Data Analysis R_eng

Embed Size (px)

Citation preview

  • 7/24/2019 12 Data Analysis R_eng

    1/31

    1

    Data analysis using R: Lesson1

    You need to write commands after the prompt (>).

    If the prompt cannot be seen, press Esc key to reset.

    If you want to repeat the commands which you used, press PgUp key.

    R commands are highlighted in yellow.

    Simple calculation

    log1010

    log10(10)

    loge10

    log(10)

    3*5

    4/5

    3*5;4/5;3-7

    quotient of 11913

    119%/%13

    odd of 11913

    119%%13

    take an arithmetic mean of 145287210189204

    y

  • 7/24/2019 12 Data Analysis R_eng

    2/31

    2

    median of consecutive integers10111213141516

    median(y2)

    standard deviation of consecutive integers10111213141516

    sd(y2)

    How to read your data set (import)

    copy & paste: from excel file

    dat

  • 7/24/2019 12 Data Analysis R_eng

    3/31

    3

    2to see variable list

    summary(dat)

    3to draw a histogramof age

    hist(age)

    4Box-whisker plotof age

    boxplot(age)

    5scatter plotof systolic blood pressure and diastolic blood pressure

    plot(sbp, dbp)

    6contingency table of male and alcohol drinking habit

    table(male, alc)

  • 7/24/2019 12 Data Analysis R_eng

    4/31

    4

    The variable list of tsunagi data set

    variable name:

    age : age (year)

    male : sex0=female1=male

    alc : alcohol drinking habit0=no1=yes

    dur_smk : duration of smoking (year)

    hgt : hight

    wgt : weight

    grip_r : grip strength for right hand

    grip_l : grip strength for left hand

    sbp : systolic blood pressure

    dbp : diastolic blood pressurehb : hemoglobin level

    wbc : white cell count

    platelet : platelet count

    GOT : aspartate aminotransferaseAST

    GPT : alanine aminotransferaseALT

    gGTP : -glutamyl transpeptidase

    tp : total protein level

    alb : albumin

    agratio : ratio of albumin to globulin

    chl : total cholesterol

    hdl : HDL cholesterol

    tgl : triglyceride

    HbA1C : HbA1C

    cr_hd : arm cramps0=no1=yes

    cr_ft : calf cramps0=no1=yes

    cancer : cancer history0=no1=yes

  • 7/24/2019 12 Data Analysis R_eng

    5/31

    5

    Data analysis using R: Lesson2

    Binomial distribution

    1Suppose that you toss a coin ten times and you get nine heads and one tail. Is this coincidence?

    Null hypothesis

    Alternative hypothesis

    pbinom(1,10,0.5)

    (number of getting tail, number of trials, expected probability)

    You will obtain P value using one-sided test.

    In this statistical test, you get the probability to obtain one tail or less in 10 trials underthe assumption that hull hypothesis is true.

    Statistically significance: 0.05 is conventionally used as significant level.

    2. Normal distribution

    1) Lets obtain 5000 random numbers with standard normal distribution.

    x

  • 7/24/2019 12 Data Analysis R_eng

    6/31

    6

    Case-control study

    Import the data set of tsunagi

    Suppose that caseis subjects with the history of cancer and controlis subjects

    without cancer history.

    We want to examine the case-control difference in the distributions of other factors.

    1) Comparison of the distribution of HDL between cases and controls

    by(hdl, cancer, mean)

    unpaired t test (Student t-test)

    t.test(hdl~cancer, var.equal=TRUE)(unpaired, variance of two sample is equal)

    Welch t test

    t.test(hdl~cancer)

    (unpaired, variance of two sample is NOT equal)

    By the way, is HDL normally distributed?

    Parametric test: a statistical test that depends on assumption(s) about the distribution of

    the data. (distribution of the data is defined by parameter(s))

    1.

    What should we do if your data does not follow the assumption?

    i)

    Data transformation, e.g., log transformation

    ii)

    Usenon-parametric test

    lhdl

  • 7/24/2019 12 Data Analysis R_eng

    7/31

    7

    Wicoxon rank sum test (=Mann-Whitney U test)

    wilcox.test(cases,controls)

    3) Comparison of the distribution of categorical variables

    Use the data of rheumatic patients in the previous page.

    (Pearson

    s) Chi-squared test

    chisq.test(matrix(c(73,50,27,50),nr=2))

    (nr stands for the number of row.)

    chisq.test(matrix(c(73,50,27,50),nc=2))(nc stands for the number of column.)

    If you want to check your contingency table,

    matrix(c(73,50,27,50),nr=2)

    5) Comparison of the distribution of categorical variables with small sample size.

    the history of

    rheumatism in parents

    Yes No

    ------------------------------

    Rheumatic patients 5 1

    Patientssiblings 3 3

    Fisher

    s exact test

    fisher.test(matrix(c(5,1,3,3),nr=2))

  • 7/24/2019 12 Data Analysis R_eng

    8/31

    8

    Data analysis using R: Lesson 3

    Data set: tsunagi

    1. Check the data distribution

    Draw a histogram of systolic blood pressure (sbp

    hist(sbp)

    qqnorm(sbp)

    to check the normality of sbp distribution. If the plots line up on a line, the date is normally distributed.

    Check the normality using statistical tests

    After transformation, check the histogram of sbp and its normality.

    *log-transformation

    lsbp

  • 7/24/2019 12 Data Analysis R_eng

    9/31

    9

    2. Non-parametric test for matched data set

    wilcox.test(grip_l,grip_r,paired=T)

    Wilcoxon signed rank test

    ANOVA

    analysis of variance

    Suppose that you want to compare the mean among three (or more) groups. How do you test?

    1)

    Lets create a new variable for smoking status, smk_grp, using the variable of dur_smk. The

    new variable has three categories: non-smokers, smokers for 1-19 years, smokers for 20years

    or more.

    smk_grp

  • 7/24/2019 12 Data Analysis R_eng

    10/31

    10

    6) Non-parametric test for the mean comparison among three or more groups

    kruskal.test(sbp~smk_grp)

    Kruskal-Wallis test

    Post-hoc testing of ANOVAs (Multiple comparison)

    If there is a statistical significant difference among the three groups, you may ask Which

    one does differ significantly from others?. In this situation, it is NOT appropriate to repeat

    t-test between the groups (A vs B, B vs C, A vs C).

    There are many tests for multiple comparison.

    Tukey

    s HSD

    for normally distributed dataHolm

    s methoda modified Bonferroni test, applicable for non-parametric test

    Bonferroni correctiona conservative method with a lower statistical power

    Scheffe

    s methodlower statistical power

    Williams methodwhen you have a control group and there is a tendency among comparison groups,

    this method is the best.

    Dunnett

    s methodmultiple comparison with a control group

    NOTICE! A method is NOT recommended.

    Duncan

    s method

    Bonferroni correction

    pairwise.t.test(sbp, smk_grp, p.adjust.method="bonferroni")

    You will obtain the following output.

    -----------------------------------------------------------------

    Pairwise comparisons using t tests with non-pooled SD

    data: sbp and smk_grp

    0 1

    11

    -

    21

    1

    P value adjustment method: bonferroni

    -----------------------------------------------------------------

    Values in red color are P values by Bonferroni correction. Since all P values are 1, there was

    no statistical significance in all comparisons.

  • 7/24/2019 12 Data Analysis R_eng

    11/31

    11

    Holms method

    pairwise.t.test(sbp, smk_grp, p.adjust.method="holm")

    or

    pairwise.t.test(sbp, smk_grp)

    Tukeys HSD

    smki

  • 7/24/2019 12 Data Analysis R_eng

    12/31

    12

    Data analysis using R: Lesson

    Analysis of Variance (ANOVA)

    One-way ANOVA

    one variable is used to divide study subjects into groups such as the example

    in Lesson3.

    Non-matched data

    For example, you may want to test the performance of a machine. Under different conditions

    (temperature), you repeated performance test ten times using the same machine, and obtained

    the following results. (We ignored the effect of machines fatigue.)

    Table 1test Factor A

    A1 A2 A3 A4

    -20 0 20 40

    1 63 64 59 78

    2 58 63 62 82

    3 60 63 60 85

    4 59 61 64 80

    5 61 59 65 836 60 65 71 81

    7 57 61 65 79

    8 62 64 68 80

    9 50 62 74 76

    10 61 70 63 83

    If you want to input the data manually, you need to run the following commands.

    A1

  • 7/24/2019 12 Data Analysis R_eng

    13/31

    13

    A y

    1 A1 63

    2 A1 58

    3 A1 60

    4 A1 59

    .

    .

    37 A4 79

    38 A4 80

    39 A4 76

    40 A4 83

    boxplot(y~A, data=Dat1, col="lightblue")to draw the box-whisker plot

    summary(aov(y~A, data=Dat1)) or

    oneway.test(y~A, data=Dat1, var.equal=TRUE)

    Matched data

    Suppose that you may want to test the performance of a machine. Under different conditions

    (temperature), you repeated performance test ten times using ten machines (but same model), and

    obtained the following results.

    Table 2

    Factor A

    A1 A2 A3 A4

    Machines No -20 0 20 40

    No.1 63 64 59 78

    No.2 58 63 62 82

    No.3 60 63 60 85

    No.4 59 61 64 80

    No.5 61 59 65 83

    No.6 60 65 71 81

    No.7 57 61 65 79

    No.8 62 64 68 80

    No.9 50 62 74 76

    No.10 61 70 63 83

  • 7/24/2019 12 Data Analysis R_eng

    14/31

    14

    Dat2F)

    A 3 2681.47 893.82 62.2353 2.972e-12 ***

    No 9 77.72 8.64 0.6013 0.7846

    Residuals 27 387.78 14.36

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    This result indicates that the measurement is significantly related to factor A (temperature) .On

    the other hand, there is no significant difference in the performance of ten machinesP=0.7846.

    Two-way ANOVA

    Suppose that you may want to test the performance of a machine by different conditions of

    temperature and humidity. You repeated the test five times for each combination, and obtained

    the following results.

  • 7/24/2019 12 Data Analysis R_eng

    15/31

    15

    Table 3 Results by factor A (temperature) and factor B

    Factor A (temp.)

    A1 A2 A3 A4

    -20 0 20 40

    Factor B

    (humidity)

    B1

    < 50%

    63 64 63 68

    58 63 62 72

    60 63 67 80

    59 61 64 70

    61 59 65 75

    B2

    50%F)

    A 3 402.67 134.22 10.3250 6.588e-05 ***

    B 1 0.03 0.03 0.0019 0.965294

    A:B 3 203.07 67.69 5.2071 0.004833 **

    Residuals 32 416.00 13.00

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

    ------------------------------------------------------------------

    The interaction term is A:B. Since the interaction term is statistically significant, we

    can say there is an interaction between A and B for the performance.To understand this association, lets draw a graph.

    attach(Dat3)

    interaction.plot(B,A,y)

  • 7/24/2019 12 Data Analysis R_eng

    17/31

    17

    3

    Cohort study (using Tsunagi data set)

    To create a new variable named smk: smokers and never-smokers.

    smk

  • 7/24/2019 12 Data Analysis R_eng

    18/31

    18

    6

    n

    m Table

    chi-squared test

    A B C

    Smokers 50 20 30

    Non-smokers 50 80 70

    all 100 100 100

    1) To examine the difference in the proportion of smokers among three groups (A, B, and C),

    s

  • 7/24/2019 12 Data Analysis R_eng

    19/31

    19

    Data analysis using R: Lesson 5

    McNemar test

    Lets assume that there are 100 cases with a disease and 100 sex- and age-matched controls.

    We are going to compare the proportion of smokers between cases and controls. (In other words,

    we want to examine the association between smoking and the risk of disease.)

    Since sex- and age-matched controls were selected, you may want to keep the 100 pairs (matched

    data set) in the statistical analysis.

    Cases

    Non-smoker Smoker

    Controls

    Non-smoker 20 (a) 40 (b)Smoker 10 (c) 30 (d)

    The 20 pairs in cell (a) and 30 pairs in cell (d) cannot contribute to the analysis since there

    is no difference in the smoking status between cases and controls.Thus, only cells (b) and

    (c) contribute to the analysis for the association between smoking status and the risk of

    disease.

    Under the null hypothesis, number of pairs should be equal between cells (b) and (c), and we

    apply McNemar test for this analysis.

    22

    degree of freedom is always 1

    The command for McNemar test is as follows: prop.test(numerator, denominator)

    prop.test(40,50)

    What happen if you replace numerator from cell(b) to cell(c)?

    prop.test(10,50)

  • 7/24/2019 12 Data Analysis R_eng

    20/31

    20

    3

    Correlation

    Using Tsunagi data set, lets see the association between systolic and diastolic blood pressures.

    scatter plot

    plot(sbp,dbp)

    Do you see any association between systolic and diastolic blood pressures?

    correlation coefficientparametric method

    cor.test(sbp,dbp)

    This is called (Pearsons) correlation coefficient, which is a measure of association that

    indicates the degree to which two variables have a linear relationship.

    The coefficient, represented by the letter r, can vary between +1 and -1; when r=+1, there

    is a perfect positive linear relationship in which one variable varies directly with the other.

    Non-parametric method

    Spearman

    s rank correlation

    cor.test(sbp, dbp, method="spearman")

    Kendall

    s Tau

    cor.test(sbp, dbp, method="kendall")

    4. Regression analysis

    Lets see the association between age and systolic blood pressure.

    plot(age,sbp)

    Systolic blood pressure tends to increase with age. Can we predict systolic blood pressure

    by age using a statistical model?

    univariate regression model

    glm(sbp~age)

    This model can be applied under the assumption that the dependent variable normally distributed.

  • 7/24/2019 12 Data Analysis R_eng

    21/31

    21

    Call: glm(formula = sbp ~ age)

    Coefficients:

    (Intercept) age

    92.0454 0.5236

    Degrees of Freedom: 1308 Total (i.e. Null); 1307 Residual

    (1 observation deleted due to missingness)

    Null Deviance: 700600

    Residual Deviance: 635700 AIC: 11820

    AIC: Akaike's Information Criterion (evaluation of statistical model)

    summary(glm(sbp~age))

    Call:

    glm(formula = sbp ~ age)

    Deviance Residuals:

    Min 1Q Median 3Q Max

    -49.837 -15.031 -2.512 12.865 88.257

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 92.04541 2.87076 32.06

  • 7/24/2019 12 Data Analysis R_eng

    22/31

    22

    Draw the scatter plot and a line obtained from the regression model

    plot(age,sbp)

    abline(glm(sbp~age))

    plotting the residuals for each subject

    rslt

  • 7/24/2019 12 Data Analysis R_eng

    23/31

    23

    Data analysis using R: Lesson 6

    Regression analysis using Tsunagi data set

    categorical variable as an explanatory variable

    In the previous model, we used age as an explanatory variable. How about categorical variables?

    Can we use a categorical variable as an explanatory variable? The answer is Yes!Lets

    see an example using sex as an explanatory variable.

    summary(glm(sbp~male))

    Output will be as follows;

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)(Intercept) 122.5988 0.8095 151.45 < 2e-16 ***

    male 4.8212 1.3103 3.68 0.000243***

    Using this regression model, systolic blood pressure can express as:

    Systolic blood pressure122.5988 + 4.8212*male

    Please notice that male is coded as 1 and female is coded as 0 in Tsunagi data.

    Thus, the mean systolic blood pressure in females is 122.5988, and that in males is 127.42

    (=122.5988 + 4.8212).

    Lets confirm these mean values by other method.

    by(sbp,male,mean)

    The P value of the regression analysis in the above indicates that the association between

    systolic blood pressure and sex is statistically significant.In other words, there is a

    significant difference in the mean systolic blood pressure between males and females.

    Actually, you would obtain the same P value by t test assuming the same variance between males

    and females.

    t.test(sbp~male, var.equal=TRUE)

    * The above command is different from that mentioned in Lesson 2.

    Without a command ofvar.equal=TRUE, you will obtain the result of t-test assuming the

    different variance between males and females.

    t.test(sbp~male)

  • 7/24/2019 12 Data Analysis R_eng

    24/31

    24

    In the case of sex, male is coded as 1 and female is coded as 0. What happen if we do not use

    number(s) for coding categorical variable(s)?

    The following command is to create a new variable, smk.

    smk

  • 7/24/2019 12 Data Analysis R_eng

    25/31

    25

    4) Multivariate regression analysis (multiple regression model)

    According to the previous analysis, systolic blood pressure is related not only age but also

    sex. In this situation, the mean age might be different between males and females. Lets see

    the difference in the mean age between males and females.

    t.test(age~male, var.equal=T)

    Two Sample t-test

    data: age by male

    t = -1.6976, df = 1307, p-value = 0.08982

    alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:

    -2.7994433 0.2020984

    sample estimates:

    mean in group 0 mean in group 1

    61.41533 62.71400

    There is a marginally significant difference in the mean age between males and females.

    Multiple regression analysis provides you a statistical model using more than two covariates.

    The following command is to estimate the model using sex and age as covariates.

    summary(glm(sbp~male+age))

    glm(formula = sbp ~ male + age)

    Min 1Q Median 3Q Max

    -52.284 -14.934 -2.683 12.583 89.900

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 90.90451 2.88096 31.554 < 2e-16 ***

    male 4.11746 1.25124 3.291 0.00103 **

    age 0.51660 0.04519 11.431 < 2e-16 ***

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    (Dispersion parameter for gaussian family taken to be 482.7276)

    Null deviance: 700607 on 1308 degrees of freedom

    Residual deviance: 630442 on 1306 degrees of freedom

  • 7/24/2019 12 Data Analysis R_eng

    26/31

    26

    (1 observation deleted due to missingness)

    AIC: 11809

    Number of Fisher Scoring iterations: 2

    The coefficient of sex indicates that the sex difference of systolic blood pressure adjusting

    the effect of age. Since the P value of sex is less than 0.05, the sex difference of systolic

    blood pressure is statistically significant even after adjusting the effect of age.

    5) Multicollinearity

    Multicollinearity

    is a statistical phenomenon in which two or more predictor (explanatory)

    variables in amultiple regression model are highly correlated. In this situation, the coefficient

    estimates may change erratically in response to small changes in the model or the data.

    Multicollinearity does not reduce the predictive power or reliability of the model as a whole;it only affects calculations regarding individual predictors. That is, a multiple regression

    model with correlated predictors can indicate how well the entire bundle of predictors predicts

    the outcome variable, but it may not give valid results about any individual predictor, or about

    which predictors are redundant with others.

    For example, weight, height, and BMI are used in a regression model.

    How to avoid multicollinearity?

    Examine each association among explanatory variables calculating correlation coefficient.

    If you find multicollinearity, pick-up one of them as a covariate.

    Analysis of covariance: ANCOVA

    effect modification, synergistic effect

    In tsunagi data set, the association between age and systolic blood pressure may be different

    between males and females. How can we show that?

    Lets see associations between age and sbp for males and females, separately.

    plot(age, sbp, pch=as.integer(male))

    : male, :female

    abline(lm1

  • 7/24/2019 12 Data Analysis R_eng

    27/31

    27

    summary(lm1)

    Call:

    glm(formula = sbp[male == 0] ~ age[male == 0])

    Deviance Residuals:

    Min 1Q Median 3Q Max

    -51.034 -14.539 -2.127 12.897 87.461

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 76.74885 3.78235 20.29

  • 7/24/2019 12 Data Analysis R_eng

    28/31

    28

    summary(glm(sbp~male+age+male*age))

    or

    summary(glm(sbp~male*age))

    This regression model can be expressed as follows,

    sbpa+b(male) + c(age) + d(male * age)

    For females, male=0, then

    sbpa+ c(age)

    For males, male=1, then

    sbpa+b + (c+d) (age)

    Thus, the difference in the slope between males and females is d.

    H0: d=0, HA: d0

    If there is a statistically significant difference between males and females, we should report

    the results of regression analysis for males and females, separately.

  • 7/24/2019 12 Data Analysis R_eng

    29/31

    29

    Data analysis using R: Lesson 7

    In this session, we use kasari data set.

    Kasari data set

    MMS (Mini Mental State)a score of cognitive function

    r_gr hand grip of the right

    l_gr hand grip of the left

    hg mercury level in the hair

    male male=1, female=0

    age age

    sbp systolic blood pressure

    dbp diastolic blood pressurecr_hd cramp of upper limb (0no1yes

    cr_ft calf cramp0no1yes

    A case-control study was conducted. Cases had a history of calf cramp, and controls did not.

    Please try to answer the following questions by yourself.

    How many cases and controls are there in this data set?

    table(cr_ft)

    Check the distribution of mercury level in the hair with drawing a graph.

    hist(hg)

    Check the normality of mercury distribution in the hair using an appropriate method. If the

    mercury level is not normally distributed, conduct a logarithmic transformation, and then,

    check the normality, again.

    shapiro.test(hg)

    lhg

  • 7/24/2019 12 Data Analysis R_eng

    30/31

    30

    table(cr_ft,hgg)

    Conduct a chi-squared test to check the association between mercury level in the hair and

    the presence of calf cramp. Furthermore, conduct an appropriate t-test to compare the mean

    of mercury level in the hair between cases and controls after logarithmic transformation.

    chisq.test(cr_ft,hgg)

    t.test(lhg~cr_ft)

    According to these results, there might be an association between mercury level in the hair and

    the presence of calf cramp. Lets see this association taking account the effects of other

    factors.

    logistic regression analysis

    In this statistical model, dependent variable is cr_ft(the presence of calf cramp: yes=1,and no=0).

    Univariate logistic regression analysis

    rst

  • 7/24/2019 12 Data Analysis R_eng

    31/31

    31

    wilcox.test(lhg~male)

    There might be a sex difference in the mercury level in the hair. Lets see the association

    between the mercury level in the hair and the calf cramp risk after adjusting the effect of

    sex.

    rst3