123
1 Tutorial: Life Tables in Stata /LIH WDEOHV OLVW WKH GHDWK UDWHV H[SHULHQFHG E\ D SRSXODWLRQ RYHU D JLYHQ SHULRG RI WLPH 7KH\ KDYH PDQ\ SUDFWLFDO XVHV )RU H[DPSOH LQVXUDQFH FRPSDQLHV XVH WKHP WR GHWHUPLQH SUHPLXPV DQG DQQXLWLHV WKH JRYHUQPHQW XVHV WKHP WR SODQ IRU VRFLDO VHFXULW\ /LIH WDEOHV DUH HDV\ WR FRPSXWH LQ 6WDWD WKURXJK WKH XVH RI WKH ltable FRPPDQG 7R EHJLQ GRZQORDG WKH lifetable.dta GDWD VHW IURP WKH FRXUVH ZHEVLWH DQG RSHQ LW LQ 6WDWD 7KLV GDWD VHW ZDV JHQHUDWHG IURP RQH RI WKH ILUVW OLIH WDEOHV UHFRUGHG GDWLQJ EDFN WR WKH ODWH WK FHQWXU\ :KDW LV WKH PHDQ OLIH VSDQ" :KDW LV WKH PHGLDQ" VXPP DJH GHWDLO JHW WKH PHDQ DQG WKH PHGLDQ IURP WKH YDOXH RI WKH YDOXH LQ WKH SHUFHQWLOH FROXPQ :KDW GRHV WKH KLVWRJUDP RI DJH DW GHDWK ORRN OLNH" ,V LW V\PPHWULF" *UDSKLFV ! +LVWRJUDP ! 6HOHFW DJH DV YDULDEOH &RPPDQG KLVWRJUDP DJH 6\PPHWULF DIWHU DQ LQLWLDO SHDN LQ GHDWK WLPHV DURXQG DJH

STATA intro

Embed Size (px)

DESCRIPTION

Harvard

Citation preview

Page 1: STATA intro

1

Tutorial: Life Tables in Stata

/LIHWDEOHVOLVWWKHGHDWKUDWHVH[SHULHQFHGE\DSRSXODWLRQRYHUDJLYHQSHULRGRIWLPH7KH\KDYHPDQ\SUDFWLFDOXVHV)RUH[DPSOHLQVXUDQFHFRPSDQLHVXVHWKHPWRGHWHUPLQHSUHPLXPVDQGDQQXLWLHVWKHJRYHUQPHQWXVHVWKHPWRSODQIRUVRFLDOVHFXULW\

/LIHWDEOHVDUHHDV\WRFRPSXWHLQ6WDWDWKURXJKWKHXVHRIWKHltable FRPPDQG7REHJLQGRZQORDGWKHlifetable.dtaGDWDVHWIURPWKHFRXUVHZHEVLWHDQGRSHQLWLQ6WDWD7KLVGDWDVHWZDVJHQHUDWHGIURPRQHRIWKHILUVWOLIHWDEOHVUHFRUGHGGDWLQJEDFNWRWKHODWHWKFHQWXU\

:KDWLVWKHPHDQOLIHVSDQ":KDWLVWKHPHGLDQ"VXPPDJHGHWDLO

JHWWKHPHDQDQGWKHPHGLDQIURPWKHYDOXHRIWKHYDOXHLQWKHSHUFHQWLOHFROXPQ

:KDWGRHVWKHKLVWRJUDPRIDJHDWGHDWKORRNOLNH",VLWV\PPHWULF"*UDSKLFV!+LVWRJUDP!6HOHFWDJHDVYDULDEOH

&RPPDQGKLVWRJUDPDJH

6\PPHWULFDIWHUDQLQLWLDOSHDNLQGHDWKWLPHVDURXQGDJH

Page 2: STATA intro

2

8VHWKHltable FRPPDQGWRJHQHUDWHDOLIHWDEOH

D :KDWLVWKHFKDQFHRIVXUYLYLQJIURPELUWKXQWLODJH"&RPPDQGOWDEOHDJH%XWLIZHXVHWKLVFRPPDQGDOOWKHLQWHUYDOVDUHRIOHQJWKZKLFKLVQWYHU\KHOSIXO6RZHZLOOXVHWKHLQWHUYDORSWLRQ:HZDQWLQWHUYDOVRIOHQJWK

&RPPDQGOWDEOHDJHLQWHUYDOVWDUWDWHQGDWLQVWHSVRI6((%(/2:E :KDWLVWKHSURSRUWLRQRILQGLYLGXDOVDOLYHRQWKHLUWKELUWKGD\ZKRGLHEHIRUH

WKHLUWKELUWKGD\"

3HRSOHDOLYHDWDJH YDOXHLQ SHRSOH3HRSOHZKRGLHGDWDJH

7KHUHIRUHSURSRUWLRQ F :KDWLVWKHFKDQFHWKDWD\HDUROGZLOOVXUYLYH\HDUV"

+RZPDQ\SHRSOHZHUHDOLYHDWDJH" YDOXHLQURZ

+RZPDQ\SHRSOHZHUHDOLYHDWDJH" YDOXHLQURZ

3URSRUWLRQ G :KDWLVWKHFKDQFHWKDWD\HDUROGZLOOVXUYLYHWRDJH"

)RUDDERYHILQGWKHQXPEHURISHRSOHDOLYHDWDJHWKHURZ,QWKLVFDVHWKHYDOXHZDV

+HQFHRXUYDOXHIRUVXUYLYDOXQWLODJH $OWHUQDWLYHO\YDOXHLQVXUYLYDOFROXPQ

DWDJHURZLV$QVZHUIRUD

GDOLYHDWDJH DOLYHDWDJH 3URSRUWLRQ

Page 3: STATA intro

Example: Probability of hypertension at baseline ,QWKH)UDPLQJKDPGDWDVHWRIWKHSDUWLFLSDQWVGLGQRWKDYHK\SHUWHQVLRQDW

EDVHOLQHDQGGLGKDYHK\SHUWHQVLRQDWEDVHOLQH8VLQJWKLVLQIRUPDWLRQZKDWLVWKHSUREDELOLW\WKDWDUDQGRPO\VHOHFWHGSDUWLFLSDQWLQWKH)UDPLQJKDPVWXG\KDGK\SHUWHQVLRQDWEDVHOLQH"

:KDWLVWKHSUREDELOLW\WKDWWKLVSDUWLFLSDQWGLGQRWKDYHK\SHUWHQVLRQDWEDVHOLQH"$UHWKHVHHYHQWVPXWXDOO\H[FOXVLYHH[KDXVWLYHQHLWKHURUERWK":KDWLVWKHSUREDELOLW\WKDWWKUHHUDQGRPO\VHOHFWHGSDUWLFLSDQWVDOOGRQRWKDYH

K\SHUWHQVLRQDWEDVHOLQH"

Raj Dasgupta
P(H) = # withhyp / # total = 1430/4434 = 0.32
Raj Dasgupta
Hc = H Complement (= Individual did not have hyp)P(Hc) = 1 - P(H)= 1 - 0.32 = 0.68
Raj Dasgupta
Both - since complementary events cannot happen at the same time (mutually exclusive) and exhaustive since P(H) + P(Hc) = 1
Raj Dasgupta
P(H = 1 ind has hyp) = 0.68P(all 3 hyp) = P(first has hyp).P(second has hyp). P(third has hyp)= P(H^3) = 0.68^3
Page 4: STATA intro

6XSSRVHZHDJDLQUDQGRPO\VHOHFWWZRSDUWLFLSDQWVIURPWKLVSRSXODWLRQ:KDWLVWKHSUREDELOLW\WKDWERWKSDUWLFLSDQWVKDYHK\SHUWHQVLRQDWEDVHOLQHJLYHQWKDWDWOHDVWRQHRIWKHSDUWLFLSDQWVKDGK\SHUWHQVLRQ

Raj Dasgupta
Raj Dasgupta
A = both has hypB = at least 1 has hypWe need to find P(A|B)P(A|B) = P(A and B) / P(B) = P(A)/P(B) = prob that both hyp / prob at least 1 hypboth has hyp = P(H).P(H)at least 1 has hyp = P(H)+P(H) - P(H).P(H)P(H) = 0.32Therefore,P(H).P(H) / P(H)+P(H) - P(H).P(H) = 0.39
Page 5: STATA intro

Example: Relationship between hypertension and CHD using probability laws :HH[DPLQHWKHUHODWLRQVKLSEHWZHHQK\SHUWHQVLRQDQG&+'DWEDVHOLQHLQWKH)UDPLQJKDPVWXG\SRSXODWLRQXVLQJWKHFRQFHSWVRISUREDELOLW\OHDUQHGWKLVZHHN

D :KDWLVWKHSUREDELOLW\WKDWD)UDPLQJKDPSDUWLFLSDQWKDVK\SHUWHQVLRQRU&+'DWEDVHOLQH"

E $UHWKHVHWZRHYHQWVLQGHSHQGHQW":RXOG\RXH[SHFWWKHVHHYHQWVWREHLQGHSHQGHQW"

F :KDWLVWKHSUREDELOLW\WKDWDSDUWLFLSDQWKDV&+'DWEDVHOLQH":KDWLVWKHSUREDELOLW\WKDWDSDUWLFLSDQWKDV&+'DWEDVHOLQHJLYHQWKDWKHVKHKDVK\SHUWHQVLRQ"

Raj Dasgupta
Raj Dasgupta
find tab prevhyp1 prevchd1
Raj Dasgupta
H = hypC = CHDP(H or C) = P(H) + P(C) - P(H and C)= 1430/4434 + 194/4434 - 113/4434= 0.34
Raj Dasgupta
Guessing, we won't expect them to be independentFor independence,P (H and C) = P(H).P(C)=> 113/4434 = 0.025 should be = P(H).P(C) = 1430/4434 X 194/4434 = 0.014 Therefore, they are not equal and hence not independent
Raj Dasgupta
P(C) = 194/4434 = 0.025We want to find P(C|H) = P(C and H) / P(H) = 113/4434 / 1430/4434= 0.08 ==> which is more than just P(C) and hence means that given you have hyp you have more prob of having CHD as opposed to having CHD with no information about hypertension
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
. tab prevhyp1 prevchd1 Prevalent | Prevalent coronaryhypertensi | heart disease, exam 1on, exam 1 | No Yes | Total-----------+----------------------+---------- No | 2,923 81 | 3,004 Yes | 1,317 113 | 1,430 -----------+----------------------+---------- Total | 4,240 194 | 4,434
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
Page 6: STATA intro

Tutorial: ROC Curves in Stata 52&FXUYHVLOOXVWUDWHWKHLQKHUHQWWUDGHRIIRIEHWZHHQVHQVLWLYLW\DQGVSHFLILFLW\:HH[DPLQH52&FXUYHVLQWKHFRQWH[WRIULVNSUHGLFWLRQ

&RQVLGHUWKHIROORZLQJVFHQDULR\RXDUHUHVSRQVLEOHIRUWHOOLQJDSDWLHQWWKDWWKH\DUHDWKLJKRUORZULVNIRU&+'JLYHQVRPHEDVHOLQHSURJQRVWLFIDFWRUV8VLQJ)UDPLQJKDPGDWDVHW\RXFDQSUHGLFWWKHSUREDELOLW\WKDWDQLQGLYLGXDOJHWV&+'JLYHQWKHLUEDVHOLQHSURJQRVWLFIDFWRUV

Constructing an ROC curve to evaluate a risk prediction model:

8VLQJV\VWROLFEORRGSUHVVXUHQXPEHURIFLJDUHWWHVVPRNHGSHUGD\WRWDOFKROHVWHUROVH[DQG%0,DWEDVHOLQHSUHGLFWWKHSUREDELOLW\WKDWHDFKLQGLYLGXDOLQWKH)UDPLQJKDPGDWDVHWKDG&+'&DOOWKLVSUREDELOLW\S

$VLQWKHGLDJQRVWLFWHVWLQJVHWWLQJVHOHFWDFXWRIISUREDELOLW\FWRGLVWLQJXLVKKLJKDQGORZULVNSDWLHQWV,ISFWKHSDWLHQWLVORZULVN,ISFWKHSDWLHQWLVKLJKULVN

&ODVVLI\DOOSDWLHQWVLQWKHGDWDVHWDVKLJKULVNRUORZULVNXVLQJWKHFXWRIIF

&DOFXODWH3KLJKULVN_&+' VHQVLWLYLW\&DOFXODWH3KLJKULVN_QR&+' ±VSHFLILFLW\

6WHSV±DUHEH\RQGWKHVFRSHRIWKLVPRGXOH7KHVHYDOXHVDUHSURYLGHGIRU\RXLQWKHGDWDVHWroc.dta. Open the dataset roc.dta in Stata.

)RUWKHYDULRXVYDOXHVRIFSORWWKHIDOVHSRVLWLYHUDWHYHUVXVVHQVLWLYLW\&RQQHFWWKHOLQHVWRJHQHUDWH\RXU52&FXUYH

&RQVLGHUWKHIROORZLQJTXHVWLRQV

+RZGRWKHVHQVLWLYLW\DQGVSHFLILFLW\FKDQJHDVWKHFXWRIILQFUHDVHVIURPWR"

:KDWYDOXHRIFZRXOG\RXFKRRVHLQGLVWLQJXLVKLQJKLJKULVNYHUVXVORZULVNSDWLHQWV":K\"

Raj Dasgupta
ROC = Graph of False + rate vs sensitivity
Raj Dasgupta
Graphics > Two Way Graph
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
Looking at the graph you want to balance sens and spec. You want to keep the false +ve rate low with high sensitivity. Better to tell a patient that he/she is high risk and be wrong compared to telling the patient that he/she is low risk and being wrong
Page 7: STATA intro

7DEOH3RLQWVRQ52&FXUYHIRUULVNSUHGLFWLRQPRGHO

&XWRIIF

6HQVLWLYLW\ 6SHFLILFLW\

)DOVH3RVLWLYH6SHFLILFLW\

1 0 1.0000 0.0000 0.7901 0.0051 1.0000 0.0000 0.7152 0.0071 0.9988 0.0012 0.6592 0.0111 0.9966 0.0034 0.6055 0.0547 0.9931 0.0069 0.5695 0.0993 0.9875 0.0125 0.5055 0.1682 0.9654 0.0346 0.4595 0.2381 0.9480 0.0520 0.4049 0.3506 0.9221 0.0779 0.3555 0.4205 0.8794 0.1206 0.3029 0.5228 0.7997 0.2003 0.2545 0.6383 0.6963 0.3037 0.2031 0.7285 0.5723 0.4277 0.1559 0.8379 0.4184 0.5816 0.1044 0.9119 0.2408 0.7592 0.0571 0.9899 0.0551 0.9449

0 1.0000 0 1

3ORW52&FXUYHIRUULVNSUHGLFWLRQPRGHO

6SHFLILFLW\

)DOVHSRVLWLYHUDWH

52&&XUYH

Page 8: STATA intro

Tutorial: More on ROC curves and complicated graphs in Stata

:HFRQVWUXFWDQHZVLPSOHUULVNSUHGLFWLRQPRGHOXVLQJRQO\V\VWROLFEORRGSUHVVXUHGLDVWROLFEORRGSUHVVXUHDQGDJHDVRXUSURJQRVWLFIDFWRUV:HFRPSDUHWKLVULVNSUHGLFWLRQPRGHOWRWKHPRGHOLQWKHSUHYLRXVWXWRULDO

2SHQWKHGDWDVHWroc.dtaRQWKHFRXUVHZHEVLWH

:HFRQVWUXFWDSORWWKDWLQFOXGHV

WKH52&FXUYHIRUWKHILUVWPRGHOIURPWKHSUHYLRXVWXWRULDOZLWKPDQ\SURJQRVWLFIDFWRUVFDOOHG0RGHO

WKHVHFRQGPRGHOZLWKRQO\VH[DQGEORRGSUHVVXUH0RGHODQG DUHIHUHQFHOLQHUHSUHVHQWLQJDUELWUDU\FODVVLILFDWLRQDVKLJKRUORZULVN

2YHUOD\LQJOLQHVLQ6WDWDLVUHODWLYHO\HDV\ZLWKLQWKHTwoway graphZLQGRZ8VLQJWKH52&SORWFRQVLGHUWKHIROORZLQJTXHVWLRQV

0RGHORXWSHUIRUPVPRGHO+RZFDQ\RXWHOOWKLVIURPWKH52&FXUYH"

:KLFKPRGHOZRXOG\RXUHFRPPHQG"

/DWHULQWKHFRXUVHZHOHDUQKRZWRILWWKHPRGHOWRREWDLQWKHSUHGLFWHGULVNV:LWKQHZELRPDUNHUVDQGJHQHWLFULVNIDFWRUVSRSSLQJXSDOOWKHWLPHULVNSUHGLFWLRQLVDKRWWRSLFLQVWDWLVWLFVULJKWQRZDQG52&FXUYHVDUHXVHGIUHTXHQWO\

Raj Dasgupta
fpr vs sensitivity for model1 overlaid with model2Here you have to create 2 plots on the Graphics > Twoway dialogue box
Page 9: STATA intro

7DEOH3RLQWVRQ52&FXUYHIRUPRGHO

&XWRIIF

6HQVLWLYLW\ 6SHFLILFLW\

)DOVH3RVLWLYH6SHFLILFLW\

1 0.0000 1.0000 0.0000 0.7901 0.0273 0.9997 0.0003 0.7152 0.0298 0.9991 0.0009 0.6592 0.0324 0.9975 0.0025 0.6055 0.0434 0.9942 0.0058 0.5695 0.0792 0.9804 0.0196 0.5055 0.1014 0.9699 0.0301 0.4595 0.2002 0.9460 0.0540 0.4049 0.2666 0.9064 0.0936 0.3555 0.4370 0.8224 0.1776 0.3029 0.5860 0.7006 0.2994 0.2545 0.6959 0.6046 0.3954 0.2031 0.7641 0.4736 0.5264 0.1559 0.8739 0.3371 0.6629 0.1044 0.9838 0.0865 0.9135 0.0571 1.0000 0.0000 1.0000

0 1.0000 0.0000 1.0000

3ORW52&FXUYHIRU0RGHOVDQGZLWKUHIHUHQFHOLQH

6HQVLWLYLW\

)DOVHSRVLWLYHUDWH

0RGHO

0RGHO

52&&XUYH

Page 10: STATA intro

Example: Sensitivity, Specificity, PPV, NPV, and Bayes Theorem7KH:RUOG+HDOWK2UJDQL]DWLRQFRQGXFWVVXUYH\VLQFRXQWULHVWRGHFODUHQHRQDWDOWHWDQXV17HOLPLQDWLRQ7RGLDJQRVH17GHDWKVLQUXUDOORFDWLRQVZRPHQDUHLQWHUYLHZHGXVLQJWKHRUDODXWRSV\PHWKRG1RWDWLRQ'ZRPDQKDGDOLYHLQIDQWZKRGLHGRIQHRQDWDOWHWDQXV'

ZRPDQKDGDOLYHLQIDQWZKRGLGQRWGLHRI177WKHRUDODXWRSV\FRQFOXGHGWKDWDQ17GHDWKRFFXUUHG7WKHRUDODXWRSV\FRQFOXGHGWKDWDQ17GHDWKGLGQRWRFFXU8VLQJGDWDIURP.HQ\DWKHVHQVLWLYLW\RIWKHRUDODXWRSV\PHWKRGLVWKHVSHFLILFLW\ZDVIRXQGWREH6XSSRVHRIWKHZRPHQVXUYH\HGKDGDQLQIDQWGLHRIQHRQDWDOWHWDQXV

D :KDWLVWKHSUREDELOLW\WKDWWKHRUDODXWRSV\PHWKRGGHFODUHVDQHRQDWDOWHWDQXVGHDWKZKHQWKHZRPDQKDGDQLQIDQWGLHRIQHRQDWDOWHWDQXV"

E :KDWLVWKHSUREDELOLW\WKDWWKHRUDODXWRSV\PHWKRGGRHVQRWGHFODUHDQHRQDWDOWHWDQXVGHDWKZKHQWKHZRPDQGLGQRWKDYHDQLQIDQWGLHRIQHRQDWDOWHWDQXV":KDWLVWKLVYDOXHFDOOHG"

)RUPRUHLQIRUPDWLRQVHHKWWSZZZZKRLQWLPPXQL]DWLRQBPRQLWRULQJGLVHDVHV017(BLQLWLDWLYHHQLQGH[KWPO6QRZ5$UPVWURQJ-50)RUVWHU'HWDO&KLOGKRRGGHDWKVLQ$IULFD8VHVDQGOLPLWDWLRQVRIYHUEDODXWRSVLHVLancet,

Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
This is actually just sensitivityP(T+|D+) = 0.90
Raj Dasgupta
This is just specificityP(T-|D-) = 0.79
Page 11: STATA intro

F :KDWLVWKHSUREDELOLW\WKDWDZRPDQKDGDQLQIDQWGLHRIQHRQDWDOWHWDQXVJLYHQWKDWWKHRUDODXWRSV\PHWKRGGHFODUHGDQHRQDWDOWHWDQXVGHDWK":KDWLVWKLVYDOXHFDOOHG"

G :KDWLVWKHSUREDELOLW\WKDWDZRPDQGLGQRWKDYHDQLQIDQWGLHRIQHRQDWDOWHWDQXVZKHQWKHRUDODXWRSV\PHWKRGGRHVQRWGHFODUHDQHRQDWDOWHWDQXVGHDWK":KDWLVWKLVYDOXHFDOOHG"

H :KDWDUHWKHLPSOLFDWLRQVRISDUWVFDQGGIRUWKHQHRQDWDOWHWDQXVVXUYH\"

Raj Dasgupta
P(D+|T+) = Positive Predictive Value= P(T+|D+).P(D+)----------------------------------------------P(T+|D+).P(D+) + P(T+|D-).P(D-)= 0.9 X 0.001 Note - P(D+) which = Prevalence = 0.001----------------------------------------------0.9X0.001 + (1-0.79).(1-0.001)= 0.004
Raj Dasgupta
P(D-|T-) = Negative Predictive Value=P(T-|D-) . P(D-)---------------------------------------------------P(T-|D-).P(D-) + P(T-|D+).P(D+)=0.79. (1-0.001)----------------------------------------------------0.79. (1-0.001) + (1-0.90).(0.001)= 0.999 = NPV
Raj Dasgupta
PPV = 0.004NPV = 0.999Therefore, very low PPV, very high NPVEven though sens and spec are reasonably high, this disease is so rare in this population that without perfect specificity we will have very low PPV. So, a vast majority of individuals who had died due to neonatal… did not actually die of this disease
Page 12: STATA intro

Tutorial: Binomial distribution in Stata

Using Stata to calculate binomial probabilities

Suppose X is a random variable that follows a binomial distribution; thus X represents the

number of successes out of n trials with success probability p.

binomialp(n,k,p) returns the probability of observing floor(k) successes

in floor(n) trials when the probability of a success on one trial is p.

binomial(n,k,p) returns the probability of observing floor(k) or fewer successes

in floor(n) trials when the probability of a success on one trial is p.

binomialtail(n,k,p) returns the probability of observing floor(k) or more successes

in floor(n) trials when the probability of a success on one trial is p.

Example: Uzbeki Flour Fortification Program

In 2003, a flour fortification program was implemented in Uzbekistan to attempt to lower the

rates of anemia among women of reproductive age. Before the program was implemented, the

prevalence of anemia was 60%. In 2007, four years after implementing the fortification women,

suppose 100 women of reproductive age were randomly selected to provide blood samples to test

for anemia. Let X be the random variable denoting how many of the 100 women were anemic.

Suppose that the prevalence of anemia in Uzbekistan did not change between 2003 and 2007.

1. Would the binomial distribution provide an appropriate model?

B binary outcome

I independent because women were randomly selected

N sample size is fixed

S same p

2. What is the expected number of women with anemia?

µ = n p = 60

3. In a random sample of women in Uzbekistan, what is the typical departure of the number of

women with anemia from this mean number?

sd(X) =pvar(X)

=pn p (1 p)

=p100 0.6 0.4

=p24

= 4.9

1

Raj Dasgupta
IMPORTANT
Raj Dasgupta
IMPORTANT
Raj Dasgupta
Raj Dasgupta
Conditions for a binomial distribution
Raj Dasgupta
100 * .6 = 60
Raj Dasgupta
Standard Deviation for a Binomial Distribution
Page 13: STATA intro

4. What is the probability that exactly 60 women develop the disease? (use the formula)

n

k

pX(1 p)nX =

100

60

0.6600.440 = 0.081

. di comb(100, 60)*0.6^60*0.4^40

.08121914

5. What is the probability that exactly 50 women are anemic?

. di binomialp(100, 50, 0.6)

.01033751

6. What is the probability that at least 50 women are anemic?

. di binomialtail(100, 50, 0.6)

.98323831

Alternatively, we could use the binomial command to calculate this probability, since P (X >50) = 1 P (X 49).

. di binomial(100, 49, 0.6)

.01676169

. di 1 - binomial(100, 49, 0.6)

.98323831

7. Now, assume that the prevalence of anemia actually dropped after implementation of the

program, and the prevalence of anemia was 40% in 2007. Now, what is the probability that

at least 50 women are anemic?

. di binomialtail(100, 50, 0.4)

.0270992

Note that under the assumption of no change in prevalence between 2003 and 2007, the

probability that more than fifty women had anemia was very high. If the prevalence of anemia

dropped to 40%, the probability that at least 50 women were anemia was then very low. So, if we

collected data on 100 women and observed fewer than 50 cases of anemia, this would suggest

that anemia prevalence dropped over time!

2

Raj Dasgupta
IMPORTANT
Page 14: STATA intro

Tutorial: Poisson distribution in Stata

Using Stata to calculate Poisson probabilities

Suppose X is a random variable that follows a Poisson distribution; X is a count of breastcancer cases.

When X Poisson(m),

poissonp(m,k) returns the probability of observing floor(k) successes

poisson(m,k) returns the probability of observing floor(k) or fewer successes

poissontail(m,k) returns the probability of observing floor(k) or more successes

Example: Ecological Cancer Study

In the United States, the National Cancer Institute (NCI) tracks cancer incidence through theSurveillance Epidemiology and End Results (SEER) database. At various SEER sites, incidentcases of cancer, cancer type, and location are tracked. Using data from SEER, epidemiologistscan monitor patterns in disease risk and find factors, such as socioeconomic status, that arecorrelated with disease.

For instance, Los Angeles County is divided into 2,056 census tracts in the 2000 census.Using the SEER database, we can estimate the number of expected breast cancer cases in eachcensus tract, based on breast cancer incidence rates in California and the age distribution withineach tract (see standardization lectures). Then, we can compare the number of observed cases ineach census tract to the expected, to determine if census tracts have more cases of cancer thanexpected. We can then try to correlate excess breast cancer cases with other area-level variable,in an ecological study.

Below, we have data on breast cancer incidence for the African-American female population ina census tract in LA County. We choose to model the observed number of breast cancer cases ina census tract using the Poisson distribution, with mean equal to the expected number of breastcancer cases in the census tract.

1

Page 15: STATA intro

Age group Observed Population Cancer rate (per 1,000 p-y) Expected15-24 0 188 0.008 0.00125-34 0 163 0.200 0.03335-44 0 216 0.875 0.18945-54 0 157 1.868 0.29355-64 0 137 2.633 0.36165-74 0 151 3.165 0.47875-84 0 121 3.452 0.41884+ 0 57 3.313 0.189Total 0 1,190 1.648 1.962

Table 1: Census tract 1

1. What is the expected number of women with breast cancer in the census tract 1?

1.962

2. What is the typical departure of the number of women with breast cancer from this meannumber?

sd(X) =pvar(X)

=pµ

=p1.962

= 1.400714

3. Does the Poisson distribution provide an appropriate model?

Count data, so Poisson distribution seems reasonable. Difficult to assess any more informa-tion about model fit without data on many census tracts.

4. What is the probability that exactly 0 women develop breast cancer in census tract 1? (usethe formula)

e1.9621.9620

0!= e1.962 = 0.1406

2

Page 16: STATA intro

Consider another census tract, with a similar total African-American female popula-

tion to the previous, but with 5 observed breast cancer cases.

Age group Observed Population Cancer rate (per 1,000 p-y) Expected15-24 0 187 0.008 0.00125-34 0 187 0.200 0.03735-44 1 218 0.875 0.19145-54 0 193 1.868 0.36155-64 1 175 2.633 0.46165-74 1 141 3.165 0.44675-84 2 66 3.452 0.22884+ 0 17 3.313 0.056Total 5 1,184 1.504 1.781

Table 2: Census Tract 2.

5. What is the probability that exactly 5 women have breast cancer in census tract 2?

. di poissonp(1.781, 5)

.02515706

6. What is the probability that at least 5 women have breast cancer in census tract 2?

. di poissontail(1.781, 5)

.03504886

Alternatively, we could use the poisson command to calculate this probability, since P (X 5) = 1 P (X 4).

di 1 - poisson(1.781, 4)

.03504886

Takeaway: Census tracts 1 and 2 have similar population sizes and consequently similarexpected breast cancer case counts. However, in census tract 1, we observe no cases; in censustract 2, we observe 5 cases. Using the Poisson distribution, we can calculate the probability ofobserving case counts as extreme as 0 or 5 in these tracts.

Remember that there are about 2,000 total tracts, so we expect to see some extreme ob-servations. We could also incorporate ecological covariates into our analysis, such as medianhousehold income or land-use data, to try to explain some of the differences between observedand expected breast cancer rates.

3

Page 17: STATA intro

Tutorial: Normal distribution in Stata

Using Stata to calculate Normal probabilities

Suppose Z is a standard normal random variable. When Z Normal(0, 1),

normal(z) returns the cumulative standard normal distribution

normalden(z) returns the standard normal density

Example: Ozone Designation Following the Clean Air Act Amendments of 1997

From 2001-2003, the Environmental Protection Agency (EPA) monitored ozone levels atmonitors across the United States. One criteria for ozone was that the ozone levels (defined asthe average fourth highest daily maximum ozone over the three year period) could not exceed80ppb. Regulatory actions were taken if the ozone levels exceeded this threshold.

Among monitors in the Southeast, the average ozone level was 45.2 ppb, with standarddeviation 6.3 ppb. Ozone levels are usually modeled using the normal distribution. We assumethat this distribution is reasonable in our application.

Define X as ozone level at a monitor. X N(45.2, 39.7), or, equivalently, X N(45.2, 6.32).

1. What is the expected ozone level at a randomly sampled monitor?

45.2 ppb

2. What is the typical departure ozone levels from this mean number?

6.3 ppb

3. Why do you think Stata named the normal density function normalden, rather than normalp,which would seemingly be more consistent with the binomial and Poisson commands?

The normal distribution is continuous, and therefore normalden does not return a proba-bility, but rather a density function.

4. Why do you think Stata only calculates probabilities with respect to the standard normal,or N(0,1), distribution?

I don’t know the answer to this. Seems pretty inconvenient.

5. What is the probability that a randomly selected monitor has ozone levels exceeding 80ppb?

First, standardize:

P (X45.26.3 > 8045.2

6.3 ) = P (Z > 5.524)

. di 1 - normal(5.524)

1.657e-08

1

Raj Dasgupta
IMPORTANT
Raj Dasgupta
NORMAL GIVES < (LESS THAN)
Page 18: STATA intro

6. Provide an interpretation of the following command:

. di normalden(0)

.39894228

0.399 is the value of the normal density function at 0. It has no interpretation in terms ofprobability.

2

Page 19: STATA intro

1

Example Problem: HIV prevalence in South Africa

$FFRUGLQJWR81$,'6 +,9SUHYDOHQFHLQ6RXWK$IULFDZDVDPRQJDGXOWVWR\HDUVROGLQ$VVXPHWKLVSUHYDOHQFHHVWLPDWHLVDFFXUDWHWRGD\DQGZHUDQGRPO\VDPSOHLQGLYLGXDOVLQ6RXWK$IULFD6XSSRVH;LVWKHQXPEHURI+,9SRVLWLYHLQGLYLGXDOVLQWKHVDPSOH

Model X using the binomial distribution.

+RZPDQ\LQGLYLGXDOVGRZHH[SHFWWREH+,9SRVLWLYHLQWKHVDPSOH

(; QS

:KDWLVWKHVWDQGDUGGHYLDWLRQRIWKHQXPEHURI+,9SRVLWLYHLQGLYLGXDOVLQWKHVDPSOH"

VG; ¥QSS

:KDWLVWKHSUREDELOLW\RIREVHUYLQJPRUHWKDQ+,9SRVLWLYHLQGLYLGXDOV"

. di 1 - binomial(500, 100, 0.178)

.09089616

. di binomialtail(500, 101, 0.178)

.09089616

:KDWLVWKHSUREDELOLW\RIREVHUYLQJEHWZHHQDQG+,9SRVLWLYHLQGLYLGXDOV"

. di binomial(500, 95, 0.178) - binomial(500, 84, 0.178)

.47533949

. di binomialtail(500, 85, 0.178) - binomialtail(500, 96, 0.178)

.47533949

Now, model X using the normal distribution instead.

:KDWLV(;"

(; QS

:KDWLVVG;"

VG; ¥QSS

:KDWLVWKHSUREDELOLW\RIREVHUYLQJPRUHWKDQ+,9SRVLWLYHLQGLYLGXDOV"

3;! 3=!± 3=!

Raj Dasgupta
IMPORTANT, THESE VALUES ARE >= (I.E., INCLUSIVE)HENCE, HERE WE USED 96 INSTEAD OF 95
Raj Dasgupta
Raj Dasgupta
Page 20: STATA intro

2

. di 1-normal(1.286)

.09922153

:KDWLVWKHSUREDELOLW\RIREVHUYLQJEHWZHHQDQG+,9SRVLWLYHLQGLYLGXDOV"

3;

3;±3;

3=±±3=

3=±3=

. di normal(0.702) - normal(-0.468)

.43876812

'RWKHQRUPDODQGELQRPLDOPRGHOVJLYHVLPLODUUHVXOWV"

:KDWLVWKHSUREDELOLW\RIREVHUYLQJPRUHWKDQ+,9SRVLWLYHLQGLYLGXDOV"

%LQRPLDO

1RUPDO

:KDWLVWKHSUREDELOLW\RIREVHUYLQJEHWZHHQDQG+,9SRVLWLYHLQGLYLGXDOV"

%LQRPLDO

1RUPDO

<HVWKH\JLYHVLPLODUUHVXOWV$SSUR[LPDWLRQLVEHWWHU³LQWKHWDLOV´LHIRUFDOFXODWLQJWKHSUREDELOLW\RIREVHUYLQJPRUHWKDQ+,9LQGLYLGXDOVWKDQLQWKHFHQWHURIWKHGLVWULEXWLRQEHWZHHQDQG+,9

http://www.unaids.org/en/regionscountries/countries/southafrica/

Raj Dasgupta
IMPORTANT
Raj Dasgupta
Raj Dasgupta
Page 21: STATA intro

Tutorial: Central Limit Theorem in Stata

We examine BMI at baseline using the Framingham cohort as our reference population.

Specifically, we can think of the Framingham population as ‘the population of interest’ and

consider sampling from this population to examine how statistics behave in samples from a

population where we know about everyone.

1. Calculate the mean µ standard deviation BMI in the Framingham dataset at baseline.

. summarize bmi1

µ = 25.8 and = 4.1.

2. Take a sample of size 20 from the Framingham dataset. Calculate a sample mean BMI

at baseline, x1. Then take a second sample from the same population and calculate the

sample mean, x2. Would you expect x1 and x2 to be exactly the same? Why or why not?

use "fhs.dta", clear

drop if bmi1 == .

keep bmi1

preserve

sample 20, count

mean bmi1

restore

preserve

sample 20, count

mean bmi1

We don’t expect x1 and x2 to be exactly the same, because the mean has some stochastic

variability.

3. Repeat this exercise, but with a sample size of 100. Are x1 and x2 closer together than

those from the samples of size 20? Are x1 and x2 always going to be closer together

using a sample size of 100 versus 20?

restore

preserve

sample 100, count

mean bmi1

restore

preserve

sample 100, count

mean bmi1

1

Page 22: STATA intro

In my sample, the values of the sample mean are closer with the larger sample. This will

usually, but not always, be true.

4. Compare histograms of BMI at baseline and prevalent MI at baseline. Would the central

limit theorem apply to the binary indicator prevalent MI at baseline?

0.0

2.0

4.0

6.0

8.1

Den

sity

10 20 30 40 50 60Body mass index, exam 1

010

2030

40D

ensi

ty

0 .2 .4 .6 .8 1Prevalent myocardial infarction, exam 1

Yes, but the more skewed a distribution is, the larger sample size we need to collect before

the CLT “kicks in”.

2

Page 23: STATA intro

Tutorial: Confidence and Predictive Intervals in Stata

1. Let X denote a random variable that represents BMI at baseline for the Framingham

cohort. Assume that X is normally distributed. What is the mean of X? The standard

deviation?

. summarize bmi1

2. Construct a 95% predictive interval for X. Pick a random observation from the dataset.

Does your interval contain the BMI for the randomly selected observation?

95% predictive interval for X is defined as µ± 1.96.

3. Suppose we now draw repeated samples of size 100 from the Framingham cohort. What

is a 95% predictive interval for X?

95% predictive interval for X is defined as µ± 1.96/pn.

4. Take a sample of size 100 from the Framingham dataset. Does your predictive interval

for X contain the mean from the 100 person subsample?

. sample 100, count

. sum bmi1

5. Construct a 95% confidence interval for the mean BMI in this sample. Does the 95%

confidence interval contain the mean BMI for the entire cohort?

A 95% CI for µ is defined as X ± 1.96/pn.

1

Page 24: STATA intro

Tutorial: Confidence intervals with the t-distribution in Stata

Suppose t is a random variable that follows a t-distribution with n degrees of freedom.

tden(n,t) returns the probability density function

of Students t distribution

ttail(n,t) returns the reverse cumulative

(upper tail or survivor) Students t distribution

invttail(n,p) returns the inverse reverse cumulative

(upper tail or survivor) Students t distribution

Note that if ttail(n,t)= p, then invttail(n,p) = t.

Stata will calculate confidence intervals for you:

Calculator: cii n mean sd, level(95)

Function: ci varlist, level(95)

There is no Stata function for calculating confidence intervals for normally distributed data

when the standard deviation is known, since this scenario doesn’t really happen in practice.

1. Calculate the mean and standard deviation of BMI at baseline.

. summarize bmi1

2. Take a sample of size 20 from the Framingham cohort. Calculate the mean and

standard deviation of BMI at baseline in the subsample (I use set seed 2, if you want

to get the same sample as me). We are interested in making inference about BMI at

baseline in the total Framingham cohort using only the sample of size 20.

. set seed 2

. drop if bmi1 == .

. sample 20, count

. sum bmi1

3. Assume that the sample standard deviation is known (and equal to the standard deviation

in the Framingham cohort). Construct a 95% confidence interval for the mean BMI in your

subsample. Note that if normal(z)= p, then invnormal(p) = z.

95% CI: x± Z0.975/pn

. di 25.0 - invnormal(0.975)*4.1/sqrt(20)

. di 25.0 + invnormal(0.975)*4.1/sqrt(20)

4. Use invttail to construct a 95% confidence interval for the mean BMI in your subsample

by hand, now assuming that the sample standard deviation is unknown.

1

Page 25: STATA intro

. di 25.0 - invttail(19, 0.025)*3.2/sqrt(20)

. di 25.0 + invttail(19, 0.025)*3.2/sqrt(20)

5. Use cii to construct a 95% confidence interval for the mean BMI in your subsample.

. cii 20 25.0 3.2

6. Use ci to construct a 95% confidence interval for the mean BMI in your subsample.

. ci bmi1

2

Page 26: STATA intro

Tutorial: Hypothesis testing in Stata

In adults over 15 years of age, a resting heart rate around 80bpm is usually consideredaverage. Using a subset of the Framingham cohort, we are going to attempt to make inferenceabout heart rate among “healthy young” adults.

Specifically, we restrict our analysis to adults with the following characteristics at baseline:non-smoker, younger than 40, BMI less than 25, diastolic blood pressure less than 80, andsystolic blood pressure less than 120. There are 61 participants who meet our criteria. Wehypothesize that heart rate at the follow up exam in 1962 would be lower than 80bpm, theresting heart rate for adults with average health.

We are making the somewhat strong assumption that these Framingham participants aregeneralizable to the broader population of healthy young adults (this assumption is necessaryif we want to make inference about heart rate in healthy young adults.) Use the dataset on thiswebpage to answer the following questions:

1. Make a histogram of heart rate at exam 2. Is the normality assumption reasonable?

histogram heartrte2

histogram heartrte2 if heartrte2 < 200

2. You are interested in whether the mean heart rate at exam 2 among healthy young adultsis equal to 80bpm. Perform a hypothesis test at the ↵ = 0.05 level.

(a) What test are you using?One-sample t-test

(b) State your null and alternative hypothesis.

H0 : µ = 80, HA : µ 6= 80

(c) Perform the hypothesis test.

Hypothesis testing in Stata: To examine options for t-tests in Stata, type db ttest.

Or, using the dropdown menu, explore the options inSummaries, tables, and tests/Classical tests of hypothesis/.

. ttest heartrte2 == 80

One-sample t test------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------heartr~2 | 61 76.55738 2.800032 21.86895 70.95648 82.15827------------------------------------------------------------------------------

mean = mean(heartrte2) t = -1.2295

1

Page 27: STATA intro

Ho: mean = 80 degrees of freedom = 60

Ha: mean < 80 Ha: mean != 80 Ha: mean > 80Pr(T < t) = 0.1118 Pr(|T| > |t|) = 0.2237 Pr(T > t) = 0.8882

. ttesti 61 76.557 21.869 80

One-sample t test------------------------------------------------------------------------------

| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------

x | 61 76.557 2.800039 21.869 70.95609 82.15791------------------------------------------------------------------------------

mean = mean(x) t = -1.2296Ho: mean = 80 degrees of freedom = 60

Ha: mean < 80 Ha: mean != 80 Ha: mean > 80Pr(T < t) = 0.1118 Pr(|T| > |t|) = 0.2236 Pr(T > t) = 0.8882

What are:

i. your test statistic, t = -1.22ii. the distribution of your test statistic under the null hypothesis t t60

iii. the p-value, 0.2236iv. your decision, and Fail to reject the null hypothesis.v. your interpretation? We do not have enough evidence to suggest that the heart

rate is different from 80 in healthy young adults at follow up.

3. As a diligent statistician, you decide to investigate the issue of the outlier in your dataset.List the information for the outlier.

. list if heartrte2 > 200

4. Repeat the hypothesis test, excluding this observation. What do you find?

. ttest heartrte2 == 80 if heartrte2 < 200

5. As the statistician, what results should you present in your analysis?

2

Page 28: STATA intro

Example: Atherosclerosis and Physical Activity

Oxidation of components of LDL cholesterol (the bad cholesterol) can result in atherosclerosis,or hardening of the arteries. Elosua et. al (2002) examine the impact of a 16 week physical activityprogram on LDL resistance to oxidation in 17 healthy young adults. After completing the program,the average maximum oxidation rate in the study participants x was 8.2 µmol/min/g, and thesample standard deviation of the maximum oxidation rate was s = 2.5µmol/min/g. Assume thatthe oxidation rate is normally distributed.

• What is the distribution of x?

x µ

s/pn t16.

• Suppose the average maximum oxidation rate in healthy young adults who did not completethe program was µ0 = 11.3µmol/min/g and the standard deviation was = 2.3. Define x0as the sample mean maximum oxidation rate from a sample of size 17 from this population.Construct a 99% predictive interval for x0. Is x in this interval?

. di 11.3 - invnormal(0.995)*2.3/sqrt(17)

. di 11.3 + invnormal(0.995)*2.3/sqrt(17)

• Construct a 99% confidence interval for µ.

. cii 17 8.2 2.5, level(99)

• If you constructed the 99% confidence interval for µ assuming that the standard deviationwas known and equal to = 2.3, would your confidence interval be wider or narrower? Willthis result always be true?

Standard deviation known: x± Z0.99Standard deviation unknown: x± t0.99,16s

• Let µ denote the mean maximum oxidation rate in young adults who participate in the pro-gram. Test the hypothesis that µ = µ0 against the alternative that µ 6= µ0 the ↵ = 0.01 level.What do you conclude?

H0 : µ = µ0, HA : µ 6= µ0

. ttesti 17 8.2 2.5 11.3, level(99)

1

Page 29: STATA intro

Using a one-sample t-test, we obtain a test statistic of -5.11, which follows a t-distributionwith 16 degrees of freedom under the null hypothesis, corresponding to a p-value of 0.0001.We reject the null at the 99% confidence level and conclude that the data suggest thatthe 16 week physical activity program lowers the maximum oxidation rate in healthy youngindividuals.

Elosua R., Molina L., Fito M., Arquer A., Sanchez-Quesada JL, Covas MI, Ordonez-Llanos J., Marrugat J.(2003)Response of

oxidative stress biomarkers to a 16-week aerobic physical activity program, and to acute physical activity, in healthy young men and

women. Atherosclerosis 167(2), 327-334.

2

Page 30: STATA intro

Two Sample t-tests in Stata

Example: In the Framingham cohort, we want to examine the distribution of heart rate at exams1 and 2. Specifically, we wish to test whether there is a difference in mean heart rate betweenexam 1 and exam 2. Additionally, we are interested in whether the mean heart rate differsbetween men and women at exam 2. We sample 100 people from the Framingham cohort.For this example, use the dataset heartrate.dta on this webpage, which contains the randomsample of 100 participants.

Hypothesis testing with paired data in Stata:

. ttest heartrte1 == heartrte2

Paired t test------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------heartr~1 | 100 75.03 1.290247 12.90247 72.46987 77.59013heartr~2 | 100 76.17 1.293031 12.93031 73.60435 78.73565---------+--------------------------------------------------------------------

diff | 100 -1.14 1.344125 13.44125 -3.807035 1.527035------------------------------------------------------------------------------

mean(diff) = mean(heartrte1 - heartrte2) t = -0.8481Ho: mean(diff) = 0 degrees of freedom = 99

Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0Pr(T < t) = 0.1992 Pr(|T| > |t|) = 0.3984 Pr(T > t) = 0.8008

. gen hdiff = heartrte2 - heartrte1

. ttest hdiff== 0

One-sample t test------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------

hdiff | 100 1.14 1.344125 13.44125 -1.527035 3.807035------------------------------------------------------------------------------

mean = mean(hdiff) t = 0.8481Ho: mean = 0 degrees of freedom = 99

Ha: mean < 0 Ha: mean != 0 Ha: mean > 0Pr(T < t) = 0.8008 Pr(|T| > |t|) = 0.3984 Pr(T > t) = 0.1992

The commands ttest heartrte2 == heartrte1 and ttest hdiff==0 lead to the same test.

This command can be found through the following drop-down menus: Statistics / Sum-maries, tables, and tests / Classical tests of hypotheses / Mean-comparison test, paired data.

1

Page 31: STATA intro

Hypothesis testing with unpaired data and equal variances in Stata:

. ttest heartrte2, by(sex1)

Two-sample t test with equal variances------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------

Male | 39 76.82051 2.042025 12.75244 72.68665 80.95438Female | 61 75.7541 1.681246 13.13095 72.39111 79.11709

---------+--------------------------------------------------------------------combined | 100 76.17 1.293031 12.93031 73.60435 78.73565---------+--------------------------------------------------------------------

diff | 1.066414 2.662326 -4.216884 6.349713------------------------------------------------------------------------------

diff = mean(Male) - mean(Female) t = 0.4006Ho: diff = 0 degrees of freedom = 98

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(T < t) = 0.6552 Pr(|T| > |t|) = 0.6896 Pr(T > t) = 0.3448

Hypothesis testing with unpaired data and unequal variances in Stata:

. ttest heartrte2, by(sex1) unequal

Two-sample t test with unequal variances------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------

Male | 39 76.82051 2.042025 12.75244 72.68665 80.95438Female | 61 75.7541 1.681246 13.13095 72.39111 79.11709

---------+--------------------------------------------------------------------combined | 100 76.17 1.293031 12.93031 73.60435 78.73565---------+--------------------------------------------------------------------

diff | 1.066414 2.645081 -4.194674 6.327503------------------------------------------------------------------------------

diff = mean(Male) - mean(Female) t = 0.4032Ho: diff = 0 Satterthwaite’s degrees of freedom = 82.8637

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(T < t) = 0.6561 Pr(|T| > |t|) = 0.6879 Pr(T > t) = 0.3439

This command can be found through the following drop-down menus: Statistics / Summaries, tables,and tests / Classical tests of hypotheses / Two-group mean-comparison test.

Instead of the data structure above, suppose that, in your dataset, you have heart rate for men in onevariable/column and heart rate for women in another variable/column (instead of our situation where wehave heart rate in one variable and sex as another variable). How do you perform a t-test then? Use thecommand ttest heartratew == heartratem, unpaired unequal, where heartratew is the heart ratevariable for women and heartratem is the heart rate for men. It is important to use the option unpaired.If you do not use this option, Stata will perform a paired t-test. You may also choose the leave out theunequal option if you wish to assume equal variances.

2

Page 32: STATA intro

The following 4 lines of code transform the data to the situation where we have heart rate for menin one variable (heartrtem) and heart rate for women in another variable (heartrtew). It is not necessaryto memorize or understand this portion of code. It is simply included for completeness. The fifth line ofcode runs the two sample t-test.

. gen id = _n

. reshape wide heartrte2, i(id) j(sex1)

. rename heartrte21 heartrtem

. rename heartrte22 heartrtew

. ttest heartrtew = heartrtem, unpaired unequal

Two-sample t test with unequal variances------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------heartr~w | 61 75.7541 1.681246 13.13095 72.39111 79.11709heartr~m | 39 76.82051 2.042025 12.75244 72.68665 80.95438---------+--------------------------------------------------------------------combined | 100 76.17 1.293031 12.93031 73.60435 78.73565---------+--------------------------------------------------------------------

diff | -1.066414 2.645081 -6.327503 4.194674------------------------------------------------------------------------------

diff = mean(heartrtew) - mean(heartrtem) t = -0.4032Ho: diff = 0 Satterthwaite’s degrees of freedom = 82.8637

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(T < t) = 0.3439 Pr(|T| > |t|) = 0.6879 Pr(T > t) = 0.6561

This command can be found through the following drop-down menus: Statistics / Summaries, tables,and tests / Classical tests of hypotheses / Two-sample mean-comparison test.Exercises

1. Calculate the sample mean and sample standard deviation of heart rate at exam 1 and exam 2 inthe Framingham cohort.

2. Are these data dependent or independent?

3. Generate a new variable for the difference in heart rate between exam 1 and exam 2. Make ahistogram of this new variable.

4. Perform a hypothesis test at the ↵ = 0.05 level.

(a) What test are you using?

(b) State your null and alternative hypothesis.

(c) Perform the hypothesis test. What are:

i. your test statistic,ii. the degrees of freedom,iii. the p-value,iv. your decision, andv. your interpretation?

3

Page 33: STATA intro

Now, assume that you are interested in whether the mean heart rate differs between menand women at exam 2.

5. Are these data dependent or independent?

6. Calculate the sample mean and sample standard deviation of heart rate at exam 2 for men andwomen.

7. Perform a hypothesis test at the ↵ = 0.05 level, assuming unequal variances.

(a) What test are you using?

(b) State your null and alternative hypothesis.

(c) Perform the hypothesis test. What are:

i. your test statistic,ii. the degrees of freedom,iii. the p-value,iv. your decision, andv. your interpretation?

8. Given the 95% confidence intervals, would you expect the hypothesis test to be significant?

4

Page 34: STATA intro

Power and Sample Size in Stata

Power and Sample size in Stata

sampsi - Sample size and power for means and proportions

Power

sampsi 18.4 20.4, sd1(2.8) n1(20) onesample

Sample Size

sampsi 18.4 20.4, sd1(2.8) power(.90) onesample

The notation changes slightly for two-sample or one-sided tests. Type db sampsi to see alloptions available within the sampsi command or select from the drop-down menus: Statistics /Power and sample size / Tests of means and proportions.

Example: Suppose we aim to implement a new physical activity program among school-agedchildren between 6 and 11 years old at high risk for obesity. We define high-risk children asthose children who do less than 2 hours of physical activity per week. According to Ogden(2012), mean BMI among children 6-11 years old in the United States was 18.4 between 2009and 2010, with standard deviation 2.8. Before implementing this program, we want to performa baseline survey, to evaluate the state of the obesity epidemic among the high risk children.We plan to design the survey to test whether the mean BMI in the high risk children is equal tothe mean BMI among 6-11 year olds in the United States at the ↵ = 0.05 level. To design thestudy, assume the standard deviation of BMI is equal in the general population and the highrisk children.

Ogden C.L., Carroll M.D., Kit B.K., and Flegal K.M. (2012). Prevalence of Obesity and Trends in Body Mass Index Among US

Children and Adolescents, 1999-2010. JAMA: The Journal of the American Medical Association. 307 (5). 483–490.

1. State the null and alternative hypothesis for the test above.

H0 : µ = 18.4

HA : µ 6= 18.4

2. Fill in the table below:

1

Page 35: STATA intro

Sample Size µA Power

100 19.4

200 18.9

10,000 18.4

20.4 0.9

19.4 0.8

19.4 0.9

Now, suppose we powered our study for the one-sided test that the mean BMI is equal

to 18.4 versus the alternative that the mean is higher in the high risk children. Repeat

the calculations above and compare to the two-sided calculations.

Power: sampsi 18.4 20.4, sd1(2.8) n1(20) onesample onesided

Sample Size: sampsi 18.4 20.4, sd1(2.8) power(.90) onesample onesided

1. State the null and alternative hypothesis for the test above.

H0 : µ = 18.4

HA : µ > 18.4

2. Fill in the table below:

Sample Size µA Power

100 19.4

200 18.9

10,000 18.4

20.4 0.9

19.4 0.8

19.4 0.9

Suppose we also wanted to investigate whether the BMI among high risk children dif-

fered between boys and girls. Let us assume that the standard deviations of BMI among

2

Page 36: STATA intro

high risk children are both equal to 2.8.

Power: sampsi 18.4 20.4, sd1(2.8) sd2(2.8) n1(20) n2(20)

Sample Size: sampsi 18.4 20.4, sd1(2.8) sd2(2.8) power(.90)

1. State the null and alternative hypothesis for the test above.

H0 : µB = µG

HA : µB 6= µG

2. Let µG and µB denote the mean BMI in boys and girls, respectively; let nB and nG denotethe sample size required for boys and girls. Fill in the table below:

nB nG µB µG Power

20 20 20.4 18.4

20 20 19.4 18.4

20.4 19.4 0.9

22.4 18.4 0.8

3

Page 37: STATA intro

Tutorial: ANOVA in Stata

In this example, we will use data from the California Health Interview Survey (CHIS). Fromtheir website (http://www.chis.ucla.edu): CHIS is the nation’s largest state health survey. Con-ducted every two years on a wide range of health topics, CHIS data gives a detailed pictureof the health and health care needs of California’s large and diverse population. CHIS is con-ducted by the UCLA Center for Health Policy Research in collaboration with many public agen-cies and private organizations.

In 2009, CHIS surveyed more than 47,000 adults, more than 12,000 teens and children andmore than 49,000 households. We will use a sample of 500 adults for this lab (CHISANOVA.dta).Suppose we are interested in the relationship between number of hours worked (per week) andhealth, as measured by BMI. Would we expect those who worked longer hours to be healthierthan those who worked shorter hours, or vice versa? Number of hours worked per week isdivided into 5 categories: 0-10, 10-25, 25-35, 35-45, 45+.

1. How many people are in each category?

2. We now wish to run an ANOVA. Are the assumptions for ANOVA met?

3. What are the null and alternative hypotheses for this test?

4. Perform the hypothesis test at the ↵ = 0.05 level.

Conduct a oneway ANOVA in Stata using the oneway command:

. oneway bmi work_cat, tabulate

| Summary of bmi

work_cat | Mean Std. Dev. Freq.

------------+------------------------------------

0-10 | 26.431579 5.9410147 38

10-25 | 26.429189 5.7075504 74

25-35 | 24.3495 4.1477871 60

35-45 | 27.128351 5.647101 188

45+ | 27.854928 6.1797228 140

------------+------------------------------------

Total | 26.8419 5.7540637 500

Analysis of Variance

Source SS df MS F Prob > F

------------------------------------------------------------------------

Between groups 550.823688 4 137.705922 4.27 0.0021

Within groups 15970.6916 495 32.2640234

------------------------------------------------------------------------

Total 16521.5153 499 33.1092491

Bartlett’s test for equal variances: chi2(4) = 11.7543 Prob>chi2 = 0.019

1

Page 38: STATA intro

You may also use the following drop-down menus to access the oneway command: Statis-tics / Linear models and related / ANOVA/MANOVA / One-way ANOVA.

What are:

(a) your test statistic,(b) the degrees of freedom,(c) the p-value,(d) your decision, and(e) your interpretation?

5. We have rejected the null hypothesis, thus we have evidence that at least one pair ofmeans are not equal. Perform all possible pairwise comparisons using the Bonferronicorrection.

6. Which pairs of means are significantly different?

7. A colleague of yours, who has the same dataset, calculates the means for each workcategory. After looking at these means he takes the group with the largest mean (45+)and the group with the smallest mean (25-35) and performs a t-test (without a Bonferronicorrection). He tells you that since he only did one test, he does not need to correct formultiple comparisons and that his method is valid. Do you agree? Why or why not?

2

Page 39: STATA intro

Tutorial: Methods for one-sample proportion inference

,QWKLVWXWRULDOZHOHDUQDERXW6WDWDFRPPDQGVIRURQHVDPSOHSURSRUWLRQLQIHUHQFHConfidence intervals: ci DQGcii ±FDOFXODWHELQRPLDOFRQILGHQFHLQWHUYDOVHypothesis Tests: bitest DQGbitesti ±H[DFWELQRPLDORQHVDPSOHSURSRUWLRQK\SRWKHVLVWHVWprtestDQGprtesti±ODUJHVDPSOHRQHVDPSOHSURSRUWLRQK\SRWKHVLVWHVW5HFDOOWKDWWKHH[WUDÄLDWWKHHQGRID6WDWDFRPPDQGQDPHGHQRWHVWKDWWKHFRPPDQGLV³LPPHGLDWH´DQGGRHVQRWXVHWKHGDWDLQPHPRU\Exercises

(VWLPDWHWKHSURSRUWLRQRI&DOLIRUQLDUHVLGHQWVZKRYLVLWWKHGRFWRUDWOHDVWRQFHLQWKHSUHYLRXV\HDUGHQRWHGp

. tabulate doctor

2. &RQVWUXFWDFRQILGHQFHLQWHUYDOIRUpXVLQJWKUHHGLIIHUHQWPHWKRGV&DQZHXVH

WKHQRUPDODSSUR[LPDWLRQWRWKHELQRPLDOGLVWULEXWLRQ"+RZGRWKHZLGWKVRIWKHVHWKUHH&,’VFRPSDUH"

. ci doctor, binomial . ci doctor, binomial wald . ci doctor, binomial Wilson

([DFWQHYHUKDVORZHUWKDQH[SHFWHGFRYHUDJHEXWLVVRPHWLPHVWRRFRQVHUYDWLYH:DOG/DUJHVDPSOHEDGFRYHUDJHHDV\WRFDOFXODWHIOH[LEOH:LOVRQ/DUJHVDPSOHJRRGFRYHUDJHOHVVIOH[LEOH

8VLQJWKHFRQILGHQFHOHYHOLVWKHUHHYLGHQFHLQWKHGDWDWKDWOHVVWKDQRIWKHSRSXODWLRQYLVLWVWKHGRFWRURQFHSHU\HDU"5HSHDWWKLVDQDO\VLVVWUDWLI\LQJE\DERYHEHORZSRYHUW\JURXSV

. bysort poverty: ci doctor, binomial

/HW’VIRUPDOL]HTXHVWLRQXVLQJDK\SRWKHVLVWHVW/HWpGHQRWHWKHSURSRUWLRQRI&DOLIRUQLDUHVLGHQWVEHORZWKHIHGHUDOSRYHUW\OHYHOZKRYLVLWHGWKHGRFWRUDWOHDVWRQFHLQWKHSDVW\HDU7HVWWKHK\SRWKHVLVWKDWp YHUVXVWKHDOWHUQDWLYHWKDWpDWWKHĮ OHYHOD)LUVWXVHWKHH[DFWELQRPLDOWHVW:KDWLVWKHSYDOXH"

bitest doctor == 0.8 if poverty == 1

E1H[WXVHWKHQRUPDODSSUR[LPDWLRQWRWKHELQRPLDOGLVWULEXWLRQ. prtest doctor == 0.8 if poverty==1

Page 40: STATA intro

,VWKHQRUPDODSSUR[LPDWLRQDSSURSULDWH"

Q S!Q S!

7KHUHIRUHWKHQRUPDODSSUR[LPDWLRQWRWKHELQRPLDOLVDSSURSULDWH

:KDWLVWKHYDOXHRI\RXUWHVWVWDWLVWLF"

=

:KDWLVWKHGLVWULEXWLRQRI\RXUWHVWVWDWLVWLFXQGHUWKHQXOOK\SRWKHVLV"

=a1

:KDWLVWKHSYDOXHRI\RXUWHVW"

S

'R\RXUHMHFWRUQRWUHMHFWWKHQXOOK\SRWKHVLV"

:HUHMHFWWKHQXOOK\SRWKHVLV

:KDWGR\RXFRQFOXGH"

:HFRQFOXGHWKDWWKHUHLVHYLGHQFHLQWKHGDWDWKDWpLVOHVVWKDQ

*LYHQWKDW\RXJRWGLIIHUHQWUHVXOWVXVLQJWKHH[DFWDQGODUJHVDPSOHK\SRWKHVLVWHVWVZKDWZRXOG\RXGRLI\RXZHUHZULWLQJDSDSHU"

7KHUHDUHQRPHDQLQJIXOGLIIHUHQFHVEHWZHHQDSYDOXHRIDQGWU\WRLQFOXGHFRQILGHQFHLQWHUYDOVLQSUDFWLFHDVSYDOXHVGRQ’WWHOO\RXDQ\WKLQJDERXWWKHPDJQLWXGHRIDQHIIHFW

Page 41: STATA intro

Two Sample Proportion Tests in Stata

Before delving into two-way associations using contingency (two-by-two) tables, we first ex-amine the structure of the two-sample test of proportions, using the normal approximation tothe binomial.

Exercises:

1. How might we define a test statistic for comparing two proportions? Specifically, wewould like to test the hypothesis that H0 : p1 = p0 versus the alternative that p1 6= p0 atthe ↵ = 0.05 level. How does this test compare to the two-sample mean test for normallydistributed data from last week?

Recall the two-sample t-test for equal variances:

Assume X1 N(µ1,2), and the sample mean of multiple realizations of X1 is x1 and

sample standard deviation is s1; and X2 N(µ2,2), and the sample mean of multiple

realizations of X2 is x2 and sample standard deviation is s2.

To test H0 : µ1 = µ2 vs. HA : µ1 6= µ2, our test statistic for the two-sample t-test withequal variances was:

t =x1 x2

sp

q1n1

+ 1n2

H0 tn1+n22

Remember: the variance is independent of the mean for normally distributed data. Forbinomial data, the variance is a function of the mean.

For binomial data:

• Assume X1 Binomial(n1, p1) and X0 Binomial(n0, p0).

• Define p1 = X1/n1 and p0 = X0/n0.

• Using the Central Limit Theorem, we know that p1 N(p1, p1(1 p1)/n1) and p0 N(p0, p0(1 p0)/n0).

• Under the null hypothesis that p1 = p0, p1 p0 N(0, V ), where V = p(1 p)

1n1

+ 1n0

and p = X1+X0

n1+n0.

• Therefore, a natural test statistic for testing H0 : p1 = p0, HA : p1 6= p0 is:

p1 p0rp(1 p)

1n1

+ 1n0

H0 N(0, 1)

For binomial data, the structure of the test statistic is similar to the two-sample t-test withequal variances, because, under the null, the variances are equal in both groups.

1

Page 42: STATA intro

2. Let p1/p0 denote the proportion of CA residents below/above the federal poverty level whovisited the doctor at least once in the past year. Test the hypothesis that p1 = p0 versusthe alternative that p1 6= p0 at the ↵ = 0.05 level. What do you conclude? Report a 95%CI along with your results.

• What test are you using? Is normality reasonable?

tabulate doctor

Check that n1p > 5, n1(1 p) > 5, n0p > 5, n0(1 p) > 5, where p = 0.804.

Two-group proportion test in Stata

. prtest doctor, by(poverty)

• What is the value of your test statistic?Z = 2.3

• What is the distribution of your test statistic?Z N(0, 1)

• What is the p-value of your test?p = 0.024

• Do you reject or not reject the null hypothesis?Reject H0

• What do you conclude?There is evidence in the data that individuals in CA who are below poverty are lesslikely to go to the doctor.

3. Based on these data, you decide to conduct an intervention among those below thepoverty line. You randomize individuals to intervention or no intervention. Suppose youpower your study to detect a 15% risk difference with 90% power, assuming the proportionin the control group would equal the estimated proportion among those below poverty(70%) in this study. What sample size would you need, with equal numbers of individualsper arm, if you plan to conduct your test at the ↵ = 0.05?

. sampsi 0.7 0.85, power(0.9) alpha(0.05)

2

Page 43: STATA intro

Tutorial: Contingency Tables

$ZHOONQRZQVWDWLVWLFLDQRQFHVDLG³$3K'VWXGHQWFRXOGZULWHDQHQWLUHGLVVHUWDWLRQRQWZRE\WZRWDEOHVRQO\´&RQWLQXLQJRXUKHDOWKGLVSDULWLHVUHVHDUFKZHQRZFRQVLGHUWKHRGGVUDWLRDQGWKH3HDUVRQ&KLVTXDUHWHVW Exercises

8VLQJGDWDIURPWKHUHVSRQGHQWVRIWKH&+,6VXUYH\FRQVWUXFWD[WDEOHFRPSDULQJSRYHUW\OHYHOYHUVXVSDVWGRFWRUYLVLW'LVSOD\WKHURZIUHTXHQFLHVDQGWKHH[SHFWHGFHOOFRXQWV. tabulate poverty doctor, row expected

&RQVWUXFWWKHRGGVUDWLRDQGFRUUHVSRQGLQJFRQILGHQFHLQWHUYDO&,IRUWKHYLVLWLQJ

WKHGRFWRULQWKHSDVWPRQWKVIRUWKRVHDERYHDQGEHORZWKHSRYHUW\OLQH

. gen nopov = 1-pov

. cs doctor nopov, or woolf

1RWLFHWKDWWKH:RROIRSWLRQLVXVHGGHQRWLQJWKDWZHZDQWVWDQGDUGHUURUVFDOFXODWHGXVLQJWKHIRUPXODSUHVHQWHGLQFODVV

&RQGXFWD3HDUVRQ¶VFKLVTXDUHWHVWWRH[DPLQHWKHDVVRFLDWLRQEHWZHHQSRYHUW\DQG

SULRUGRFWRUYLVLW

. cs doctor poverty, or woolf

25XVHtabulate«

. tabulate poverty doctor, expected chi2

1RWHWKDWWDEXODWHH[WHQGVQLFHO\WR5[&WDEOHV«

. tabulate racecat doctor, expected chi2

:KDWDUHWKHQXOODQGDOWHUQDWLYHK\SRWKHVHV"

Null:QRDVVRFLDWLRQEHWZHHQDERYHEHORZSRYHUW\OLQHDQGZKHWKHUDQLQGLYLGXDOYLVLWHGWKHGRFWRULQWKHSDVWPRQWKVAlternative:WKHUHLVDQDVVRFLDWLRQ

Null:25 Alternative:25

$UHWKHH[SHFWHGFHOOFRXQWVVXIILFLHQWO\ODUJH"

$OOH[SHFWHGFHOOFRXQWVDUHJUHDWHUWKDQ

:KDWLVWKHYDOXHDQGGLVWULEXWLRQRIWKHWHVWVWDWLVWLFXQGHUWKHQXOOK\SRWKHVLV"

ȋ aȋ

:KDWLVWKHSYDOXH"

S

Page 44: STATA intro

'R\RXUHMHFWWKHQXOOK\SRWKHVLV":KDWLV\RXUFRQFOXVLRQ"

:HUHMHFWWKHQXOOK\SRWKHVLVDQGFRQFOXGHWKDWWKHUHLVHYLGHQFHLQWKHGDWDWKDWWKHRGGVRIYLVLWLQJWKHGRFWRULQWKHSDVWPRQWKVDUHKLJKHULQWKRVHZKRDUHDERYHWKHSRYHUW\OLQH

)RUWKHDERYHDQGEHORZSRYHUW\JURXSVFRPSDUHWKHIROORZLQJ

&,IRUWKHRGGVUDWLR &,IRUWKHGLIIHUHQFHLQWZRSURSRUWLRQVIURPWKHSUHYLRXVWXWRULDO WKHSYDOXHIURPWKHWZRVDPSOHSURSRUWLRQWHVWIURPWKHSUHYLRXVWXWRULDO WKHSYDOXHIURPWKH3HDUVRQ&KLVTXDUHWHVW

D'R\RXJHWWKHVDPHJHQHUDOFRQFOXVLRQZLWKHDFKWHVW"

'LIIHUHQFHLQSURSRUWLRQVEHWZHHQDERYHDQGEHORZSRYHUW\JURXSV

ZLWK&,

2GGVUDWLRIRUDERYHDQGEHORZJURXSV

ZLWK&,

3HDUVRQ&KLVTXDUH

S

7ZRVDPSOHSURSRUWLRQWHVW

S

E:KLFKWHVWGR\RXILQGPRVWXVHIXO"

,W¶VDOZD\VJRRGWRVKRZERWKDSYDOXHDQGDFRQILGHQFHLQWHUYDO1RWHWKDW\RXFDQ¶WVKRZDFRQILGHQFHLQWHUYDOIRUWKHULVNGLIIHUHQFHLI\RXKDYHDFDVHFRQWUROVWXG\

Page 45: STATA intro

7XWRULDO,QIHUHQFHIRU3DLUHG'DWDXVLQJ0F1HPDU¶V7HVW3DUW

&RQVLGHUWKHIROORZLQJVWXG\IURP'HNNHUVHWDOWKDWFRPSDUHGWZRGLIIHUHQWVFUHHQLQJWHVWVIRUGHWHUPLQLQJDGUHQDOLQVXIILFLHQF\$GUHQDOLQVXIILFLHQF\LVDFRQGLWLRQLQZKLFKWKHDGUHQDOJODQGVGRQRWSURGXFHDGHTXDWHDPRXQWVRIFHUWDLQKRUPRQHV7KHVFUHHQLQJWHVWLQYROYHVPHDVXULQJDSDWLHQW¶VFRUWLVROUHVSRQVHDIWHUDGPLQLVWUDWLRQRIDQLQWUDYHQRXVEROXVRIDGUHQRFRUWLFRWURSLFKRUPRQH$&7+&XUUHQWO\WZRGRVHVRI$&7+DUHXVHGIRUGLDJQRVWLFSXUSRVHVLQSDWLHQWVZLWKVXVSHFWHGDGUHQDOLQVXIILFLHQF\ȝJDQGȝJ'HNNHUVHWDO7KHUHLVDQRQJRLQJGHEDWHDERXWZKLFKGRVHVKRXOGEHXVHGIRUWKHLQLWLDODVVHVVPHQWRIDGUHQDOIXQFWLRQ'HNNHUVHWDO7KHJRDORIWKLVVWXG\ZDVWRFRPSDUHWKHFRUWLVROUHVSRQVHRIWKHȝJDQGȝJ$&7+WHVWDPRQJSDWLHQWVZLWKVXVSHFWHGDGUHQDOLQVXIILFLHQF\3DWLHQWVZLWKFRUWLVROFRQFHQWUDWLRQVRIQPROODIWHU$&7+VWLPXODWLRQFRQVLGHUHGQRUPDOFRUWLVROUHVSRQVHZHUHFODVVLILHGDVQRWKDYLQJDGUHQDOLQVXIILFLHQF\7KLVZDVDUHWURVSHFWLYHFRKRUWVWXG\ZKHUHE\SDWLHQWVZKRUHFHLYHGERWKWKHȝJDQGȝJ$&7+WHVWEHWZHHQ-DQXDU\DQG'HFHPEHUZHUHLQFOXGHGIRUDQDO\VLV7KHGDWDFDQEHIRXQGLQWKHAI.dtaGDWDVHW6RXUFH'HNNHUV207LPPHUPDQV-06PLW-:5RPLMQ-$3HUHLUD$0&RPSDULVRQRIWKHFRUWLVROUHVSRQVHVWRWHVWLQJZLWKWZRGRVHVRI$&7+LQSDWLHQWVZLWKVXVSHFWHGDGUHQDOLQVXIILFLHQF\Eur J Endocrinol-DQ

6LQFHWKLVLVSDLUHGGDWDZHGHFLGHWRXVH0F1HPDU¶VWHVW6WDWHWKHQXOODQGDOWHUQDWLYHK\SRWKHVLVIRU0F1HPDU¶VWHVW1XOO7KHSURSRUWLRQRISDWLHQWVFODVVLILHGDVKDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVWLVWKHVDPHDVWKHSURSRUWLRQRISDWLHQWVFODVVLILHGDVKDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVW$OWHUQDWLYH7KRVHSURSRUWLRQVDUHQRWHTXDO,VWKLVWKHVDPHDVWHVWLQJWKDWWKHSURSRUWLRQRISDWLHQWVFODVVLILHGDVnotKDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVWLVWKHVDPHDVWKHSURSRUWLRQRISDWLHQWVFODVVLILHGDVnot KDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVW"

8VHWKHtableFRPPDQGWRVXPPDUL]HWKHGDWD

. tabulate one two | two one | Abnormal Normal | Total -----------+----------------------+---------- Abnormal | 42 19 | 61 Normal | 14 132 | 146 -----------+----------------------+---------- Total | 56 151 | 207

D +RZPDQ\GLVFRUGDQWSDLUVDUHWKHUH"

Page 46: STATA intro

&DUU\RXW0F1HPDU¶VWHVWLQ6WDWDDWWKHĮ VLJQLILFDQFHOHYHO. mcc one two | Controls | Cases | Exposed Unexposed | Total -----------------+------------------------+------------ Exposed | 132 14 | 146 Unexposed | 19 42 | 61 -----------------+------------------------+------------ Total | 151 56 | 207 McNemar's chi2(1) = 0.76 Prob > chi2 = 0.3841 Exact McNemar significance probability = 0.4869 Proportion with factor Cases .705314 Controls .7294686 [95% Conf. Interval] --------- -------------------- difference -.0241546 -.0832778 .0349687 ratio .9668874 .8962794 1.043058 rel. diff. -.0892857 -.2991256 .1205541 odds ratio .7368421 .3418529 1.550025 (exact)

D :KDWLVWKHWHVWVWDWLVWLF"1XOOGLVWULEXWLRQ"3YDOXH"7KHWHVWVWDWLVWLFLV7KHQXOOGLVWULEXWLRQRIWKHWHVWVWDWLVWLFLVFKLVTXDUHGZLWKGHJUHHRIIUHHGRP7KHSYDOXHLV1RWHWKHUHLVDQH[DFWWHVWYHUVLRQRI0F1HPDU¶VWHVWEDVHGRQWKHELQRPLDOGLVWULEXWLRQOHDGLQJWRDSYDOXHRIZKLFKZDVWKHSYDOXHUHSRUWHGLQWKHSDSHU

E :KDWLV\RXUFRQFOXVLRQ"6LQFHRXUSYDOXHLVJUHDWHUWKDQZHIDLOWRUHMHFWWKHQXOOK\SRWKHVLV7KXVZHKDYHQRHYLGHQFHWKDWWKHSURSRUWLRQRISDWLHQWVFODVVLILHGDVKDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVWLVGLIIHUHQWIURPWKHSURSRUWLRQRISDWLHQWVFODVVLILHGDVKDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVW

Page 47: STATA intro

Tutorial: Inference for Matched Data using McNemar’s Test 7RLQFRUSRUDWHPRUHLQGLYLGXDOLQIRUPDWLRQLQWRRXUDQDO\VLVZHPDWFKLQGLYLGXDOVZKRZHUHEHORZWKHSRYHUW\OLQHWRDQLQGLYLGXDOZKRZDVDERYHWKHSRYHUW\OLQHEDVHGRQDJHXUEDQYVUXUDOORFDWLRQUDFHDQGJHQGHU1RWHWKDWZHFRXOGLQFRUSRUDWHPRUHFRYDULDWHVWRLPSURYHWKHPDWFKHV:HFRQGXFW0F1HPDU¶VWHVWWRH[DPLQHWKHUHODWLRQVKLSEHWZHHQSRYHUW\DQGGRFWRUYLVLWVDPRQJPDWFKHGSDLUV2SHQWKHGDWDVHWchis_matched.dta . mcc doctor_0 doctor_1

6WDWHWKHQXOODQGDOWHUQDWLYHK\SRWKHVLVIRU0F1HPDU¶VWHVWNull:WKHUHLVQRDVVRFLDWLRQEHWZHHQSRYHUW\DQGYLVLWLQJWKHGRFWRULQWKHSDVWPRQWKVAlternative:WKHUHLVDQDVVRFLDWLRQEHWZHHQSRYHUW\DQGYLVLWLQJWKHGRFWRULQWKHSDVWPRQWKV

$VXEWOHVLGHQRWHZHDUHQRZJHQHUDOL]LQJWRDVOLJKWO\GLIIHUHQWSRSXODWLRQ%HFDXVHRIWKHZD\ZHLPSOHPHQWHGRXUPDWFKLQJVFKHPHZHDUHQRORQJHUPDNLQJLQIHUHQFHDERXWDOO&DOLIRUQLDUHVLGHQWV5DWKHUZHDUHPDNLQJLQIHUHQFHZLWKUHVSHFWWRWKHSRSXODWLRQZLWKDFRYDULDWHSDWWHUQDJHUDFHORFDWLRQDQGJHQGHUVLPLODUWRWKHSRSXODWLRQEHORZWKHSRYHUW\OHYHO

+RZPDQ\SDLUVFRQWULEXWHWRWKHWHVWVWDWLVWLF"

2QO\GLVFRUGDQWSDLUVFRQWULEXWHWRWKHWHVWVWDWLVWLF

'XHWRWKHVPDOOVDPSOHVL]HQXPEHURIGLVFRUGDQWSDLUVOHVVWKDQWKHQRUPDODSSUR[LPDWLRQLVGXELRXVLQWKLVLQVWDQFH7KHUHLVDQH[DFWWHVWEDVHGRQWKHELQRPLDOGLVWULEXWLRQZKLFKGRHVQRWUHO\RQODUJHVDPSOHDSSUR[LPDWLRQV

8VLQJDODUJHVDPSOHWHVWZKDWLVWKHWHVWVWDWLVWLF"1XOOGLVWULEXWLRQ"3YDOXH"&RPSDUHWRWKHH[DFWWHVW

Ȥ aȤ

S

1RWHWKDWWKHPRUHFRQVHUYDWLYHH[DFWWHVWUHVXOWVLQDSYDOXHRIVLPLODUWRWKHODUJHVDPSOHUHVXOW

:KDWLVWKHRGGVUDWLR"&RPSDUHWRWKH25IURPWKHQRQPDWFKHGDQDO\VLV

)URPWKHQRQPDWFKHGDQDO\VLVWKH25ZDVZLWK&,

Page 48: STATA intro

&RPSDULQJWKHUHVXOWVRIWKH0F1HPDU¶VWHVWWRWKH3HDUVRQ&KLVTXDUHWHVWFRQVLGHUWKHIROORZLQJTXHVWLRQ(YHQWKRXJKZHKDYHGHFUHDVHGWKHVDPSOHVL]HGRZHJDLQSRZHUE\PDWFKLQJ":KLFKWHVWSURYLGHVVWURQJHUHYLGHQFHWKDWSRYHUW\LPSDFWVZKHWKHURUQRWVRPHRQHJRHVWRWKHGRFWRUHDFK\HDU"

%HFDXVHZHDUHFRPSDULQJGRFWRUYLVLWVDPRQJVLPLODULQGLYLGXDOVZHJDLQVRPHSRZHUE\PDWFKLQJ

Page 49: STATA intro

Survey Data Analysis in Stata

Example

Real-world, publicly available survey data is often very complex (see the DHS example).Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of adisease, say malaria, in a hypothetical country, called “Inventia”.

Country profile:

Province Population size Number of districts1 225,000 502 150,000 423 100,000 324 25,000 23Total 500,000 146

In Inventia, the climate differs between provinces; for instance, province 4 is more arid andat a higher altitudes than the rest of the country. Consequently, the prevalence of malaria pdiffers between different provinces. Also, access to malaria prevention is not consistent acrossthe country, and subsequently p may also vary somewhat between districts. (For instance, ur-ban populations may have more resources to prevent malaria, and thus a lower prevalence.)The true prevalence of malaria in Inventia is 13.1%.

Today, we review how to analyze data from several different survey designs:

• Simple Random Sampling - We randomly sample 1,000 people from Inventia.

• Stratified Sampling - We randomly sample 250 people from each of the 4 provinces ofInventia.

• Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample40 people within each district.

• Stratified Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-tricts. Within these 20 districts, we randomly sample 50 people.

1

Page 50: STATA intro

Analyzing Survey Data in Stata

In order to analyze survey data in Stata, you must first svyset your data. This commandtells Stata what survey design was used to obtain the data. This includes specification of surveyweights, the finite population correction(s), and levels of clustering and stratification.

Once Stata has this information, it incorporates the specified design elements into its calcu-lations. You can then use the survey estimation procedures in Stata. For example, svy: mean

var name, svy: proportion var name, svy: regress ....

Before analyzing your survey data, you need to be able to answer the following questions:

1. What is the design of my survey?

2. Am I using a finite population correction? At which stage of the design?

3. What are the survey weights used in the design?

Once you know these things, you can start analyzing your data in Stata.

2

Page 51: STATA intro

1 Simple Random Sampling

Design: We randomly sample 1,000 people from the entire country of Inventia.

Notation:

• N is the total population size

• n is the number of individuals sampled from the population without replacement

In our case, n = 1, 000, N = 500, 000.

Finite Population Correction: 1 f =1 n

N

Survey Weights wi = P ( individual i is included in the survey)1 = Nn

Exercise: Estimate the prevalence of malaria in Inventia.

use "srs.dta", clear

generate weight_srs = pop_size/1000

generate fpc = 1000/pop_size * note that this does not match the definition above

svyset id [pweight=weight_srs], fpc(fpc)

svy: proportion malaria

svyset id [pweight=weight_srs]

svy: proportion malaria

estat effects, deff

proportion malaria

Under simple random sampling (SRS), when will proportion malaria and svy: proportion

malaria give you the same results? Why?

Why does it not matter much if you use the finite population correction in this example?

Exercise: Estimate the prevalence of malaria in each of the four provinces.

svy, sub(if province==1): proportion malaria

svy, sub(if province==2): proportion malaria

svy, sub(if province==3): proportion malaria

svy, sub(if province==4): proportion malaria

Is there evidence of province-level variation in malaria prevalence?

3

Page 52: STATA intro

2 Stratified Sampling

Design: We randomly sample 250 people from each of the 4 provinces of Inventia.

Notation:

• N is the total population size

• Nj is the population in province j, j = 1, 2, 3, 4

• nj individuals are sampled from province j

The important design question in stratified sampling is how to choose the sample size withineach stratum. In our case, N1 = 225, 000, N2 = 150, 000, N3 = 100, 000 and N4 = 25, 000.nj = 250 for each j.

Finite Population Correction: 1 fj =1 nj

Nj

Survey Weights: wij = P ( individual i in strata j is in the survey)1 = Nj

nj

Exercise: Estimate the prevalence of malaria in Inventia.

use "stratified.dta", clear

proportion malaria

proportion malaria, over(province)

generate weight_stratified = prov_size/250

generate fpc_stratified = 1/weight_stratified

svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)

svydescribe weight

svy: proportion malaria

estat effects, deff

Exercise: Why is our estimate of p too low when we do not specify the survey design?

4

Page 53: STATA intro

3 Cluster Sampling

Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-domly sample 40 people.

Notation:

• N is the total population size

• Nk is the population size in district k, k = 1, ..., 146

• nI out of NI total districts are sampled for inclusion in the survey (primary sampling unit)

• We sample nk individuals in district k are selected for inclusion in the survey (secondarysampling unit)

In our survey, nI = 25, NI = 146, nk = 40, and Nk is the population size in district k.

Finite Population Correction:

Stage I: 1 fI =1 nI

NI

Stage II: 1 fk =1 nk

Nk

Survey Weights’:

wik = P (individual i in cluster k is in the survey)1

= [P (cluster k selected) P ( individual i in cluster k selected | clusterk selected)]1

=NI

nI Nk

nk

Exercise: Estimate the prevalence of malaria in Inventia, using only the first stage finite popu-lation correction.

use "cluster.dta", clear

generate fpc1 = 25/146

generate fpc2 = 40/districtsize

generate weight_cluster = (fpc1*fpc2)^-1

svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)

svy: proportion malaria

estat effects, deff

5

Page 54: STATA intro

4 Stratified Cluster Sampling

We could combine stratified, cluster and simple random sampling all into one design!

Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20districts, we randomly sample 50 people.

Survey weights: As an example, for province 2:

P (person i in district j in province 2 in survey )

= P (district j in survey | province 2)P (person i in survey | district j)

=5

42 50

districtsizej

Finite population correction:

Stage I: #sampled districtstotal#districts in the province

Stage II: #sampled per districtdistrict population = 50

districtsizejfor district j.

Exercise: Estimate the prevalence of malaria in Inventia.

use "stratifiedcluster.dta", clear

generate fpc1 = 5/ndistrict

generate fpc2 = 50/districtsize

generate weight_stratcluster = (fpc1*fpc2)^-1

svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)

svy: proportion malaria

estat effects, deff

6

Page 55: STATA intro

Survey Data Analysis in Stata

Example

Real-world, publicly available survey data is often very complex (see the DHS example).Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of adisease, say malaria, in a hypothetical country, called “Inventia”.

Country profile:

Province Population size Number of districts1 225,000 502 150,000 423 100,000 324 25,000 23Total 500,000 146

In Inventia, the climate differs between provinces; for instance, province 4 is more arid andat a higher altitudes than the rest of the country. Consequently, the prevalence of malaria pdiffers between different provinces. Also, access to malaria prevention is not consistent acrossthe country, and subsequently p may also vary somewhat between districts. (For instance, ur-ban populations may have more resources to prevent malaria, and thus a lower prevalence.)The true prevalence of malaria in Inventia is 13.1%.

Today, we review how to analyze data from several different survey designs:

• Simple Random Sampling - We randomly sample 1,000 people from Inventia.

• Stratified Sampling - We randomly sample 250 people from each of the 4 provinces ofInventia.

• Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample40 people within each district.

• Stratified Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-tricts. Within these 20 districts, we randomly sample 50 people.

1

Page 56: STATA intro

Analyzing Survey Data in Stata

In order to analyze survey data in Stata, you must first svyset your data. This commandtells Stata what survey design was used to obtain the data. This includes specification of surveyweights, the finite population correction(s), and levels of clustering and stratification.

Once Stata has this information, it incorporates the specified design elements into its calcu-lations. You can then use the survey estimation procedures in Stata. For example, svy: mean

var name, svy: proportion var name, svy: regress ....

Before analyzing your survey data, you need to be able to answer the following questions:

1. What is the design of my survey?

2. Am I using a finite population correction? At which stage of the design?

3. What are the survey weights used in the design?

Once you know these things, you can start analyzing your data in Stata.

2

Page 57: STATA intro

1 Simple Random Sampling

Design: We randomly sample 1,000 people from the entire country of Inventia.

Notation:

• N is the total population size

• n is the number of individuals sampled from the population without replacement

In our case, n = 1, 000, N = 500, 000.

Finite Population Correction: 1 f =1 n

N

Survey Weights wi = P ( individual i is included in the survey)1 = Nn

Exercise: Estimate the prevalence of malaria in Inventia.

use "srs.dta", clear

generate weight_srs = pop_size/1000

generate fpc = 1000/pop_size * note that this does not match the definition above

svyset id [pweight=weight_srs], fpc(fpc)

svy: proportion malaria

svyset id [pweight=weight_srs]

svy: proportion malaria

estat effects, deff

proportion malaria

Under simple random sampling (SRS), when will proportion malaria and svy: proportion

malaria give you the same results? Why?

Why does it not matter much if you use the finite population correction in this example?

Exercise: Estimate the prevalence of malaria in each of the four provinces.

svy, sub(if province==1): proportion malaria

svy, sub(if province==2): proportion malaria

svy, sub(if province==3): proportion malaria

svy, sub(if province==4): proportion malaria

Is there evidence of province-level variation in malaria prevalence?

3

Page 58: STATA intro

2 Stratified Sampling

Design: We randomly sample 250 people from each of the 4 provinces of Inventia.

Notation:

• N is the total population size

• Nj is the population in province j, j = 1, 2, 3, 4

• nj individuals are sampled from province j

The important design question in stratified sampling is how to choose the sample size withineach stratum. In our case, N1 = 225, 000, N2 = 150, 000, N3 = 100, 000 and N4 = 25, 000.nj = 250 for each j.

Finite Population Correction: 1 fj =1 nj

Nj

Survey Weights: wij = P ( individual i in strata j is in the survey)1 = Nj

nj

Exercise: Estimate the prevalence of malaria in Inventia.

use "stratified.dta", clear

proportion malaria

proportion malaria, over(province)

generate weight_stratified = prov_size/250

generate fpc_stratified = 1/weight_stratified

svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)

svydescribe weight

svy: proportion malaria

estat effects, deff

Exercise: Why is our estimate of p too low when we do not specify the survey design?

4

Page 59: STATA intro

3 Cluster Sampling

Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-domly sample 40 people.

Notation:

• N is the total population size

• Nk is the population size in district k, k = 1, ..., 146

• nI out of NI total districts are sampled for inclusion in the survey (primary sampling unit)

• We sample nk individuals in district k are selected for inclusion in the survey (secondarysampling unit)

In our survey, nI = 25, NI = 146, nk = 40, and Nk is the population size in district k.

Finite Population Correction:

Stage I: 1 fI =1 nI

NI

Stage II: 1 fk =1 nk

Nk

Survey Weights’:

wik = P (individual i in cluster k is in the survey)1

= [P (cluster k selected) P ( individual i in cluster k selected | clusterk selected)]1

=NI

nI Nk

nk

Exercise: Estimate the prevalence of malaria in Inventia, using only the first stage finite popu-lation correction.

use "cluster.dta", clear

generate fpc1 = 25/146

generate fpc2 = 40/districtsize

generate weight_cluster = (fpc1*fpc2)^-1

svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)

svy: proportion malaria

estat effects, deff

5

Page 60: STATA intro

4 Stratified Cluster Sampling

We could combine stratified, cluster and simple random sampling all into one design!

Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20districts, we randomly sample 50 people.

Survey weights: As an example, for province 2:

P (person i in district j in province 2 in survey )

= P (district j in survey | province 2)P (person i in survey | district j)

=5

42 50

districtsizej

Finite population correction:

Stage I: #sampled districtstotal#districts in the province

Stage II: #sampled per districtdistrict population = 50

districtsizejfor district j.

Exercise: Estimate the prevalence of malaria in Inventia.

use "stratifiedcluster.dta", clear

generate fpc1 = 5/ndistrict

generate fpc2 = 50/districtsize

generate weight_stratcluster = (fpc1*fpc2)^-1

svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)

svy: proportion malaria

estat effects, deff

6

Page 61: STATA intro

Survey Data Analysis in Stata

Example

Real-world, publicly available survey data is often very complex (see the DHS example).Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of adisease, say malaria, in a hypothetical country, called “Inventia”.

Country profile:

Province Population size Number of districts1 225,000 502 150,000 423 100,000 324 25,000 23Total 500,000 146

In Inventia, the climate differs between provinces; for instance, province 4 is more arid andat a higher altitudes than the rest of the country. Consequently, the prevalence of malaria pdiffers between different provinces. Also, access to malaria prevention is not consistent acrossthe country, and subsequently p may also vary somewhat between districts. (For instance, ur-ban populations may have more resources to prevent malaria, and thus a lower prevalence.)The true prevalence of malaria in Inventia is 13.1%.

Today, we review how to analyze data from several different survey designs:

• Simple Random Sampling - We randomly sample 1,000 people from Inventia.

• Stratified Sampling - We randomly sample 250 people from each of the 4 provinces ofInventia.

• Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample40 people within each district.

• Stratified Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-tricts. Within these 20 districts, we randomly sample 50 people.

1

Page 62: STATA intro

Analyzing Survey Data in Stata

In order to analyze survey data in Stata, you must first svyset your data. This commandtells Stata what survey design was used to obtain the data. This includes specification of surveyweights, the finite population correction(s), and levels of clustering and stratification.

Once Stata has this information, it incorporates the specified design elements into its calcu-lations. You can then use the survey estimation procedures in Stata. For example, svy: mean

var name, svy: proportion var name, svy: regress ....

Before analyzing your survey data, you need to be able to answer the following questions:

1. What is the design of my survey?

2. Am I using a finite population correction? At which stage of the design?

3. What are the survey weights used in the design?

Once you know these things, you can start analyzing your data in Stata.

2

Page 63: STATA intro

1 Simple Random Sampling

Design: We randomly sample 1,000 people from the entire country of Inventia.

Notation:

• N is the total population size

• n is the number of individuals sampled from the population without replacement

In our case, n = 1, 000, N = 500, 000.

Finite Population Correction: 1 f =1 n

N

Survey Weights wi = P ( individual i is included in the survey)1 = Nn

Exercise: Estimate the prevalence of malaria in Inventia.

use "srs.dta", clear

generate weight_srs = pop_size/1000

generate fpc = 1000/pop_size * note that this does not match the definition above

svyset id [pweight=weight_srs], fpc(fpc)

svy: proportion malaria

svyset id [pweight=weight_srs]

svy: proportion malaria

estat effects, deff

proportion malaria

Under simple random sampling (SRS), when will proportion malaria and svy: proportion

malaria give you the same results? Why?

Why does it not matter much if you use the finite population correction in this example?

Exercise: Estimate the prevalence of malaria in each of the four provinces.

svy, sub(if province==1): proportion malaria

svy, sub(if province==2): proportion malaria

svy, sub(if province==3): proportion malaria

svy, sub(if province==4): proportion malaria

Is there evidence of province-level variation in malaria prevalence?

3

Page 64: STATA intro

2 Stratified Sampling

Design: We randomly sample 250 people from each of the 4 provinces of Inventia.

Notation:

• N is the total population size

• Nj is the population in province j, j = 1, 2, 3, 4

• nj individuals are sampled from province j

The important design question in stratified sampling is how to choose the sample size withineach stratum. In our case, N1 = 225, 000, N2 = 150, 000, N3 = 100, 000 and N4 = 25, 000.nj = 250 for each j.

Finite Population Correction: 1 fj =1 nj

Nj

Survey Weights: wij = P ( individual i in strata j is in the survey)1 = Nj

nj

Exercise: Estimate the prevalence of malaria in Inventia.

use "stratified.dta", clear

proportion malaria

proportion malaria, over(province)

generate weight_stratified = prov_size/250

generate fpc_stratified = 1/weight_stratified

svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)

svydescribe weight

svy: proportion malaria

estat effects, deff

Exercise: Why is our estimate of p too low when we do not specify the survey design?

4

Page 65: STATA intro

3 Cluster Sampling

Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-domly sample 40 people.

Notation:

• N is the total population size

• Nk is the population size in district k, k = 1, ..., 146

• nI out of NI total districts are sampled for inclusion in the survey (primary sampling unit)

• We sample nk individuals in district k are selected for inclusion in the survey (secondarysampling unit)

In our survey, nI = 25, NI = 146, nk = 40, and Nk is the population size in district k.

Finite Population Correction:

Stage I: 1 fI =1 nI

NI

Stage II: 1 fk =1 nk

Nk

Survey Weights’:

wik = P (individual i in cluster k is in the survey)1

= [P (cluster k selected) P ( individual i in cluster k selected | clusterk selected)]1

=NI

nI Nk

nk

Exercise: Estimate the prevalence of malaria in Inventia, using only the first stage finite popu-lation correction.

use "cluster.dta", clear

generate fpc1 = 25/146

generate fpc2 = 40/districtsize

generate weight_cluster = (fpc1*fpc2)^-1

svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)

svy: proportion malaria

estat effects, deff

5

Page 66: STATA intro

4 Stratified Cluster Sampling

We could combine stratified, cluster and simple random sampling all into one design!

Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20districts, we randomly sample 50 people.

Survey weights: As an example, for province 2:

P (person i in district j in province 2 in survey )

= P (district j in survey | province 2)P (person i in survey | district j)

=5

42 50

districtsizej

Finite population correction:

Stage I: #sampled districtstotal#districts in the province

Stage II: #sampled per districtdistrict population = 50

districtsizejfor district j.

Exercise: Estimate the prevalence of malaria in Inventia.

use "stratifiedcluster.dta", clear

generate fpc1 = 5/ndistrict

generate fpc2 = 50/districtsize

generate weight_stratcluster = (fpc1*fpc2)^-1

svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)

svy: proportion malaria

estat effects, deff

6

Page 67: STATA intro

Survey Data Analysis in Stata

Example

Real-world, publicly available survey data is often very complex (see the DHS example).Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of adisease, say malaria, in a hypothetical country, called “Inventia”.

Country profile:

Province Population size Number of districts1 225,000 502 150,000 423 100,000 324 25,000 23Total 500,000 146

In Inventia, the climate differs between provinces; for instance, province 4 is more arid andat a higher altitudes than the rest of the country. Consequently, the prevalence of malaria pdiffers between different provinces. Also, access to malaria prevention is not consistent acrossthe country, and subsequently p may also vary somewhat between districts. (For instance, ur-ban populations may have more resources to prevent malaria, and thus a lower prevalence.)The true prevalence of malaria in Inventia is 13.1%.

Today, we review how to analyze data from several different survey designs:

• Simple Random Sampling - We randomly sample 1,000 people from Inventia.

• Stratified Sampling - We randomly sample 250 people from each of the 4 provinces ofInventia.

• Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample40 people within each district.

• Stratified Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-tricts. Within these 20 districts, we randomly sample 50 people.

1

Page 68: STATA intro

Analyzing Survey Data in Stata

In order to analyze survey data in Stata, you must first svyset your data. This commandtells Stata what survey design was used to obtain the data. This includes specification of surveyweights, the finite population correction(s), and levels of clustering and stratification.

Once Stata has this information, it incorporates the specified design elements into its calcu-lations. You can then use the survey estimation procedures in Stata. For example, svy: mean

var name, svy: proportion var name, svy: regress ....

Before analyzing your survey data, you need to be able to answer the following questions:

1. What is the design of my survey?

2. Am I using a finite population correction? At which stage of the design?

3. What are the survey weights used in the design?

Once you know these things, you can start analyzing your data in Stata.

2

Page 69: STATA intro

1 Simple Random Sampling

Design: We randomly sample 1,000 people from the entire country of Inventia.

Notation:

• N is the total population size

• n is the number of individuals sampled from the population without replacement

In our case, n = 1, 000, N = 500, 000.

Finite Population Correction: 1 f =1 n

N

Survey Weights wi = P ( individual i is included in the survey)1 = Nn

Exercise: Estimate the prevalence of malaria in Inventia.

use "srs.dta", clear

generate weight_srs = pop_size/1000

generate fpc = 1000/pop_size * note that this does not match the definition above

svyset id [pweight=weight_srs], fpc(fpc)

svy: proportion malaria

svyset id [pweight=weight_srs]

svy: proportion malaria

estat effects, deff

proportion malaria

Under simple random sampling (SRS), when will proportion malaria and svy: proportion

malaria give you the same results? Why?

Why does it not matter much if you use the finite population correction in this example?

Exercise: Estimate the prevalence of malaria in each of the four provinces.

svy, sub(if province==1): proportion malaria

svy, sub(if province==2): proportion malaria

svy, sub(if province==3): proportion malaria

svy, sub(if province==4): proportion malaria

Is there evidence of province-level variation in malaria prevalence?

3

Page 70: STATA intro

2 Stratified Sampling

Design: We randomly sample 250 people from each of the 4 provinces of Inventia.

Notation:

• N is the total population size

• Nj is the population in province j, j = 1, 2, 3, 4

• nj individuals are sampled from province j

The important design question in stratified sampling is how to choose the sample size withineach stratum. In our case, N1 = 225, 000, N2 = 150, 000, N3 = 100, 000 and N4 = 25, 000.nj = 250 for each j.

Finite Population Correction: 1 fj =1 nj

Nj

Survey Weights: wij = P ( individual i in strata j is in the survey)1 = Nj

nj

Exercise: Estimate the prevalence of malaria in Inventia.

use "stratified.dta", clear

proportion malaria

proportion malaria, over(province)

generate weight_stratified = prov_size/250

generate fpc_stratified = 1/weight_stratified

svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)

svydescribe weight

svy: proportion malaria

estat effects, deff

Exercise: Why is our estimate of p too low when we do not specify the survey design?

4

Page 71: STATA intro

3 Cluster Sampling

Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-domly sample 40 people.

Notation:

• N is the total population size

• Nk is the population size in district k, k = 1, ..., 146

• nI out of NI total districts are sampled for inclusion in the survey (primary sampling unit)

• We sample nk individuals in district k are selected for inclusion in the survey (secondarysampling unit)

In our survey, nI = 25, NI = 146, nk = 40, and Nk is the population size in district k.

Finite Population Correction:

Stage I: 1 fI =1 nI

NI

Stage II: 1 fk =1 nk

Nk

Survey Weights’:

wik = P (individual i in cluster k is in the survey)1

= [P (cluster k selected) P ( individual i in cluster k selected | clusterk selected)]1

=NI

nI Nk

nk

Exercise: Estimate the prevalence of malaria in Inventia, using only the first stage finite popu-lation correction.

use "cluster.dta", clear

generate fpc1 = 25/146

generate fpc2 = 40/districtsize

generate weight_cluster = (fpc1*fpc2)^-1

svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)

svy: proportion malaria

estat effects, deff

5

Page 72: STATA intro

4 Stratified Cluster Sampling

We could combine stratified, cluster and simple random sampling all into one design!

Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20districts, we randomly sample 50 people.

Survey weights: As an example, for province 2:

P (person i in district j in province 2 in survey )

= P (district j in survey | province 2)P (person i in survey | district j)

=5

42 50

districtsizej

Finite population correction:

Stage I: #sampled districtstotal#districts in the province

Stage II: #sampled per districtdistrict population = 50

districtsizejfor district j.

Exercise: Estimate the prevalence of malaria in Inventia.

use "stratifiedcluster.dta", clear

generate fpc1 = 5/ndistrict

generate fpc2 = 50/districtsize

generate weight_stratcluster = (fpc1*fpc2)^-1

svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)

svy: proportion malaria

estat effects, deff

6

Page 73: STATA intro

Tutorial: Non-response bias in surveys

Non-response is a huge issue in many surveys (Groves and Peytcheva, 2008). Survey non-response leads to significant bias if response is correlated with the survey indicators of interest.We use a simple example from the Framingham study to illustrate this concept.

Source: Groves, R.M. and Peytcheva, E. (2008). The impact of nonresponse rates on nonresponsebias. Public opinion quarterly, 72(2): 167-89.(I found a free draft via Google.)

Example:

• Suppose blood samples from the participants at baseline got lost; rather than measureeveryone in the population again, the study investigators decided to try to estimate thebaseline prevalence of high cholesterol (cholesterol > 240 mg/dL). They randomly sam-pled 400 individuals and asked them to return to the study center to have their cholesterolmeasured again, knowing that not all 400 would return for the re-test.

• The willingness of a participant to revisit the lab was correlated with the frailty of theindividual, sex, and prior knowledge of high cholesterol. With a lot of missing data, wewould expect to obtain biased of high cholesterol prevalence.

• Prevalence of high cholesterol at baseline was 43.1% in the Framingham cohort.

We consider three different scenarios:

A. Low response rate

B. Moderate reponse rate

C. High response rate

Exercise: Calculate the prevalence of high cholesterol for each of the three response ratesettings, as well as for the complete sample of 400 individuals.

proportion highchol

proportion highcholA highcholB highcholC

proportion highcholA

proportion highcholB

proportion highcholC

As suspected, bias increases with the amount of missingness.

We have baseline covariate data from the Framingham study. We can estimate the proba-bility that a sampled individual returns to have his/her cholesterol tested again as a function ofthese covariates.

If we knew these probabilities exactly, we could obtain an unbiased estimate of high choles-terol prevalence at baseline. In this example, we do have these probabilities (pA, pB, and pC

1

Page 74: STATA intro

in the dataset).

Exercise: Calculate the prevalence of high cholesterol for each of the three response ratesettings using the survey weights, and compare to the complete-case data.

gen wA = 1/pA

gen wB = 1/pB

gen wC = 1/pC

proportion highchol

proportion highcholA [pweight=wA]

proportion highcholB [pweight=wB]

proportion highcholC [pweight=wC]

Here we recovered unbiased estimates. However in practice, we will never exactly know pA,

pB, and pC. Many methods have been developed to address survey non-response, includingmultiple imputation and weighting for non-response. Maximizing the response rate is alwaysthe best policy.

2

Page 75: STATA intro

1

Tutorial: Correlation in Stata

7KH:RUOG%DQNKWWSGDWDZRUOGEDQNRUJLVDJUHDWVRXUFHRIIUHHSXEOLFGDWDRQWUHQGVLQKHDOWKDQGHFRQRPLFVDURXQGWKHZRUOG,QWKLVH[DPSOHZHXVHSXEOLFGDWDIURPWKH:RUOG%DQNWRH[DPLQHWUHQGVLQLPPXQL]DWLRQFRYHUDJHIRUPHDVOHVDQG'37RYHUWLPHLQORZLQFRPHFRXQWULHV2SHQWKHGDWDVHW:RUOG%DQNGWD

&DOFXODWHWKHSDLUZLVHFRUUHODWLRQVEHWZHHQPHDVOHVYDFFLQDWLRQFRYHUDJH'37YDFFLQDWLRQFRYHUDJHDQGWLPHpwcorr measles dpt year

0DNHDVFDWWHUSORWLQFOXGLQJERWKPHDVOHVDQGLPPXQL]DWLRQFRYHUDJHRQWKHSORW'RHVWKHSORWH[SODLQWKHUHVXOWVDERYH"twoway (scatter dpt year) (scatter measles year)

<HVWKHUHLVDYHU\VWURQJSRVLWLYHUHODWLRQVKLSEHWZHHQWLPHDQGLPPXQL]DWLRQFRYHUDJH)XUWKHULWVHHPVHYLGHQWWKDWWUHQGVLQVFDOLQJXSLQLPPXQL]DWLRQZHUHVLPLODUIRUPHDVOHVDQG'37

7HVWZKHWKHUWKHUHLVDOLQHDUUHODWLRQVKLSEHWZHHQWLPHDQGPHDVOHVYDFFLQDWLRQFRYHUDJH:KDWDUHWKHQXOODQGDOWHUQDWLYHK\SRWKHVHV":KDWLV\RXUFRQFOXVLRQ"pwcorr measles year, sig 6WDWLVWLFV!6XPPDULHV7DEOHVDQG7HVWV!6XPPDULHVDQG'HVFULSWLYH6WDWLVWLFV!3DLUZLVH&RUUHODWLRQV

7HVWIRUDPRQRWRQLFUHODWLRQVKLSEHWZHHQWLPHDQGPHDVOHVYDFFLQDWLRQFRYHUDJH:KDWDUHWKHQXOODQGDOWHUQDWLYHK\SRWKHVHV":KDWLV\RXUFRQFOXVLRQ"spearman measles year 6WDWLVWLFV!6XPPDULHVWDEOHVDQGWHVWV!1RQSDUDPHWULFWHVWVRIK\SRWKHVHV!6SHDUPDQ¶VUDQNFRUUHODWLRQ

:K\GR\RXWKLQNWKHFRUUHODWLRQVDUHVRKLJKLQWKLVH[DPSOH"6KRXOG\RXDOZD\VKDYHVXFKKLJKDVSLUDWLRQVUHJDUGLQJWKHPDJQLWXGHRI\RXUFRUUHODWLRQFRHIILFLHQWVZKHQDQDO\]LQJSXEOLFKHDOWKGDWD"

Source:&UHDWHGIURP:RUOG%DQN:RUOG'HYHORSPHQW,QGLFDWRUVDQG*OREDO'HYHORSPHQW)LQDQFH 9DFFLQDWLRQFRYHUDJHIURP:+2DQG81,&()

Page 76: STATA intro

1

Tutorial: Non-Parametric Tests in Stata The Sign Test and Wilcoxon Signed-Rank Test

Consider the following table taken from Whitley and Ball (2002) showing central venous oxygen saturation in 10 patients at admission and 6 hours after admission to an intensive care unit (ICU). Table 1: Central Venous Oxygen Saturation (%)

Subject At admission 6 hours after admission to ICU 1 39.7 52.9 2 59.1 56.7 3 56.1 61.9 4 57.7 71.4 5 60.6 67.7 6 37.8 50.0 7 58.2 60.7 8 33.6 51.3 9 56.0 59.5

10 65.3 59.8 E. Whitley and J. Ball. Statistics review 6: Nonparametric methods. Crit Care. 2002; 6(6): 509–513. It is hypothesized that after 6 hours in the ICU central venous oxygen saturation should increase. The authors are interested in whether the apparent increase in central venous oxygen saturation is likely to reflect a genuine effect of admission and treatment or whether it is simply due to chance. The data are located in the CVOS.dta data set. In this example we want to know whether there is a difference in central venous oxygen saturation at admission compared to 6 hours after admission to the ICU. That is, we want to know whether 6 hours in the ICU has an effect on central venous oxygen saturation.

1. Are the data independent or dependent? What parametric and nonparametric tests are available for this type of data? Dependent: We measure central venous oxygen saturation at admission and 6 hours after admission on the same subject. Parametric test: paired t- test Non-parametric tests: sign test, Wilcoxon Signed-Rank Test

2. What type of statistical test is most appropriate for this data and why?

We should probably use a non-parametric test since we have a small sample size. Furthermore, we have no information to suggest that the differences are normally distributed. You could also make a histogram of the differences to inspect normality.

3. Suppose we decide to use the sign test. What are the null and alternative hypotheses?

The null hypothesis is that the median of the differences is equal to zero. The alternative is that the median of the differences is not equal to zero.

Page 77: STATA intro

2

4. Perform a sign test in Stata at alpha = 0.05. What is the value of your test statistic? Your p-value? Your decision? Your interpretation? You may use the following drop-down menus to access the signtest command: Statistics / Summaries, tables, and tests / Nonparametric tests of hypotheses / Test equality of matched pairs. . signtest t6=t0 Sign test sign | observed expected -------------+------------------------ positive | 8 5 negative | 2 5 zero | 0 0 -------------+------------------------ all | 10 10 One-sided tests: Ho: median of t6 - t0 = 0 vs. Ha: median of t6 - t0 > 0 Pr(#positive >= 8) = Binomial(n = 10, x >= 8, p = 0.5) = 0.0547 Ho: median of t6 - t0 = 0 vs. Ha: median of t6 - t0 < 0 Pr(#negative >= 2) = Binomial(n = 10, x >= 2, p = 0.5) = 0.9893 Two-sided test: Ho: median of t6 - t0 = 0 vs. Ha: median of t6 - t0 != 0 Pr(#positive >= 8 or #negative >= 8) = min(1, 2*Binomial(n = 10, x >= 8, p = 0.5)) = 0.1094 Our test statistics is D = 8, since we have two plus signs. Stata uses the binomial distribution to generate the p-value. Our p-value is 0.1094. Thus, we fail to reject the null hypothesis and conclude that we do not find evidence that median central venous oxygen saturation is different at admission and 6 hours after admission to the ICU using the sign test.

5. Suppose that instead of conducting the sign test we conduct the Wilcoxon signed-rank test. Which test has more power? Why? The signed-rank test has more power since it incorporates the magnitude of differences via the ranks.

6. State the null and alternative hypothesis for the Wilcoxon signed-rank test.

Page 78: STATA intro

3

The null hypothesis is that the median of the differences is equal to zero. The alternative is that the median of the differences is not equal to zero.

7. Perform a signed-rank test in Stata at the alpha = 0.05 level. What is the value of your test statistic? Your p-value? Your decision? Your interpretation? You may use the following drop-down menus to access the signrank command: Statistics / Summaries, tables, and tests / Nonparametric tests of hypotheses / Wilcoxon matched-pairs signed-rank test

. signrank t6=t0 Wilcoxon signed-rank test sign | obs sum ranks expected -------------+--------------------------------- positive | 8 50 27.5 negative | 2 5 27.5 zero | 0 0 0 -------------+--------------------------------- all | 10 55 55 unadjusted variance 96.25 adjustment for ties 0.00 adjustment for zeros 0.00 ---------- adjusted variance 96.25 Ho: t6 = t0 z = 2.293 Prob > |z| = 0.0218 Our test statistic is 2.293. The p-value is 0.0218. Therefore, we reject the null hypothesis. Thus, we have evidence that median central venous oxygen saturation is different at admission and 6 hours after admission to the ICU. It appears that central venous oxygen saturation is higher after 6 hours in the ICU.

Page 79: STATA intro

Tutorial: Non-Parametric Tests in Stata 7KH:LOFR[RQ5DQN6XP7HVW

,QWKLVWXWRULDOZHZLOOXVHGDWDIURPWKH'LJLWDOLV,QYHVWLJDWLRQ*URXS',*3OHDVHUHDGWKHSURYLGHGGDWDGRFXPHQWDWLRQEHIRUHFRQWLQXLQJZLWKWKLVWXWRULDOVHH',*B'RFXPHQWDWLRQSGI:HZLOOUHSOLFDWHRQHRIWKHDQDO\VHVIURPWKH1HZ(QJODQG-RXUQDORI0HGLFLQHSDSHUVHH1(-0B',*SGI*DUJ5*RUOLQ56PLWK7<XVXI6IRUWKH'LJLWDOLV,QYHVWLJDWLRQ*URXS7KHHIIHFWRIGLJR[LQRQPRUWDOLW\DQGPRUELGLW\LQSDWLHQWVZLWKKHDUWIDLOXUHN Engl J Med1997,QWKLVWULDOSDWLHQWVZHUHUDQGRPL]HGWRHLWKHU'LJR[LQRUSODFHER7KH:LOFR[RQUDQNVXPWHVWZDVXVHGWRGHWHUPLQHLIWKHUHZHUHDQ\GLIIHUHQFHVEHWZHHQJURXSVLQWKHQXPEHURIKRVSLWDOL]DWLRQV7KHGDWDDUHORFDWHGLQWKHdig.dta GDWDVHW

([DPLQHWKHGLVWULEXWLRQRIQXPEHURIKRVSLWDOL]DWLRQVE\WUHDWPHQWJURXS$UHWKH\VLPLODU"$UHWKH\V\PPHWULF"

7KHWZRGLVWULEXWLRQVDUHYHU\VLPLODU7KH\DUHQRWV\PPHWULFEXWUDWKHUULJKWVNHZHG

'HQVLW\

QXPEHURIKRVSLWDOL]DWLRQV*UDSKVE\ SODFHER WUHDWPHQW

Page 80: STATA intro

'RHVWKHUDQNVXPWHVWUHTXLUHDQ\DVVXPSWLRQV"<HV7KHWZRVDPSOHVPXVWEHLQGHSHQGHQWDQGWKHGLVWULEXWLRQVVKRXOGKDYHWKHVDPHJHQHUDOVKDSH

:KDWLVWKHQXOOK\SRWKHVLVIRUWKHUDQNVXPWHVW":KDWLVWKHDOWHUQDWLYH"7KHQXOOK\SRWKHVLVLVWKDWWKHPHGLDQQXPEHURIKRVSLWDOL]DWLRQVIRUWKHWZRWUHDWPHQWJURXSVDUHLGHQWLFDO7KXVWKHDOWHUQDWLYHLVWKDWWKHPHGLDQQXPEHURIKRVSLWDOL]DWLRQVIRUWKHWZRWUHDWPHQWJURXSVDUHQRWLGHQWLFDO6LQFHZHDVVXPHWKHWZRGLVWULEXWLRQVKDYHWKHVDPHJHQHUDOVKDSHDGLIIHUHQFHRIWKHPHGLDQVZRXOGLPSO\WKDWWKHWZRGLVWULEXWLRQVKDYHWKHVDPHVKDSHEXWDUHVKLIWHGLQORFDWLRQ

3HUIRUPDUDQNVXPWHVWLQ6WDWDZLWKDOSKD :KDWLV\RXUWHVWVWDWLVWLF"<RXUSYDOXH"<RXUGHFLVLRQ"<RXULQWHUSUHWDWLRQ"<RXPD\XVHWKHIROORZLQJGURSGRZQPHQXVWRDFFHVVWKHranksumFRPPDQG6WDWLVWLFV6XPPDULHVWDEOHVDQGWHVWV1RQSDUDPHWULFWHVWVRIK\SRWKHVHV:LOFR[RQUDQNVXPWHVW

. ranksum nhosp, by(trtmt) Two-sample Wilcoxon rank-sum (Mann-Whitney) test trtmt | obs rank sum expected -------------+--------------------------------- 0 | 3403 11767615 11571902 1 | 3397 11355786 11551499 -------------+--------------------------------- combined | 6800 23123400 23123400 unadjusted variance 6.552e+09 adjustment for ties -3.811e+08 ---------- adjusted variance 6.171e+09 Ho: nhosp(trtmt==0) = nhosp(trtmt==1) z = 2.491 Prob > |z| = 0.0127 2XUWHVWVWDWLVWLFLV7KHSYDOXHLV1RWHWKDWWKLVZDVWKHSYDOXHUHSRUWHGLQWKHSDSHU:HUHMHFWWKHQXOOK\SRWKHVLV7KXVZHFRQFOXGHWKDWZHKDYHHYLGHQFHWKDWWKHPHGLDQQXPEHURIKRVSLWDOL]DWLRQVGLIIHUE\WUHDWPHQWJURXS,QIDFWZHKDYHHYLGHQFHWKDWWKHUHVLJQLILFDQWO\PRUHKRVSLWDOL]DWLRQVLQWKHSODFHERJURXS

Page 81: STATA intro

Tutorial: Simple Linear Regression

Open the dataset hospitaldata.dta.

Exercises:

1. Calculate the Pearson correlation for the percent of patients who say their nurse alwayscommunicated well (nursealways) and the percent of patients who would always recom-mend the hospital (recommendyes).

pwcorr recommendyes nursealways, sig

These two variables are correlated. However, simple linear regression gives us a moreintuitive measure of the relationship between the two variables. Specifically, we can state:”For a one percent increase in the percent of patients who say their nurse always com-municated well, we would, on average, expect to see a corresponding increase of B% ofpatients who would always recommend the hospital.” Here B is determined by fitting anappropriate linear regression model.

2. Now that you have established that these variables are correlated, you decide to fit a linearregression model to assess the relationship between recommendyes and nursealways.State your model.

Y i = percent of patients who always recommend the hospitalXi = perecnt of patients who say that the nurse always communicates well

Yi = ↵+ Xi + i

where i N(0,2). Equivalently, we could write:

µyi = E(Yi|Xi) = ↵+ Xi

where Yi N(µyi,2).

Goal is to estimate and obtain measures of uncertainty for ↵ and . We use the methodof least squares for estimation.

3. Construct a scatter plot with nursealways on the x-axis and recommendyes on the y axis.Use the scatterplot to evaluate the assumptions of simple linear regression.

twoway (scatter recommendyes nursealways)

Assumptions:

• Independent observations

1

Page 82: STATA intro

• Y |X is normally distributed• Homoscedasticity (constant variance)• Linearity

4. Fit the linear regression model. Provide estimates, confidence intervals, and interpreta-tions of the regression coefficients ↵ and .

. regress recommendyes nursealways

Source | SS df MS Number of obs = 3570

-------------+------------------------------ F( 1, 3568) = 2723.72

Model | 144368.851 1 144368.851 Prob > F = 0.0000

Residual | 189118.972 3568 53.0041962 R-squared = 0.4329

-------------+------------------------------ Adj R-squared = 0.4327

Total | 333487.823 3569 93.4401297 Root MSE = 7.2804

------------------------------------------------------------------------------

recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

nursealways | 1.159487 .0222169 52.19 0.000 1.115928 1.203046

_cons | -19.21559 1.712829 -11.22 0.000 -22.57381 -15.85737

------------------------------------------------------------------------------

Our fitted regression line is:

Yi = 19.2 + 1.16Xi + i

where N(0, 7.32).

Confidence intervals for ↵ and , respectively, are (-22.57, -15.86) and (1.12, 1.2).

For a 1% increase in patients reporting their nurse communicated well, is correspondingaverage increase in the percent of patients who would always recommend the hospital1.16%.

↵ is the mean value of the response Yi when Xi = 0 and for this example has no mean-ingful interpretation. (However, it is necessary for constructing the regression line andmaking subsequent predictions).

5. Test the hypothesis that H0 : = 0 versus the alternative that HA : 6= 0.

We find that = 1.16, se() = 0.02, and t = 52.2. Under H0, t = /se() t35702,and our p-value < 0.0001. Therefore, we reject the null hypothesis and conclude that thepercent of patients who say a nurse always communicates well is positively correlatedwith the percent of patients who would always recommend a hospital.

6. What is the value of R2. Interpret this quantity.

0.433

2

Page 83: STATA intro

43% of the variability among the observed values of recommendyes is explained by thelinear relationship with nursealways.

7. Examine a residual plot. Using R2 and the plot, does the model appear to fit well? (Arethere any outliers?)

rvfplot

rvpplot nursealways

We don’t see any strong trends or outliers in the residual plots.

8. Using the regression line, predict the expected percent of patients who always recom-mend the hospital when the reported percent of nurses who always communicate well is80%? Construct corresponding 95% confidence interval.

Denote Y 80 as the predicted average percent of patients who always recommend a hos-pital among hospitals with patients reporting that nurses always communicate well 80%of the time.

Y 80 = 19.2 + 1.16 80 = 73.6

. lincom _cons + 80*nursealways

( 1) 80*nursealways + _cons = 0

------------------------------------------------------------------------------

recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

(1) | 73.54339 .1399631 525.45 0.000 73.26897 73.8178

------------------------------------------------------------------------------

A 95% confidence interval for Y 80 is (73.2690, 73.8178).

9. For a new hospital with 80% of patients reporting that nurses always communicate well,predict the percent of patients who will always recommend the hospital. Construct corre-sponding 95% confidence interval.

Denote Y 80 as the predicted percent of patients who always recommend the hospital ina new hospital where patients reporting that nurses always communicate well 80% of thetime.

Y 80 = 73.54339. To find a confidence interval, we need to account for additional uncer-tainty associate with predicting a new outcome.

se(Y 80) =qvar(Y 80) + 2 =

p.13996312 + 7.28042 = 7.281745

. di 73.54339 - invttail(3568, 0.025)*7.281745

59.266589

3

Page 84: STATA intro

. di 73.54339 + invttail(3568, 0.025)*7.281745

87.820191

So, a 95% confidence interval Y 80 is 73.54339± t3568,0.975 7.281745 = (59.27, 87.82).

4

Page 85: STATA intro

Indicator Variables and Regression

Suppose a hospital is trying to set a benchmark goal of having patients report that nursesalways communicate well at least 75% of the time. We now define a nurse communicationindicator variable and use simple linear regression to further examine the relationship betweennurse communication and the percentage of patients always recommending the hospital.

Open the dataset hospitaldata.dta.

Exercises:

1. Generate a new variable, highnurse, that equals 1 if a hospital had nursealways 75%;and equals 0 if nursealways < 75%.

gen highnurse = .

replace highnurse = 1 if nursealways >= 75 & nursealways <= 100

replace highnurse = 0 if nursealways < 75

2. State your model and evaluate the model assumptions.

Yi = percent of patients who recommend the hospital alwaysDi = 1 if at least 75% of patients at the hospital report that nurses communicate well, andis 0 otherwise

Yi = ↵+ Di + i

where i N(0,2).

The model is identical to a one-way ANOVA therefore the assumputions we make are thesame. When we only have two groups, the assumptions are identical to the t-test withequal variances.

3. Fit the model.

xi: regress recommendyes i.highnurse

or

regress recommendyes highnurse

Source | SS df MS Number of obs = 3570

-------------+------------------------------ F( 1, 3568) = 1004.37

1

Page 86: STATA intro

Model | 73254.0735 1 73254.0735 Prob > F = 0.0000

Residual | 260233.749 3568 72.9354678 R-squared = 0.2197

-------------+------------------------------ Adj R-squared = 0.2194

Total | 333487.823 3569 93.4401297 Root MSE = 8.5402

------------------------------------------------------------------------------

recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

highnurse | 9.980834 .3149346 31.69 0.000 9.363364 10.5983

_cons | 62.86486 .2653319 236.93 0.000 62.34465 63.38508

------------------------------------------------------------------------------

So, our fitted model is Yi = 62.9 + 10.0 Di + i, where i N(0, 8.52).

4. Interpret the coefficients.

↵ = 62.9 is E(Yi|Di = 0). The average percent of patients who always recommend ahospital when less than 75% of patients say nurses always communicated well is 62.9%.

= 10.0 is E(Yi|Di = 1) E(Yi|Di = 0). Comparing hospitals with at least 75% ofpatients say nurses always communicated well with those where less than 75% of thepatients report that nurses always communicate well, the average difference in percent ofpatients who always recommend a hospital was 10%.

↵+ = 72.9 is E(Yi|Di = 1). The average percent of patients who always recommend ahospital when at least 75% of patients say nurses always communicated well is 72.9%.

5. Test the null hypothesis that there is no difference in the average percent of patients whoalways recommend a hospital between hospitals with less than and at least 75% of pa-tients reporting that nurses always communicate well.

We test H0 : = 0 versus HA : 6= 0 using a two-sided test with ↵ = 0.05.

We find that = 10.0, se() = 0.3, and t = 31.7. Under H0, t tn2, and p < 0.0001.We conclude that the average percent of patients who always recommend a hospital isgreater when at least 75% of patients report that nurses always communicate well.

6. Compare the results of the test above to a two-sample t-test with equal variances.

. ttest recommendyes, by(highnurse)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

2

Page 87: STATA intro

---------+--------------------------------------------------------------------

0 | 1036 62.86486 .272132 8.759099 62.33087 63.39886

1 | 2534 72.8457 .1678457 8.449162 72.51657 73.17483

---------+--------------------------------------------------------------------

combined | 3570 69.9493 .1617829 9.666443 69.6321 70.2665

---------+--------------------------------------------------------------------

diff | -9.980834 .3149346 -10.5983 -9.363364

------------------------------------------------------------------------------

diff = mean(0) - mean(1) t = -31.6918

Ho: diff = 0 degrees of freedom = 3568

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000

You should notice some striking similarities!

3

Page 88: STATA intro

Multiple Linear Regression

Now, the hospital aims to assess the impact of nurse communication and hospital noiselevel on the percentage of patients who would always recommend the hospital.

Fit a linear regression model with recommendyes as the outcome and nursealways andquietalways as the covariates.

1. Make a scatter plot of quietalways versus recommendyes.

twoway (scatter recommendyes quietalways)

While the relationship appears linear, note that we cannot assess any of the assumptionsof multiple linear regression using this plot.

2. State your model.

Y i = percent of patients who recommend the hospital alwaysX1i = percent of patients in a hospital who say that the nurse always communicates wellX2i = percent of patients who report that the hospital is always quiet

Yi = ↵+ 1X1i + 2X2i + i

where i N(0,2). Equivalently, we could write:

µyi = E(Yi|X1i, X2i) = ↵+ 1X1i + 2X2i

where Yi N(µyi ,2).

3. Fit the model.

regress recommendyes nursealways quietalways

Source | SS df MS Number of obs = 3570

-------------+------------------------------ F( 2, 3567) = 1363.40

Model | 144484.252 2 72242.126 Prob > F = 0.0000

Residual | 189003.571 3567 52.9867033 R-squared = 0.4333

-------------+------------------------------ Adj R-squared = 0.4329

Total | 333487.823 3569 93.4401297 Root MSE = 7.2792

------------------------------------------------------------------------------

recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]

1

Page 89: STATA intro

-------------+----------------------------------------------------------------

nursealways | 1.133725 .0282517 40.13 0.000 1.078334 1.189116

quietalways | .0229694 .0155642 1.48 0.140 -.0075463 .053485

_cons | -18.58225 1.7655 -10.53 0.000 -22.04374 -15.12075

------------------------------------------------------------------------------

Our model is Yi = ↵+ 1.13X1i + 0.02X2i + i, where i N(0, 7.282).

4. Evaluate the model assumptions.The adjusted R2 is 0.43 (compared to 0.43 from the simple linear regression model withonly nursealways).

rvfplot

rvpplot nursealways

rvpplot quietalways

5. Interpret the coefficients.

We estimate 1 = 1.13, with 95% confidence interval (1.08, 1.19). For a one percentincrease in the patients who say that the nurses always communicate well, we see onaverage a 1.13 percent increase in the percent of patients who would always recommendthe hospital, when the percent of patients who say the hospital is always quiet is fixed(does not vary).

We estimate 2 = 0.02, with 95% confidence interval (0.01, 0.05). For a one percentincrease in the patients who say the hospital is always quiet, we see on average a 0.02percent increase in the percent of patients who would always recommend the hospital,fixing the percent of patients who say their nurse always communicates well.

We estimate that ↵ = 18.58. ↵ is the value of E(Yi) when X1i and X2i are set to 0. Inour dataset, the covariates never drop below 48% and 30% respectively, and therefore ↵does not have a meaningful interpretation for this study.

6. Suppose we consider a new hospital, where the percentage of nurses who always com-municate is 90% and the percentage of those who say the hospital is always quiet is 70%?What is the expected percent of patients who would always recommend this hospital?

E(Yi|X1i = 90, X2i = 70) = 18.58 + 1.13 90 + 0.02 70 = 84.5%.

7. Using the regression results above, perform the follow three hypothesis tests at the 0.05level of significance.

• H0 : 1 = 0, HA : 1 6= 0

2

Page 90: STATA intro

1 = 1.13, se(1) = 0.03, t = 40.1. Under H0, t t357021, and p < 0.0001. Wereject H0 and conclude that an increase in the percent of patients who say nursesalways communicate well results in an increase in the percent of patients who alwaysrecommend the hospital, fixing the percent of patients who say the hospital is alwaysquiet.

• H0 : 2 = 0, HA : 2 6= 0

1 = 0.02, se(1) = 0.02, t = 1.5. Under H0, t t357021, and p = 0.14. We fail toreject H0 and conclude that we do not have evidence in the data that increasing thepercent of patients who say the hospital is always quiet is correlated with the percentof patients who always recommend the hospital, fixing the percent of patients whosay that the nurses always communicate well .

• H0 : 1 = 2 = 0, HA : one of1,2 6= 0

. test nursealways quietalways

( 1) nursealways = 0

( 2) quietalways = 0

F( 2, 3567) = 1363.40

Prob > F = 0.0000

Our F-statistic equals 1363.4. Under H0, F F2,3567, and p < 0.0001. We reject H0

and conclude that atleast one of 1 or 2 is non-zero.

8. Do we observe any collinearity between X1i and X2i. How does this impact the result.

twoway (scatter nursealways quietalways)

Yes, the covariates are collinear. We would likely see an association between X2i and Yiif X1i was excluded from the model.

3

Page 91: STATA intro

More Multiple Linear Regression

For those interested in delving a bit deeper into the world of linear regression, a few addi-tional examples are included below. In the first example, you can work through a multiple linearregression model with one binary covariate and one continuous covariate. In the second exam-ple, we add an interaction between these covariates to examine effect modification/interactionbetween covariates in the context of linear regression. It is important to think about how theinterpretation of the regression coefficients changes in the presence of an interaction term.

Example 1:

Fit a linear regression model with recommendyes as the outcome and highnurse and quietalways

as the covariates.

1. Make a scatterplot with quietalways on the x-axis and recommendyes on the y-axis. Strat-ify by highnurse when you are plotting, so that you can distinguish between hospitals withhighnurse = 0 and highnurse = 1. Overlay a linear prediction line for highnurse = 0 andhighnurse = 1.

Via the dropdown menus, go to Graphics ! Two-way graph. Within the two-way window,you will need to create four different plots: two scatter plots (go to Basic plots ! Scatterand then fill in Y and X variables) and two linear prediction lines (go to Fit plots ! LinearPrediction and then fill in Y and X variables). Or, via command line:

twoway (scatter recommendyes quietalways if highnurse==1) ///

(scatter recommendyes quietalways if highnurse==0) ///

(lfit recommendyes quietalways if highnurse == 1)///

(lfit recommendyes quietalways if highnurse==0)

2. State your model.

Y i = percent of patients who recommend the hospital alwaysX1i = percent of patients in a hospital who say that the hospital is always quietDi = 1 if at least 75% of patients at the hospital report that nurses communicate well, andis 0 otherwise

Yi = ↵+ 1X1i + 2D2i + i

where i N(0,2).

3. Fit the model.

1

Page 92: STATA intro

regress recommendyes highnurse quietalways

4. Evaluate the model assumptions.

Suggestions: Check the residual plots to look for outliers and heteroskedasticity. Do theresiduals look approximately normal? Patterns in the residual plot could suggest that yourmodel for the mean of the outcome is misspecified (linearity is violated).

5. Interpret the coefficients.

• ↵ - the average percent of patients who always recommend a hospital when high-nurse is 0 and quietalways is 0. ↵ does not have a meaningful interpretation for thisstudy since quietalways never drops to 0.

• 1 - the average increase in the percent of patients who always recommend a hos-pital for a one percent increase in quietalways, for a given value of highnurse.

• 2 - the average increase in the percent of patients who always recommend a hospi-tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, fixingquietalways.

• ↵ + 801 + 2 - the average percent of patients who recommend a hospital withhighnurse = 1 and quietalways = 80.

• ↵+801 - the average percent of patients who recommend a hospital with highnurse= 0 and quietalways = 80.

2

Page 93: STATA intro

Example 2 - Multiple Linear Regression with an Interaction

Now, we examine whether there is an interaction between highnurse and quietalways onrecommendyes. Equivalently, we look for evidence of effect modification of the relationship be-tween quietalways and recommendyes by highnurse.

1. Check out the scatter plot from the previous example. Does the plot suggest that an in-teraction term might improve the model?

Yes. We can look for evidence of effect modification by comparing the slopes of theoverlayed lines in the scatter plot. Because the slopes appear to differ by highnurse,there is evidence of effect modification.

2. State your model.

Y i = percent of patients who recommend the hospital alwaysX1i = percent of patients in a hospital who say that the hospital is always quietDi = 1 if at least 75% of patients at the hospital report that nurses communicate well, andis 0 otherwise

Yi = ↵+ 1X1i + 2Di + 3X1iDi + i

where i N(0,2).

3. Fit the model.

. xi: regress recommendyes i.highnurse*quietalways

4. Evaluate the model assumptions.

You can use the same approach as the previous question.

5. Interpret the coefficients.

• ↵ - the average percent of patients who always recommend a hospital when high-nurse is 0 and quietalways is 0. ↵ does not have a meaningful interpretation for thisstudy since quietalways never drops to 0.

• 1 - the average increase in the percent of patients who always recommend a hos-pital for a one percent increase in quietalways, when highnurse = 0.

3

Page 94: STATA intro

• 1 + 3 - the average increase in the percent of patients who always recommend ahospital for a one percent increase in quietalways, when highnurse = 1.

• 2 - the average increase in the percent of patients who always recommend a hospi-tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, whenquietalways equals 0. 2 does not have a meaningful interpretation in this analysis.Note that we could have centered the covariate quietalways around its mean, so thatthe covariate would be more interpretable.

• 2+703 - the average increase in the percent of patients who always recommend ahospital for hospitals with highnurse = 1 compared to hospitals with high nurse = 0,when quietalways equals 70.

• ↵ + 801 + 2 + 803 - the average percent of patients who recommend a hospitalwith highnurse = 1 and quietalways = 80.

• ↵+801 - the average percent of patients who recommend a hospital with highnurse= 0 and quietalways = 80.

4

Page 95: STATA intro

More Multiple Linear Regression

For those interested in delving a bit deeper into the world of linear regression, a few addi-tional examples are included below. In the first example, you can work through a multiple linearregression model with one binary covariate and one continuous covariate. In the second exam-ple, we add an interaction between these covariates to examine effect modification/interactionbetween covariates in the context of linear regression. It is important to think about how theinterpretation of the regression coefficients changes in the presence of an interaction term.

Example 1:

Fit a linear regression model with recommendyes as the outcome and highnurse and quietalways

as the covariates.

1. Make a scatterplot with quietalways on the x-axis and recommendyes on the y-axis. Strat-ify by highnurse when you are plotting, so that you can distinguish between hospitals withhighnurse = 0 and highnurse = 1. Overlay a linear prediction line for highnurse = 0 andhighnurse = 1.

Via the dropdown menus, go to Graphics ! Two-way graph. Within the two-way window,you will need to create four different plots: two scatter plots (go to Basic plots ! Scatterand then fill in Y and X variables) and two linear prediction lines (go to Fit plots ! LinearPrediction and then fill in Y and X variables). Or, via command line:

twoway (scatter recommendyes quietalways if highnurse==1) ///

(scatter recommendyes quietalways if highnurse==0) ///

(lfit recommendyes quietalways if highnurse == 1)///

(lfit recommendyes quietalways if highnurse==0)

2. State your model.

Y i = percent of patients who recommend the hospital alwaysX1i = percent of patients in a hospital who say that the hospital is always quietDi = 1 if at least 75% of patients at the hospital report that nurses communicate well, andis 0 otherwise

Yi = ↵+ 1X1i + 2D2i + i

where i N(0,2).

3. Fit the model.

1

Page 96: STATA intro

regress recommendyes highnurse quietalways

4. Evaluate the model assumptions.

Suggestions: Check the residual plots to look for outliers and heteroskedasticity. Do theresiduals look approximately normal? Patterns in the residual plot could suggest that yourmodel for the mean of the outcome is misspecified (linearity is violated).

5. Interpret the coefficients.

• ↵ - the average percent of patients who always recommend a hospital when high-nurse is 0 and quietalways is 0. ↵ does not have a meaningful interpretation for thisstudy since quietalways never drops to 0.

• 1 - the average increase in the percent of patients who always recommend a hos-pital for a one percent increase in quietalways, for a given value of highnurse.

• 2 - the average increase in the percent of patients who always recommend a hospi-tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, fixingquietalways.

• ↵ + 801 + 2 - the average percent of patients who recommend a hospital withhighnurse = 1 and quietalways = 80.

• ↵+801 - the average percent of patients who recommend a hospital with highnurse= 0 and quietalways = 80.

2

Page 97: STATA intro

Example 2 - Multiple Linear Regression with an Interaction

Now, we examine whether there is an interaction between highnurse and quietalways onrecommendyes. Equivalently, we look for evidence of effect modification of the relationship be-tween quietalways and recommendyes by highnurse.

1. Check out the scatter plot from the previous example. Does the plot suggest that an in-teraction term might improve the model?

Yes. We can look for evidence of effect modification by comparing the slopes of theoverlayed lines in the scatter plot. Because the slopes appear to differ by highnurse,there is evidence of effect modification.

2. State your model.

Y i = percent of patients who recommend the hospital alwaysX1i = percent of patients in a hospital who say that the hospital is always quietDi = 1 if at least 75% of patients at the hospital report that nurses communicate well, andis 0 otherwise

Yi = ↵+ 1X1i + 2Di + 3X1iDi + i

where i N(0,2).

3. Fit the model.

. xi: regress recommendyes i.highnurse*quietalways

4. Evaluate the model assumptions.

You can use the same approach as the previous question.

5. Interpret the coefficients.

• ↵ - the average percent of patients who always recommend a hospital when high-nurse is 0 and quietalways is 0. ↵ does not have a meaningful interpretation for thisstudy since quietalways never drops to 0.

• 1 - the average increase in the percent of patients who always recommend a hos-pital for a one percent increase in quietalways, when highnurse = 0.

3

Page 98: STATA intro

• 1 + 3 - the average increase in the percent of patients who always recommend ahospital for a one percent increase in quietalways, when highnurse = 1.

• 2 - the average increase in the percent of patients who always recommend a hospi-tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, whenquietalways equals 0. 2 does not have a meaningful interpretation in this analysis.Note that we could have centered the covariate quietalways around its mean, so thatthe covariate would be more interpretable.

• 2+703 - the average increase in the percent of patients who always recommend ahospital for hospitals with highnurse = 1 compared to hospitals with high nurse = 0,when quietalways equals 70.

• ↵ + 801 + 2 + 803 - the average percent of patients who recommend a hospitalwith highnurse = 1 and quietalways = 80.

• ↵+801 - the average percent of patients who recommend a hospital with highnurse= 0 and quietalways = 80.

4

Page 99: STATA intro

Simple Logistic Regression

Think back to Week 7, when we used the sample from the California Health Indicator Sur-vey (CHIS) to examine the relationship between poverty and visiting the doctor within the past12 months. This week, we use logistic regression to examine this relationship. Open thechis healthdisparities.dta dataset.

Fit a logistic regression model with visiting the doctor in the past 12 months as the outcomeand the poverty indicator as your covariate.

1. List the assumptions for performing logistic regression.

We assume the responses are Bernoulli, and we assume linearity in the parameters onthe logit scale.

2. State your model.

Define Yi = 1 if individual i visited the doctor in the last 12 months, 0 otherwise. DefineXi = 1 if the individual is above the poverty line, 0 otherwise. Then, our model is Yi Bernoulli(pi), where

logit(pi) = ↵+ Xi

3. Fit the model.

. logit doctor nopov

Iteration 0: log likelihood = -247.4035

Iteration 1: log likelihood = -245.14765

Iteration 2: log likelihood = -245.08244

Iteration 3: log likelihood = -245.08242

Logistic regression Number of obs = 500

LR chi2(1) = 4.64

Prob > chi2 = 0.0312

Log likelihood = -245.08242 Pseudo R2 = 0.0094

------------------------------------------------------------------------------

doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

nopov | .6713351 .3013476 2.23 0.026 .0807047 1.261965

_cons | .83975 .2745156 3.06 0.002 .3017093 1.377791

------------------------------------------------------------------------------

The fitted regression model is logit(pi) = 1.511 + 0.671Xi.

4. Interpret the coefficients.

1

Raj Dasgupta
i.e., we have a binary (0 or 1) outcome and observations are independent
Raj Dasgupta
our logistic regression model
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
this is essentially modelling the mean, here pi is the mean as we have in linear regression where m yi | xi = a + bxand yi ~ N(m yi | xi, sigma^2)
Raj Dasgupta
Raj Dasgupta
gen nopov = 1 - poverty
Raj Dasgupta
Raj Dasgupta
Statistics > Binary Outcomes > Logistic Regression
Raj Dasgupta
<-- IMPORTANT
Raj Dasgupta
Page 100: STATA intro

• ↵ = log(odds of visiting the doctor when Xi = 0)• = log(odds ratio of visiting the doctor for no poverty versus poverty) = log(odds of

visiting doctor when Xi = 1) - log(odds of visiting doctor when Xi = 0)• ↵+ = log(odds of visiting the doctor when Xi = 1)

5. Provide an OR and a 95% confidence interval.

Hard way: exp() = 1.957 with 95% CI (exp(0.0807047), exp(1.261965)) = (1.084, 3.532).

Easy way:

. lincom nopov, eform

( 1) [doctor]nopov = 0

------------------------------------------------------------------------------

doctor | exp(b) Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

(1) | 1.956848 .5896914 2.23 0.026 1.084051 3.532357

------------------------------------------------------------------------------

Another easy way:

. logistic doctor nopov

Logistic regression Number of obs = 500

LR chi2(1) = 4.64

Prob > chi2 = 0.0312

Log likelihood = -245.08242 Pseudo R2 = 0.0094

------------------------------------------------------------------------------

doctor | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

nopov | 1.956848 .5896914 2.23 0.026 1.084051 3.532357

_cons | 2.315788 .63572 3.06 0.002 1.352168 3.96613

------------------------------------------------------------------------------

6. What is the probability of visiting the doctor in the past 12 months for those above poverty?below poverty?

. predict phat

(option pr assumed; Pr(doctor))

Below poverty: 0.6984126Above poverty: 0.819222

7. Test the hypothesis that H0 : = 0 versus H0 : 6= 0 at the 0.05 level of significance. = 0.6713351, se() = .3013476, Z = 2.23.

Under H0, Z N(0, 1), and p = 0.026. We reject H0 and conclude being above thepoverty level is associated with higher odds of visiting the doctor within the past 12months.

2

Raj Dasgupta
log (pi / (1 - pi)) = a + bxi
Raj Dasgupta
Raj Dasgupta
log (pi / (1-pi)) = log odds (logodds)
Raj Dasgupta
as you can see above, when xi = 0
Raj Dasgupta
Raj Dasgupta
beta from above = 0.67, hence e^beta = e^0.67 = 1.96
Raj Dasgupta
Using logistic instead of logit gives us the odds ratio
Raj Dasgupta
you can also use lincom nopov, or
Raj Dasgupta
Raj Dasgupta
Statistics > Binary Outcomes > Logistic Regression (Reporting Odds Ratio)
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
predict command: Creates a new covariate called phat which contains the probability of going to the doctor for every individual in the dataset
Raj Dasgupta
table nopov, phatget the values from the column names
Raj Dasgupta
Raj Dasgupta
you can simply check the value of p in the row for nopov and if it is significant it means that beta is non-zero
Raj Dasgupta
Raj Dasgupta
ALTERNATIVELY: If you are looking at Odds Ratio and the 95% CI does not include 1 then also you can state that beta is significant. Note that for Odds Ratio you have to test for 1 and not 0
Raj Dasgupta
Page 101: STATA intro

Note that the 95% CI for excludes 0 and the 95% CI for the odds ratio excludes 1,leading to the same conclusion (as will always be the case).

8. Compare your results to the 2 2 table analysis from week 7.

Yes, our results match up to the contingency table analysis, as they should! The beautyof logistic regression is in its flexibility, as we see next.

3

Page 102: STATA intro

Multiple Logistic Regression

Now, we expand the regression model, adding in more covariates. Add gender to your

model.

1. First, assume no effect modification by gender. State the model.

Define Yi = 1 if individual i visited the doctor in the last 12 months, 0 otherwise; X1i = 1if the individual is above the poverty line, 0 otherwise; X2i = 1 if female, 0 if male. Then,

our model is Yi Bernoulli(pi), where

logit(pi) = ↵+ 1X1i + 2X2i

2. Fit the model.

. logit doctor nopov female

Iteration 0: log likelihood = -247.4035

Iteration 1: log likelihood = -229.36247

Iteration 2: log likelihood = -228.56747

Iteration 3: log likelihood = -228.56462

Iteration 4: log likelihood = -228.56462

Logistic regression Number of obs = 500

LR chi2(2) = 37.68

Prob > chi2 = 0.0000

Log likelihood = -228.56462 Pseudo R2 = 0.0761

------------------------------------------------------------------------------

doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

nopov | .997763 .3245721 3.07 0.002 .3616134 1.633913

female | 1.384033 .2549714 5.43 0.000 .8842978 1.883767

_cons | -.0321554 .3246122 -0.10 0.921 -.6683837 .6040729

------------------------------------------------------------------------------

The fitted regression model is logit(pi) = .0322 + .998X1i + 1.384X2i.

3. Is there evidence of effect modification by gender?

Now, we fit the model

logit(pi) = ↵+ 1X1i + 2X2i + 3X1iX2i

and test whether 3 = 0.

. xi: logit doctor i.nopov*female

i.nopov _Inopov_0-1 (naturally coded; _Inopov_0 omitted)

i.nopov*female _InopXfemal_# (coded as above)

Iteration 0: log likelihood = -247.4035

Iteration 1: log likelihood = -229.89318

1

Raj Dasgupta
State your modelyi = 1 visited doctoryi = 0otherwisex1i = 1above poverty levelx1i = 0below poverty levelx2i = 1femalex2i = 0maleyi ~ Bernoulli (pi)We model pi on a logit scale
Raj Dasgupta
Raj Dasgupta
gen female = gender - 1 (in the dataset gender = 1 for males), so if we get female = gender - 1, all males will have the value of female = 0 (intuitively)
Raj Dasgupta
Raj Dasgupta
table gender female
Raj Dasgupta
b3 -- this will give us gender specific odds ratio
Raj Dasgupta
but this does not account for gender specific odds ratio. for that we need an interaction term as given below
Raj Dasgupta
INTERACTION -> gen interaction = nopov * femaleDIFFERENT METHOD -LOGIT DOCTOR NOPOV FEMALE INTERACTION
Page 103: STATA intro

Iteration 2: log likelihood = -228.56544

Iteration 3: log likelihood = -228.55916

Iteration 4: log likelihood = -228.55916

Logistic regression Number of obs = 500

LR chi2(3) = 37.69

Prob > chi2 = 0.0000

Log likelihood = -228.55916 Pseudo R2 = 0.0762

-------------------------------------------------------------------------------

doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]

--------------+----------------------------------------------------------------

_Inopov_1 | .9619012 .472267 2.04 0.042 .036275 1.887528

female | 1.329136 .5835434 2.28 0.023 .1854119 2.47286

_InopXfemal_1 | .0678287 .6489728 0.10 0.917 -1.204135 1.339792

_cons | -3.76e-15 .4472136 -0.00 1.000 -.8765225 .8765225

-------------------------------------------------------------------------------

There is no evidence of effect modification by gender.

4. Is there evidence of confounding by gender?

Without gender: 1 = 0.671With gender: 1 = 0.998Yes, there is evidence of confounding by gender.

2

Raj Dasgupta
b3 has p of 0.917 which means b3 is not significant, and no evidence that the OR of visiting the doctor varies by gender.Therefore NO Evidence of EFFECT MODIFICATION
Raj Dasgupta
Raj Dasgupta
there is a big change in the value of b1 for nopov when we add the female term, which shows that gender is having an effect on the probability of visiting the doctor
Raj Dasgupta
You can also check forlogit nopov female and you'll get as coefficient of female the value of -0.75 which shows that females are more likely to be in poverty. So, gender is associated with poverty levelandif you find the value forlogit doctor femalethe coeff of female is 1.25 meaning tha females are much more likely to go to the doctor (note that if female the value of female = 1 and otherwise 0 for male)
Page 104: STATA intro

Logistic Regression with a Continuous Covariate

As in the previous tutorial, we fit a model to examine the relationship between visiting the doctorin the past 12 months and whether an individual is above or below the federal poverty level,conditional on gender. We fit a logistic regression model with doctor as the outcome, and withnopov and female as covariates. But now we add a continuous covariate age to the model!

Open the chis healthdisparities.dta dataset.

1. Assume that, conditional on age and gender, probability of visiting the doctor varies lin-early on the logit scale with age. State your model.

Define Yi = 1 if individual i visited the doctor in the last 12 months, 0 otherwise; X1i = 1if the individual is above the poverty line, 0 otherwise; X2i = 1 if female, 0 if male; andX3i = age in years. Then, our model is Yi Bernoulli(pi), where

logit(pi) = ↵+ 1X1i + 2X2i + 3X3i

2. Fit the model.

. logit doctor nopov female age

Iteration 0: log likelihood = -247.4035

Iteration 1: log likelihood = -226.31928

Iteration 2: log likelihood = -225.22574

Iteration 3: log likelihood = -225.2222

Iteration 4: log likelihood = -225.2222

Logistic regression Number of obs = 500

LR chi2(3) = 44.36

Prob > chi2 = 0.0000

Log likelihood = -225.2222 Pseudo R2 = 0.0897

------------------------------------------------------------------------------

doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

nopov | .9882978 .3271762 3.02 0.003 .3470442 1.629551

female | 1.334568 .2568062 5.20 0.000 .8312367 1.837899

age | .0187776 .0074311 2.53 0.012 .0042129 .0333423

_cons | -.8066067 .4469253 -1.80 0.071 -1.682564 .0693507

------------------------------------------------------------------------------

The fitted model is logit(pi) = .807 + .988X1i + 1.335X2i + 0.019X3i

3. Is there evidence that age is a confounder of the doctor-poverty relationship? Would youexpect age to be a confounder?

With gender only: 1 = 0.998With age and gender: 1 = 0.988No, there is not evidence of confounding by age.

1

Raj Dasgupta
State your modelyi = 1 visited doctoryi = 0otherwisex1i = 1above poverty levelx1i = 0below poverty levelx2i = 1femalex2i = 0malex3i = age (continuous)
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
bi is the coefficient for nopov
Page 105: STATA intro

4. Interpret the odds ratio.

. lincom nopov, eform

( 1) [doctor]nopov = 0

------------------------------------------------------------------------------

doctor | exp(b) Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

(1) | 2.686657 .8790104 3.02 0.003 1.414879 5.101586

------------------------------------------------------------------------------

Conditioning on age and gender, the odds of visiting the doctor are 2.69 times higher(with 95% CI 1.41, 5.10) in those above the poverty line, compared to those below thepoverty line.

5. Test for an association between poverty and visiting the doctor in the past 12 months,conditioning on age and gender, at the 0.05 level of significance.

We test H0 : 1 = 0 versus H0 : 1 6= 0.

1 = .988, se(1) = .327, Z = 3.02.

Under H0, Z N(0, 1), and p = 0.003. (Note: the 95% CI for 1 excludes 0 and the 95%CI for the OR subsequently excludes 1.)

We reject H0 and conclude that there is evidence in the data that being above the povertyline increases the likelihood of visiting the doctor in the past 12 months, conditioning onage and gender.

6. Predict the probability of visiting the doctor for everyone in your dataset.

predict phat

7. What is the predicted probability of visiting the doctor for a 45 year old woman above thepoverty level? below the poverty level?

. lincom _cons + age*45 + female + nopov

( 1) [doctor]nopov + [doctor]female + 45*[doctor]age + [doctor]_cons = 0

------------------------------------------------------------------------------

doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

(1) | 2.361251 .2244562 10.52 0.000 1.921325 2.801177

------------------------------------------------------------------------------

. lincom _cons + age*45 + female + nopov*0

( 1) [doctor]female + 45*[doctor]age + [doctor]_cons = 0

2

Raj Dasgupta
or lincom nopov, or
Raj Dasgupta
b1 is coeff of nopov which also has a p value of < 0.05 and is hence significant
Raj Dasgupta
LINCOM _CONS + NOPOV*1 + FEMALE*1 + AGE * 45
Raj Dasgupta
SO,logit (phati) = 2.36so, to find phatI have to findinverse logit of 2.36
Raj Dasgupta
Raj Dasgupta
Now, female, below poverty
Raj Dasgupta
female, above poverty
Page 106: STATA intro

------------------------------------------------------------------------------

doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

(1) | 1.372953 .3101936 4.43 0.000 .7649849 1.980921

------------------------------------------------------------------------------

. di invlogit(2.361251 )

.91382437

. di invlogit(1.372953)

.79785684

Above the poverty line: 91.2%Below the poverty line: 79.8%

3

Raj Dasgupta
Raj Dasgupta
get inverse logit to find the predicted probabilitySo, for above poverty, prob of visiting doctor = 91%and for below poverty, prob. of visiting the doctor = 80%
Raj Dasgupta
Raj Dasgupta
Page 107: STATA intro

Recap + Model Fit

Open the chis healthdisparities.dta dataset.

1. After fitting and considering several models, what are our conclusions about the relation-ship between poverty and visiting the doctor in the past 12 months?

Those below the poverty line appear less likely to visit the doctor in the past 12 months.

2. Compare the fit of these models.

There are several options for assessing the fit of a logistic regression model. We don’thave time to look at all of them (if you are interested, look at Hosmer-Lemeshow anddeviance). But, to relate back to week 3, let’s look at the ROC curve.

Fit the logistic regression model with doctor as the outcome and nopov, female, andage as covariates.

We choose a cut-off c and construct a classification table:Yi = 1 Yi = 0

pi > c Correct False +pi <= c False - Correct

For example, when c = 0.8:

. estat classification, cutoff(0.8)

Logistic model for doctor

-------- True --------

Classified | D ~D | Total

-----------+--------------------------+-----------

+ | 243 25 | 268

- | 159 73 | 232

-----------+--------------------------+-----------

Total | 402 98 | 500

Classified + if predicted Pr(D) >= .8

True D defined as doctor != 0

--------------------------------------------------

Sensitivity Pr( +| D) 60.45%

Specificity Pr( -|~D) 74.49%

Positive predictive value Pr( D| +) 90.67%

Negative predictive value Pr(~D| -) 31.47%

--------------------------------------------------

False + rate for true ~D Pr( +|~D) 25.51%

False - rate for true D Pr( -| D) 39.55%

False + rate for classified + Pr(~D| +) 9.33%

False - rate for classified - Pr( D| -) 68.53%

1

Raj Dasgupta
We can't do residual plots like we did for linear regression and hence need to use different methods
Raj Dasgupta
Raj Dasgupta
logit doctor nopov female ageIn Stata,Statistics > Binary Outcomes > PostEstimation > Goodness of Fit after logit ... (and then check for Report various summary stats in the reporting options window + change value of Positive Outcome Threshold = 0.8 instead of 0.5)
Raj Dasgupta
Raj Dasgupta
here we used 0.8 since the p of visiting the doctor is very high.In practice, you should try with 0.5, etc different cutoff values
Page 108: STATA intro

--------------------------------------------------

Correctly classified 63.20%

--------------------------------------------------

To get the full ROC curve (and the area under the ROC curve), try lroc.

Plot the ROC curve for the three models above to visualize the improved classification ofthe more complex models. We could likely add more covariates to further improve thediscriminatory ability of the model.

2

Raj Dasgupta
Statistics > Binary Outcomes > PostEstimation > ROC CurveThe steeper the curve above the diagonal lne, the better the ROC curve and in this case, you could then add more covariates and see which model gives you the best ROC curve
Page 109: STATA intro

PH207X Fall 2012 Survey Data Demo Page 1 of 11

Objectives for Survey Results Module 1 – Basic Statistics

Number of respondents at baseline and follow-up Number of participants in longitudinal dataset Differences between baseline and longitudinal dataset

I. Number of respondents at baseline and follow-up

a. Baseline survey – 9175 respondents b. Follow-up survey – 3700 respondents

II. Number of participants in longitudinal dataset

a. 596 participants provided unique identifiers in both the baseline and follow-up survey

III. Differences between baseline and longitudinal dataset

The tables below present information on those who responded at the baseline survey to those who were included in the longitudinal dataset for some selected variables which we will be using later in the demo.

Baseline (n=9157) Longitudinal (n=596) Sex female 3984 (44%) 282 (47%) male 4521 (49%) 310 (52%) missing 652 (7%) 4 (0.7%) Age 32±10.1 33.78±10.3 Computer Mac 1429 (16%) 84 (14%) PC 7080 (77%) 509 (85%) missing 648 (7%) 3 (0.5%) Aptitude math 3773 (41%) 287 (48%) verbal 4735 (52%) 305 (51%) missing 649 (7%) 4 (0.7%) Your Handedness righty 7582 (83%) 531 (89%) lefty 547 (6%) 44 (7%) ambidexterous 351 (4%) 18 (3%)

Raj Dasgupta
Since the exposure and the outcome variables were measured at the same time, we can use a Cross-Sectional Study
Raj Dasgupta
With Cross Sectional, no follow-up time involved, so we cannot get/use Rate Ratio. We "CAN" calculate a Risk Ratio / Odds RatioBetter to use Risk Ratio - easier to interpretOdds Ratio - May not be similar to the Risk Ratio if the prevalence of our outcome is RateTherefore, preferable to present Risk RatioWe'll just view the
Raj Dasgupta
Page 110: STATA intro

PH207X Fall 2012 Survey Data Demo Page 2 of 11

Objectives for Survey Results Module 2 – Factors Related to Mac vs PC Use

Choose a study design to examine the association between math and verbal aptitude and Mac/PC use.

Calculate the appropriate measure of association comparing math versus verbal aptitude and Mac/PC use.

Construct your own analysis to study the association between handedness and Mac/PC use.

I. Choose a study design

a. Exposure: Math and Verbal aptitude b. Outcome: Mac/PC Use c. Study design: Cross-sectional study

II. Calculate the appropriate measure of association comparing the math and verbal

aptitude and Mac/PC use. a. Measure of association – Risk Ratio or Odds Ratio b. Dropdown:

i. Statistics Epidemiology and RelatedTables for EpidemiologistsCohort study risk-ratio etc.

ii. Case variable: macpc iii. Exposed variable: aptitude iv. On the options tab, check box for “Report odds ratio” v. Submit

c. Command Window Syntax: cs macpc aptitude,or | aptitude | | Exposed Unexposed | Total -----------------+------------------------+------------ Cases | 39 45 | 84 Noncases | 248 258 | 506 -----------------+------------------------+------------ Total | 287 303 | 590 | | Risk | .1358885 .1485149 | .1423729 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Risk difference | -.0126263 | -.0689729 .0437202 Risk ratio | .9149826 | .6150247 1.361235 Prev. frac. ex. | .0850174 | -.3612351 .3849753 Prev. frac. pop | .0413559 | Odds ratio | .9016129 | .5690484 1.428626 (Cornfield) +-------------------------------------------------

chi2(1) = 0.19 Pr>chi2 = 0.6609 People with stronger math abilities were about 9% less likely to use a Mac compared to people with stronger verbal abilities. The confidence interval for our risk ratio was 0.62 to 1.36

Raj Dasgupta
Page 111: STATA intro

PH207X Fall 2012 Survey Data Demo Page 3 of 11

III. Construct your own analysis to study the association between Mac/PC use and handedness. a. Measure of association – Risk Ratio or Odds Ratio b. Dropdown:

i. Statistics Epidemiology and RelatedTables for EpidemiologistsCohort study risk-ratio etc.

ii. Case variable: macpc iii. Exposed variable: lefty iv. On the options tab, check box for “Report odds ratio” v. Submit

c. Command Window Syntax: cs macpc lefty,or | lefty | | Exposed Unexposed | Total -----------------+------------------------+------------ Cases | 6 77 | 83 Noncases | 38 452 | 490 -----------------+------------------------+------------ Total | 44 529 | 573 | | Risk | .1363636 .1455577 | .1448517 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Risk difference | -.009194 | -.1149534 .0965653 Risk ratio | .9368359 | .4330182 2.026847 Prev. frac. ex. | .0631641 | -1.026847 .5669818 Prev. frac. pop | .0048503 | Odds ratio | .9268626 | .3889418 2.214373 (Cornfield) +------------------------------------------------- chi2(1) = 0.03 Pr>chi2 = 0.8678 The risk ratio for this study was 0.94 and the odds ratio was 0.93. This shows that people who are left-handed were less likely to use a Mac compared to people who are right-handed.

Raj Dasgupta
PLEASE SEE NEXT PAGE FOR SURVEY RESULTS MODULE 3EXPLANATION FOR SURVEY RESULTS MODULE 3Exposure came first (tea or coffee)Then Outcome came later (sleep difficulty)Therefore Cohort StudyThe next question is what is the appropriate measure of association.In a cohort study, we can use data on the number of exposed and unexposed cases and non-cases to calculate a risk ratio or we can collect information about exposed and unexposed person time in our cases and non-cases and calculate a rate ratio.As you may recall from way back in the beginning of the course, we use rates when we're concerned about competing risk and loss to follow up when we have studies where people are followed for many, many years. And we're worried about not being able to observe all the outcomes because of these issues of competing risk and loss to follow up. But in this study that we're thinking about right now, we have information on tea and coffee consumption and their sleep quality that night, so we're not concerned about competing risks and loss to follow up.Also, we often use rates when we want to understand the timing of how long it takes for the exposure to result in an outcome. But here we're not asking how long does it take for the tea and coffee consumption to cause a change in sleep difficulties. We just want to know do they have sleep difficulties. Yes or no. Case or no case. Therefore, the appropriate measure of association here would be a risk ratio.Our exposure variable is going to be caff2hrb4, which equals one if the person said that he or she did drink your coffee within two hours of going to bed and equals zero if he or she did not report drinking tea or coffee in the two hours before bed. And the variable equals missing if the person did not answer the question. Our outcome variable is sleepdiff, which equals one if the participant reported sleep difficulties that night and zero if the person did not report
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
Raj Dasgupta
Page 112: STATA intro

PH207X Fall 2012 Survey Data Demo Page 4 of 11

Objectives for Survey Results Module 3 –Risk Factors for Sleep Difficulties

Choose a study design to examine the association between tea/coffee consumption before bed and sleep difficulties.

Calculate the appropriate measure of association comparing tea/coffee consumption before bed and sleep difficulties.

Consider confounding and effect modification sex. Consider confounding and effect modification age. Construct your own analysis to study the association between handedness and sleep

difficulties. Consider confounding and effect modification by sex. I. Choose a study design to examine the association between tea/coffee

consumption before bed and sleep difficulties. a. Study design: Cohort study b. Exposure: Tea and coffee consumption two hours before bed c. Outcome: Sleep difficulties that night

II. Calculate the appropriate measure of association comparing tea/coffee

consumption before bed and sleep difficulties. a. Measure of association: Risk ratio b. Dropdown:

i. Statistics Epidemiology and RelatedTables for EpidemiologistsCohort study risk-ratio etc.

ii. Case variable: sleepdiff iii. Exposed variable: caff2hrb4 iv. Submit.

c. Command Window Syntax: cs sleepdiff caff2hrb4 | caff2hrb4 | | Exposed Unexposed | Total -----------------+------------------------+------------ Cases | 19 81 | 100 Noncases | 102 389 | 491 -----------------+------------------------+------------ Total | 121 470 | 591 | | Risk | .1570248 .1723404 | .1692047 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Risk difference | -.0153156 | -.0885836 .0579524 Risk ratio | .9111315 | .5763827 1.440294 Prev. frac. ex. | .0888685 | -.4402942 .4236173 Prev. frac. pop | .0181947 | +------------------------------------------------- chi2(1) = 0.16 Pr>chi2 = 0.6886 Those who drank tea or coffee before bed had 0.91 times the risk of sleep difficulties compared to those who did not drink tea or coffee.

Raj Dasgupta
Page 113: STATA intro

PH207X Fall 2012 Survey Data Demo Page 5 of 11

III. Consider confounding and effect modification by sex. a. Dropdown:

i. Statistics Epidemiology and RelatedTables for EpidemiologistsCohort study risk-ratio etc.

ii. Case variable: sleepdiff iii. Exposed variable: caff2hrb4 iv. Go to the “Options” tab;; click the box next to “stratify on variables”;;

use the dropdown menu to select “male” Note: Under “Within-stratum weights” the button next to “Use Mantel-Haenszel” should be automatically selected

v. Submit. b. Command Window Syntax: cs sleepdiff caff2hrb4, by(male)

male | RR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------- no | .6689266 .3338178 1.34044 9.448399 yes | 1.241935 .6697457 2.302969 7.068404 -----------------+------------------------------------------------- Crude | .9111315 .5763827 1.440294 M-H combined | .9141471 .5777317 1.446458 ------------------------------------------------------------------- Test of homogeneity (M-H) chi2(1) = 1.721 Pr>chi2 = 0.1895 The crude effect estimate is 0.911 while the Mantel-Haenszel adjusted risk ratio is 0.914. Since the crude and adjusted-risk ratios are so similar, we can conclude that there is not strong confounding by sex in our study. Although our risk ratios for males and females seem different, there is no evidence of statistically significant effect modification by sex.

IV. Consider confounding and effect modification by age.

a. Dropdown: i. Statistics Epidemiology and RelatedTables for

EpidemiologistsCohort study risk-ratio etc. ii. Case variable: sleepdiff iii. Exposed variable: caff2hrb4 iv. Go to the “Options” tab;; click the box next to “stratify on variables”;;

use the dropdown menu to select “agecat” Note: Under “Within-stratum weights” the button next to “Use Mantel-Haenszel” should be automatically selected

v. Submit. b. Command Window Syntax: cs sleepdiff caff2hrb4,

by(agecat)

Raj Dasgupta
The p value > 0.05 and hence not significant - no evidence of effect modification
Raj Dasgupta
Page 114: STATA intro

PH207X Fall 2012 Survey Data Demo Page 6 of 11

agecat | RR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------- 18-29 yrs old | 1.202553 .654413 2.209817 7.208 30-39 yrs old | .5083612 .1870783 1.381406 6.040404 40-49 yrs old | .8452381 .2120323 3.369427 1.976471 >=50 yrs old | 1.305556 .3431283 4.967457 1.309091 -----------------+------------------------------------------------- Crude | .9111315 .5763827 1.440294 M-H combined | .9143836 .5795498 1.442667 ------------------------------------------------------------------- Test of homogeneity (M-H) chi2(3) = 2.389 Pr>chi2 = 0.4957 The crude risk ratio is 0.911 while the Mantel-Haenszel adjusted risk ratio is 0.914. Since the crude and adjusted-risk ratios are so similar, there is not strong confounding by age category in our study. Despite the differences in the risk ratios by age category, there is no evidence of statistically significant effect modification by age category. V. Construct your own analysis to study the association between handedness and

sleep difficulties. Consider confounding and effect modification by sex. a. Dropdown:

i. Statistics Epidemiology and RelatedTables for EpidemiologistsCohort study risk-ratio etc.

ii. Case variable: sleepdiff iii. Exposed variable: lefty iv. Submit

b. Command Window Syntax: cs sleepdiff lefty

| lefty | | Exposed Unexposed | Total -----------------+------------------------+------------ Cases | 8 89 | 97 Noncases | 36 440 | 476 -----------------+------------------------+------------ Total | 44 529 | 573 | | Risk | .1818182 .168242 | .1692845 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Risk difference | .0135762 | -.1047616 .131914 Risk ratio | 1.080695 | .5614644 2.080098 Attr. frac. ex. | .0746692 | -.7810567 .5192533 Attr. frac. pop | .0061583 | +------------------------------------------------- chi2(1) = 0.05 Pr>chi2 = 0.8175 People who are left-handed have a slightly higher (1.08 fold higher) risk of sleep difficulties compared people who are right-handed.

Page 115: STATA intro

PH207X Fall 2012 Survey Data Demo Page 7 of 11

c. Dropdown: vi. Statistics Epidemiology and RelatedTables for

EpidemiologistsCohort study risk-ratio etc. vii. Case variable: sleepdiff viii. Exposed variable: lefty ix. Go to the “Options” tab;; click the box next to “stratify on variables”;;

use the dropdown menu to select “male” Note: Under “Within-stratum weights” the button next to “Use Mantel-Haenszel” should be automatically selected

x. Submit. d. Command Window Syntax: cs sleepdiff lefty, by(male)

male | RR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------- female | .9338374 .3691631 2.362241 3.918519 male | 1.333333 .531464 3.345058 2.8 -----------------+------------------------------------------------- Crude | 1.080695 .5614644 2.080098 M-H combined | 1.100331 .5725662 2.114564 ------------------------------------------------------------------- Test of homogeneity (M-H) chi2(1) = 0.288 Pr>chi2 = 0.5918 Our results stratified by gender show slightly different results among males and females. Left-handed males have 1.33 times the risk of sleep difficulties compared to right-handed males while left-handed females have 0.93 times of the risk of sleep difficulties compared to right-handed males.

Page 116: STATA intro

PH207X Fall 2012 Survey Data Demo Page 8 of 11

Objectives for Survey Results Module 4 –Risk Factors for Left and Right Handedness

Choose a study design to examine the association between mother’s age at birth of PH207x participant and handedness of the participant.

Calculate the appropriate measure of association comparing the mother’s age among those who are left-handed to those who are right-handed.

Construct your own analysis to study the association between having a left-handed parent and child’s handedness.

I. Choose a study design

a. Exposure: Mother’s age at birth of PH207x participant b. Outcome: Handedness of PH207x participant c. Study design: Case-control

II. Calculate the appropriate measure of association comparing the mother’s age among

those who are left-handed to those who are right-handed. a. Measure of association – Odds ratio b. Calculating the odds ratio in Stata c. Dropdown:

i. Statistics Epidemiology and RelatedTables for EpidemiologistsCase control odds ratio.

ii. Case variable: lefty iii. Exposed variable: momagecat iv. Submit

d. Command Window Syntax:

Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+------------------------ Cases | 9 35 | 44 0.2045 Controls | 57 474 | 531 0.1073 -----------------+------------------------+------------------------ Total | 66 509 | 575 0.1148 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Odds ratio | 2.138346 | .8576737 4.829077 (exact) Attr. frac. ex. | .5323488 | -.1659446 .7929211 (exact) Attr. frac. pop | .1088895 | +-------------------------------------------------

chi2(1) = 3.78 Pr>chi2 = 0.0519

Mothers who are 35 years of age or older at the time of their child’s birth have 2.14 times the odds of having a left-handed child compared to mothers who were younger than 35 at the time of their child’s birth. We are 95% confident that the true odds ratio ranges from 0.86 to 4.83.

Raj Dasgupta
Whenever we do a case control study we use the Odds Ratio as our Measure of Association
Raj Dasgupta
Raj Dasgupta
Exposed = mom age < = 35Unexposed = mom age > 35Cases = Left-handedControls = Right-handed
Raj Dasgupta
Page 117: STATA intro

PH207X Fall 2012 Survey Data Demo Page 9 of 11

III. Construct your own analysis to study the association between having at least one left-handed parent and child’s handedness.

. cc lefty parentlefty Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+------------------------ Cases | 6 38 | 44 0.1364 Controls | 42 487 | 529 0.0794 -----------------+------------------------+------------------------ Total | 48 525 | 573 0.0838 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Odds ratio | 1.830827 | .597326 4.708769 (exact) Attr. frac. ex. | .4537988 | -.6741276 .7876302 (exact) Attr. frac. pop | .0618817 | +-------------------------------------------------

chi2(1) = 1.72 Pr>chi2 = 0.1900

People with at least one left-handed parent have 1.83 times the odds of being left-handed compared to those without a left-handed parent.

Conclusions The appropriate measure of association depends on the type of exposure and outcome of

interest, the type of data available and the study design used to obtain the data. In survey studies, one must always be concerned about issues of selection bias and

generalizability.

Page 118: STATA intro

PH207X Fall 2012 Survey Data Demo Page 10 of 11

Data Dictionary for Survey Dataset

Variable Description Values

board

In the past two weeks how often did you use the chat room for this course to post a question

"0" "2-3 times" "4 or more times" "Never" "Once"

male Sex 0 no (female) 1 yes (male) . missing

degree Highest level of education

1 pre-college / university degree 2 bachelor degree 3 masters degree 4 doctoral degree

precollege Highest level of education is Pre-College/University Degree

0 no 1 yes . missing

masters Highest level of education is Masters Degree

0 no 1 yes . missing

doctorate Highest level of education is Doctoral Degree

0 no 1 yes . missing

macpc Which type of computer do you use most of the time?

0 pc 1 mac . missing

aptitude Which is stronger, your math aptitude or your verbal aptitude?

0 verbal 1 math . missing

caff2hrb4 Did you drink coffee or tea within two hours of bedtime yesterday?

0 no 1 yes . missing

sleepdiff Did you have trouble sleeping last night?

0 no 1 yes . missing

shower Do you face the shower head? 0 no 1 yes . missing

longhair Do you consider your hair to be long? 0 no 1 yes . missing

Page 119: STATA intro

PH207X Fall 2012 Survey Data Demo Page 11 of 11

Variable Description Values

facialhair If you are a man, do you have a beard, mustache, or goatee?

0 no 1 yes . missing

agecat Age of participant

1 18-29 yrs old 2 30-39 yrs old 3 40-49 yrs old 4 >=50 yrs old

momagecat How old was your mother at your birth? 0 <35 yrs old 1 >=35 yrs old

lefty Are you left-handed? 0 righty 1 lefty . missing

dadlefty Is your father left-handed? 0 righty 1 lefty . missing

momlefty Is your mother left-handed? 0 righty 1 lefty . missing

parentlefty Is one (or both) of your parents left-handed?

0 No left-handed parents 1 Left-handed parent . missing

allhourscat On average, how many hours per week did you spend on all aspects of this course?

0 0-7 hours 1 >=8 hours

hwkhourscat On average, how many hours per week did you spend working on the homework assignments for this course?

0 0-2 hours 1 >=3 hours

comphrscat For how many hours did you use your computer last night

0 0-1 hours 1 >=2 hours

Page 120: STATA intro

1

Tutorial: Survival Analysis in Stata

In this tutorial, we use data from the Digitalis Investigation Group (DIG). Recall that the DIG trial was a was a randomized, double-blind, multicenter trial designed to examine the safety and efficacy of Digoxin in treating patients with congestive heart failure. In this trial, patients were randomized to either Digoxin or placebo. The log-rank test was used to compare overall mortality between the two groups.

To begin, open the dig.dta data set. Before we can do any analyses, we must first tell Stata that we are working with survival data (analogous to how we had to svyset our data and tell Stata that we were working with survey data). You can do this using the stset command. The command for this dataset is stset deathday, failure(death==1). This command tells Stata that our time-to-death variable is deathday; and a value of 1 for the death variable means that person died while any other value (in this case 0) means that person was censored. For survival data, we need at least two variables: 1) a variable for the time to the event and 2) a variable to indicate if the observation is censored or not.

1. Graph the Kaplan-Meier estimates of the survival curves for each treatment group. .sts graph, by(trtmt)

Raj Dasgupta
Page 121: STATA intro

2

Note: you can also list the values of the survival function using the sts list, by(trtmt) command.

2. In the New England of Journal paper (see handout NEJM_DIG), the authors plotted 1 – S(t) in Figure 1. Graph 1 – S(t) for each treatment group. sts graph, failure by(trtmt)

0.00

0.25

0.50

0.75

1.00

0 500 1000 1500 2000analysis time

trtmt = 0 trtmt = 1

Kaplan-Meier survival estimates

Page 122: STATA intro

3

3. Conduct a log-rank test at the 0.05 level of significance to test the hypothesis that the survival distribution is the same in the two treatment groups. Use the following command: sts test trtmt, logrank

a. What are your null and alternative hypotheses? The null hypothesis is that the two groups have same distribution of survival times. The alternative is that they do not.

b. What is the value of your test statistic? 0.00

c. What distribution does your test statistic have under the null hypothesis? Chi-squared distribution with 1 degree of freedom

d. What is your p-value?

0.9616. Note: using all 6800 observation yields a p-value of 0.8013. This is the p-value the authors reported in Figure 1.

0.00

0.25

0.50

0.75

1.00

0 500 1000 1500 2000analysis time

trtmt = 0 trtmt = 1

Kaplan-Meier failure estimates

Raj Dasgupta
logrank test command - sts test trtmt, logrank
Raj Dasgupta
Page 123: STATA intro

4

e. What is your conclusion?

Since our p-value is greater than 0.05, we fail to reject the null hypothesis. Thus, we conclude that we do not have evidence that the distribution of survival times is different between the Digoxin group and the placebo group.