Optimizing Stata for Analysis of Large Data Sets

Preview:

DESCRIPTION

Optimizing Stata for Analysis of Large Data Sets. Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July 19, 2013. Background. Programmer/Statistician: 20 years experience with SAS Took new job and started using Stata in January 2013 - PowerPoint PPT Presentation

Citation preview

Optimizing Stata for Analysis of Large Data Sets

Joseph Canner, MHSEric Schneider, PhD

Johns Hopkins University

Stata ConferenceNew Orleans, LA

July 19, 2013

Background

• Programmer/Statistician: 20 years experience with SAS

• Took new job and started using Stata in January 2013

• Reviewed many do-files from predecessors and colleagues in order to learn Stata and understand new job

Caveats

• Large data sets: irrelevant if you don’t use large data sets and/or if you don’t have a system that has sufficient memory to analyze large data sets

• Coding practices: these are examples from real users, but not necessarily trained programmers or Stata experts

Benchmark Testing

• NIS 2010 Core (unless noted otherwise)• 7,800,441 observations• 155 variables• 5.6 Gb memory• 25 ICD-9 diagnosis codes (DX1-DX25)• 15 ICD-9 procedure codes (PR1-PR15)

Benchmark Testing

• Testing code:timer clear 1timer on 1 …Code to be tested…timer off 1timer list 1

• Groups of tests always run at the same time to eliminate issues with different server/memory/usage conditions– 24 core CPU, 256 Gb RAM (50% load), Windows 2008

Test #1: Coding ICD-9 variables

• Option 1:gen FOREACH=0forvalues x = 1/15 { foreach value in "7359" "741" "9955" "640" { replace FOREACH=1 if PR`x'=="`value'" }}

• Time=27.6 sec

Test #1: Coding ICD-9 variables

• Option 2:gen IFOR=0forvalues x = 1/15 { replace IFOR=1 if PR`x'=="7359" | PR`x'=="741" | PR`x'=="9955" | PR`x'=="640"}

• Time=13.2 (half the time!)

Test #1: Coding ICD-9 variables

• Option 3:gen INLIST=0forvalues x = 1/15 { replace INLIST=1 if inlist(PR`x',"7359","741", "9955","640")}

• Time=9.6 sec (a little better than Option 2, and easier to write and read)

Test #1a: Coding single ICD-9 variablesinlist() vs. recode

• Option 1:gen INLIST1=0replace INLIST1=1 if inlist(PR1,"7359","741","9955","640", "9904","8154","7569","3893")

• Time=1.2 sec

Test #1a: Coding single ICD-9 variablesinlist() vs. recode

• Option 2a:destring PR1, gen(tempPR1) ignore("incvl")recode tempPR1 (7359 741 9955 640 9904 8154 7569 3893 = 1) (else=0), gen(RECODE)drop tempPR1

• Time=118.1 sec (Ouch! Much of the time is devoted to the destring command)

Test #1a: Coding single ICD-9 variablesinlist() vs. recode

• Option 2b (use real() instead of destring):gen tempPR1=real(PR1)recode tempPR1 (7359 741 9955 640 9904 8154 7569 3893 = 1) (else=0), gen(RECODE)drop tempPR1

• Time=26.0 sec (much better than destring, but still much slower than inlist())

Test #1b: Coding single ICD-9 variables when there are ranges

• Option 1:split ECODE1, gen(nECODE) parse(E)destring nECODE2, gen(iECODE1)drop nECODE2recode iECODE1 (9200/9209 956 966 986 974 = 1)… (8800/8869 888 9570/9579 9681 9870 =2) (9220/9223 9228 9229 9550/9554 9650/9654 9794 9850/9854 970=3) (8100/8199 9585 9685 9885=4), gen(mech1)recode mech1 (5/10000=5)

• Time= 142.6 sec (Again, split and destring take the bulk of the time here.)

Test #1b: Coding single ICD-9 variables when there are ranges

• Option 2:iECODE1=real(substr(ECODE1,2,4))recode iECODE1 (9200/9209 956 966 986 974 =1)… () () ()…, gen(mech2)recode mech2 (5/10000=5)

• Time= 68.7 sec; better, but…

Test #1b: Coding single ICD-9 variableswhen there are ranges

• Option 3:gen mech3=.replace mech3=1 if (ECODE1>="E9200" & ECODE1<="E9209") | inlist(ECODE1,"E956","E966", "E986","E974")…replace mech3=5 if mech3==. & substr(ECODE1,1,1)=="E"

• Time=5.74 sec (a little harder to write, but much faster!)

Test #1b: Coding single ICD-9 variableswhen there are ranges

• Option 4:gen mech4=.replace mech4=1 if inrange(ECODE1,"E9200”,"E9209") | inlist(ECODE1,"E956","E966", "E986","E974")…replace mech4=5 if mech3==. & substr(ECODE1,1,1)=="E"

• Time=5.32 sec (a little faster still, and much easier to write)

Test #1: Coding ICD-9 VariablesConclusions

• Using inlist() reduces the time required to recode ICD-9 variables by 65% when searching 15 variables for 4 target codes.

• Performance improves to 80% for 8 codes, and continues to improve slightly thereafter, with a maximum improvement of 92%. (Note: inlist() limit is 10 string codes or 255 numeric codes)

• In order to “stress” the test, the codes used in the test are the most popular, but the results are the same for any set of codes.

Test #1: Coding ICD-9 VariablesConclusions (cont’d)

• Using recode is much slower than inlist() for lists of single ICD-9 codes, in large part because of the need to convert from string to numeric

• Using recode for ranges is also much slower than replace/if, for the same reason; inrange() also helps with readability

• Can use real() instead of destring, substring() instead of split

Test #2: Recoding continuous variables

• Option 1:gen AGE1=.replace AGE1=1 if AGE>=0 & AGE <=9replace AGE1=2 if AGE>=10 & AGE <=19…replace AGE1=10 if AGE>=90 & AGE <=120

• Time=6.6 sec

Test #2: Recoding continuous variables

• Option 2:gen AGE2=recode(AGE,9,19,29,39,49, 59,69,79,89,120)

• Time=0.66 sec (exactly one-tenth of the time(!) and easier to write and read)

• Caution: need to be careful with truly continuous variables that you are cutting at the right place

Test #2: Recoding continuous variables

• Option 3:recode AGE (0/9=1) (10/19=2) (20/29=3) (30/39=4) (40/49=5) (50/59=6) (60/69=7) (70/79=8) (80/89=9) (90/120=10), gen(AGE3)

• Time=46.3 sec (Ouch!) and harder to write• May be useful for instances where ranges

are not mutually exclusive (i.e., can’t use recode function)

Test #3: Reordering Values

• Option 1:gen sex_new=sexreplace sex_new=0 if sex_new==3replace sex_new=5 if sex_new==2replace sex_new=4 if sex_new==1replace sex_new=1 if sex_new==5replace sex_new=2 if sex_new==4

• Time=2.0 sec; very cumbersome and hard to follow

Test #3: Reordering Values

• Option 2:recode sex (3=0) (1=2) (2=1), gen(sex_new1)

• Time=15.0 sec (Ouch! ); but, easier to write and MUCH easier to read)

• Can also use recode to do things like:(3 4 = 0) // 3 and 4 are recoded to 0(3/5 = 0) // 3, 4, and 5 are recoded to 0

Test #3: Reordering Values

• Option 3:gen sex_new=sexreplace sex_new=0 if sex==3replace sex_new=1 if sex==2replace sex_new=2 if sex==1

• Time=1.4 sec (Faster than Option #1 by 40% and not too hard to read/write)

Test #4 De-stringing Numeric Values(e.g., NSQIP age)

• Option 1 (Variation of Test #3 Option #1):encode age, gen (age_new)replace age_new=180 if age_new==1…replace age_new=900 if age_new==73replace age_new=18 if age_new==180…replace age_new=90 if age_new==900

• Time=25.8 sec (NSQIP 2011; n=442,149), • Always need to do “tab age_new, nolabel” because

labels are messed up

Test #4 Destringing Numeric Values(e.g., NSQIP age)

• Option 2:destring age, gen(age_new1) ignore(“+”)

• Time=6.3 sec (NSQIP 2011; n=442,149); four times faster!

• Caution: make sure it is clear that 89=89+

Test #4a Removing Characters from ID Numbers (e.g., XXX-XX-XXXX)

• Option 1destring SSN, ignore("-") gen(newSSN1)

• Time=33.0 sec

Test #4a Removing Characters from ID Numbers (e.g., XXX-XX-XXXX)

• Option 2:gen long newSSN2=real(subinstr(SSN,"-","",.))

• Time=1.7 sec; almost 20 times faster!• Only useful if there are a few

characters to get rid of.

Future Tests

• Confirm results for 10 years of NIS (about 80 million observations, nearly 50 Gb RAM)

• Other Stata commands where there are multiple ways to do the same thing…any ideas?

• Other programming practices found reviewing code written by colleagues and students

Implications

• With 10 years of NIS, could save…– 3 minutes per ICD-9 recode– 1 minute per continuous variable categorization– 6 seconds per variable reorder– A lot more if you used recode

• It all adds up! • Might make it less onerous to run recoding and

cleaning programs more often instead of saving new copies of the dataset

• Easier to read programs

Recommended