STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo...

Preview:

Citation preview

STATA Boot Camp Day 3:Advanced Data Manipulation + Summaries

Nate MercaldoSarah Fletcher Mercaldo

Andrew Wiese

Objectives

• Advanced Data Management/Manipulation– missing data – operators– a few cool functions to create new variables

• Multivariate Statistical Summaries– generate multi-way tables (counts, means, other summaries)– generate multivariate figures (+ modifying figure aesthetics)

• Reproducible Research (do files)

Setup

• Start log file!

• Load birth weight data (birthweight_v2.dta or birthweight_v2_11.dta – if not using Stata 14)

• Variables– bwt, low = child’s birth weight, indicator of low birth weight– age, race, smoke, height, weight = mother’s age, race, smoking

status, height and weight

Exercise 1

• Generate bmi = weight (lb) / height (in)2 * (703)

• Summarize bmi– min, max, mean (sd), median (IQR)

• Create a histogram of bmi

Exercise 1 variable | min max mean sd p50 iqr-------------+------------------------------------------------------------ bmi | .0007453 1716.137 106.1197 360.3435 22.39452 3.201514--------------------------------------------------------------------------

Huh?

• Anything odd about the generated summaries?• Typically we receive data sets that have NOT been “cleaned”

. summarize height weight bmi

Variable | Obs Mean Std. Dev. Min Max-------------+--------------------------------------------------------- height | 189 590.7354 2229.642 61 9999 weight | 189 656.9788 2213.965 106 9999 bmi | 189 106.1197 360.3435 .0007453 1716.137

Missing Values

• Stata codes missing values as a .

• Researchers code missing values as all sorts of things (e.g., -99, -9, 9, 99, NA, ?)

• Guesses on how missing values were coded in the lbw file?

• How can we replace these values with .?

Aside: Operators

• Arithmetic+ (addition), - (subtraction), * (multiplication), / (division), ^ (power)

• Relational> (greater than), < (less than), >= (greater than or equal),

<= (less than or equal), == (equal), != (not equal)

• Logical& (and), | (or, pipe – look above enter key), and ! (not)

Missing Values

• How can we replace these values with .?

• Use logical operators (or logical expressions)!

generate height2 = height if height < 9999

generate weight2 = weight if weight < 9999

replace height = . if height==9999replace weight =. if weight == 9999

Missing Values• Recompute bmi and summarize variable | min max mean sd p50 iqr-------------+------------------------------------------------------------ bmi | 16.87565 28.52808 22.50763 2.037052 22.39452 2.842493--------------------------------------------------------------------------

Other uses of relational / logical operators• Restrict data to a specific group

– list if age <= 15 | age >= 45 (list “at risk” mothers)– drop if smoke == 1 (remove smokers, BEWARE!)

• Generating new variables– See Exercise 2

• Compute summaries for a specific group– summarize bmi if smoke==0

• Used A LOT when creating custom tables/figures

Exercise 2

• Categorize BMI Generate a new variable called overweight– overweight, 25 <= bmi <= 30 (note – max bmi = 29 in this dataset)

• Summarize birth weight by overweight variable

overweight | min max mean sd p50 iqr-----------+------------------------------------------------------------ 0 | 709 4990 2964.701 698.1659 2977 1035.5 1 | 1330 3941 2903.8 882.4416 2977 1723-----------+------------------------------------------------------------ Total | 709 4990 2959.598 712.6656 2977 1063------------------------------------------------------------------------

Functions to Generate New Variables• Data > Create or change data > Create new variable (extended)

• Categorize continuous variablesegen bmicat = cut(bmi), at(0,18.5,25,30,100) icodes

• Group variablesegen racesmoke = group(race smoke)

• Create indicator/dummy variables quietly tabulate bmicat, generate(bmicat_)table bmicat; table racesmoke race smoke; table bmi bmicat

Multivariate Summaries • Day 2 – we looked at a lot of univariate (or marginal) summaries

• Generally we are more interested in multivariate summaries, say identifying factors associated low birth weight infants.

• Using operators to compute summaries (by hand) can be tedious – it would be helpful to have Stata do all the heavy lifting (e.g., cut command).

Multivariate Tabular Summaries• Possible factors associated with low birth weight infants

– age, smoke, bmi (bmicat)

• How can we summarize these variables by low?– Continuous: age, bmi [range, mean, sd, quantiles]– Categorical: smoking status and bmicat (frequencies/proportions)

• Statistics > Summaries, tables, and tests >– Summary and descriptive statistics– Other tables > Compact table of summary statistics

Multivariate Tabular Summaries• Compact table of summary stats, Options (wide table)

tabstat age smoke bmi, statistics( mean sd median iqr) by(low) longstub

low stats | age smoke bmi------------------+------------------------------0 mean | 23.66154 .3384615 22.49244 sd | 5.584522 .4750169 2.044822 p50 | 23 0 22.29633 iqr | 9 1 2.746094------------------+------------------------------1 mean | 22.30508 .5084746 22.54381 sd | 4.511496 .5042195 2.038627 p50 | 22 1 22.48364 iqr | 6 1 2.973585------------------+------------------------------

Multivariate Tabular Summaries• Suppose we are interested in testing to see is an association

between smoking and low birth weight– Statistics > Summaries, tables and tests > Frequency tables > two-way

tables with measures of association. tabulate smoke low, chi2 | low smoke | 0 1 | Total-----------+----------------------+---------- 0 | 86 29 | 115 1 | 44 30 | 74 -----------+----------------------+---------- Total | 130 59 | 189 Pearson chi2(1) = 4.9237 Pr = 0.026

– Statistics > Summaries, tables and tests > Classical tests of hypotheses

Exercise 3• Compute the associations (tables and χ2) between smoking and

low birth weight by race (hint: command from Day 1?)

-> race = 1

| low smoke | 0 1 | Total-----------+----------------------+---------- 0 | 40 4 | 44 1 | 33 19 | 52 -----------+----------------------+---------- Total | 73 23 | 96

Pearson chi2(1) = 9.8556 Pr = 0.002

Multivariate Graphics

www.ats.ucla.edu/Stat/stata/library/GraphExamples/default.htm

Box-Plot Scatterplot

Default Customized

Examples

Stata can make lots of plots – but that does not mean you should!

http://www.surveydesign.com.au/tipsgraphs.html

Multivariate Plots

• Type of plot depends on the TYPES of variables– Categorical/categorical

• Tables

– Categorical/Continuous • Box plots, histograms

– Continuous/Continuous• Scatter/bubble plots

Multivariate Plots: Categorical / Continuous

– Box Plots• Graphics > Box plot >

main variable = continuous, Categories Tab > Group 1 = categorical• graph box bwt, over(smoke)

– Histograms• Graphics > Histogram

main variable – continuous, By Tab > categorical• histogram bwt, frequency bin(10) by(smoke)

Multivariate Plots: Continuous/ Continuous

– Scatter plots (=bubble plots with varying sizes of points)• Graphics > Twoway graph > Create > Basic Plots > Scatter

Y variable= continuous, X variable = continuous • twoway scatter bwt age, sort

– Other add-ons: lowess smoothers• Graphics > Twoway graph > Create > Advanced Plots > Lowess Line

Y variable= continuous, X variable = continuous • twoway (scatter bwt age, sort) (lowess bwt age)

Exercise 4

• Summarize the birth weight by smoking status and race– Create a boxplot of birth weight by smoking status – Create a boxplot of birth weight by race– Create a boxplot of birth weight by smoking status AND race

• Summarize maternal age and birth weight (as a group)– Create a scatter plot of age by birth weight– Add smoothers by smoking status (red: smoke=1, black: smoke=0)

Exercise 4

Exercise 4

Can we improve the aesthetics of these plots?

Improving Aesthetics of a Plot

Plots are comprised of points/symbols, lines, text, labels, legends, …

Stata defaults are fine for preliminary analyses or for homework, but modifications are needed for publications (or reflect personal style)

- Provide examples of how to:- Add/modify text: titles, x-/y-axes, legends … - Modify plotting symbols: color, size, symbol, …- Modify plotting lines: color, width, type, …- Modify colors: histograms, box-plots, …

Modifying Aesthetics: Text

• Birth weight by maternal race and smoking status

Modifying Aesthetics: Text - CODE

• Birth weight by maternal race and smoking statusgraph box bwt, over(racesmoke)

label define rslab 1 "White: NS" 2 "White: S" 3 "Black: NS" 4 "Black: S" 5 "Other: NS" 6 "Other: S"label values racesmoke rslabgraph box bwt, over(racesmoke) ytitle("Birth Weight") title("Infant Birth Weight by Maternal Race and Smoking Status") subtitle("subtitle") caption("caption") note("note")

Modifying Aesthetics: Symbols

• Birth weight by maternal age and smoking status

http://www.stata.com/manuals13/g-3marker_options.pdf

Modifying Aesthetics: Symbols - CODE

• Birth weight by maternal age and smoking status

twoway scatter bwt age

twoway (scatter bwt age if smoke==0, mcolor(black) msize(small) msymbol(diamond)) (scatter bwt age if smoke==1, mcolor(red) msize(large)), legend(order(1 "Non-Smoker" 2 "Smoker"))

Note: it is NOT recommended to use all options simultaneously!

Modifying Aesthetics: Lines

• Birth weight by maternal age and smoking status

http://www.stata.com/manuals13/g-3line_options.pdf

Modifying Aesthetics: Lines- CODE

• Birth weight by maternal age and smoking status

twoway (scatter bwt age) (lowess bwt age)

twoway (scatter bwt age) (lowess bwt age if smoke==0, lcolor(black) lwidth(thin) lpattern(dash)) (lowess bwt age if smoke==1, lcolor(red) lwidth(thick))

Note: it is NOT recommended to use all options simultaneously!

(see .do file for code) http://www.stata.com/manuals13/g-3marker_options.pdf

Modifying Aesthetics: Colors

• Birth weight by maternal age and smoking status

Modifying Aesthetics: Colors - CODE

• Birth weight by maternal age and smoking statushistogram bwthistogram bwt, bin(35) frequency fcolor(sandb) lcolor(lavender) lwidth(thick)

label define slab 1 “Smoker” 0 “Non-Smoker” label values smoke slabgraph box bwt, over(smoke)graph box bwt, over(smoke) box(1, fcolor(chocolate) lcolor(pink)) graph box bwt, over(smoke) scheme(s2mono)

(see .do file for code) http://www.stata.com/manuals13/g-3marker_options.pdf

Reproducible Research

• Do file – What is a do file?

File that contains all code (w/comments)

– Benefits of do file?Record of all data manipulationsRecord of everything you do to generate an analysis (summary, figure)

– How do do files differ from log files?

• What if I told you that height was in cm and not inches? How long would it take you to redo all the analysis from today?