32
Research Methods Lecture 3 More STATA Ian Walker Room S2.109 [email protected] 02475 23054 Slides available at: http://www2.warwick.ac.uk/fac/soc/economics/pg/modules/rm/notes/i w_lectures/

Research Methods Lecture 3 More STATA Ian Walker Room S2.109 [email protected] 02475 [email protected] Slides available at:

Embed Size (px)

Citation preview

Page 1: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Research MethodsLecture 3

More STATA

Ian WalkerRoom S2.109

[email protected] 02475 23054

Slides available at:

http://www2.warwick.ac.uk/fac/soc/economics/pg/modules/rm/notes/iw_lectures/

Page 2: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Housekeeping announcement

Page 3: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Stat-Transfer• Use STAT-

TRANSFER to convert data.

• Click on• Stat-transfer is

“point and click”.

• Just tell it the file name and format

• and the format you want it in.

• Click “transfer”.

Stat Tran 6.lnk

Page 4: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Stat Transfer options• Useful options for creating a manageable

dataset from a large one:– Keep or drop variables– Change variable format

• E.g. float to integer

– Select observations• E.g. “where (income + benefits)/famsize < 4500”

• Can be used for reading a large STATA dataset and writing a smaller one

• Avoids doing this in STATA itself

Page 5: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Customising STATA• profile.do runs automatically when STATA

starts• Edit it to include commands you want to

invoke every time.set mem 200m.log using justincase.log, replace

• Define preferences for STATA’s look and feel– Click on Prefs in menu

• Colours, graph scheme, etc.• Save window positioning

Page 6: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Merging data - 1• file1 has id x1 x2 x3 , file2 has id x4 x4 x5.

• You can merge using “key” in BOTH files (id)

• But you need to sort both files first. use file1. sort id sorts file1 according to id variable

. save, replace

. use file 2

. sort id sorts file2 according to id variable

. merge using file1

. drop if _merge~=3 drops obs with any missing info

. save file3

Page 7: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Merging data - 2

• For each row (id) all vars in file1 added to corresponding row of file2 (if there is one).

• .merge creates a new variable, _merge – which =1 for those obs only in file1, =2 for

those only in file2, and =3 for those in both.

• So the syntax above drops those obs that don’t have data in both files– and saves the result containing x1-x6 in file3

• .append to add more obs on the same vars.

Page 8: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Collapsing data (use with care)

• Collapse converts the data in memory into a dataset of means (or sums, medians, etc.)

• This is useful when you want to provide summary info at a higher level of aggregation– For example, suppose a dataset contains data

on individuals – say their reg and whether u/e– To find the average u/e rates across reg type:

. collapse unemp, by(region)

leaves 1 obs for each reg and mean u/e rate.

Page 9: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Reshaping files• Data may be “long” but thin

– Eg each record is a household member– But there are few vars - say wage and hours

• Data may be “wide” but short– each record is a household and has lots of vars– (eg w1 w2 w3 hours1 hours2 hours3)

. reshape long inc ue, i(id) j(year) wide to long

. reshape wide inc ue, i(id) j(year) back to wide

. Handy for merging data together and for panel data

Page 10: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Syntax to remember>= means "greater or equal",  &     means    "and",  |       means    "or"  = means “set equal to”== means “is it equal to?”~= means “not equal” (or use != ). means missing valueFor example

. keep if x1>= 1 & x1<=3 | x1==7 & x1 ~= .

. gen x = log(y)

. reg y x if z == 1 & y != .

Page 11: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Using STATA as a calculator• .display command

– .dis 22/7– disp log(250)– di exp(3.6)– di chiprob(2,6.45) (i.e. 2 df, deviance 6.45)

• returns 0.398 (i.e. its significant at 5% level)– display _N

• Returns the sample size – (_N is the number of the last obs)

Page 12: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Using the data editor• Open a

datafile (eg auto.dta)

• Click on the icon

• Or type .edit • You can edit

datapoints!• Or just

browse the datafile

Page 13: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

STATA’s editor• STATA has an editor that

allows you to create do files– Enter cmds – 1 per line– Save the commands in a

“do” file – Highlight commands and

click the button with page (or page with text) and down arrow to “run” (or “do”) commands.

Page 14: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Saving output• Scroll – best to open “log file” (and close it). • Click on file, log, begin . • Or type

. log using myoutputThen type some commands here and then. log close

• log command allows replace and append• Default is a .smcl file extension (to “view”)• It doesn’t save graphs

– Copy graphs (use cut and paste or use menus)

Page 15: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Saving commands• You might prefer own extension, say, .log

– then you get an ASCII file that anything can edit– you can translate files to and from smcl format

• click on file, log, translate and fill in the dialog box

• Logging your output is a good way of developing a .do file– since it saves the commands as well as output

• Or you can just log the commands– type .cmdlog using xxx

• You can turn logging off and back on– .log off then .log on when ready to resume

Page 16: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Useful tips for .do files.#delimit ; /*makes ; the end of line char) */.use mydata, clear ;.set more off ;.set mem 200m ;.set matsize 200 ;.log using xxxx.log, replace ;...log close ;.exit, clear ;

Page 17: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Handling string variablesencode • Use encode when the original var is a

character var (eg gender is "m" or "f’")• encode command does not produce dummy

variables, it just assigns numbers to each group defined by the character variable.

• In this example, gender was the original character var and sex is new numeric var:

. encode gender, gen(sex)• decode does the opposite

Page 18: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Extended generate (.egen)

egen• Useful when you need a new variable that

is the mean, median, etc. of another variable– for all observations or for groups of

observations. • Also useful when you need to simply

number groups of observations based on some classification variables.

• Great when you have panel data

Page 19: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

.egen examples

. egen sumvar1 = sum(var1)creates sumvar1 as sum of values of var1

. egen meanvar1= mean(var1), by(var3) creates meanvar1 as mean of all values of var1

. egen counter = count(id), by(company)creates count as the number of companies with

nonmissing id’s

. egen groupid = group(month year)assigns a number to each month/year group

Page 20: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Saving typing - 1macro• For defining lists of vars (globally or locally).

.local macvar x1 x2 x3 x4 x5 x6

.reg y1 `macvar’

.reg y2 `macvar’

.reg y3 `macvar’

.reg y4 `macvar’

• macvar becomes string “x1 x2 x3 x4 x5 x6" • Pay careful attention to the different type of

quotation marks

Page 21: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Saving typing - 2for • Performs same command on several vars. • It can use several types of variable lists

.for var1-var25: replace @=. if @==99replaces 99 by missings)

.for var*: replace @=. if @=99 replaces 99 by missings

.for 1-3, ltype(numeric): gen q@==0creates q1=0, q2=0, q3=0

.for a b c, ltype(any): gen str2 @="x” creates a=x, b=x, c=x

Page 22: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Regression models - I

• Linear regression and related models when the outcome variable is continuous– OLS, 2SLS, 3SLS, IV, quantile reg, Box-Cox …

• Binary outcome data– the outcome variable is 0 or 1(or y/n)

• probit, logit, nested logit...;

• Multiple outcome data– the outcome variable is 1, 2, ...,

• conditional logit, ordered probit

Page 23: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Regression models - II• Count data

– the outcome variable is 0, 1, 2, ..., occurrences • Poisson regression, negative binomial

• Choice models– multinomial choice– A, B or C

• Multinomial logit, Random utility model, unordered probit, nested logit, ...etc

• Selection models– Truncated, censored

• Tobit, Heckman selection models; • linear regression or probit with selection

Page 24: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Regression models - III• STATA supports several special data types.• Once type is defined special commands

work• Time series

– Estimate ARIMA, and ARCH models– Estimators for autocorrelation and

heteroscedasticity– Estimate MA and other smoothers– Tests for auto, het, unit roots - h, d, LM, Q, ADF,

P-P …..– TS graphs

Page 25: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Special data types: survey

• Non-randomness induces OLS to be inefficient

• STATA can handle non-random survey data– see the “syv***” commands– Example (stratified sample of medical cases):

. webuse nhanes2f, clear

. svyset psuid [pweight=finalwgt], strata(stratid)

. svy: reg zinc age age2 weight female black orace rural

. reg zinc age age2 weight female black orace rural

Page 26: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Number of strata = 31 Number of obs = 9189 Number of PSUs = 62 Population size = 1.042e+08 Design df = 31 F( 7, 25) = 62.50 Prob > F = 0.0000 R-squared = 0.0698 ------------------------------------------------------------------------------ | Linearized zinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.1701161 .0844192 -2.02 0.053 -.3422901 .002058 age2 | .0008744 .0008655 1.01 0.320 -.0008907 .0026396 weight | .0535225 .0139115 3.85 0.001 .0251499 .0818951 female | -6.134161 .4403625 -13.93 0.000 -7.032286 -5.236035 black | -2.881813 1.075958 -2.68 0.012 -5.076244 -.687381 orace | -4.118051 1.621121 -2.54 0.016 -7.424349 -.8117528 rural | -.5386327 .6171836 -0.87 0.390 -1.797387 .7201216 _cons | 92.47495 2.228263 41.50 0.000 87.93038 97.01952 ------------------------------------------------------------------------------ . regress zinc age age2 weight female black orace rural Source | SS df MS Number of obs = 9189 -------------+------------------------------ F( 7, 9181) = 79.72 Model | 110417.827 7 15773.9753 Prob > F = 0.0000 Residual | 1816535.3 9181 197.85811 R-squared = 0.0573 -------------+------------------------------ Adj R-squared = 0.0566 Total | 1926953.13 9188 209.724982 Root MSE = 14.066 ------------------------------------------------------------------------------ zinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.090298 .0638452 -1.41 0.157 -.2154488 .0348528 age2 | -.0000324 .0006788 -0.05 0.962 -.0013631 .0012983 weight | .0606481 .0105986 5.72 0.000 .0398725 .0814237 female | -5.021949 .3194705 -15.72 0.000 -5.648182 -4.395716 black | -2.311753 .5073536 -4.56 0.000 -3.306279 -1.317227 orace | -3.390879 1.060981 -3.20 0.001 -5.470637 -1.311121 rural | -.0966462 .3098948 -0.31 0.755 -.7041089 .5108166 _cons | 89.49465 1.477528 60.57 0.000 86.59836 92.39093

Page 27: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Special data types: duration

• Survival time data– See the “st***” commands

.stset failtime /*sets the var that defines duration*/

• Estimates a wide variety of models to explain duration

Page 28: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Weibull example ….• ST regression supports Weibull, Cox PH

and other options. streg load bearings, distribution(weibull)

• After streg you can plot the estimated hazard with

. stcurve, cumhaz• STATA allows functions to be plotted by

specifying the function:– E.g. Weibull “hazard” model –

Page 29: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Special data types: Panel data

• STATA can handle “panel” data easily– see the “xt***” commands

• Common commands are.xtdes Describe pattern of xt data

.xtsum Summarize xt data

.xttab Tabulate xt data

.xtline Line plots with xt data

.xtreg Fixed and random effects

Page 30: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Panel data• An xt dataset looks like this:

pid yr_visit fev age sex height smokes ---------------------------------------------------------- 1071 1991 1.21 25 1 69 0 1071 1992 1.52 26 1 69 0 1071 1993 1.32 28 1 68 0 1072 1991 1.33 18 1 71 1 1072 1992 1.18 20 1 71 1 1072 1993 1.19 21 1 71 0

• xt*** cmds need vars identify person and “wave”: . iis pid . tis yr_visit• Or use the tsset command

. tsset pid yr_visit, yearly

Page 31: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Panel regression

• Once STATA has been told how to read the data it can perform regressions quite quickly:. xtreg y x, fe

. xtreg y x, re

Page 32: Research Methods Lecture 3 More STATA Ian Walker Room S2.109  i.walker@warwick.ac.uk  02475 23054i.walker@warwick.ac.uk Slides available at:

Further advice• See Stephen Jenkins’ excellent course on

duration modelling in STATA• Steve Pudney’s excellent panel course

– Beware his example dataset is 30mb+

• To get up and running– Just have a go - you won’t break it!– Try some of the commands in this lecture

• To start to get proficient– Sign up for netcourses