STATA Training Session 2

Sun LiCentre for Academic Computing

[email protected]

STATA Training Session 2

Statistical Analysis in STATA

Outline

Resources And Books

Data Description And Simple Inference

Group Comparison And Correlation

General Linear Regression

Logistic Model

Binary Logistic Model

Ordinal Logistic Model

Multinomial Logistic Model

Resources And Books

CAC Computing Resources for STATA users

Windows:

STATA/SE version 10.0

10-user network perpetual license

Installation guide

(http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA-

Software Questions.aspx)

Linux CAC Beowulf Cluster:

STATA/SE version 10.0

Unlimited users

About CAC Beowulf Cluster:

(http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx)

New features in STATA 10.0 (http://www.stata.com/stata10)

Resources And Books

Website resources:

The STATA website: http://www.stata.com

The STATA journal reviewed papers, regular columns, user-written

software: http://www.stata-journal.com/

STATA FAQ : http://www.stata.com/support/faqs STATA User Support : http://www.stata.com/support

Books: http://www.stata.com/bookstore/

CAC STATA support:

Website:

http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx

Contact:

For statistical consultation: Sun Li: [email protected]

For software installation: TAN SuhWen: [email protected]

Resources And Books

Additional recommended readings:

Regression Models for Categorical Dependent Variables Using

Stata, 2nd Edition, J. Scott Long and Jeremy Freese

Logistic Regression with Stata, Xiao Chen, Phil Ender, Michael

Mitchell & Christine Wells, UCLA

Statistics with Stata (Updated for Version 9), Lawrence C.

Hamilton

Data Analysis Using Stata, Ulrich Kohler and Frauke Kreuter

Download Training Slides , data and Syntax:

http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/T

raining%20Slides%20and%20Syntax.aspx

Data Description & Simple Inference

Description of Data

Name: ibmff.dta

Variables:

Variable name Variable information

permno CRSP Permanent Number

date Numeric date

ret Holding Period Return

retx Return without dividends

mktrf Excess return on markert

smb Small-minus-big return

hml High-minus-low return

rf Risk-free return rate

umd Momentum factor


Convert to STATA date format

gen year=int(date/10000)

gen month=int((date-year*10000)/100)

gen day=date-year*10000-month*100

gen newdate=mdy(month, day, year)

format newdate %td

list date newdate year month day in 1


Distribution of Variables

pnorm ret

swilk ret mktrf smb hml rf

0.0

00.2

50.5

00.7

51.0

0

No

rmal F

[(re

t-m

)/s]

0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)

pnorm: Standarized normal probability plot

swilk: Shapiro-Wilk normality test with null

hypothesis that data is normal.


It appears all variables listed are not normally distributed. As variable ret is the variable

of interest, we adjust its skewness with zero-skewness log function lnskew0, then run

swilk to test it again.

lnskew0 lnret=ret

swilk lnret ret

For lnret, p-value > 0.05 (by default the significant level is 95%), we do not reject the supposition that data is normally distributed.

Group Comparison & Correlation

Question: To test whether average holding period return for year 1998 has significant

difference from average of return of all the other years.

Generate dummy variables

tab year, gen(dumyear)


tabstat lnret, stat(n mean sd p25 p50 p75) by(dumyear1)

graph box lnret, by(dumyear1) box(1, bfcolor(blue))

-.6

-.4

-.2

0.2

0 1

ln(r

et+

.752

79

07

)

Graphs by year== 1998.0000


sdtest lnret, by(dumyear1)

ttest lnret, by(dumyear1)

sdtest tests the

equality of variances

ttest performs one-

sample T-test and

independent-samples

T-test


Question: To test whether average holding period return for all these years are

significantly different with each other. If yes, then find out which groups have the

differences.

oneway lnret year

oneway lnret year, tabulate bonferroni

tabulate lists average returns for all the years.

bonferroni performs multiple comparison btw groups with adjusted p-values.

To detect correlations btw profit returns and other factors.

graph matrix ret mktrf smb hml rf, half

spearman ret mktrf smb hml rf, stats(rho p) print(.05) bonferroni


ret

excessreturn on

themarket

small-minus-bigreturn

high-minus-lowreturn

risk-freereturn rate(one monthtreasury bill

rate)

-.2 0 .2 .4

-.2

-.1

0

.1

-.2 -.1 0 .1

-.2

0

.2

-.2 0 .2

-.1

0

.1

-.1 0 .1

0

.002

.004

.006


Exercise 1

1. Tabulate the average risk-free return rate by different years

2. Use help to search command ranksum: Mann-Whitney U-test.

3. To test if average risk-free return rate in year 2005 is significant different from

2006 using Mann-Whitney U-test.

(hint: generate dummy variable first)

4. Use help to search command correlate: Pearsons correlation.

5. To identify correlation btw the factors of interest in year 2006 using listwise and

pairwise Pearsons correlation respectively.


General Form of Model

Data: ibmff.dta

XY

. X

Y

sy variableexplanator ofmatrix )1( theis

responses. of vector 1 theis

pn

n

tscoefficien regression theof estimates squareleast YX1

X)X(


Step 1: Examine data

graph matrix ret mktrf smb hml rf umd

ret

excessreturn on

themarket

small-minus-bigreturn

high-minus-lowreturn

risk-freereturn rate(one monthtreasury bill

rate)

momentumfactor

-.2

0

.2

-.2 0 .2

-.2

-.1

0

.1

-.2 -.1 0 .1

-.2

0

.2

-.2 0 .2

-.1

0

.1

-.1 0 .1

0

.002

.004

.006

0 .002 .004 .006

-.1

0

.1

.2

-.1 0 .1 .2


Step 2: Perform Linear Regression

regress ret mktrf smb hml rf umd year

regress: to perform linear regression


sw regress ret mktrf smb hml rf umd year, pe(0.05)

sw: to perform stepwise regression

pe(0.05): to specify the significant level of the F-test for addition to the model; items with a p-value less than 0.05 will be included.


Step 3: Post-estimation Statistics

vif //variance inflation factor

rvfplot //plot residuals against predicted values

predict fit //store fitted values

predict sdres, rstandard //store standard residuals

pnorm sdres //normal probability plot of residuals

twoway scatter sdres fit //plot residuals against predicted values

predict cook, cooksd //store Cooks distance statistics

list year ret cook if cook>4/108 // lists details of those observations for which the

statistic is above the suggested cut-off point (4/n).

General Linear Regression-.

2-.

10

.1.2

Re

sid

uals

-.2 -.1 0 .1 .2Fitted values

0.0

00.2

50.5

00.7

51.0

0

No

rmal F

[(sd

res-m

)/s]

0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)


Exercise 2

1. Repeat the analysis described in this section after removing the listed possible

outliers identified by Cooks.

2. After finishing Q1, repeat the analysis but treat the variable year as the

categorical.

hint: use command

xi: sw regress ret mktrf smb hml rf umd i.year, pe(0.05)

Logistic Model

Binary logistic model: dichotomous response outcomes

e,.g.: presence or absence of an event

Ordinal logistic model: ordinal response variable with more than two ordered categories

e,.g.: a 5-point Likert scale

Multinomial logistic model: nominal response variables with more than two categories

e,.g.: different types of programs in school

Binary Logistic Regression


is the Odds Ratio that when increases by one unit and all other

covariates remain the same.

Binary responses are typically coded as 1 for the event of interest, and 0 for the

opposite event.

)|( iii xyE

pipiiiii xxx ...))1/(log()(logit 22110

))'exp(1/()'exp( iii xx

)exp( k 1y kx

Y


Description of Data

How to identify a person with high chance of getting defaults on the bank loan. We have

700 records from bank database (bankloan.csv) .

Variable name Variable information

age Age in years

ed Level of education

1= didnt complete high school 2= high school degree

3= college degree 4= undergraduate 5= postgraduate

employ Years with current employer

address Years in current address

income Household income in thousands

debtinc Debt to income ratio (*100)

creddebt Credit card debt in thousands

othdebt Other debts in thousands

default Previously defaulted (1=Yes; 0=No)


Step 1: Import and examine data

insheet using bankloan.csv

d

browse

codebook default


tabstat age employ address income debtinc creddebt othdebt, by(default)

table ed, c(mean income mean age mean debtinc mean creddebt mean othdebt) by(default)

Binary Logistic RegressionStep 2: Construct logistic model

logistic default age ed employ income address

estimates store model1

logistic default age ed employ income address debtinc creddebt othdebt

lrtest model1 .

sw logit default age address employ income debtinc creddebt othdebt, pe(0.05)

logistic: produces odds ratios.

logit: produces parameter coefficients.

estimates: saves the current likelihood and all the estimates.

lrtest: produces p-value of likelihood-ratio test.


Step 3: Post-estimation statistics

predict prob

predict resi, rstandard

hist resi

estat gof

estat gof: goodness-of-fit test0

.2.4

.6.8

De

nsity

-5 0 5 10standardized Pearson residual


estat classification

Summary of correct

predictions

Summary of incorrect

predictions

Overall success rate

This is calculated based on 50% as a

cut-off point for positive

predictions.


gen z=_b[debtinc]*debtinc+_b[employ]*employ+_b[creddebt]* creddebt+_b[address]*address

line prob z, sort0

.2.4

.6.8

1

Pr(

de

fault)

-10 -5 0 5 10z


gen empcat=employ>5

logit default address empcat debtinc creddebt

postgr3 debtinc, by(empcat) //you need to install postgr3 package

0.2

.4.6

.8

0 10 20 30 40debtinc

yhat_, empcat == 0 yhat_, empcat == 1

postgr3: graphs the predicted

values , holding all other variables

constant at specified values (default

is the mean).

Marginal impact is higher for

people with short service than

for those with long service in

their current company.


Exercise 3

1. Explore the use of commands lroc and lsens to diagnostic data and interpret

results.

lroc: graphs the ROC curve and calculates the area under the curve.

lsens: graphs sensitivity and specificity versus probability cutoff.

2. Predict the probability of default on bank loan for a person with

debt/income ratio of 22.7, 2 years with current employer, 16 years living in

current place, and 1.21 thousand credit card debt.



.

xpp

pppp

xp

pp

')(1

log)(Logit

'1

log)(Logit

20

21

2121

10

1

11

1... and

')...(1

...log)...(Logit

.

121

0

21

2121

kk

k

k

kk

pppp

xppp

pppppp

represents Odds Ratio that for any s when increases by one unit and all

other covariates remain the same.

Ordered responses with k categories can be formulated as a threshold model.

)exp( k say kx


Construct model

recode income (min/20=1 "


xi: ologit inccat age i.ed employ debtinc, or

listcoef, help oligit: to perform ordered logistic regression.

listcoef: to obtain ORs and change

of odds for a sd of the variable.


xi: omodel logit inccat age i.ed employ debtinc

brant, detail

Test parallel regression assumption

(proportional odds assumption):

omodel: to perform likelihood ratio

test.

brant: to do Brant test.


prtab employ //predicted probabilities for each of the values of the variable specified

prvalue, x(_Ied_2=1) //predicted probabilities for selected values of variables

prvalue, x(_Ied_2=1 age=28 employ=3 debtinc=10)


xi: mlogit inccat age i.ed employ debtinc


listcoef

fitstat

prtab _Ied_2


predict p1 p2 p3 p4 p5

summarize p1 p2 p3 p4 p5

sort employ

twoway connect p1 p5 employ, msym(i i)

0.2

.4.6

.81

0 10 20 30employ

Pr(inccat==1) Pr(inccat==5)

Logistic Model

Exercise 4

1. Try to construct probit models.

Documents

STATA Training Session 2