Upload
rajesh-kumar
View
50
Download
4
Embed Size (px)
DESCRIPTION
stata2
Citation preview
Sun LiCentre for Academic Computing
STATA Training Session 2
Statistical Analysis in STATA
Outline
Resources And Books
Data Description And Simple Inference
Group Comparison And Correlation
General Linear Regression
Logistic Model
Binary Logistic Model
Ordinal Logistic Model
Multinomial Logistic Model
Resources And Books
CAC Computing Resources for STATA users
Windows:
STATA/SE version 10.0
10-user network perpetual license
Installation guide
(http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA-
Software Questions.aspx)
Linux CAC Beowulf Cluster:
STATA/SE version 10.0
Unlimited users
About CAC Beowulf Cluster:
(http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx)
New features in STATA 10.0 (http://www.stata.com/stata10)
Resources And Books
Website resources:
The STATA website: http://www.stata.com
The STATA journal reviewed papers, regular columns, user-written
software: http://www.stata-journal.com/
STATA FAQ : http://www.stata.com/support/faqs STATA User Support : http://www.stata.com/support
Books: http://www.stata.com/bookstore/
CAC STATA support:
Website:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx
Contact:
For statistical consultation: Sun Li: [email protected]
For software installation: TAN SuhWen: [email protected]
Resources And Books
Additional recommended readings:
Regression Models for Categorical Dependent Variables Using
Stata, 2nd Edition, J. Scott Long and Jeremy Freese
Logistic Regression with Stata, Xiao Chen, Phil Ender, Michael
Mitchell & Christine Wells, UCLA
Statistics with Stata (Updated for Version 9), Lawrence C.
Hamilton
Data Analysis Using Stata, Ulrich Kohler and Frauke Kreuter
Download Training Slides , data and Syntax:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/T
raining%20Slides%20and%20Syntax.aspx
Data Description & Simple Inference
Description of Data
Name: ibmff.dta
Variables:
Variable name Variable information
permno CRSP Permanent Number
date Numeric date
ret Holding Period Return
retx Return without dividends
mktrf Excess return on markert
smb Small-minus-big return
hml High-minus-low return
rf Risk-free return rate
umd Momentum factor
Data Description & Simple Inference
Data Description & Simple Inference
Convert to STATA date format
gen year=int(date/10000)
gen month=int((date-year*10000)/100)
gen day=date-year*10000-month*100
gen newdate=mdy(month, day, year)
format newdate %td
list date newdate year month day in 1
Data Description & Simple Inference
Distribution of Variables
pnorm ret
swilk ret mktrf smb hml rf
0.0
00.2
50.5
00.7
51.0
0
No
rmal F
[(re
t-m
)/s]
0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)
pnorm: Standarized normal probability plot
swilk: Shapiro-Wilk normality test with null
hypothesis that data is normal.
Data Description & Simple Inference
It appears all variables listed are not normally distributed. As variable ret is the variable
of interest, we adjust its skewness with zero-skewness log function lnskew0, then run
swilk to test it again.
lnskew0 lnret=ret
swilk lnret ret
For lnret, p-value > 0.05 (by default the significant level is 95%), we do not reject the supposition that data is normally distributed.
Group Comparison & Correlation
Question: To test whether average holding period return for year 1998 has significant
difference from average of return of all the other years.
Generate dummy variables
tab year, gen(dumyear)
Group Comparison & Correlation
tabstat lnret, stat(n mean sd p25 p50 p75) by(dumyear1)
graph box lnret, by(dumyear1) box(1, bfcolor(blue))
-.6
-.4
-.2
0.2
0 1
ln(r
et+
.752
79
07
)
Graphs by year== 1998.0000
Group Comparison & Correlation
sdtest lnret, by(dumyear1)
ttest lnret, by(dumyear1)
sdtest tests the
equality of variances
ttest performs one-
sample T-test and
independent-samples
T-test
Group Comparison & Correlation
Question: To test whether average holding period return for all these years are
significantly different with each other. If yes, then find out which groups have the
differences.
oneway lnret year
oneway lnret year, tabulate bonferroni
tabulate lists average returns for all the years.
bonferroni performs multiple comparison btw groups with adjusted p-values.
To detect correlations btw profit returns and other factors.
graph matrix ret mktrf smb hml rf, half
spearman ret mktrf smb hml rf, stats(rho p) print(.05) bonferroni
Group Comparison & Correlation
ret
excessreturn on
themarket
small-minus-bigreturn
high-minus-lowreturn
risk-freereturn rate(one monthtreasury bill
rate)
-.2 0 .2 .4
-.2
-.1
0
.1
-.2 -.1 0 .1
-.2
0
.2
-.2 0 .2
-.1
0
.1
-.1 0 .1
0
.002
.004
.006
Group Comparison & Correlation
Exercise 1
1. Tabulate the average risk-free return rate by different years
2. Use help to search command ranksum: Mann-Whitney U-test.
3. To test if average risk-free return rate in year 2005 is significant different from
2006 using Mann-Whitney U-test.
(hint: generate dummy variable first)
4. Use help to search command correlate: Pearsons correlation.
5. To identify correlation btw the factors of interest in year 2006 using listwise and
pairwise Pearsons correlation respectively.
General Linear Regression
General Form of Model
Data: ibmff.dta
XY
. X
Y
sy variableexplanator ofmatrix )1( theis
responses. of vector 1 theis
pn
n
tscoefficien regression theof estimates squareleast YX1
X)X(
General Linear Regression
Step 1: Examine data
graph matrix ret mktrf smb hml rf umd
ret
excessreturn on
themarket
small-minus-bigreturn
high-minus-lowreturn
risk-freereturn rate(one monthtreasury bill
rate)
momentumfactor
-.2
0
.2
-.2 0 .2
-.2
-.1
0
.1
-.2 -.1 0 .1
-.2
0
.2
-.2 0 .2
-.1
0
.1
-.1 0 .1
0
.002
.004
.006
0 .002 .004 .006
-.1
0
.1
.2
-.1 0 .1 .2
General Linear Regression
Step 2: Perform Linear Regression
regress ret mktrf smb hml rf umd year
regress: to perform linear regression
General Linear Regression
sw regress ret mktrf smb hml rf umd year, pe(0.05)
sw: to perform stepwise regression
pe(0.05): to specify the significant level of the F-test for addition to the model; items with a p-value less than 0.05 will be included.
General Linear Regression
Step 3: Post-estimation Statistics
vif //variance inflation factor
rvfplot //plot residuals against predicted values
predict fit //store fitted values
predict sdres, rstandard //store standard residuals
pnorm sdres //normal probability plot of residuals
twoway scatter sdres fit //plot residuals against predicted values
predict cook, cooksd //store Cooks distance statistics
list year ret cook if cook>4/108 // lists details of those observations for which the
statistic is above the suggested cut-off point (4/n).
General Linear Regression-.
2-.
10
.1.2
Re
sid
uals
-.2 -.1 0 .1 .2Fitted values
0.0
00.2
50.5
00.7
51.0
0
No
rmal F
[(sd
res-m
)/s]
0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)
General Linear Regression
Exercise 2
1. Repeat the analysis described in this section after removing the listed possible
outliers identified by Cooks.
2. After finishing Q1, repeat the analysis but treat the variable year as the
categorical.
hint: use command
xi: sw regress ret mktrf smb hml rf umd i.year, pe(0.05)
Logistic Model
Binary logistic model: dichotomous response outcomes
e,.g.: presence or absence of an event
Ordinal logistic model: ordinal response variable with more than two ordered categories
e,.g.: a 5-point Likert scale
Multinomial logistic model: nominal response variables with more than two categories
e,.g.: different types of programs in school
Binary Logistic Regression
General Form of Model
is the Odds Ratio that when increases by one unit and all other
covariates remain the same.
Binary responses are typically coded as 1 for the event of interest, and 0 for the
opposite event.
)|( iii xyE
pipiiiii xxx ...))1/(log()(logit 22110
))'exp(1/()'exp( iii xx
)exp( k 1y kx
Y
Binary Logistic Regression
Description of Data
How to identify a person with high chance of getting defaults on the bank loan. We have
700 records from bank database (bankloan.csv) .
Variable name Variable information
age Age in years
ed Level of education
1= didnt complete high school 2= high school degree
3= college degree 4= undergraduate 5= postgraduate
employ Years with current employer
address Years in current address
income Household income in thousands
debtinc Debt to income ratio (*100)
creddebt Credit card debt in thousands
othdebt Other debts in thousands
default Previously defaulted (1=Yes; 0=No)
Binary Logistic Regression
Step 1: Import and examine data
insheet using bankloan.csv
d
browse
codebook default
Binary Logistic Regression
tabstat age employ address income debtinc creddebt othdebt, by(default)
table ed, c(mean income mean age mean debtinc mean creddebt mean othdebt) by(default)
Binary Logistic RegressionStep 2: Construct logistic model
logistic default age ed employ income address
estimates store model1
logistic default age ed employ income address debtinc creddebt othdebt
lrtest model1 .
sw logit default age address employ income debtinc creddebt othdebt, pe(0.05)
logistic: produces odds ratios.
logit: produces parameter coefficients.
estimates: saves the current likelihood and all the estimates.
lrtest: produces p-value of likelihood-ratio test.
Binary Logistic Regression
Step 3: Post-estimation statistics
predict prob
predict resi, rstandard
hist resi
estat gof
estat gof: goodness-of-fit test0
.2.4
.6.8
De
nsity
-5 0 5 10standardized Pearson residual
Binary Logistic Regression
estat classification
Summary of correct
predictions
Summary of incorrect
predictions
Overall success rate
This is calculated based on 50% as a
cut-off point for positive
predictions.
Binary Logistic Regression
gen z=_b[debtinc]*debtinc+_b[employ]*employ+_b[creddebt]* creddebt+_b[address]*address
line prob z, sort0
.2.4
.6.8
1
Pr(
de
fault)
-10 -5 0 5 10z
Binary Logistic Regression
gen empcat=employ>5
logit default address empcat debtinc creddebt
postgr3 debtinc, by(empcat) //you need to install postgr3 package
0.2
.4.6
.8
0 10 20 30 40debtinc
yhat_, empcat == 0 yhat_, empcat == 1
postgr3: graphs the predicted
values , holding all other variables
constant at specified values (default
is the mean).
Marginal impact is higher for
people with short service than
for those with long service in
their current company.
Binary Logistic Regression
Exercise 3
1. Explore the use of commands lroc and lsens to diagnostic data and interpret
results.
lroc: graphs the ROC curve and calculates the area under the curve.
lsens: graphs sensitivity and specificity versus probability cutoff.
2. Predict the probability of default on bank loan for a person with
debt/income ratio of 22.7, 2 years with current employer, 16 years living in
current place, and 1.21 thousand credit card debt.
Ordinal Logistic Model
General Form of Model
.
xpp
pppp
xp
pp
')(1
log)(Logit
'1
log)(Logit
20
21
2121
10
1
11
1... and
')...(1
...log)...(Logit
.
121
0
21
2121
kk
k
k
kk
pppp
xppp
pppppp
represents Odds Ratio that for any s when increases by one unit and all
other covariates remain the same.
Ordered responses with k categories can be formulated as a threshold model.
)exp( k say kx
Ordinal Logistic Model
Construct model
recode income (min/20=1 "
Ordinal Logistic Model
xi: ologit inccat age i.ed employ debtinc, or
listcoef, help oligit: to perform ordered logistic regression.
listcoef: to obtain ORs and change
of odds for a sd of the variable.
Ordinal Logistic Model
xi: omodel logit inccat age i.ed employ debtinc
brant, detail
Test parallel regression assumption
(proportional odds assumption):
omodel: to perform likelihood ratio
test.
brant: to do Brant test.
Ordinal Logistic Model
prtab employ //predicted probabilities for each of the values of the variable specified
prvalue, x(_Ied_2=1) //predicted probabilities for selected values of variables
prvalue, x(_Ied_2=1 age=28 employ=3 debtinc=10)
Multinomial Logistic Model
xi: mlogit inccat age i.ed employ debtinc
Multinomial Logistic Model
listcoef
fitstat
prtab _Ied_2
Multinomial Logistic Model
predict p1 p2 p3 p4 p5
summarize p1 p2 p3 p4 p5
sort employ
twoway connect p1 p5 employ, msym(i i)
0.2
.4.6
.81
0 10 20 30employ
Pr(inccat==1) Pr(inccat==5)
Logistic Model
Exercise 4
1. Try to construct probit models.
End