Survey Design and Analysis
Torben Schubert, December 12th, 2012, CIRCLE, Lund
NORSI course on ‘Survey of Quantitative Research’
Survey Design◦ Cluster analysis◦ Latent factors
Hypothesis testing using Community Innovation Survey data◦ Limited dependent variables◦ Application using STATA
Outline
Survey Design
Yesterday, you have had an introduction into linear regression analysis
OLS is one the most powerful tools to test hypothesis
But hypothesis testing is not the only task in quantitative empirical research
Sometimes we might not even have a clear idea about structures in the data set. We may find it difficult to develop sensible hypothesis.
Introduction
Sometimes we encounter measurement problems that make it difficult to discern what the theoretical meaning of a variable or a set of variables actually is.
What can we do then?
Introduction
Good empirical research should follow the following steps:◦ Build a theory about a certain phenomenon (e.g. by literature
review, or by squeezing your brain)◦ Delineate expectations about empirical relationships (often
called hypotheses)◦ Collect the data that is necessary to measure your relationships◦ Use a sensible technique to determine whether your hypotheses
hold.
The ideal way to good results
This ideal process is often obstructed:◦ We might get access to a rich dataset that we have not self-
compiled and which we therefore do not fully understand.◦ We might have a complex measurement construct in mind, but
we are not sure whether our variables really measure it.
Problems
If you are unsure about the information contained in your dataset, do not underestimate the power of descriptive statistics.
Means by groups or correlations can greatly improve your understanding of the data.
Take your time to investigate an unknown dataset.
Some suggestions
Cluster analysis
What is a cluster? Loosely defined:
Data can be considered clustered, if◦ observations belonging
to the same cluster is alike.
◦ observations belonging to other clusters differ.
Cluster analysis
Cluster analysis assumes that observations (e.g. firms) belong to a given number of different clusters that are inherently different from each other.
Technically, you search for multivariate similarity between observations giving a set of characteristics.
E.g. you could think firms differ by age, size, and innovativeness
Cluster analysis
A clustering method then sorts those firms together into a given number of clusters that are most similar to each other.
A multitude of techniques exist, but most of the common ones are rather descriptive allowing many arbitrary options to the researcher:◦ Which variables to include?◦ How many clusters to go for?◦ Which method to use?
Cluster analysis
Cluster Analysis
-2 -1 0 1 2 3 4 5
05
10
15
x[,1]
x[,2]
-2 -1 0 1 2 3 4 5
-20
24
68
10
x[,1]
x[,2]
-2 -1 0 1 2 3 4 5
-20
24
6
x[,1]
x[,2]
-2 -1 0 1 2 3 4 5
-2-1
01
23
4
x[,1]
x[,2]
and not all data are clustered…
An example in STATA based on the auto data set The command structure is
cluster subcommand varlist, options
Type the following:sysuse autocluster wardslinkage rep78 length price if
!missing (rep78) & !missing(length) & !missing(price), measure(correlation)
cluster dendrogram
Cluster analysis
The dendrogram looks like this and tells at which tolerance we start to cluster together observations and subgroups
Number of cluster arbitrary, but maybe 3 not a bad choice.
Cluster analysis
.996
.997
.998
.999
1co
rrelat
ion si
milar
ity m
easu
re
123142281821011173214724205691916151213254429415434485245513740502639304733382849314227323546365343556368576258566461656759666069
Dendrogram for _clus_1 cluster analysis
Then type cluster generate cutvar = groups(3)
In order to generate a grouping variable To generate summary statistics by groups type
bysort cutvar: sum rep78 length price if !missing(cutvar)
Cluster analysis
Cluster analysis
price 15 10896.27 2667.294 6850 15906 length 15 199.8 21.74922 156 233 rep78 15 3.533333 .8338094 2 5 Variable Obs Mean Std. Dev. Min Max
-> cutvar = 3
price 30 5319.367 874.2762 3895 7827 length 30 181.6 25.47155 142 222 rep78 30 3.633333 .9278575 2 5 Variable Obs Mean Std. Dev. Min Max
-> cutvar = 2
price 24 4210.5 516.8955 3291 5379 length 24 189.4583 16.66284 163 221 rep78 24 3.041667 1.082636 1 5 Variable Obs Mean Std. Dev. Min Max
-> cutvar = 1
Cluster analysis is a nice tool of data mining useful when you have no idea of what is going on.◦ Arguably, I would not recommend using it in a scientific paper,
because of its exploratory character.◦ It might assist you in earlier stages of research.
Note that there are statistically more advanced methods in other packages such as R (header: model based clustering)
Cluster analysis
Latent factors
Often theory is termed in unmeasureable concepts.
Happens often in management research, sociology, psychology
Suppose, you hypothesize that teacher quality increases student performance.
How to measure teacher quality? Might consider to ask a battery of questions
about a set of quality dimension (Is he well prepared? Does he react to students‘ questions?...)
Latent factors
The first question you ask is, if there is really a unidimensional thing called teacher quality.
You can use factor analysis for this. Factor analysis determines for any given set of
variables underlying (latent) constructs. Type in the following:
use http://www.ats.ucla.edu/stat/stata/output/m255, clear
factor item13-item24, ipf factor(3)
Factor analysis
General rule: use as many factors as there are Eigenvalues greater than one.
In this case 1: good news!
Factor analysis
item24 0.6952 0.0183 -0.3873 0.3665 item23 0.8194 -0.0262 -0.3454 0.2086 item22 0.6128 0.2609 -0.0228 0.5559 item21 0.7317 0.1168 0.0007 0.4509 item20 0.5501 0.2392 0.0932 0.6315 item19 0.6165 0.4159 0.1551 0.4228 item18 0.7395 0.3448 0.1129 0.3216 item17 0.7831 -0.0734 0.0667 0.3770 item16 0.6478 -0.1890 0.1114 0.5322 item15 0.7212 -0.2450 0.1057 0.4086 item14 0.7032 -0.3391 0.0978 0.3810 item13 0.7134 -0.3987 0.0923 0.3236 Variable Factor1 Factor2 Factor3 Uniqueness
Factor loadings (pattern matrix) and unique variances
LR test: independent vs. saturated: chi2(66) = 8683.10 Prob>chi2 = 0.0000 Factor12 -0.09084 . -0.0129 1.0000 Factor11 -0.06035 0.03050 -0.0086 1.0129 Factor10 -0.04594 0.01440 -0.0065 1.0215 Factor9 -0.01906 0.02688 -0.0027 1.0281 Factor8 -0.00440 0.01466 -0.0006 1.0308 Factor7 0.00218 0.00658 0.0003 1.0314 Factor6 0.03164 0.02946 0.0045 1.0311 Factor5 0.05527 0.02362 0.0079 1.0266 Factor4 0.13146 0.07619 0.0187 1.0187 Factor3 0.36146 0.23001 0.0515 1.0000 Factor2 0.80687 0.44540 0.1149 0.9485 Factor1 5.85150 5.04464 0.8336 0.8336 Factor Eigenvalue Difference Proportion Cumulative
Rotation: (unrotated) Number of params = 33 Method: iterated principal factors Retained factors = 3Factor analysis/correlation Number of obs = 1365
Another commonly used measure is Cronbach‘s Alpha being defined as the average correlation between a given set of variables.
This should be large (at least 0.65). Type in
alpha item13-item24
Cronbach‘s Alpha
Scale reliability coefficient: 0.9125Number of items in the scale: 12Average interitem covariance: .386608
Test scale = mean(unstandardized items)
Hypothesis testing using Community
Innovation Survey data
Community Innovation Survey: harmonized survey of innovation behavior in the European Union+Norway
Moving cross section data with many information about innovation inputs, outputs, firm characteristics, markets,…
We can analyse this data with the tools we been equipped with yesterday:◦ T-tests about differences in means◦ OLS to test more complicated hypotheses
But many variables do not easily lend themselves to OLS because of their nature…
Introduction
Limited Dependent Variables
Limited dependent variables (LDV)◦ Types of LDV◦ Implications for OLS
Estimation Methods◦ Maximum Likelihood Estimation◦ The need for marginal effects◦ Probit and Logit Models◦ Multinomial Models◦ Count data◦ Tobit Models
Overview
What do we estimate by regression? Suppose we have the regression equation:
We are typically interested in the coefficients/parameters.
But what is their meaning? A commonly heard suggestion:
◦ Measures how the explained variable changes when the explaining variables change by one unit…
Introductory Reminder
y x u
This is imprecise. But why? Look at the formula again:
The error obstructs this direct relationship between the explained variable, and the coefficients as well as the explaining variables.
Introductory Reminder
y x u
We solve that by focusing on expectations
The coefficient now has the following meaning:
A coefficient measures how the expected value of the explained variable changes when the explaining variables change by one unit.
Introductory Reminder
E y x
Ek
k
y
x
Some Theory
Basic definition:An LDV is any dependent (also: explained, left-hand-side) variable in a regression that cannot take any value on the real axis.
Examples◦ Indicator-variables: e.g. employed (y/n)◦ Count variables: # patents◦ Strictly positive variables: amount of consumed alcohol per
week◦ Multinomial response variables: prefered leasure time activities
(bowling, reading, meeting friends)
LDV - Types
Suppose we intended to explain employment status of persons.
Convenient way of coding is 1: employed and 0: unemployed
Technically we could run a linear regression of the following form:
yielding estimates
LDV – Implications for OLS
empl x u
But consider the estimate expectation of
Since is fixed and there are no restrictions the predicted values way well lie outside
the theoretical boundaries of 0 and 1. Implication of the linearity of OLS.
LDV – Implications for OLS
E e empl mpl x
x
empl
We impose a linear model with no restrictions on an expected value that should be bounded between 0 and 1.
LDV – Implications for OLS
Need to find a non-linear model for the expectation value.
Suppose you want to explain income, data is censored at an upper threshold (e.g. 100,000€ p.m. and above)
What happens, if you use OLS dropping the highest category (truncation) or replacing the censored value with 100,000 (censoring)?
LDV – Implications for OLS
Obviously, downward bias in this case.
Inconsistent results from OLS.
LDV – Implications for OLS
OLS doesn‘t work in these situations. Common practice therefore:
◦ Confirm that explained variable is not LDV (profits), or at least roughly not LDV (size of a person)
◦ If variable is LDV in some sense, use other methods implementing appropriate non-linear models for the expectation value.
What are these methods?
Estimation methods: ML
Gladly, the Maximum Likelihood Approach offers a flexible solution to a large class of such problems (developed by Fisher in the beginning 20th century)
It follows several steps:◦ Choose an appropriate statistical model for your data.◦ Based on this model express the likelihood for observing your
sample as a functions of the parameters◦ Maximize this likelihood over the parameters. The solution to
this problem are the ML estimates.
Estimation methods: ML
What about size of the effects? We are always interested in how the dependent
variable changes when one of the indepent changes.
Unfortunately, because the expectation value is now non-linear, the coefficients are not identical to the marginal effects anymore.
Marginal effects and meaning of coefficients
E( )
j
y
x
In the Probit Model for example we can show that the marginal effect is:
Marginal effects and meaning of coefficients
0
E( ) ( )( ) j
j j
y xx
x x
Implications:◦ In the Probit model the coefficient does not coincide with the
marginal effect◦ Nonetheless, it gives the correct direction. This holds for many
ML methods but not for all. Allways, and I seriously mean allways, report
marginal effects instead of raw coefficients when using ML. (STATA can do that easily.)
Marginal effects and meaning of coefficients
Practice in STATA
Whenever, we encounter an indicator variable (0/1) as dependent we should think of a correct probability model
Examples:◦ Unemployed vs. Employed◦ Non-patenting company vs. patenting company◦ …
Several usable models, but most common:◦ Logit model and probit model◦ Practically, no large difference between both, when we focus on
marginal effects
The Probit and the Logit Model
Easy to invoke them in STATA using the probit or logit command
probit depvar indepvars, optionslogit depvar indepvars, options
For example, if you have a patent indicator pat, the innovation expenditures innoexp and the size of the company empl, the command looks like this:
probit pat innoexp empl
The Probit and the Logit Model
The marginal effects are computed using the command directly after a probit/logit regression:
mfx, predict(p)
Observe that this command always refers to the last regression.
The Probit and the Logit Model
Suppose there are many buying alternatives for a product (e.g. Android Smartphone, I-Phone, Windows Smartphone) and you would like to know how customers‘ characteristics impact on there buying decision
In this case, 4 categories:no SPAndroid SPIPhoneWindows SP
Multinomial models
Differs from probit/logit because there is more than one category.
Two widely used models:◦ Multinomial logit◦ Multinomial probit
Here there is a difference: multinomial probit more flexible, but calculation computationally usually not feasible with more than four-five categories.
Multinomial models
STATA commands are mprobit and mlogit:
mprobit depvar indepvars, optionsmlogit depvar indepvars, options
For example you have a variable sp giving consumer level data on SP choice, inc being the imcome, and age the age, the command would be
mprobit sp inc age
Multinomial models
Obs: coefficients and marginal effects do not even have the same direction
You must calculate marginal effects using (we have four categories, each has its own marginal effects)
mfx, predict(p outcome(1))mfx, predict(p outcome(2))mfx, predict(p outcome(3))mfx, predict(p outcome(4))
Note: If data is ordered (e.g. Likert scale) you can use Ordered probit (oprobit with the same syntax)
Multinomial models
When data takes on integer values that have a clear numeric meaning (e.g. patents) you should use count data.
What is the difference to a Likert scale? Sensible methods are
◦ Poisson regression◦ Negative binomial regression
Negative binomial is much more flexible without imposing a considerable computational penalty.
Count data
The STATA command is nbreg depvar indepvars, options
Suppose number of patents is stored in numpat, size is given by empl and innovation expenditures by innoexp the command could be
nbreg numpat innoexp size
For marginal effects it is enough to type
mfx
Count data
Censored or quasi-censored dependent variables are those that ◦ are principally continuous◦ cannot take on all values on the real axis◦ have mass points at their censoring limit
Examples:◦ Innovation expenditures◦ Turnover◦ R&D-intensity (R&D expenditures divided by turnover)
Data can be single or double-censored
Censored data
The correct model to use is the Tobit model, which can be invoked by the following command:
tobit depvar indepvars, ll() ul()
where the options ll() and ul() handle the upper and lower limits
Suppose you want to explain the share of employees with tertiary education in % (shtert) by size (empl) of the comapny, you could use
tobit shtert empl, ll(0) ul(100)
Censored data
The marginal effects follow using
mfx, predict(e(0,100))
In the case of only zero censored data that is otherwise inrestricted you would technically want to write something like this:
mfx, predict(e(0,infinity))
Censored data
But STATA does not know infinity as number. You could simply use a number larger than your sample maximum.
Or more elegantly you can use the following sequence (suppose that your dependent var is turnover)
summarize turnoverlocal maxturn=r(max)mfx, predict(e(0,`maxturn‘))
Censored data
ML methods are computationally intensive and are purely numerical methods
Sometimes the standard algorithm does not converge (mprobit not unlikely to produce this outcome)
What to do then?◦ Do under no circumstances report results when convergence
was not achieved.◦ You can try the difficult option.◦ You can use maximize options.◦ You can provide different starting values.
But nothing is guaranteed to help.
A word of caution
Check whether your variable is LDV. If not, use OLS. If yes, determine the type of LDV characteristic. Choose appropriate model (there are many more
than those discussed today). If you use LDV methods, it is safest to report
marginal effects instead of raw coefficients.
Summary
Sounds complicated, and in fact it can be so. But an easy example is the Probit model for
indicator explained variables. Suppose there is an unobservable (latent)
variable taking on any value and the observable indicator taking on only a value of 0 or 1.
Estimation methods: ML
*yy
Back-Up
Both are linked as follows:
Like always, we would like to estimate the expected value of
Estimation methods: ML
* ~ (0,1)y x u u N *1 0y y
y
E( ) 1 ( 1) 0 ( 0) ( 1)y P y P y P y
Both are linked as follows:
In order to form the Likelihood function we have to find the probabilities that equals zero and one as function of the parameters:
The probability that the indicator is zero is simply
Estimation methods: ML
* ~ (0,1)y x u u N *1 0y y
*1 ( 0) ( ) ( )P y P y P u x P u x
( )x
( 0) 1 ( 1) 1 ( )P y P y x
y
The probability for observing a generic observation then is:
Because of independence between each observation, the likelihood giving the probability of observing the whole sample is:
Or for computational reasons in log-Form:
Estimation methods: ML
1( ) (1 ( ))i iy yi ix x
1
1
( ) ( ) (1 ( ))i i
ny y
i ii
L x x
1
( ) log ( ) (1 ) log(1 ( ))n
i i i ii
l y x y x