Upload
nguyenkiet
View
217
Download
0
Embed Size (px)
Citation preview
Is Random Assignment Passé?
Walter Nicholson Department of Economics
Amherst College [email protected]
Version of November 14, 2001
Paper prepared for International Methodology Conference “From Theory to Practice” sponsored by Evaluation and Data Development, Human Resources Development, Canada. Ottawa, November 16, 2001.
2Estimates from designs featuring random assignment have long been considered by researchers
to be the “gold standard” in social policy evaluation. The statistical advantage of such designs
(that they control for unobserved differences between treated and untreated individuals) in
combination with the ease with which such results can be explained to policymakers has made
random assignment the benchmark against which other studies are measured (see, for example,
Lalonde, 1986). Despite this apparent success a number of challenges have recently been made
to the random assignment hegemony. Specifically, some authors have pointed out that, for some
applications, the conceptual superiority of random assignment is not a foregone conclusion
(Heckman and Smith, 1995). Others have pointed to practical and ethical problems in the
implementation of random assignment designs1. Finally, a recent challenge to random
assignment has been the development of a number of statistical methodologies that, it is claimed,
yield estimates that are equally good as those obtained through random assignment without the
implementation problems that random assignment poses (see, for example, Dehejia and Wahba,
1999).
The purpose of this paper is to provide a critical assessment of these attacks on random
assignment. It concludes that in many cases the objections to random assignment are overstated.
In other cases we show that, although the objections to random assignment have some validity,
these objections apply with even greater force to other statistical approaches to the evaluation
problem. A final section of the paper restates the case for random assignment and suggests some
ways in which such research using such designs might be enhanced.
1 Program staff are more likely to raise ethical issues connected with random assignment than are researchers. For example a survey of JTPA training centers found that more than half cited ethical and public relation concerns about participating in a random assignment evaluation (Doolittle and Traeger, 1990)
3
A. The Simple Case for Random Assignment
The case for random assignment is very simple – it is that this is the only approach to social
policy evaluation that assures statistical independence between the treatment being offered and
other determinants of outcomes. This point can be illustrated with a simple linear2 structural
model which assumes that the outcome of interest for a particular individual (Yi – which might
be taken to be annual earnings) depends on both observed (X1i) and unobserved (X2i)
characteristics3 of a population, on a treatment variable (T – which may be either binary of
continuous), and on a purely random error term (Ui – which is assumed to be independent of the
other variables) according to the equation:
.322110 iiiii UTXXY ++++= ββββ [1]
Random assignment ensures that Ti is statistically independent of all of the other variables in the
model. The “treatment effect” (β3) can therefore be consistently estimated by least squares or
other procedures4. The key independence here is between T and the unobservable variables in
the model (X2). No other approach to evaluation can guarantee that independence. In the
absence of such independence, potential correlations between T and X2 may well result in
inconsistent estimates because, by hypothesis, it is impossible to control completely for these
unobservable variables. Hence, estimates will be confounded by unobservable influence that
affect both treatments received and outcomes. That is, as a rule, they will be subject to
“selectivity biases” of unknown magnitudes and directions.
2 Of course, an equation that is linear in parameters need not be linear in variables, so this model includes many possible non-linearities. Determination of the precise functional form may not be straightforward, however (see the discussion of matching methods later in this paper). 3 Such characteristics may be “unobserved” either because the analyst has chosen not to measure them or because the variables themselves are innately difficult to measure. A brief discussion on the potential importance of more extensive data collection is provided in the concluding section of this paper.
4In addition to this unique advantage, estimates from random assignment evaluations have a
number of secondary advantages. First, because the methodology is easily explained and is
familiar from other sciences, policymakers may have greater confidence in them than they would
in those based on more complex calculations. Second, because the underlying methodology is
considered valid, variations among experimental findings (say, across sites) can be viewed as
arising from “real” differences that might be informative (about differences in program
implementation and operation or about interactions with the economic environment, say) rather
than from statistical artifacts. Finally, adoption of random assignment offers at least the
possibility of using structurally-oriented treatment specifications so that complex response
surfaces can be estimated in ways that permit estimates to be made for newly developed program
options (see Conlisk and Watts, 1977 and the discussion in the conclusion to this paper).
These advantages of random assignment have not gone unchallenged, however. The challenges
can be roughly categorized into three general groupings: (1) Methodological; (2) Ethical and
Cost and; (3) Availability of better alternatives. In the following sections we examine each of
these in turn.
4 Later we discuss why β3 cannot always be interpreted as the effect of “the treatment on the treated”
5
B. Methodological Challenges to Random Assignment
Methodological challenges to random assignment focus primarily on claims that the estimates of
the treatment parameter (β3) in Equation [1] do not actually measure what they purport to
measure. Here we look at four such claims.
1. Faulty Randomization
Although attempts by clients or program staff to undermine randomization explicitly
(through purposeful creaming of participants, for example) are relatively rare, the correct
implementation of random assignment in social experiments is not so simple as it may at first
appear. Some of the most complex design questions involve the seemingly simple issue of
“when to flip the coin”. Of course, no matter when randomization occurs in the program entry
process Equation [1] can be used to obtain a consistent estimate of β3, but the interpretation of
what T as a program “impact” may be subject to considerable ambiguity. For example,
Heckman and Smith (1995) illustrate how cost considerations drove researchers to adopt a less
than optimal placement of random assignment in the JTPA evaluation. By placing random
assignment too long before the actual initiation of specific programs, the evaluation experienced
large-scale attrition among those individuals assigned to the training treatments. It was then
problematic whether estimates of β3 actually reflect the “impact of training”5. Similar problems
occurred in randomizing the prepaid “HMO” treatment in the Health Insurance Experiment
(Manning et al., 1987). There, it proved very difficult for the researchers to implement random
assignment at a stage where it was known that both experimental and control individuals would
5 The standard solution to this problem is to redefine the observed treatment effect as measuring the impact of “an offer of training”. An alternative approach is to use random assignment as an instrumental variable to predict actual
6have been willing to participate in a prepaid medical plan – a decision that may have depended
importantly on the individual’s health status. In the Canadian context, the design of the very
successful Self-Sufficiency Project (Michalopoulos, et al. 2000) required that all participants
volunteer for the study before randomization was implemented. Whether this procedure
replicates the operation of an earnings supplement in an on-going program is open to question.
The lesson of these experiences is, of course, that considerable care must be taken in the design
of random assignment evaluations. Practical problems in implementation may indeed undermine
the validity of the results. But this problem is not unique to random assignment evaluations. All
statistical estimates of program impacts must define which individuals are program
“participants”. There will always be some arbitrariness in this definition6. And, each potential
definition may introduce its own unique selectivity biases.
participation in training. In this case, the estimate of β3 can be interpreted as the effect of the “treatment on the treated”. See Bloom et al., 1997. 6 For example, a focus only on those who “complete” a particular treatment would clearly yield a biased sample of all individuals who spent any time in it. But a participation definition that includes everyone with any time in a program will also be a biased sample of those who actually received the intended intervention. Similar problems arise in defining members of the comparison sample – especially in deciding on an artificial starting date to be applied to that sample (for a discussion in the context of evaluating employment and training programs in Nova Scotia see Nicholson, 2000)
7
2. Substitution Bias
A related set of problems occurs in random assignment evaluations in defining precisely what
“treatment” was received by members of the control group. If persons assigned to the control
group choose to enroll in programs similar to those offered to individuals in the treatment
category, or if treated individuals displace those in the comparison group in obtaining new jobs,
the interpretation of β3 requires some care. Although the parameter still measures the difference
between those treated and those not treated it cannot be viewed as an unbiased estimated of the
“impact of the treatment on the treated”. That is, because the control group does not actually
receive a “null” treatment, simple experimental-control differences will not measure the pure
impact of the program. Heckman, Lalonde, and Smith (1999) explore the statistical issues that
surround this problem. In some cases, of course, policymakers may not be specifically interested
in a “pure’ impact estimate, but may prefer information about an “incremental” estimate that
measures how program participants fare relative to individuals who experience a baseline set of
services. Hence, all evaluations must ultimately ask whether the impact estimate obtained is the
one of most direct policy relevance.
Again, although the implications of these observations for understanding impact estimates based
on random assignment should be clearly recognized, it should be pointed out that other statis tical
methods are not immune to these problems also. In most respects, there are no meaningful
differences between problems in defining the “treatment” to which the comparison group has
been exposed in experimental and non-experimental evaluations.
8
3. Experimental Artifacts
A primary purpose of social policy evaluation is to allow the analyst to extrapolate from the
results of an experiment or demonstration to derive estimates of the impact of a fully
implemented program. If the experimental treatment, T, does not accurately replicate how
individuals would view a fully implemented program, experimental estimates will cannot be
used directly for this purpose. Experimental artifacts include: (1) Reactions by either
experimental or comparison group members to the data collection aspects of an evaluation
(Hawthorne effects); (2) Reactions related to the limited duration of experiments which may not
be replicated in a permanent program; (3) Focusing experimental evaluations on client groups
that differ from those who would be served in an on-going program; and (4) Devoting more
resources to experimental treatments than would be the case under a fully implemented program.
Any of these effects could in principle severely damage the validity of experimental estimates in
making projections of the effects of full implementation.
Because experimental artifacts do not exist in evaluations that use program-generated data, these
are indeed problems unique to random-assignment or other demonstration-type evaluations.
Such problems are, however, well known and analysts have developed a number of innovative
ways for coping with them. For example, a recent job search demonstration in Maryland used a
separate experimental cell in order to evaluate Hawthorne effects (Johnson, et al., 1998).
Similarly, an early and influential paper by Metcalf (1973) shows how the biases from limited
duration experiments can be extrapolated to long term impacts by using precisely formulated
structural models. And many of the random assignment evaluations in unemployment
9compensation have taken great pains to ensure that experimental treatments closely follow the
procedures that would be used in on-going programs (Robins and Spiegelman, 2001).
4. Heterogeneous Impacts
The evaluation model specified in equation [1] implicitly assumes that there is a single treatment
effect observed for all participants. Of course, most researchers recognize that estimates of β3
can only be regarded as measuring mean impacts and that other parameters of the distribution of
impacts (such as the standard deviation of impacts, the median impact, impacts at various
quantiles of the distribution, or impacts at the margin for program expansions 7) may also be of
interest to policymakers. Perhaps the most frequently used method for presenting such
information is to look at impacts on subgroups. In general, randomization assures that such
estimates will also be consistent. Because of sample size limitations, especially for narrowly
defined populations, most random-assignment experiments have paid relatively little attention to
subgroup results, however.
An alternative approach to the study of impact heterogeneity is to model the distribution of
impacts directly. One approach is through the use of variants of the random coefficients model
(Swamy, 1971; Greene, 1997). Heckman, Lalonde, and Smith (1999) review a few other
approaches. They also make the important point that if program participation or attrition is
dependent on individuals’ perceptions of the likely impact of the program on them, the
consistency of most statistical procedures (including random assignment) may be called into
question.
10It is important again to note that many of these concerns with impact heterogeneity occur with
equal force to evaluation methods that do not employ random assignment. Examining all of the
ways in which impacts may differ across participants is simply impossible in most
circumstances. Seeking too fine a disaggregation of program impacts may also undermine the
policy rationale for learning about these impacts in the first place.
C. Ethical and Cost Challenges
Implementation of random assignment experiments in social policy often presents ethical
dilemmas. Perhaps the most frequent is that the experimental need to assign some individuals to
a “null” control treatment conflicts with basic societal norms of universal access to programs that
are assumed8 to be beneficial. Sometimes funding limitations can be used as an argument for
service denial – if there are only so many slots available in a program one might as well assign
them randomly. But this rationale often does not sit well with program staff who strongly
believe that resource limitations make the desirability of focusing services on the most needy
even more pressing. Hence, alternative approaches to solving the universality dilemma seem
worth pursuing.
One approach that has been frequently employed in many recent experimental evaluations is to
use only treatments that represent service enhancements ove r “standard” levels (this is the
approach, for example, in the on-going Self Sufficiency Project in Canada which provides
earnings supplements to former welfare recipients who find full time employment – see
Michalopoulos, et al., 2000). Denial of such enhancements to control group members is not
7 This latter notion of impact is sometimes referred to as the “local average treatment effect” (LATE) for such groups. For a discussion see Heckman, Lalonde, and Smith, 1999. 8 Notice that there is an important ambiguity here. Because program impacts are not known (they are to be determined in the evaluation), service denial may not in fact be harmful.
11viewed as conflicting with the universality mandate. The disadvantage of this approach is, of
course, that it permits an evaluation only of the enhancements, not of the standard level of
service. It would require strong structural assumptions indeed to infer the impact of standard
services only from observed reactions to the enhancements. Because many policy initiatives are
incremental this loss of generality may not be a major disadvantage, however.
Some economists have suggested monetary compensation as a way of mitigating the low service
levels provided to control cases in random assignment designs. One of the few examples of
applying this principle in practice was in the Health Insurance Experiment (Manning et al. 1987)
where all participants received a lump sum payment that insured that even those in high
coinsurance cells were not made financially worse-off by participation in the experiment.
Although no employment or training intervention has sought to buy individuals out of their
universal eligibility, there seems no obvious ethical reason (other than cost) why this could not
be done. Of course, using such a treatment would have to be carefully evaluated to ensure that
payment of the lump sum participation bonus did not interact with the treatment effect to be
estimated.
Random assignment experiments in social policy are costly in financial terms and may also
impose costs on agencies delivering the services. The financial costs of experimentation in on-
going programs consist mainly of research-specific costs because the majority of services would
have been delivered in any case. Still, the administrative costs associated with the incremental
features of random assignment experiments can be significant, amounting to perhaps 30 percent
of the total research budget. For experiments that include separately budgeted treatments, costs
can be quite high. To replicate the three largest of the random assignment experiments
12conducted in the United States during the 1970’s (the Seattle-Denver Income Maintenance
Experiment, the Health Insurance Experiment, and the National Supported Work Demonstration)
would easily cost more that $200 million each in today’s dollars (Greenberg and Shroder, 1997).
Is the information gathered from these experiments worth such costs? A comparison of
experimental costs to the costs of the on-going programs on which they focus is perhaps overly
optimistic in this regard. Surely, it might be argued, spending less than one percent of program
costs on experimental evaluations must provide information that can improve program
operations by at least that amount. Such a conclusion is not obvious, however, because, in order
for the information developed in social experiments to have value, that information must in some
way change policymakers’ decisions.9 Although it seems clear that some narrowly focused
experiments have generated information that changed policy10 it is much less clear whether
larger scale experiments have had that effect. For example, probably the two most important
empirical findings from the random assignment experiments of the 1970’s were: (1) That the
substitution effects induced by implicit tax rates on income support payments were relatively
low, at least for males (Burtless, 1986; Hum and Simpson, 1993); and (2) That individuals did
indeed respond to coinsurance rates for medical care (Manning et al. 1987). It seems probable
that both of these findings had some influence in future policy debate over welfare reform and
expansions of government-provided health insurance, respectively. But estimating the extent to
which this information produced “better” policy would be a monumental task.
D. The Challenge of New Statistical Methodologies
9 This is, the losses from adoption of non-optimal policies mu st be reduced as a result of the information gathered. 10 For example, some of the experiments that focused on the enforcement of continuing eligibility rules for unemployment insurance claimants had important effects on how job search provisions were enforced.
13Many of the disadvantages of random assignment experiments could be avoided if it were
possible to estimate impacts directly from existing data on program participants. That is, if
statistical methodologies could be developed to obtain consistent estimates of β3 directly from
program data on equation [1] by adopting procedures that substitute for random assignment,
much of the rationale for controlled experiments would disappear. Over the past twenty years
major advances have been made along two different lines of approach to this problem: (1)
Instrumental variable estimation; and (2) Matching procedures. Neither of these has yet been
shown to obviate the need for random assignment, however. So long as outcomes are
determined by unobservable factors, the validity of such procedures can never be assessed with
certainty.
1. Instrumental Variable Estimation
Instrumental variable (IV) estimation procedures depend crucially on the existence of a
measurable variable that is correlated with program participation and is statistically independent
of untreated outcomes. Inclusion of this variable in the analysis of equation [1] then permits a
separation of the program participation decision from outcome determination and, in principle,
provides a consistent estimate of β3. Perhaps the most famous method for accomplishing this
procedure was developed by Heckman (1979). In that procedure the instrumental variable (or
possibly several such variables) is first used to identify the program participation relationship 11
and then estimates from this relationship are used to obtain selectivity adjusted estimates of
equation [1].
11 In principle this relationship might be identified because of its non-linearity even in the absence of a suitable instrument. In practice identification by this alternative approach yields unreliable estimates.
14The primary shortcoming of these procedures is the absence of believable instruments. Most
measurable variables that affect program participation also affect untreated outcomes. In this
case IV estimates of β3 will be very sensitive to exactly how the procedure is employed. The
resulting instability imparts a large degree of subjectivity into which estimates are reported and
how statistical significance is assessed. Because impact estimates derived by IV procedures are
also difficult to explain to policymakers, their influence in social policy evaluation in the United
States has, to date, been rather minimal. The concluding section to this paper discusses how
some of these difficulties with IV estimation might be ameliorated through special uses of
random assignment methods.
2. Matching Procedures
Matching procedures pay little attention to the unmeasurable variables (X2) in equation [1] on the
implicit belief that a close enough matching on measured variables (X1) will ameliorate
selectivity problems 12. Early approaches to matching used multidimensional cells to draw
samples of participants and non-participants that closely matched along all the chosen
dimensions. Often these procedures floundered because of dimensionality problems. Exact
matching on a large number of variables proved intractable and the choices involved in reducing
the dimensionality of the problem often proved rather arbitrary. In any event, the matching
procedures did not control for unmeasured determinants of program participation and often this
resulted in estimates that may have been inconsistent.
A more recent approach to matching adopts the propensity score procedures developed by
Rosenbaum and Rubin (1983). These procedures match participants and non-participants
12 In principle matching on X1 may prove preferable to estimating equation [1] by least squares because the latter procedure imposes a specific structural form on the data whereas the former does not. Matching can also illustrate the “support” problem – that there are significant non- overlaps in the characteristics of participants and non-participants that are often obscured when OLS is applied uncritically.
15according to their estimated likelihood of participating in the program of interest. Because
such matching takes place over only one dimension13, it is easier to implement than more
complex matching on many characteristics and it may pose fewer support problems14. Some
initial research on these procedures suggested that they perform rather well in that they were able
to reproduce closely some estimates based on random assignment (Dehejia and Wahba, 1998).
A recent reanalysis of these results suggests that this correspondence may be an artifact of the
specific sample used, however (Smith and Todd, 2000). Regardless of how these uncertainties
about how propensity score matching performed on this specific data set (taken from the JTPA
evaluation), however, the fact remains that this matching procedure also does nothing to ensure
that unmeasured variables will not continue to impart selectivity biases into estimates of β3. It is
impossible to prove this would be true in all cases. Only with random assignment benchmarks
can the validity of the procedure be accurately assessed.
E. Conclusion – Still the Gold Standard
Hence, the overall conclusion of these brief remarks is that random assignment remains the gold
standard for social policy evaluation. Most conceptual objections to the approach apply equally
well to any other approach to evaluation. Although ethical and cost considerations may be
important in specific applications, in many others these are not insurmountable given the policy
interest in knowing program impacts. And the statistical alternatives to random assignment have
so far proven to fall far short of general acceptability (although they may work well in some
applications). In the remainder of this paper I illustrate a few of the ways that random
assignment methodology might be improved and made more general.
13 In practice, however, a variety of specifications for the propensity score equation are often tested using multidimensional matching as a test. 14 In this context such problems arise when there is little overlap between the estimated propensity scores for participant and non-participant groups.
16 1. Reconsider the advantages of structural modeling
Most recent random assignment experiments have utilized a “black box” approach in which the
treatment is conceptualized as a single binary variable. This is in contrast to the earlier
generation of random assignment experiments that were designed based on rather strong
structural models. Although some authors (Burtless, 1995) have claimed that the black box
experiments have ultimately had a greater impact on policymaking, I believe that case remains
unproven. As stated previously, probably the most lasting contributions to general knowledge
about the values of economic parameters were provided by the income maintenance and health
insurance experiments -- I imagine these results will continue to be used in a wide variety of
policy contexts long after the simpler experiments have been forgotten. Heckman, Lalonde, and
Smith (1999) make the case accurately:
Samples generated under the new model for social experiments [black box experiments] produce evidence that does not accumulate in the same way as evidence accumulated under the old model, because there is no common basis for comparing the “treatment effects” from one experiment to those from another....it is difficult to estimate policy- invariant structural parameters that can be used to evaluate a wide variety of programs never previously implemented.(p. 2084)
Designing random assignment experiments based on structural models also has advantages in
terms of sample allocation decisions and addressing experimental artifacts. Of course, it may be
the case that economic theory is not well-enough developed to specify clear structural models
that represent decisions that are of special interest to policy makers. A particular need in this
regard is the development of more carefully specified models of the process of human capital
accumulation. The goal of such models would be to identify the key parameters that influence
whether job-training programs pay off so that these could be the focus of experimental
estimation. Devising such models is no easy task. But it is unlikely that knowledge on “what
works” will advance much until this is done.
17 2. Explore innovative ways of defining randomly assigned program
enhancements
Ethical considerations constrain most random assignment experiments that focus on existing
programs to use program enhancements as treatments. There are two ways in which this
constraint can be made less severe in terms of the information generated. First, to the extent that
the enhancements can be tied to the “base level” treatment through a structural model, it may be
possible to extrapolate “backwards” to learn something about the impact of that base level.
Second, and more likely, random assignment of enhancements may encourage participation in
the program. Hence, the randomly assigned enhancement can act as an instrumental variable in
estimating the impact of the treatment itself15. For example, random assignment of child-care
vouchers in a training program for young mothers might encourage them to get training. Using
voucher eligibility as a first stage predictor of program participation may circumvent some of the
identification problems typically encountered in instrumental variable estimation procedures.
The use of monetary side payments as a program “enhancement” has been uncommon in social
experimentation. Reasons for this probably relate both to cost and to fears that public knowledge
of such payments may bring claims of giving money away. But there are good reasons why such
payments should be reconsidered. Most important, availability of cash as part of a treatment
may make it feasible to implement treatments that would otherwise be ruled out by ethical
15 The use of random assignment to generate instrumental variables provides a more robust procedure than would, for example, the collection of more information because selectivity could remain a problem with such additional information. Still, additional data might mitigate inconsistencies in IV estimation. Two particularly promising areas in which additional data collection might be considered are: (1) Measurement of information and psychological attributes of clients that might predict program participation; and (2) Measurement of characteristics of the program entry process that affect participation (for an initial attempt at this process in Nova Scotia see Nicholson, 2000)
18concerns (for example, denials or restrictions of “universal” services). Surely economists
should feel comfortable with such treatments and perhaps they can convince others of their
usefulness16.
3. Devote additional resources to formal aspects of random assignment design.
Designs of most recent random assignment experiments have stressed operational aspects of
randomization and how it can best be coordinated with on going program functions. Because the
implementation of random assignment can have important influences on how experimental
estimates are to be interpreted, such a focus does provide valuable information on how to avoid
randomization biases. However, the focus on randomization of simple “black box” treatments
has led to some lack of attention to many aspects of formal experimental design that provided
many of the early insights from social experiments. These include important topics such as
treatment definition, response surface specification, optimal sample allocation, and developing a
statistical methodology appropriate to such designs. Devoting additional resources to the design
phases of random assignment evaluations might lead to advances in these areas similar to those
that were made in the 1970s. Adapting advances in experimental design from other research
areas such as public health or engineering might be especially promising in this regard. Clearly,
after a gap of twenty-five years, now may be a good time to revisit basic methodological issues
in the design of random assignment social experiments.
4. Increasing the availability of experimental data
19Although some social experiments have made serious efforts to make their data available to
other researchers, this has not always been the case. Often issues of confidentiality, costs of
preparing public use data sets, or the simple desire of researchers to keep their data to themselves
have resulted in very limited availability. From a scientific point of view this is clearly an
undesirable state of affairs. Reanalysis of the data from a specific experiment can often turn up
unexpected results or suggest alternative ways to proceed. The ability to compare estimates
across experiments using common data definitions can often yield important insights about the
causes of impact differences. More generally, the public availability of data can help to ensure
that the results from experiments can contribute to the overall incremental accumulation of
knowledge. For these reasons, most future random assignment evaluations should contain
explicit funding for the creation of public use data sets. Resources for further examinations of
those data sets and for conducting pooled analyses of several experimental data sets should also
be available.
16 Of course, as pointed out earlier, one must be careful with the use of income supplements to understand how these may interact with the treatment parameters on interest.
20References
Bloom, H.S., L.L. Orr, S.H. Bell, G. Care, F. Doolittle, W. Lin, and J.M. Bas. 1997. “The Benefits and Costs of JTPA Title II-A Programs.” Journal of Human Resources, Summer, 32(3), pp. 549-576.
Burtless, G. 1995. “The Case for Randomized Field Trail in Economic and Policy Research.”
Journal of Economic Perspectives, Spring, 9, pp. 63-84. Conlisk, J., and H.W. Watts. 1977. “A Model for Optimizing Designs for Estimating
Response Surfaces.” In H.W. Watts and A. Rees (Editors), The New Jersey Income Maintenance Experiment, Volume III. New York: Academic Press. Pp. 430-440.
Dehejia, R. and S. Wahba. 1998 “Propensity Score Matching Methods for Non-experimental
Causal Studies.” National Bureau of Economic Research (Cambridge, MA) Working Paper NO. 6829..
Doolittle, F and L. Traeger. 1990 Implementing the National JTPA Study New York:
Manpower Demonstration Research Corporation. Greenberg, D., and M. Shroder. 1997. The Digest of Social Experiments, 2nd edition.
Washington, DC: The Urban Institute Press. Greene, W.J. 1997 Econometric Analysis, third edition. Upper Saddle River New Jersey.
Prentice –Hall. Heckman, J. 1979 “Sample Selection Bias as a Specification Error.” Econometrica. 47. pp
153-161. Heckman, J., R. LaLonde and J. Smith. 1999 “The Economics and Econometrics of Active
Labor Market Programs,” in Orley Ashenfelter and David Card, eds., Handbook of labor economics, Vol. 3A. Amsterdam: North-Holland, 1999, pp. 1865-2097.
Heckman, J.J. and J.A. Smith. 1995. “Assessing the Cases for Social Experiments.” Journal
of Economic Perspectives, Spring, 9, pp. 85-110. Hum, D., and W. Simpson 1993. “Economic Response to a Guaranteed Annual Income:
Experience from Canada and the United States” Journal of Labor Economics 11(1 Part 2):S263-S296.
Johnson, T.R., D.H. Klepinger, J.M. Joesch, and J.M. Benus 1998 “Evaluation of the
Maryland Unemployment Insurance Work Search Demonstration”. U.S. Department of Labor. Unemployment Insurance Occasional Paper 98-2.
21Manning, W.G., J.P. Newhouse, N. Duan, E.B. Keeler, and A. Leibowitz. 1987 “Health
Insurance and the Demand for Medical Care: Evidence from a Randomized Experiment”.American Economic Review 77:3 (June), pp. 251-277.
Metcalf, C. 1973. “Making Inferences from Controlled Income Maintenance Experiments.”
American Economic Review, June, pp. 478-483. Michalopoulos, C., D. Card, L.A. Gennetian, K. Hasknett, and P.K. Robins. 2000 The Self-
Sufficiency Project at 36 Months: Effects of a Financial Work Incentive on Employment and Income. Social Research and Demonstration Corporation.
Nicholson, W., 2000 “Assessing the Feasibility of Measuring Medium Term Net Impacts of the
EBSM Program in Nova Scotia” Working Paper prepared for HRDC, March. Robins, P.K. and R. G. Spiegelman. 2001 Reemployment Bonuses in the Unemployment
Insurance System: Evidence from Three Field Experiments. Kalamazoo, MI. W.E. Upjohn Institute.
Rosenbaum, P. and D. Rubin 1983 “The Central Role of the Propensity Score in
Observational Studies for Causal Effects.” Biometrika, (April) 70(1), pp. 41-55. Smith, Jeffrey and Todd, Petra. “Reconciling Conflicting Evidence on Performance of
Propensity Score Matching Methods.” American Economic Review Papers and Proceedings, May 2001, 91(2), pp. 112-18.
Swamy, P. 1971. Statistical Inference in Random Coefficient Regression Models. New York.
Springer-Verlag.