Untangling Selection Effects in Studies of Coercionutip.lbj.utexas.edu/colloquium/Gholz.pdfis threatened, the defender must decide whether to respond by issuing a deterrent warning

Untangling Selection Effects in Studies of Coercion

Eugene Gholz Assistant Professor

Lyndon B. Johnson School of Public Affairs University of Texas

[email protected]

Daryl Press Associate Professor

Department of Political Science University of Pennsylvania

[email protected] Abstract: For more than a decade, scholars have recognized that studies of coercion are plagued by selection effects. Analyses that fail to account for the strategic decisions that lead countries to initiate (or not) crises will yield biased results. Unfortunately, recent attempts to improve research design to account for selection effects are flawed. We use a formal model to demonstrate that selection effects create complex, non-monotonic relationships between key parameters (e.g., defender interests and power) and observable crisis outcomes. Whether the real connections between these variables and deterrence are positive, negative, or non-monotonic, scholars will observe complex non-monotonic relationships in datasets of crisis dynamics, greatly complicating empirical analyses of coercion. We describe a better set of research approaches – both quantitative and qualitative – that scholars can use to mitigate the problems of selection effects as they study coercion. We provide two short case studies to illustrate how the recommendations for qualitative research can be carried out.

DRAFT – DO NOT CITE 1

For years scholars of international politics have labored to test theories of

coercion. The main body of this research uses data on crisis outcomes to determine

whether any of a host of variables – such as formal alliances, domestic political

institutions, public statements by leaders, military force balances – affect the odds of

successful deterrence or compellence.1 The goal is to identify factors that affect the

likelihood of wars (that is, to understand deterrence), to discover what makes countries

accede to adversaries' demands (that is, to understand compellence), and to understand

when countries issue explicit threats or resort to brinkmanship to pursue their foreign

policy goals (that is, to understand crisis initiation).

Selection effects greatly complicate efforts to test theories of coercion, because

they make the easily observable cases of crisis interaction a non-representative sample of

countries. The problem arises because many potential challengers are deterred before

they initiate a crisis, and many defenders cede the issue at hand rather than take a stand

they are likely to abandon later. Datasets of crises, in which by definition both

challengers and defenders have forgone opportunities to concede, over-represent the most

motivated challengers and defenders.2 The implication for international relations

scholarship is profound: the statistical relationships that are observable in the data on

crisis outcomes may differ significantly from the actual relationships that exist between

key variables and successful coercion.3

In this paper we demonstrate that several apparently plausible research strategies

for studying coercion despite the selection effects do not work. James Fearon proposes

reversing the sign of the expected empirical relationships; Paul Huth and Todd Allee 1 For a useful review, see Huth 1999. 2 Fearon 1992. 3 Smith 1999.


recommend using an off-the-shelf selection estimator like the Heckman probit.4 We use a

formal model of crisis dynamics to show the selection effect more precisely than previous

work. The selection effect is considerably more complex than scholars have assumed:

under some conditions, selection effects strongly influence the types of countries that

appear in observable datasets of crisis interactions, but under other conditions, the

selection effect is relatively weak. Our model reveals how changes in parameter values

affect both the real rate of successful coercion and the rate that scholars can observe in

datasets of crisis outcomes. Overall, the selection effect should lead scholars to observe a

non-monotonic relationship between key independent variables and deterrence success,

whether the underlying real relationship is positive, negative, or non-monotonic.

The results of our model should not be confused with previous work on non-

monotonic relationships in crisis dynamics.5 For example, according to some models of

crisis dynamics, the relationship between the balance of power and the probability of war

should be non-monotonic, because the identity of “challengers” and “defenders” is

endogenous. As weak defenders become more powerful, war becomes less likely until

the “defender” becomes so strong that it might actually choose to start a war, in essence

becoming a challenger. Models that incorporate this dynamic, though, address the real

relationship between power and the probability of war, and their authors assume in their

attempts to empirically test them that the real relationship maps in a straightforward way

to the relationships that we can observe in datasets on crisis outcomes. The implication

of our model is that this assumption is not correct, hence their attempts to empirically test

their models are suspect. Whatever real relationship exists between the balance of power

4 Fearon 1994; Huth and Allee 2004. 5 Bueno de Mesquita, Morrow, and Zorick 1997.


or signals of interest and successful coercion – whether linear or not, monotonic or not –

the observable relationship is complex and non-monotonic.6

The problems that we identify with the current approach to studying coercion are

not quibbles. They raise fundamental questions about a decade of empirical scholarship

on coercion. Variables that actually have a powerful effect on the real rate of deterrence

success, or the real rate of success at compellence, may have been overlooked. Even

worse, mistakes in interpreting selection effects and the predictions of models of coercion

can lead to dangerous inferences that are exactly opposite of the truth.

We offer suggestions to improve research design in studies of coercion using both

quantitative and qualitative methods. One promising route is for scholars to use

statistical estimators that are derived directly from the payoffs and structure of a crisis

game.7 Custom estimators like this would serve as alternatives to the off-the-shelf

estimators like probit and logit that are widely used in the international relations

literature. This approach directly captures the effects of strategic interaction (and hence

internalizes the selection effect) while maintaining the traditional advantages of statistical

research. The downside of this research design is that it places high demands on data

quality and on scholars' confidence in their precise model specification.

6 Bueno de Mesquita, Morrow, and Zorick 1997 use a quadratic term in a logit model to test their model's prediction of a non-monotonic relationship between the balance of power and the probability of war. Signorino 1999 argues that their choice of logit (shared by many other scholars in their studies of crises) is not appropriate for datasets that are generated through strategic interactions. However, Signorino's tests do not help us to understand whether logit fails due to the underlying strategic relationship, due to the complex selection effect, or both. Signorino 1999 shows that there is a problem with off-the-shelf estimators, but he cannot explain the source of the problem. This paper shows the crisis dynamics much more clearly. Our suggestions for improving research design complement Signorino's suggestions. 7 Lewis and Schultz 2003. See also Signorino 1999; Smith 1999.


A second promising approach using large-n datasets is for scholars to replace

crisis outcomes with crisis initiation as their dependent variable.8 This research design

mitigates problems caused by selection effects, and the relationship between key

independent variables and the decision to issue a challenge should at least be monotonic,

but this approach has its own limitations: it allows scholars to answer only some of the

important questions about coercion, and it does not solve all selection effects problems.

Although neither of these two quantitative approaches is perfect, each can be used to

strengthen scholars' understanding of coercion.

The third promising research approach is almost universally overlooked as a

method for mitigating selection effects: case studies. The central problem in studying

coercion results from the indeterminate relationship between actual rates of successful

coercion and observable patterns of crisis outcomes. To avoid that problem, scholars can

study the process of decision making rather than the outcomes of crises. Scholars can use

archival material to directly observe how leaders assess the seriousness of a given threat.

What did the leaders discuss? Did they debate the significance of the adversary’s

domestic political institutions, note the existence or absence of formal alliances, and

believe the adversary’s public statements? Or did other factors weigh more heavily

during their deliberations? Of course, selection effects will still lurk in the backgrounds

of these case studies: crises will involve a non-representative sample of countries. But

the selection effect will have a less pernicious impact because inferences are not being

8 See, for example, Leeds 2003.


drawn from the frequency of coercive success but from the types of information that

leaders focus on, discuss, and debate during crises.9

The remainder of this paper is divided into four sections. The first section

describes the canonical model of military deterrence crises and explains why it is

susceptible to selection effects. The second section shows that the implications of the

multi-stage model for crisis outcomes are more complex than past work has revealed and

details the problems that the complexity poses for empirical studies. The third section

demonstrates how case studies can mitigate the problems of selection effects. The

conclusion emphasizes the importance of the full specification of the formal model, the

necessity of carefully linking the design of statistical analyses to the formal model, and

the value of careful case studies for better tests of theories of coercion.

Selection Effects in Studies of Deterrence

Scholars usually model deterrence interactions as occurring in two stages: general

deterrence and immediate deterrence. General deterrence takes place during non-crisis

periods when one country (a challenger) considers threatening a second country (the

protégé); typically a third country in the model (the defender) may come to the protégé’s

aid.10 General deterrence succeeds—invisibly—when prospective challengers decide not

to threaten the protégé; if a threat is issued, general deterrence has failed. Once a protégé 9 Case studies have been criticized specifically for their vulnerability to selection effects: researchers may choose non-representative or biased cases (King, Keohane, and Verba 1994). These critics are correct, and case study researchers must choose their cases carefully. But this researcher-induced selection bias is separate from the selection effect introduced by the strategic behavior of countries (Collier, Mahoney, and Seawright 2004). Properly chosen case studies can dramatically mitigate the impact of the selection effects that plague datasets on crises. 10 Deterrence theorists distinguish between direct and extended deterrence. Direct deterrence refers to efforts to prevent attacks on oneself; extended deterrence is an effort to prevent attacks on others. The text describes extended deterrence situations, but this argument applies to direct deterrence, too. In those cases, the issue over which the challenger threatens can be considered the "protégé.”


is threatened, the defender must decide whether to respond by issuing a deterrent warning

or by quietly conceding. If the defender promises to come to the protégé's aid, an

immediate deterrence crisis begins. A subsequent attack by the challenger would

constitute a failure of immediate deterrence; if the challenger instead refrains from

attacking (backs down from its earlier threat), then immediate deterrence has succeeded.11

Rational deterrence theory suggests that a powerful defender with a strong interest

in a given protégé should be more effective at deterring attacks. Early quantitative

analyses of deterrence crises applied that theory.12 In their empirical tests, scholars

expected to find a positive correlation between immediate deterrence success (i.e., a

challenger's decision to back down during a crisis) and variables that reflect the

defender's capabilities and interest in the protégé. We call this expected positive

relationship the "traditional prediction" of deterrence theory.

The straightforward approach for testing theories of deterrence was challenged by

scholars, notably including James Fearon, who realized that selection effects distort the

easily observable data on crisis dynamics.13 To make valid inferences, scholars must

consider the choices that challengers and defenders make before crises begin. Other

things being equal, countries are more likely to threaten protégés if potential defenders

are too weak to resist effectively or are likely to give in without a fight. Only highly

11 Some critics of the model (e.g., Lebow and Stein 1989 and 1990) question whether the absence of a threat necessarily reflects general deterrence success and whether a challenger’s decision to back down during a crisis means that immediate deterrence necessarily succeeded. Sometimes countries—even those that have issued threats—have no interest in attacking, irrespective of any deterrence calculations. These critics raise important points. In this paper, however, we limit ourselves to critiquing the logic behind the research design of many empirical studies of coercion. For that purpose, we assume that investigators can correctly define the boundaries of each observation in their datasets. Later in this paper, we argue that careful case studies can avoid the problems of selection effects; they can also reduce the danger of misidentifying a challenger’s lack of interest as deterrence. 12 See, for example, Huth and Russett 1984 and 1990; Huth, Gelpi, and Bennett 1993. 13 The seminal works are Fearon 1992, 1994, and 2002.


motivated challengers (i.e., those who would rather fight than accept the status quo) will

threaten protégés in the sphere of powerful and credible defenders. In datasets of crises,

therefore, challenger motivation will be correlated with defender credibility and power.

The implication of this argument is both counter-intuitive and profound: the defenders

who look most fearsome will usually fail at deterrence during crises. Credible and

capable defenders might succeed at general deterrence, but scholars only observe

immediate deterrence in datasets of crisis outcomes. Consequently, when scholars

estimate the relationship between variables that reflect the defender's capabilities and

credibility and immediate deterrence success, they should expect to find a negative

correlation. We call this the "selection effects prediction" of deterrence theory.14

Insert Figure 1

The selection effects argument is clearer when it is illustrated formally. The

model in figure 1 depicts a deterrence crisis in four stages and describes the payoffs that

the challenger and defender receive for each outcome.15 First a potential challenger

decides whether to threaten a protégé or simply accept the status quo; the value of the

14 Fearon 1994 expects a negative correlation between immediate deterrence success and variables that reflect defender interest but a positive correlation with variables that reflect defender power. This distinction is based on the assumption that although challengers update their assessments of defender interest during crises, they do not learn about defender power. We disagree; challengers can learn about both defender interest and power from crisis behavior. Therefore, the most internally consistent version of the “selection effects prediction” would treat interests and power similarly: expecting both variables to be inversely related to immediate deterrence success in crises. We use the latter version of the selection effects prediction, but our model results in the next section contradict both versions. 15 The model is based on Fearon 1992 with modifications explained in the text and the appendix. This four-stage model allows for uncertainty about both the challenger and the defender: the power and / or interests of both actors can be modeled as private information, so that each must act based on his predictions about what his adversary will do. Other scholars (e.g., Schultz 1999; Lewis and Schultz 2003) have used three-stage models or even a two-stage model (Signorino and Yilmaz 2003) because they are simpler to solve mathematically yet still demonstrate the importance of strategic behavior for empirical studies of coercion. However, these simpler models eliminate the possibility of defender bluffs and therefore diverge from reality. That choice substantively affects the calculations of forward-looking challengers in the model. The four-stage model is the simplest that captures challengers' and defenders' simultaneous incentives to misrepresent their motivations and capabilities (the real-world situation).

C

D

C

D

Thre

aten

Don

’t Th

reat

en

Atta

ckD

on’t

Atta

ck

Mob

ilize

Don

’t M

obili

ze

Figh

tD

on’t

Figh

t

Stat

us Q

uo(0

, 0)

Def

. Acq

uies

ces

(AC, -

AD)

Cha

l. B

acks

Dow

n(-

RC, R

D)

Def

. Bac

ksD

own

(AC+R

C, -A

D-R

D)

War

(1-p

)*(A

C) -

F C,

(1-p

)*(-

AD) -

FD

Figu

re 1

: A F

our

Stag

e D

eter

renc

e En

coun

ter

C =

Chal

leng

erD

= D

efen

der

Not

e: P

ayof

fs a

re d

escr

ibed

in te

xt.


status quo is normalized to zero for both countries. If the challenger decides to threaten

and the defender concedes (chooses “not mobilize”), the challenger seizes the protégé

and gains AC; the defender loses AD. AC and AD represent the challenger and defender's

levels of interest in the protégé. If, on the other hand, the defender mobilizes, the

challenger must decide whether to back down or attack. Backing down, however, is not

free. The challenger would suffer an audience cost equal to –RC if he were to retreat from

his threat; the defender would enjoy a foreign policy victory, receiving RD. If the

challenger does attack, the defender has a final choice to make: he can back down, or he

can carry through on his deterrent threat and fight to defend the protégé. Backing down

entails the loss of the protégé and also an audience cost (-AD-RD), while the challenger

gets AC+RC for seizing the protégé and defeating the defender. The final outcome arises

if the defender decides to fight: both the challenger and defender receive their expected

value for war, which is a function of the probability that the defender will win (p), the

value of the protégé, and the cost of fighting (FD for the defender and FC for the

challenger).16 For the defender, the expected value for war reduces to (1-p)(-AD) – FD;17

the challenger receives (1-p)(AC) – FC for fighting.

16 The original formulation in Fearon 1992 uses a composite "value for war" parameter rather than separating the power-related variables (probability of winning the war and the cost of fighting) from the interest variables (AC and AD). This conflation of power and interests makes it difficult to follow the mechanisms by which real-world independent variables affect outcomes. For example, researchers interested in the effect of democracy on international relations have suggested that democracy might make countries more sensitive to the costs of war (reducing the value for war by increasing F in our model) and that democracies might be more likely to win their wars (increasing the value for war by changing p in our model). A composite “value for war” parameter complicates efforts to test these theories. Furthermore, each country’s value for war presumably is directly related to the value that it assigns to the protégé, yet the composite "value for war" specification does not take that into account. Finally, using the value for war as the outcome payoff at the bottom of the tree hides a relationship between the challenger's payoffs and the defender's payoffs: the probability that the challenger will win a war is just one minus the probability that the defender will win the war (not accounting for ties), meaning that the payoffs at the bottom of the tree should be correlated. Fearon 1997 revises the payoff for war along the lines used here, but other recent efforts to model crisis interaction have continued to use the less transparent formulation (Schultz 1999; Lewis and Schultz 2003).


With complete information, the game tree in Figure 1 has only three possible

outcomes: the status quo, defender acquiesces, and war. If a defender is unwilling to

fight (and the challenger knows it), the challenger will always threaten, and the defender

will always acquiesce (i.e., "not mobilize"). If, on the other hand, a defender values the

protégé highly enough or is powerful enough to have a high expected value for war, both

the challenger and the defender will know that the defender will fight if the challenger

attacks. A challenger then would have only two options: either 1) accept the status quo

or 2) “threaten,” “attack,” and fight a war over the protégé. Under complete information,

therefore, immediate deterrence never succeeds. Immediate deterrence can only succeed

when a challenger either bluffs or probes – impossible strategies with complete

information.18

Immediate deterrence is possible, however, if there is incomplete information.

Challengers and defenders may not know their relative military capabilities (p) or the cost

of a war (FC and FD). More often in studying crisis behavior, scholars have focused on

incomplete information about interests: a country can never be sure about the value its

adversary places on a given protégé.19 For example, not knowing the true level of the

defender's interest creates an incentive for unmotivated challengers to initiate crises as a

way to gain information—all the while knowing that they will back down if the defender

mobilizes. The challenger's goal in issuing the threat is to find out if the defender cares 17 The defender’s payoff for war is [p*(0) + (1-p)*(-AD)] – FD, which reduces to the expression in the text. 18 A bluff is a threat that a country knows it will not carry out if an adversary issues a counter-threat. A probe is a threat whose execution depends on the intensity of the adversary's response. 19 Huth 1999. Many empirical articles try to estimate the importance of various signals of interest. For example, does signing a formal alliance treaty increase the defender's credibility, hence increasing deterrence? Do public statements by leaders (like President Kennedy's famous "Ich bin ein Berliner") increase credibility and deterrence? Do tripwire deployments of troops that are too small to affect the probable outcome of a war send a strong signal of defender interest? These and other independent variables (e.g., measures of a defender's intrinsic interest in a protégé) are all presumed to correlate with a challenger's estimate of the defender's level of interest in a protégé.


enough about the protégé to be willing to pay the cost of mobilizing; if the defender is

willing to pay that cost, then the challenger can update and increase its assessment of the

probability that the defender also cares enough about the protégé to be willing to fight for

it.20 In the model with uncertainty, immediate deterrence succeeds when a challenger

who is simply probing or bluffing encounters a defender who mobilizes and is relatively

likely to be willing to fight.

The formal model described above can incorporate the assumption of incomplete

information, thereby demonstrating the selection effect. If a challenger does not know

how much the defender values the protégé (i.e., the actual value of AD), it must use its

best estimate of the defender's level of interest, K, to calculate the expected value of a

bluff or probe.21 When K is big, the expected value of a bluff is small because the

defender appears relatively likely to mobilize and fight.22 Therefore in datasets of crises,

a credible defender (e.g., high K) is unlikely to be paired with a bluffer; a less credible

defender could face either a bluffer or a highly motivated challenger. Because the rate of

immediate deterrence is determined by the ratio of bluffers to committed challengers,

20 It is likely that challengers and defenders also gain information about the power variables during a crisis. For example, a challenger might learn how much of the defender's military it was willing to deploy to the protégé country, how smoothly the defender's troops were mobilized, and how many of the defender's allies were willing to mobilize, too. All of that information would allow the challenger to update its assessment of the probability that it would win a war over the protégé. The model could be readily recast to make p rather than AD and AC the incomplete information parameter. 21 Many of the tools of statecraft available to defenders in extended deterrence situations are ways of signaling their level of interest, thereby affecting K. For example, signing a mutual defense pact presumably would increase the value of K. 22 Challengers only benefit by bluffing when the defender chooses “not mobilize.” But as K increases, the challenger believes that “fight” becomes more attractive to the defender relative to surrendering the protégé. The derivative of the payoff for “fight” with respect to AD is greater than the derivative of the payoffs of both “don’t mobilize” and “don’t fight" (that is, p-1 > -1).


actions that increase K should correlate negatively with the observed rate of immediate

deterrence success – the selection effects prediction.23

In sum, recent scholarship on deterrence, increasingly attuned to selection effects,

argues that previous analyses systematically misinterpreted their data on deterrence. But

the studies that model selection effects offer good news: if we consider the strategic

behavior that led countries into crises, we can correct our interpretation of statistical tests

of deterrence theory. Specifically, those attributes of a crisis that correlate with

immediate deterrence failure should be emulated by potential defenders, because they are

successfully screening out all but the highly motivated (undeterrable) challengers before

crises even begin. In other words, the results that the early studies of deterrence

produced should be reversed; the signs predicted for coefficients in deterrence theory

regressions should be "flipped." This simple correction allegedly helps us to understand

the true relationship between various independent variables and deterrence success.

A More Complete View of the Multi-Stage Model

By recognizing the danger of selection effects in data on deterrence, scholars have

identified a critical flaw in early studies. Unfortunately, scholars have drawn the wrong

empirical predictions from the multi-stage model of deterrence, leading to incorrect

interpretations of data on crisis outcomes. The signs of the coefficients of estimated

relationships between independent variables and immediate deterrence success may be

misunderstood, and for many samples, the estimates will be biased toward zero.

23 The reason that unmotivated challengers usually do not threaten a highly credible defender – that is, the reason for the selection effect – is that there are costs associated with backing down during a crisis (RC and RD). If prospective attackers faced no costs from backing down, then there would be no selection effect. On audience costs, see Schultz 2001; Fearon 1997.


The central problem is that the traditional predictions — i.e., that greater defender

power, interest in the protégé, and credibility make immediate deterrence success more

likely — and the selection effects predictions — i.e., the exact reverse — are both

sometimes correct.24 In other words, actions that strengthen general deterrence (reduce

the number of challenges) will sometimes cause the observed probability of immediate

deterrence successes to increase, and other times the same actions will cause the rate of

immediate deterrence success to decline. Scholars will find it very difficult to determine

a priori which situation applies for a given sample — or whether each situation applies

for a subset of the data. The result is that deterrence theory makes no determinate

predictions about patterns of immediate deterrence success in scholars' datasets, and

scholars cannot test specific hypotheses about coercion (e.g., whether local military force

advantages bolster deterrence) by drawing straightforward inferences from patterns of

crisis outcomes.

A detailed look at the effects of an increase in defender credibility shows its two

countervailing effects on the likelihood of immediate deterrence. According to the logic

of the selection effects prediction, it reduces the frequency of immediate deterrence

success by reducing the expected value of bluffing and therefore the pool of bluffers who

decide to issue threats. At the same time, though, an increase in defender credibility also

reduces the expected value of attacking after the defender mobilizes, because it seems

more likely to the challenger that the defender will fight. As a result, some challengers

that might have been willing to attack against a less-credible defender instead will only

24 In his excellent article on the democratic peace, Schultz 1999 notes in passing a non-monotonic relationship between key variables in his model and war likelihood. Schultz’s model is not intended to capture the complete dynamics of deterrence crises (e.g., for simplicity he omits the stage in which defenders can bluff); however, his results are generally consistent with our finding. See also Lewis and Schultz 2003.


probe, and if they face defenders who actually mobilize, those challengers will back

down. Against a less credible defender, they would have been undeterrable, but the

increase in defender credibility turned them into examples of immediate deterrence

success. In sum, increasing defender credibility both reduces and increases the number

of bluffers in the observable dataset.

The net effect of changes in credibility on immediate deterrence success depends

on the relative magnitude of the two effects. If, for example, challenger audience costs

(RC) are very big, then the pool of bluffers should shrink rapidly when defender

credibility rises; in this case the selection effect prediction is correct, and increasing

defender credibility will correlate with less immediate deterrence. But if audience costs

are small, the pool of challengers who actually plan to attack may shrink more quickly

than the pool of bluffers, in which case increases in defender credibility will lead to more

immediate deterrence success. The other parameters in the game tree – i.e., the costs of

fighting, the probability of defender victory, and the baseline level of challenger and

defender interest in the protégé – similarly affect the responses of bluffers and committed

attackers to an increase in defender credibility, changing the relative composition of the

observed pool of challengers in a dataset. The net effect on the predicted correlation

between defender credibility and immediate deterrence success is ambiguous.

Insert figure 2

Figure 2 demonstrates the countervailing effects graphically. The line depicts the

range of potential values for AC, the challenger's interest in the protégé, from the lowest

possible interest at the left to the highest at the right. We assume that the defender's pre-

crisis estimate of the challenger's interest (J) lies in the center of the range of possible

Figu

re 2

: Ind

iffer

ence

Poi

nts b

etw

een

Cha

lleng

er S

trat

egie

s

AC

AC1

AC2

-∞∞

Don

’t Th

reat

enTh

reat

en/D

on’t

atta

ckTh

reat

en/a

ttack

Chal

leng

erSt

rate

gy

Chal

leng

erIn

diffe

renc

epo

ints

Cha

lleng

er: S

tatu

s quo

seek

erB

luff

erM

otiv

ated

atta

cker

J-α

J+α


values and that the defender's uncertainty about his estimate (α) correctly delimits the

width of the interval of possible levels of challenger interest.25 Two particular points are

indicated in the figure: AC1 is defined as the value for AC at which a challenger is

indifferent between adopting the strategies “not threaten” and “threaten/not attack”.26

The actual value of AC1 can be calculated in terms of the other payoffs on the figure 1

game tree (see the appendix). Similarly, AC2 is the value of AC at which a challenger is

indifferent between the strategies of “threaten/not attack” and “threaten/attack.” In the

figure, the probability of immediate deterrence success is the ratio of the distance

between AC1 and AC2 to the distance between AC1 and J+α.

An action taken by a defender prior to a crisis that increases its credibility (e.g.,

something that increases K) has two effects on potential challengers' calculations. First,

because the action makes the defender appear more likely to mobilize, the incentive for a

challenger to bluff declines. AC1, therefore, moves to the right.27 This shift of AC1 is why

the selection effects literature argues that credible defenders deter most “bluffers” from

issuing a threat. But an increase in defender credibility also means that the defender is

more likely to fight for the protégé rather than choose “not fight” after a challenger

attacks. Therefore, only a highly motivated challenger will actually attack when facing a

credible defender. In figure 2, AC2 moves to the right as defender credibility increases.

25 In this model α is public knowledge. We use α to reflect both the defender’s uncertainty about the challenger’s actual interest in the protégé and the challenger’s uncertainty about the defender’s true level of interest. 26 A challenger would choose the strategy “threaten/not attack” in the hope that the defender would not mobilize. 27 One way to think of this is that as defender credibility increases, the marginal bluffer decides that bluffing is not worth it — i.e., as defender credibility increases, it takes a greater value of AC to make a challenger indifferent between “not threatening” and bluffing (“threaten/not attack”).


Unless one knows the relative distance that AC1 and AC2 shift as defender

credibility increases, one cannot determine the net effect on the probability of immediate

deterrence success. If AC1 shifts more quickly than AC2, the proportion of bluffers in

crises will drop, and successful immediate deterrence will become less common. This is

the selection effects prediction. But if AC2 shifts more quickly, rising credibility will

increase the likelihood of immediate deterrence success, as suggested by the traditional

prediction.

The problem for scholars who study deterrence is that the net effect of increases

in credibility on deterrence outcomes depends on the precise values of the other

parameters in the game tree. The appendix demonstrates the complexity of these

relationships. For a wide range of values for the magnitude of audience costs, costs of

fighting, probability of defender victory in a war, and uncertainty in the adversaries'

predictions of each other's level of interest in the protégé, we can choose values of the

other variables such that an increase in defender credibility will either increase or

decrease the probability of immediate deterrence success. Without very precise

measurements of all of these variables, scholars cannot know whether deterrence theory

predicts a positive or negative correlation between actions that signal greater defender

interest in its protégé and immediate deterrence success.

Insert figures 3

Figure 3 shows the rate of immediate deterrence success as a function of the

defender’s power (p) and its apparent interest in the protégé (K) under a range of

circumstances. Panel 1 shows the relationship between K and IDS for a range of

Figu

re 3

: Com

plex

Rel

atio

nshi

ps b

etw

een

pow

er (p

), in

tere

sts (

K),

and

imm

edia

te d

eter

renc

e su

cces

s (ID

S)


parameter values that might describe “typical” cases.28 Panel 2 shows the relationship

between p and IDS for the same sets of parameters. Panel 3 illustrates the relationship

between K and IDS in a defense dominant world, meaning that the expected costs of

fighting are greater for the challenger than the defender. Panel 4 presents the relationship

between p and IDS in a rapacious world: with these values, bluffing is rampant because

the cost of backing down is low, and war is common because the cost of fighting is much

smaller than the potential spoils of victory. In all four panels, the relationship between

the variable of interest and IDS is non-monotonic and quite complex.29

There are three key points to take from these graphs. First, an increase in K or p

can result in either a substantial increase or decrease in the rate of immediate deterrence

success. Therefore, steps that a defender takes that successfully signal its interest in a

protégé or its ability to successfully defend a protégé could generate either higher or

lower rates of observable deterrence success (IDS). Second, for some parameter values

(e.g., FC=9 in panel 1 and RC=5 in panel 3), the relationship between the defender’s

apparent interest, K, and immediate deterrence success is essentially flat for wide ranges

of parameter values, meaning that even when deterrence is working (i.e., challengers are

threatening and attacking less often than they would have at lower levels of K), no

evidence of this successful deterrence will appear in data on crisis outcomes. Finally, the

relationship between K and IDS is a function of the size of K; as K varies, the

28 We consider these values to be “typical” because audience costs are smaller than the costs of fighting (except for the FC=3 line), and because the costs of fighting are smaller than the value of the protégé (except for the lowest values of K in Panel 1). Scholars may disagree about what constitute typical values, but the general shapes of these lines appear with a range of parameter values. 29 Most of the curves end before reaching the right limit of the graph (in Panel 3 the curves end at values of K that range from 18.0 to 18.8, though this is difficult to see). This occurs because for some parameter values there are no crises. For example, if for a given set of parameters everyone knows that even the most interested defender is unwilling to fight (AD2>K+∝), then all challengers will threaten but no defenders will mobilize, so there will be no crises.


relationship between K and IDS changes.30 Similar results can be seen in the panels

relating IDS to p.

These graphs illustrate a serious problem for analyses that attempt to draw

inferences about theories of coercion by observing immediate deterrence outcomes. For

example, studies that regress immediate deterrence success on either indicators of a

defender’s interest in a protégé or indicators of a defender’s power will not produce

meaningful results.31 If a study assumes that the selection effects prediction is correct but

inadvertently samples cases in which the actual relationship between K (or p) and

immediate deterrence success is positive, the analysis will tend to fail theories that are

correct and possibly confirm those that are wrong. If the sample comprises observations

in which the relationship between K or p and IDS is relatively flat, then variables that

have great significance as causes of K or p—and hence great significance for coercion—

will appear to be irrelevant. And if the study happens to examine a sample that includes

both “positive correlation” and “negative correlation” cases (for example, cases that cross

over a local maximum of the probability of immediate deterrence success), the estimated

coefficient relating defender interest or power to immediate deterrence success will be

biased towards zero. These latter two cases will be statistically indistinguishable from

30 If the relationship between K and IDS were always concave down, scholars could execute a weak test of various theories of deterrence using a quadratic specification in a regression equation: the squared term on the relationship between K and IDS should never have a positive coefficient. Unfortunately for some parameter values (e.g., the left part of the Fc=3 curve in Panel 1), the relationship is concave up. We thank Bear Braumoeller for discussion of this point. 31 Signorino 1999 also shows that strategic interaction between countries can cause problems for attempts to estimate crisis models, specifically for studies using logit (and probit). Signorino's paper draws on a version of the Bueno de Mesquita et al. model: the true relationship between power and the probability of war is non-monotonic because of the endogeneity of the identity of “challengers.” His results are based on a complete information game (with uncertainty about crisis outcomes generated because each country is assumed to make errors in its strategic choices at a publicly known rate) – so while his article does an excellent job of showing examples of the problems with using off-the-shelf estimators in the presence of strategic interaction, the assumption of perfect information is not realistic (see also Lewis and Schultz 2003). The incomplete information model developed in this article better captures actual crisis dynamics.


cases in which the independent variables genuinely have no relationship to immediate

deterrence success. These results support neither the “traditional prediction” of

deterrence theory nor the “selection effects prediction.”32

The addition of control variables to off-the-shelf quantitative estimators cannot

solve the selection effects problems. Controlling for the value of the other parameters in

the model (e.g., FC, FD, etc.) merely accounts for the effects of those parameters on IDS,

not for their effects on the shape of the relationship between K and IDS or p and IDS.

Adding control variables would assume that there is a single, "true" relationship between

the study variables and IDS and that each observation, once the effects of the control

variables are factored out, would contribute additional information about those true

relationships. Unfortunately that assumption is not warranted: there is no single function

that relates K or p to IDS. In effect, large datasets on crises almost certainly encompass

multiple causal relationships between K, p, and IDS, and consequently the observations

cannot simply be pooled.33 Even in a dataset with observations drawn randomly from all

values of K, p, and the other parameters, estimators that compute an "average"

relationship between the independent variables of interest and IDS, hoping to "wash out"

the effects of the various relationships between intervening variables like K and IDS,

would not yield meaningful results.

Scholars have also been tempted to try to mitigate the selection effects problem

by using sophisticated two-stage estimators proposed by Heckman and others.34 A recent

32 Note that the results in Figure 3 also contradict Fearon’s version of the selection effects prediction, which suggests a monotonic negative relationship between IDS and interest variables (K) and a monotonic positive relationship between IDS and power variables (p). 33 Collier and Mahoney 1996. 34 Huth and Allee 2002; Smith 1996; Nooruddin 2002.


article on research design explicitly endorses this trend.35 These analyses separately

estimate relationships among variables during two stages of a crisis: they first study the

decision to initiate a crisis and then study behavior during a crisis conditional on the prior

decision to initiate a crisis in the first stage. The models are based on the hypothesis that

the error term in the first estimate is correlated with the error term of the second estimate.

Accounting for correlation in error terms is important, but it does not address the key

problem in research design described in this article: even if there were no correlation in

the error terms, we would still not know the right functional form to estimate at either

stage.36 The relationships at each stage are non-linear, and their shapes depend on all of

the model parameters. Simply using a selection model estimator does not help us to

determine whether to expect an increase in defender credibility will deter more potential

challengers at the first stage (crisis initiation) or at the second stage (crisis behavior). In

other words, putting aside the problems with the error terms, scholars do not know a

priori what coefficients to expect and what functional form to look for in their statistical

analyses of crisis behavior.

In sum, Fearon’s insight about selection effects made a substantial contribution to

scholars’ understanding of pitfalls in studies of deterrence. Unfortunately the hurdles that

stand before scholars are even higher than Fearon and others realized: both the traditional

and the selection effects predictions about the relationship between defender credibility

and immediate deterrence success should obtain in datasets of crisis outcomes.

35 Huth and Allee 2004. 36 In a similar vein, Smith 1999 also argues that at least two separate problems (censoring and interdependence of observations) plague datasets on crises. Accounting for correlation of error terms at best would solve one of the problems.


Furthermore, simple solutions – such as using off-the-shelf selection estimators – do not

solve this problem.

Mitigating the Problems of Selection Effects with Case Studies

Given that selection effects make it difficult to make straightforward predictions

about crisis outcomes, what should scholars do to study coercion empirically? One

approach is to apply better statistical methods to datasets of crisis outcomes or to new

datasets tailored to account for the selection effects. We will address some of these

possibilities in the next section. A promising alternative is to study the process of

decision-making rather than the outcomes of crises. This section builds on the general

observation of Collier, Brady, and Seawright that studies of the causal process gain their

inferential leverage in a different way than studies of "dataset observations."37

Examining the decision-making process allows researchers to (1) directly observe

the variables that scholars typically measure indirectly (such as decision-makers’

estimates of their adversaries' power, interests, and credibility), and (2) directly observe

which tools of statecraft influenced those estimates. The key point is that most

quantitative studies of coercion use patterns of crisis outcomes to draw inferences about

how credible, powerful, or committed each country appeared to its adversary (and hence

about the relative effectiveness of the steps each country took to signal those attributes);

these inferences are dubious because of the non-monotonic relationships explained in the

previous section. By studying the decision-making process, however, scholars can avoid

reliance on inferences about what a given rate of IDS implies about a country’s

37 Collier, Brady, and Seawright 2004.


credibility, power, or interests. Studying the decision-making process therefore avoids

the most serious problems posed by selection effects for studies of coercion.38

Many theories of coercion can be tested using evidence about the decision making

process. For example, scholars believe that leaders' assessments of their adversaries'

credibility have an important effect on coercion, so scholars would like to know what

influences decision-makers' assessments. Specifically, a credible defender is one that the

challenger believes is likely to fight rather than back down if confronted with a choice at

the fourth node of the game tree in Figure 1. Credibility then depends on model

parameters like K, p, and FD.39 Ideally scholars could study the effects of various tools of

statecraft on K, p, FD, and the other model parameters, and they could then learn about

both the causes of credibility and crisis dynamics.

In the empirical record, decision-makers do not speak in the language of the

model, but they frequently estimate their adversary's overall credibility and other

variables. Scholars can translate these estimates into the variables that are important for

the theories that they want to test, and they can also examine what evidence decision-

makers used as they made their assessments. Did they consider alliances, their

adversaries' past behavior, the balance of military capabilities, the personality traits of

specific leaders, or other factors? Scholars can read the internal memos and the

transcripts from closed-door meetings to "listen in" on the secret deliberations. Armed

with direct measures of credibility gleaned from examining the decision-making process, 38 Although decision-making processes can be studied using either quantitative or qualitative techniques, we highlight case study research designs because they have been overlooked as an approach to avoid selection effects. To be clear, scholars have frequently used case studies to study crisis decision-making and deterrence, but most of the past qualitative studies, like their quantitative counterparts, draw key conclusions from crisis outcomes and are therefore vulnerable to the problems introduced by selection effects. See, for example, George and Smoke 1974. 39 Credibility is the challenger's estimate of the probability that the defender will fight. We give a mathematical expression for this probability, labeled as y, in the appendix.


scholars can learn a good deal of what they would like to know about coercion without

disentangling the complex relationships to crisis outcomes shown in the formal model.

In the following paragraphs we illustrate this research method. We introduce two

theories of credibility and provide a short background on two cases in which those

theories can be tested. We describe the ideal evidence that we might find in a case, and

we compare it to the actual data we gathered by studying the decision-making process.

Our goal here is not to conduct a conclusive test of theories of credibility; rather, the

point is about research design. A study using the type of data described below would

avoid the most serious problems associated with selection effects, and it is possible to

carry out such a study.40

Testing Theories of Credibility using the Appeasement Crises

A powerful conventional wisdom, which we call Past Actions theory, posits that

leaders assess their adversaries’ credibility by evaluating the adversaries' histories of

keeping or breaking commitments. A competing theory, Current Calculus theory, holds

that leaders pay little attention to their adversaries’ past behavior; instead, they consider

credible those threats that an adversary has the power to carry out at reasonable cost

compared to the value of the issue at stake.41

The series of crises in which Germany faced Britain and France before World

War II can be used to evaluate these theories. During the confrontations over

Czechoslovakia (1938) and Poland (1939), German leaders debated whether Britain and

France would follow through on their promises to defend Germany’s intended victims.

40 Case studies suffer from other problems, such as the difficulty of generalizing the results from a small number of cases. We discuss these limitations below. 41 For a more detailed description of these theories and a more complete effort to test them, see AUTHOR 2005.


Because these crises were preceded by years of British and French vacillation, the Past

Actions theory predicts that German leaders would doubt the Allies’ threats.42 Minutes

from meetings, memos, and transcripts of secret deliberations should reveal German

leaders predicting another Allied withdrawal (for example, statements like: “The British

and French won’t defend Czechoslovakia.”). Furthermore, the documents should reveal

German leaders explaining their estimates of Allied credibility by referring to prior acts

of appeasement by the British and French (e.g., “The Allies won’t oppose us; they backed

down last time we faced them.”).

Current Calculus theory makes quite different predictions.43 Hitler thought that

German military power outmatched the Allies during both crises, but his military

commanders believed that Germany was outgunned until the crisis over Poland. Current

Calculus theory, therefore, predicts that Hitler would dismiss Allied threats during both

crises; the German military should view the Allies as credible in 1938 but less credible

the following year. And, according to the theory, the debates among German leaders

should have focused on the balance of capabilities rather than the Allies’ past actions

(e.g., “The French won’t fight us over Poland; the French Army is too weak.”).

The documents from Nazi Germany show that leaders’ private statements can be

used to track their assessments of enemy credibility. In meeting after meeting, Hitler

insisted that the Allies would back down if Germany attacked Czechoslovakia or Poland.

For example, in a key 1937 meeting he argued that despite Allied promises, their

intervention in a war over Czechoslovakia “was hardly probable;” he asserted that the

42 The British and French took no significant action when Germany repeatedly violated the Versailles Treaty by exceeding its permitted military size, reinstating conscription, militarizing the Rhineland region of Germany, and seizing Austria. 43 The interests at stake for the Allies and Germany were roughly equivalent in these two crises, so we focus the predictions of the Current Calculus theory on assessments of relative power. AUTHOR 2005.


Allies had “written off the Czechs.”44 Hitler later assured his Foreign Minister that the

Allies “would definitely not move” to defend the Czechs.45 Hitler remained skeptical

about Allied credibility in the months leading to the Poland crisis; he repeatedly insisted

that the Allies would never fight for Poland.46

Germany’s military commanders disagreed with Hitler’s assessments during the

Czech crisis. Despite years of appeasement, they were confident that the Allies would

fight for Czechoslovakia. The German War Minister, the Army Commander in Chief,

and the Army Chief of Staff all argued vehemently against Hitler’s disparaging view of

Allied credibility.47 And when a large group of Germany’s senior military commanders

gathered in August, 1938, to discuss the military situation, they were nearly unanimous

that the Allies would defend Czechoslovakia if Germany attacked.48 The debates in

Germany show that it was not until 1939—when the balance of power finally swung in

Germany’s favor—that the military leadership finally shared Hitler’s low assessment of

British and French credibility.

The statements by German leaders permit a straightforward congruence test of

theories of credibility. Specifically, the Past Actions theory predicts that German leaders

should doubt the Allies’ promises to fight for Czechoslovakia and Poland; the theory

passes the congruence test that draws on Hitler's assessments of Allied credibility, but it

fails the congruence test based on the military's evaluation of the situation. The Current

Calculus theory performs better. It predicts that Hitler, who was confident in German

44 For the minutes of this meeting see Documents on German Foreign Policy [henceforth DGFP] 1949, 35. 45 See the collection of German documents edited by Michaelis, Schraepler and Scheel 1959, esp. p. 266 46 DGFP 1983, 552-55 and 200-206. 47 The arguments by the War Minister and Army Commander in Chief are recorded in DGFP 1949, 38. The memos that record the German Army Chief of Staff’s views of Allied credibility are reproduced in Müller 1980, esp. 502, 521-28. 48 Michaelis, Schraepler and Scheel 1979, 253-56.


military power throughout the period, should have dismissed Allied threats in both crises

– and he did. German military commanders believed that Germany was outgunned until

the Poland crisis; as expected by the theory, they believed that the Allies would fight for

Czechoslovakia but not for Poland. The most important point from the standpoint of

research design is that these congruence tests are not undermined by the selection effect,

because the evaluation of the theories does not depend on the relationship between

estimates of credibility and the outcomes of the crises.

These congruence tests can be bolstered with the addition of causal-process

observations:49 evidence about the reasoning that Hitler and his military commanders

used as they debated Allied credibility. The best evidence would be statements that

directly explain both leaders' point estimates of the likelihood that their adversaries

would carry through on their threats and the reasoning that led them to their estimates.

Although decision-makers rarely speak with such clarity, their discussions frequently

reveal what they feel is salient as they consider and debate their options. In the German

case, Hitler and his generals repeatedly argued about Allied credibility, and their

discussions focused almost entirely on the balance of power. For example, in the months

leading to the Czechoslovakian crisis, Hitler explained why he believed the British and

French would not fight: the Empire was a drain on British resources, and their army was

too small to fight Germany; the French Army had obsolete weapons; the Italians would

be powerful German allies; and German fortifications could repulse an attack along the

Franco-German border.50 German commanders disagreed with Hitler’s assessment, but

they, too, focused on the balance of power – specifically, the traditional German problem

49 Collier, Brady, and Seawright 2004. 50 Michaelis, Schraepler and Scheel 1979, XX-XX.


of fighting a two-front war. They argued that the Allies would fight because Germany

would be vulnerable while its army was busy fighting the Czechs.51 Hitler and his

Generals reached different conclusions about Allied credibility, but they reasoned

according to the balance of power and not the Allies’ past actions.

One well-known piece of evidence appears to support the Past Actions theory, but

the historical record reveals that it is an outlier. On August 22, 1939, Hitler derided

Allied threats to defend Poland: “Our enemies are worms. I saw them in Munich,”

referring to the Allies’ appeasement the previous year at the Munich Conference.52 But

Hitler actually presented seven arguments to explain why he doubted Allied credibility

during the August 22 meeting; all seven were about the military balance, and the

"worms" comment was simply a brief aside from his main train of thought.53

German deliberations before World War II show how congruence tests and

causal-process tests can be mutually reinforcing. Congruence tests are vulnerable to

spurious correlation, but evidence about causal processes can check that the congruence

is the result of the transmission mechanism suggested by the theory.54 Similarly, causal-

process observations suffer from leaders’ frequent failure to carefully articulate their

reasoning and the possibility that decision-makers may omit their real reasons from

discussions. But congruence tests mitigate these dangers by comparing the stated or

implied reasoning with the point estimates of the values of key variables. As we have

51 Documents 44 and 46 in Müller 1980; DGFP 1949, 38; Michaelis, Schraepler and Scheel 1979, 253-56. 52 DGFP 1983, 204. 53 Hitler argued that the British army was small; its Empire was crumbling; the French army was weak; German fortifications were powerful; supplies from Eastern Europe would circumvent a blockade; the German economy was strong; and the Soviet Union would ally with Germany (DGFP 1983, 554-555). 54 Process tracing can also help verify that the variables have been scored properly in the congruence test. See George and Bennett 2005.


shown in the pre-World War II cases, these tests can be used together to draw robust

conclusions about decision-making processes and theories of coercion.

The key point for research design is that there is ample evidence in German

documents to assess theories of credibility. As it turns out, the evidence overwhelmingly

supports the predictions of Current Calculus theory.55 As with most empirical research,

the evidence from the case is not perfect, and not all of the evidence points in exactly the

same direction (e.g., the “worms” quote). But a case study of the 1930s crises that

directly measures credibility – rather than trying to draw inference about credibility by

observing crisis outcomes – avoids the pernicious impact of the selection effect, and a

clear preponderance of the evidence allows us to draw a conclusion about which theory

of credibility is more accurate.

Using the Cuban Missile Crisis

In 1962 the United States discovered Soviet nuclear-armed missiles in Cuba. As

U.S. leaders considered a range of military options to remove the missiles, a key question

they considered was the likely Soviet response: would Khrushchev back down as he had

repeatedly in recent crises over Berlin? Or would he stand firm and retaliate by striking

U.S. allies in Europe?

The missile crisis erupted soon after a series of Soviet bluffs over Berlin, so Past

Actions theory predicts that U.S. leaders would expect Khrushchev to back down again.

They should explain their views by referring to Khrushchev’s previous retreats. Current

Calculus theory, on the other hand, predicts that U.S. leaders would find the Soviet Union

55 It is striking that Hitler did not frequently argue that Allied appeasement revealed their unwillingness to fight. This would have been a smart rhetorical argument to support his preference for war even if he did not believe it. Yet almost all of his arguments assessed Allied credibility on the basis of the military balance.


more credible than in the crises over Berlin. In previous crises, the U.S. could respond to

Soviet threats with the threat of escalation, because of the American capability to launch

a disarming nuclear first strike against the U.S.S.R. But U.S. nuclear superiority had

melted away by 1962, so Khrushchev could be more resolute in this crisis, refuse to

concede to American pressure, and retaliate in Europe if the United States used force. If

Current Calculus theory is correct, U.S. leaders should have expected Khrushchev to

stand firm regardless of American mobilization or threats to the island. The American

decision-makers should have explained their assessments by referring to the unfavorable

shift in the balance of power.

The statements of senior Kennedy Administration officials provide ample

evidence about Soviet credibility. Four years of Khrushchev’s bluffs over Berlin had no

effect on his—or the Soviet Union’s—credibility. Kennedy’s advisors were divided

about the best course of action during the crisis, but they were virtually unanimous that

Khrushchev would not back down. For example, the president predicted, “If we attack

Cuban missiles…it gives them a clear line to take Berlin.”56 Similarly, U.S. officials

were convinced that neither an ultimatum to Khrushchev to remove the missiles nor a

blockade of Cuba would cause the Soviets to back down. Even advocates of the

ultimatum strategy called the prospects of Khrushchev backing down “illusory” (142).57

And advocates of a blockade agreed that their preferred strategy was unlikely to work.

Secretary of Defense McNamara warned, “I never have thought we’d get [the missiles]

out of Cuba” through a blockade (417). The President favored a blockade but agreed

56 May and Zelikow 1997, 175. All subsequent quotes in this paragraph are from May and Zelikow, with page references in the text. 57 They believed that issuing an ultimatum would strengthen the U.S. diplomatic position in the crisis, which would be helpful if the crisis later escalated to war.


with McNamara: “We’re not going to get [the missiles] out with the quarantine.” The

only way to do that was to “trade them out” or “go in and get them out ourselves” (464).

None of Kennedy’s senior advisors believed that Khrushchev would buckle to U.S.

pressure; Khrushchev’s credibility was high. The evidence from this congruence test is

consistent with the Current Calculus theory but not with the Past Actions theory.

The reasoning data from this crisis is less conclusive than the congruence test:

U.S. officials explained the logic behind their assessments less frequently than German

leaders in the 1930s did.58 What is striking, however, is that although the senior members

of the U.S. government worried incessantly about how backing down might affect U.S.

credibility—and how U.S. weakness over Cuba would harm future U.S. credibility over

Berlin—we found no instances during the deliberations in which U.S. leaders asked

themselves what previous Soviet withdrawals over Berlin revealed about Soviet

credibility. However, U.S. leaders did discuss the strategic nuclear stalemate and how it

narrowed their options toward Cuba.59 In sum, the evidence in this case study provides

moderate support for the Current Calculus theory—principally from the congruence

test—and flatly contradicts the Past Actions theory. The more important point is that

evidence to test theories of credibility through means that avoid the problem of selection

effects is abundant in the Cuban missile crisis case.

Case Studies, Coercion, and Theory Testing

Three clarifications about case studies are necessary. First, no matter what case

study researchers do, selection effects will still lurk in the background. Not all countries

58 Perhaps the U.S. officials made few efforts to explain the reasoning behind their views on Soviet credibility because there was broad agreement that Khrushchev was credible. In the 1930s crises, there were powerful disagreements, so people were forced to explain and advocate their views. 59 CITE.


are equally likely to get into crises, and therefore the outcomes of the crises will not show

the effect of independent variables on deterrence or compellance success.60 But selection

effects should not significantly affect the process by which leaders assess their

adversaries' credibility, as long as highly motivated challengers and defenders do not

reason differently than less motivated ones.61

Second, case studies are not without their own substantial methodological

problems. Most notably, generalizing the results of a few cases to a larger population is

hazardous. This problem is real, but it need not be fatal. Scholars have offered

approaches to mitigate this weakness in case studies.62 No research design is flawless.

From the perspective of avoiding the selection effects explained in this article, though,

properly executed case studies have much to offer.

Finally, careful archival work requires substantial access to a country’s most

sensitive documents. Some countries make these documents available after a few

decades, but the sample of countries that do so is not representative. A few democracies

make their government documents available, and the conquest of Nazi Germany opened

the decisions of one dictatorship to scrutiny. But those cases may be idiosyncratic.

Nonetheless, the case study method mitigates the selection effects problem that

confounds quantitative analyses of coercion. Case studies can, data permitting, provide

high quality information about how leaders actually make decisions about deterrence.

60 Furthermore, leaders may be more likely to back down when they predict that audience costs will be low, further complicating efforts to interpret patterns in crisis outcomes (Schultz 2001; Sartori 2002). 61 If the leaders of aggressive countries reason differently than leaders of average countries, this method may reveal more about the aggressive leaders. Even so, learning how aggressive leaders evaluate credibility may be particularly valuable for understanding war initiation, evaluating theories of deterrence, and offering foreign policy prescriptions. 62 For two differing approaches, see Van Evera 1997 and [author] 2005.


Where to Go From Here?

Formal theory, statistical analyses and case studies complement each other in

developing and testing theories of coercion. Studies using formal methods – including

Fearon's seminal work and continuing through the model presented in this article – have

helped disentangle the complex relationships among the key variables that determine

crisis dynamics. Neither statistical analyses nor case studies can draw reasonable

conclusions about crisis behavior without considering the effects of strategic interaction

on observable data – a task best accomplished formally.

Statistical analyses can also contribute greatly to the understanding of coercion,

but only if scholars account for selection effects. Two new lines of work in international

relations are promising in this regard. In one approach, scholars are beginning to use

custom statistical estimators that are derived directly from the payoffs and structure of the

crisis game they are modeling rather than using off-the-shelf estimators like probit and

logit. Because of the complexity of the relationships that scholars should expect to find

in datasets of crisis outcomes, they cannot adapt the off-the-shelf estimators (e.g., by

adding control variables or polynomial terms) to yield valid inferences. By using custom

estimators – which mathematically reflect the actual interactions of the key parameters in

a given game tree – scholars directly capture the strategic interaction in the model, hence

their analyses internalize the selection effect.63

Scholars who are working to implement this research design have recognized its

limitations.64 Most important, this method requires great faith in the precise structure of

the game tree. If the game does not conform to reality (that is, if it leaves out

63 Lewis and Schultz 2003. 64 We thank Jeff Lewis for helpful discussions about these points.


substantively important crisis dynamics), then the estimated coefficients will not

necessarily be closer to reality than the flawed estimates from off-the-shelf statistical

techniques. And because the estimator itself assumes that the game tree really generated

the observed data, the estimating procedure cannot be used to test how well the game tree

matches reality. In effect, scholars can either assume that they have the right structure of

the model and then estimate the effects of explanatory variables of interest (e.g.,

alliances, speeches, and troop deployments), or scholars can assume that they know the

true effects of the explanatory variables and use the estimates to test the structure of the

game. They cannot do both at the same time. Furthermore, this approach to the

statistical analysis of coercion requires the creation of new datasets, and some of the new

outcome data in particular are conceptually difficult to code. Because the estimator gets

its inferential leverage from the percentage of interactions that end in different outcomes

– including the percentage of potential crises that end in the "status quo" outcome –

scholars need to be able to decide when a particular "status quo" period begins and ends,

even if the preceding and following periods are also "status quo" periods. Coding "too

many" or "too few" episodes of status quo is likely to substantially bias the results of the

estimator.

A second promising approach using large-n datasets is for scholars to study crisis

initiation rather than crisis outcomes.65 By studying the decision to initiate (or not) a

crisis, scholars avoid the most serious problem associated with selection effects: the non-

monotonic relationship between key parameters (such as defender interests and power)

and immediate deterrence success. Specifically, there is no reason to believe that

65 See, for example, Leeds 2003.


increasing defender credibility, signals of defender interest, or defender power would

ever increase the probability that a challenger would threaten a protégé.66 Because the

relationships between the model parameters and the probability of a challenge are

monotonic, scholars can at least look for statistically significant correlates with the rate of

challenges.

But even this promising approach has its drawbacks. Selection effects still

complicate analyses that focus on crisis initiation. For example, studies that search for a

relationship between the presence (or type) of alliances and the likelihood of a military

challenge must account for the possibility that a country’s willingness to extend an

alliance is partly a function of the likelihood that the potential ally will be attacked.67

Furthermore, while this research design is a promising approach for answering some

important questions in international relations (e.g., when are threats issued?), it cannot

answer other critical questions: When do challenges lead to war? When do coercive

threats cause targets to concede? How much new information about their adversaries do

decision-makers learn during crises? To answer these questions scholars will need to

untangle the interactions between challengers and defenders at each stage of their

interaction – not just at the first decision in the game tree.68

66 All of our runs of the Maple code for the model described in this article have yielded monotonic relationships between the model parameters and the probability of challenges, supporting the underlying assumption of the approach adopted by Leeds and others. 67 If alliances are more likely when threats are low (because the expected cost of extending the alliance is small), then studies will find a negative correlation between the presence of an alliance and the likelihood of a challenge, even if alliances do nothing to deter challengers. On the other hand, it is also possible that likely targets of aggression will try harder to find allies (and will grant more concessions in exchange for an alliance) than countries that face smaller risks. Untangling these conflicting selection effects is necessary before the effect of alliances on the probability of challenges can be accurately estimated. 68 There is another problem with this approach: statistical estimators generally assume that a single relationship links each independent variable with the dependent variable. However, the shape of the relationship between key independent variables (e.g., defender interest and power) and the probability of a challenge varies widely: sometimes the relationship between the probability of a challenge and, say, K is nearly linear, and sometimes it curves sharply; sometimes it is relatively flat, and sometimes it is steep.


Overall, neither of these two quantitative approaches is perfect, but each can be

used to strengthen scholars' understanding of coercion. "Straightforward" tests that seek

to draw inference from crisis outcomes, however, are dangerous.

Case studies can also contribute substantially to our understanding of decision-

making during foreign policy crises. Governments produce enormous paper trails when

they make decisions. Scholars should fully avail themselves of this evidence to directly

test how leaders reason as they consider whether to initiate a crisis, how they reason as

they debate the wisdom of responding to an adversary’s threats, and how they struggle

with the tough decisions they face during crises. Case studies – like formal methods and

statistical analyses – have their limitations, but they also have unique advantages: for

example, they allow scholars to directly measure variables (e.g., assessments of

credibility, assessments of power) that quantitative studies must measure indirectly by

observing behaviors that imperfectly reflect these variables. The field under-appreciates

the value of case studies in developing and testing theories of coercion, so scholars do not

aggressively exploit the archival data that documents actual episodes of crisis decision-

making. The presumption that scholars should draw inference from crisis outcomes also

infects the existing case study literature on crisis decision-making, and as long as it does,

case studies will suffer from selection effects as much as statistical studies. The key is to

use case studies (and statistical tests) in the appropriate way, considering the selection

effects revealed in the formal model of crisis interactions.

Graphs of these results are available from the authors on request. These varied relationships cause especially serious inferential problems if countries are more likely to use certain policy tools to coerce adversaries in given circumstances (e.g., perhaps they are more likely to move military forces when their interests are high, or they might be more likely to make speeches when their power is low). Inferences drawn from straightforward regression analysis in these circumstances will be biased.


The broader point about research methods is that all the methods that scholars of

international politics have at their disposal are imperfect. Careful, systematic work—by

scholars using any of these research methods—can improve our understanding of

international relations. Each approach has its flaws, and each can offer something to

cover the blind spots of the others. Too often, calls for methodological diversity by

scholars of international relations are based solely on the virtues of broadmindedness and

collegiality. But there is an even stronger case to be made: progress in the field is more

likely if formal theorists, quantitative analysts, and case study researchers all use their

imperfect tools to poke at different sides of the same difficult problems. The real reason

to encourage methodological diversity in international relations is not that all beliefs,

styles, research methods are necessarily created equal; it is that given the complexity of

our subject matter and the limits on experimental research, we need careful formal

models, statistical analyses, and process tracing case studies to draw valid inferences.


Appendix

The game tree in Figure 1 illustrates a deterrence crisis between a challenger and

defender. In this model the challenger and the defender each have three potential

strategies. The challenger can choose between (1) “not threaten”; (2) “threaten, not

attack”; and (3) “threaten, attack.” Similarly the defender has the same three choices

with respect to mobilization (or not) and fight (or not).

All the variables are assumed to be common knowledge except each country’s

value for the protégé, AC and AD, which are private information. AC and AD can take any

value, positive or negative.69

Challengers and defenders each estimate their adversary's interest level in the

protégé in terms of two variables. K is the challenger’s expected value for AD, and α is a

measure of the challenger’s uncertainty about AD such that the actual value of AD lies

along the continuum [K-α, K+α]. Similarly, J is the defender's expected value for AC,

and the challenger’s actual value for his interest in the protégé lies on the continuum [J-α,

J+α].70 J, K, and α are all common knowledge.71

69 Positive values express the extent to which a challenger would like to conquer the protégé and the extent to which the defender values the protégé's independence. A negative value for AC suggests that a challenger would prefer not to conquer a particular protégé. Many countries are unappealing targets of conquest – for example, because the cost of administration exceeds the advantages of ownership. Similarly, a negative value for AD suggests that a prospective defender prefers that the challenger conquer a protégé. Perhaps the protégé has significant value and the prospective challenger is an ally, or perhaps the protégé has negative value and the defender hopes the challenger will become stuck in a quagmire. 70 In this formulation, the challenger and defender have the same amount of uncertainty about the other’s value for the protégé. This simplification is probably not accurate; different countries may be easier or harder to “read.” For example, the interests of democracies may be easier to estimate (smaller α) because of the greater transparency of their political process; alternatively, the α of a country might shrink the longer a given leader remains in office, if a significant amount of uncertainty about “interests” is about particular views of specific leaders. Future research could allow α to vary between the challenger and defender to test how changes in α affect bargaining outcomes. 71 This model is based on Fearon 1992, but it differs in two important ways. First, we model the level of interest as the private information variable, while Fearon modeled the "value for war" as the private information variable. As discussed in section 2, the "value for war" is a problematic parameter because it


To determine the relationship between the challenger's estimate of the defender's

interest in the protégé (K) and immediate deterrence success (IDS), we identify the

parameter values that induce challengers to choose “threaten, not attack."72 The range of

challengers that choose this strategy is bounded on the left by the level of AC at which the

challenger would be indifferent between threatening and accepting the status quo

(indicated by AC1 in Figure 2). The range is bounded on the right by the level of AC at

which the challenger would be indifferent between backing down after issuing a threat

and attacking the protégé (indicated by AC2 in Figure 2).73 For a given set of parameters,

the probability that a challenger will back down in a crisis – that is, the probability of IDS

– can be expressed as

!

C 2A " C1A

(J +#) " C1A

.

conflates power and interest; the “value for war” is a function of both the likelihood of winning (power) and the importance of the issue at stake (interest). Models that use a "value for war" parameter are therefore difficult to interpret. Furthermore, models designed to study immediate deterrence that use “value for war” as the private information parameter (specifically including Fearon 1992) face another problem. These models evaluate the probability of immediate deterrence by estimating the proportion of challengers who choose each of three possible strategies: “don’t threaten”, “threaten, not attack”, and “threaten, attack.” However the indifference point between the first two strategies is not a function of the challenger’s value for war. Fearon 1997 revises the formula for the payoffs for fighting a war along the lines we use here, but that article does not explore the implications of this change for crisis dynamics or for empirical studies of coercion. Second, instead of modeling the range of uncertainty in the private information variables' values as running from zero up to a maximum value for war, we add a separate parameter of the model to bound the range of uncertainty (α). This change is important, because in Fearon's original formulation there is a mechanical correlation between the unknown variable’s expected value and the level of uncertainty about its value. This raises a problem because (a) cases with high uncertainty are the ones in which there is room to bluff profitably; (b) in Fearon’s model these cases only arise when there is a high "value for war;" and (c) bluffing is unlikely to succeed when countries have a high value for war because adversaries are likely to choose the "attack" and "fight" branches of the tree. Fearon’s model, therefore, understates both the opportunity for bluffing and, consequently, the frequency of successful immediate deterrence. We can learn much more about bluffing behavior and immediate deterrence success by separating the level of interest in the protégé from the level of uncertainty in the model. Schultz 1999 uses a similar formulation to the one in our model in order to separate the level of uncertainty from the expected value for war, but that study still suffers from the problems of using a composite “value for war” parameter rather than a level of interest parameter. 72 “Threaten, not attack” appears to be a negative payoff strategy, but challengers sometimes choose it because they hope that the defender will not mobilize, giving the challenger the protégé for free. 73 For some values of J, K, α, p, RC, RD, FC, and FD, AC2 will be less than AC1. In those situations, immediate deterrence is never possible because any challenger who is willing to threaten is also willing to fight.


AC1, the indifference point for the “threat/not threat” decision, is influenced by the

challenger's predictions of the likelihood that the defender will choose “mobilize” if the

challenger threatens and “fight” if the challenger attacks the protégé. AC2, the

indifference point for the challenger’s decision whether or not to attack, is influenced by

the additional information that the challenger has gleaned from the defender’s decision to

mobilize, which updated the challenger about the defender's level of interest in the

protégé.74 Mathematically, at the moment when the challenger decides whether or not to

threaten, it estimates the probability that the defender will mobilize as

!

q =(K +") # D1A

2".

At the challenger's second decision node, whether to attack or not given that the defender

has already mobilized, the challenger estimates the probability that the defender will fight

as

!

y =(K +") # D2A

(K +") # D1A.

AD2 can be readily expressed in terms of parameters whose values are public

knowledge. The defender's expected value for fighting is

!

[p(0) + (1" p)(" DA )]" DF , and

its payoff for backing down after mobilizing is just

!

" DA " DR . So the defender is

indifferent when

!

D2A =CF " CR

p.

The other three variables (AC1, AC2, and AD1) are more complicated, because each

depends on the other two. For example, AC1 is a function of whether or not the defender

will mobilize (the likelihood that AD is greater than AD1), but the defender's mobilization

decision partly depends on the defender’s estimate of the likelihood that the challenger

74 Defenders with low values for AD will cede the protégé without mobilizing. Those that mobilize either value the protégé highly and intend to fight, or they do not value the protégé sufficiently to fight but mobilize in the hope of exposing a challenger bluff (and thereby preserving the protégé and gaining an additional audience benefit by embarrassing the challenger).


will attack,

!

x =( j +") # C 2A

( j +") # C1A. We can write three simultaneous equations for AC1, AC2,

and AD1 as follows:

(1)

!

" CR = y(1" p) C 2A " CF + (1" y)( C 2A + CR )

(2)

!

" D1A = x(" D1A " DR ) + (1" x) DR

" D1A = x[(1" p)(" D1A ) " DF ]+ (1" x) DR

# $ %

if

!

D1A < ( DF " DR ) p

D1A > ( DF " DR ) p

(3)

!

0 = q(" CR ) + (1" q) C1A

These nonlinear equations do not have a simple, analytical solution, but they can

be solved numerically. We used Maple 9 mathematical software to study the relationship

between K and IDS.75

75 The worksheets containing the Maple code are available on request.


References

Bueno de Mesquita, Bruce, James D. Morrow, and Ethan R. Zorick. 1997. Capabilities,

Perceptions, and Escalation. American Political Science Review 91 (1): 15-27. Brady, Henry E., and David Collier, eds. 2004. Rethinking Social Inquiry: Diverse Tools,

Shared Standards. Lanham, MD: Rowman and Littlefield Publishers. Collier, David, and James Mahoney. 1996. Insights and Pitfalls: Selection Bias in

Qualitative Research. World Politics 49 (1): 56-91. Collier, David, Henry E. Brady, and Jason Seawright. 2004. Sources of Leverage in

Causal Inference: Toward an Alternative View of Methodology. In Rethinking Social Inquiry: Diverse Tools, Shared Standards, edited by Henry Brady and David Collier, 229-66. Lanham, MD: Rowman & Littlefield Publishers.

Collier, David, James Mahoney, and Jason Seawright. 2004. Claiming Too Much:

Warnings about Selection Bias. In Rethinking Social Inquiry: Diverse Tools, Shared Standards, edited by Henry Brady and David Collier, 85-102. Lanham, MD: Rowman & Littlefield Publishers.

Danilovic, Vesna. 2001a. Conceptual and Selection Bias Issues in Deterrence. Journal of

Conflict Resolution 45 (1): 97-125. Danilovic, Vesna. 2001b. The Sources of Threat Credibility in Extended Deterrence.

Journal of Conflict Resolution 45 (3): 341-69. Documents on German Foreign Policy, 1918-1945. 1949. Series D. Vol. 2. Washington,

D.C.: U.S. Government Printing Office. Documents on German Foreign Policy, 1918-1945. 1983. Series D. Vol. 7. Washington,

D.C.: U.S. Government Printing Office. Fearon, James D. 1992. Threats to Use Force: Costly Signals and Bargaining in

International Crises. Ph.D. diss., University of California, Berkeley. Fearon, James D. 1994. Signaling Versus the Balance of Power and Interests: An Empirical

Test of a Crisis Bargaining Model. Journal of Conflict Resolution 38 (X): 236-69. Fearon, James D. 1997. Signaling Foreign Policy Interests: Tying Hands versus Sinking

Costs. Journal of Conflict Resolution 41 (1): 68-90. Fearon, James D. 2002. Selection Effects and Deterrence. International Interactions 28 (1):

5-29.


George, Alexander, and Andrew Bennett. 2005. Case Studies and Theory Development in the Social Sciences. Cambridge, Mass.: Massachusetts Institute of Technology.

George, Alexander, and Richard Smoke. 1974. Deterrence in American Foreign Policy:

Theory and Practice, New York: Columbia University Press. Huth, Paul K. 1999. Deterrence and International Conflict: Empirical Findings and

Theoretical Debates. Annual Review of Political Science 2 (X): 25-48. Huth, Paul K., and Todd L. Allee. 2002. Domestic Political Accountability and the

Escalation and Settlement of Interstate Disputes. Journal of Conflict Resolution 46 (6): 754-90.

Huth, Paul, and Todd Allee. 2004. Research Design in Testing Theories of International

Conflict. In Models, Numbers, and Cases: Methods for Studying International Relations, edited by Detlef F. Sprinz and Yael Wolinsky-Nahmias, 193-223. Ann Arbor, Mich.: University of Michigan Press.

Huth, Paul, Christopher Gelpi, and D. Scott Bennett. 1993. The Escalation of Great

Power Militarized Disputes: Testing Rational Deterrence Theory and Structural Realism. American Political Science Review 87 (3): XX-XX.

Huth, Paul, and Bruce Russett. 1984. What Makes Deterrence Work? Cases from 1900 to

1980. World Politics 36 (4): 496-526. Huth, Paul, and Bruce Russett. 1990. Testing Deterrence Theory: Rigor Makes a

Difference. World Politics 42 (4): 466-501. King, Gary, Robert O. Keohane, and Sidney Verba. 1994. Designing Social Inquiry.

Princeton: N.J.: Princeton University Press. Lebow, Richard N., and Janice G. Stein. 1989. I Think, Therefore I Deter. World Politics

41 (2): 208-24. Lebow, Richard N. and Janice G. Stein, 1990. Deterrence: The Elusive Dependent Variable.

World Politics 42 (3): 336-69. Leeds, Brett Ashley. 2003. Do Alliances Deter Aggression? The Influence of Military

Alliances on the Initiation of Militarized Interstate Disputes. American Journal of Political Science 47 (3): 427-39.

Lewis, Jeffrey B., and Kenneth A. Schultz. 2003. Revealing Preferences: Empirical

Estimation of a Crisis Bargaining Game with Incomplete Information. Political Analysis 11 (4): 345-67.

May, Ernest, and Philip Zelikow, eds. 1997. The Kennedy Tapes: Inside the White House

During the Cuban Missile Crisis. Cambridge, Mass.: Harvard University Press.


Michaelis, Herbert, and Ernst Schraepler, eds., 1979. Ursachen und Folgen: Vom Deutschen Zusammenbruch 1918 und 1945 bis zur staatlichen Neuordnung Deutschlands in der Gegenwart [Causes and Consequences: From the German collapse in 1918 and 1945 to the reorganization of the present German state]. Vol. 12. Berlin: Dokumenten-Verlag Dr. Herbert Wendler.

Müller, Klaus-Jürgen. 1980. General Ludwig Beck: Studien und Dokumente zur politisch-

militurischen Vorstellungswelt und Tatigkeit des Generalstabschefs des deutschen Heer, 1933-1938 [General Ludwig Beck: Studies and documents on the political-military worldview and actions by the general staff of the German army, 1933-1938]. Boppard am Rhein: Harald Boldt Verlag.

Nooruddin, Irfan. 2002. Modeling Selection Bias in Studies of Sanctions Efficacy.

International Interactions 28 (1): 59-75. Partell, Peter J., and Glenn Palmer. 1999. Audience Costs and Interstate Crises: An

Empirical Assessment of Fearon's Model of Dispute Outcomes. International Studies Quarterly 43 (X): 389-405.

Sartori, Anne E. 2002. The Might of the Pen: A Reputational Theory of Communication

in International Disputes. International Organization 56 (1): 123-51. Schultz, Kenneth A. 1998. Domestic Opposition and Signaling in International Crises.

American Political Science Review 92 (4):: 829-44. Schultz, Kenneth A. 1999. Do Democratic Institutions Constrain or Inform? Contrasting

Two Institutional Perspectives on Democracy and War. International Organization 53 (2): 233-66.

Schultz, Kenneth A. 2001. Looking for Audience Costs. Journal of Conflict Resolution

45 (1): 32-60. Signorino, Curtis S. 1999. Strategic Interaction and the Statistical Analysis of

International Conflict. American Political Science Review 93 (2): 279-97. Signorino, Curtis S., and Kuzey Yilmaz. 2003. Strategic Misspecification in Regression

Models. American Journal of Political Science 47 (3): 551-66. Smith, Alistair. 1996. To Intervene or Not to Intervene: A Biased Decision. Journal of

Conflict Resolution 40 (1): 16-40. Smith, Alistair. 1999. Testing Theories of Strategic Choice: The Example of Crisis

Escalation. American Journal of Political Science 43 (4): 1254-1283.


Sprinz, Detlef F., and Yael Wolinsky-Nahmias, eds. 2004. Models, Numbers, and Cases: Methods for Studying International Relations. Ann Arbor, Mich.: University of Michigan Press.

Van Evera, Stephen. 1997. Guide to Methods for Students of Political Science. Ithaca,

N.Y.: Cornell University Press.