Upload
amika7sethi
View
1.511
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Delphi method
Citation preview
International Journal of Forecasting 15 (1999) 353–375www.elsevier.com/ locate / ijforecast
The Delphi technique as a forecasting tool: issues and analysis
a , b*Gene Rowe , George WrightaInstitute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK
bStrathclyde Graduate Business School, Strathclyde University, 199 Cathedral Street, Glasgow G4 0QU, UK
Abstract
This paper systematically reviews empirical studies looking at the effectiveness of the Delphi technique, and provides acritique of this research. Findings suggest that Delphi groups outperform statistical groups (by 12 studies to two with two‘ties’) and standard interacting groups (by five studies to one with two ‘ties’), although there is no consistent evidence thatthe technique outperforms other structured group procedures. However, important differences exist between the typicallaboratory version of the technique and the original concept of Delphi, which make generalisations about ‘Delphi’ per sedifficult. These differences derive from a lack of control of important group, task, and technique characteristics (such as therelative level of panellist expertise and the nature of feedback used). Indeed, there are theoretical and empirical reasons tobelieve that a Delphi conducted according to ‘ideal’ specifications might perform better than the standard laboratoryinterpretations. It is concluded that a different focus of research is required to answer questions on Delphi effectiveness,focusing on an analysis of the process of judgment change within nominal groups. 1999 Elsevier Science B.V. All rightsreserved.
Keywords: Delphi; Interacting groups; Statistical group; Structured groups
1. Introduction been largely inappropriate, such that our knowledgeabout the potential of Delphi is still poor. In essence,
Since its design at the RAND Corporation over 40 this paper relates a critique of the methodology ofyears ago, the Delphi technique has become a widely evaluation research and suggests that we will acquireused tool for measuring and aiding forecasting and no great knowledge of the potential benefits ofdecision making in a variety of disciplines. But what Delphi until we adopt a different methodologicaldo we really understand about the technique and its approach, which is subsequently detailed.workings, and indeed, how is it being employed? Inthis paper we adopt the perspective of Delphi as ajudgment or forecasting or decision-aiding tool, and 2. The nature of Delphiwe review the studies that have attempted to evaluateit. These studies are actually sparse, and, we will The Delphi technique has been comprehensivelyargue, their attempts at evaluating the technique have reviewed elsewhere (e.g., Hill & Fowles, 1975;
´Linstone & Turoff, 1975; Lock, 1987; Parente &´Anderson-Parente, 1987; Stewart, 1987; Rowe,*Corresponding author. Tel.: 144-1603-255-125.
E-mail address: [email protected] (G. Rowe) Wright & Bolger, 1991), and so we will present a
0169-2070/99/$ – see front matter 1999 Elsevier Science B.V. All rights reserved.PI I : S0169-2070( 99 )00018-7
354 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375
brief review only. The Delphi technique was de- invalid criteria (such as the status of an idea’sveloped during the 1950s by workers at the RAND proponent). Furthermore, with the iteration of theCorporation while involved on a U.S. Air Force questionnaire over a number of rounds, the indi-sponsored project. The aim of the project was the viduals are given the opportunity to change theirapplication of expert opinion to the selection – from opinions and judgments without fear of losing face inthe point of view of a Soviet strategic planner – of the eyes of the (anonymous) others in the group.an optimal U.S. industrial target system, with a Between each questionnaire iteration, controlledcorresponding estimation of the number of atomic feedback is provided through which the group mem-bombs required to reduce munitions output by a bers are informed of the opinions of their anonymousprescribed amount. More generally, the technique is colleagues. Often feedback is presented as a simpleseen as a procedure to ‘‘obtain the most reliable statistical summary of the group response, usuallyconsensus of opinion of a group of experts . . . by a comprising a mean or median value, such as theseries of intensive questionnaires interspersed with average ‘group’ estimate of the date by when ancontrolled opinion feedback’’ (Dalkey & Helmer, event is forecast to occur. Occasionally, additional1963, p. 458). In particular, the structure of the information may also be provided, such as argumentstechnique is intended to allow access to the positive from individuals whose judgments fall outside cer-attributes of interacting groups (knowledge from a tain pre-specified limits. In this manner, feedbackvariety of sources, creative synthesis, etc.), while comprises the opinions and judgments of all grouppre-empting their negative aspects (attributable to members and not just the most vocal. At the end ofsocial, personal and political conflicts, etc.). From a the polling of participants (i.e., after several roundspractical perspective, the method allows input from a of questionnaire iteration), the group judgment islarger number of participants than could feasibly be taken as the statistical average (mean/median) of theincluded in a group or committee meeting, and from panellists’ estimates on the final round. The finalmembers who are geographically dispersed. judgment may thus be seen as an equal weighting of
Delphi is not a procedure intended to challenge the members of a staticized group.statistical or model-based procedures, against which The above four characteristics are necessary defin-human judgment is generally shown to be inferior: it ing attributes of a Delphi procedure, although thereis intended for use in judgment and forecasting are numerous ways in which they may be applied.situations in which pure model-based statistical The first round of the classical Delphi proceduremethods are not practical or possible because of the (Martino, 1983) is unstructured, allowing the in-lack of appropriate historical /economic / technical dividual experts relatively free scope to identify, anddata, and thus where some form of human judg- elaborate on, those issues they see as important.mental input is necessary (e.g., Wright, Lawrence & These individual factors are then consolidated into aCollopy, 1996). Such input needs to be used as single set by the monitor team, who produce aefficiently as possible, and for this purpose the structured questionnaire from which the views, opin-Delphi technique might serve a role. ions and judgments of the Delphi panellists may be
Four key features may be regarded as necessary elicited in a quantitative manner on subsequentfor defining a procedure as a ‘Delphi’. These are: rounds. After each of these rounds, responses areanonymity, iteration, controlled feedback, and the analysed and statistically summarised (usually intostatistical aggregation of group response. Anonymity medians plus upper and lower quartiles), which areis achieved through the use of questionnaires. By then presented to the panellists for further considera-allowing the individual group members the oppor- tion. Hence, from the third round onwards, panelliststunity to express their opinions and judgments are given the opportunity to alter prior estimates onprivately, undue social pressures – as from dominant the basis of the provided feedback. Furthermore, ifor dogmatic individuals, or from a majority – should panellists’ assessments fall outside the upper orbe avoided. Ideally, this should allow the individual lower quartiles they may be asked to give reasonsgroup members to consider each idea on the basis of why they believe their selections are correct againstmerit alone, rather than on the basis of potentially the majority opinion. This procedure continues until
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375 355
a certain stability in panellists’ responses is achieved. of controlled experimentation or academic inves-The forecast or assessment for each item in the tigation that has taken place . . . (and that) asidequestionnaire is typically represented by the median from some RAND studies by Dalkey, most ‘evalua-on the final round. tions’ of the technique have been secondary efforts
An important point to note here is that variations associated with some application which was thefrom the above Delphi ideal do exist (Linstone, primary interest’’ (p. 11). Indeed, the relative paucity1978; Martino, 1983). Most commonly, round one is of evaluation studies noted by Linstone and Turoff isstructured in order to make the application of the still evident. Although a number of empirical exami-procedure simpler for the monitor team and panel- nations have subsequently been conducted, the bulklists; the number of rounds is variable, though of Delphi references still concern applications ratherseldom goes beyond one or two iterations (during than evaluations. That is, there appears to be awhich time most change in panellists’ responses widespread assumption that Delphi is a useful instru-generally occurs); and often, panellists may be asked ment that may be used to measure some kind of truthfor just a single statistic – such as the date by when and it is mainly represented in the literature in thisan event has a 50% likelihood of occurring – rather vein. Consideration of what the evaluation studiesthan for multiple figures or dates representing de- actually tell us about Delphi is the focus of thisgrees of confidence or likelihood (e.g., the 10% and paper.90% likelihood dates), or for written justifications ofextreme opinions or judgments. These simplificationsare particularly common in laboratory studies and 4. Evaluative studies of Delphihave important consequences for the generalisabilityof research findings. We attempted to gather together details of all
published (English-language) studies involvingevaluation of the Delphi technique. There were a
3. The study of Delphi number of types of studies that we decided not toinclude in our analysis. Unpublished PhD theses,
Since the 1950s, use of Delphi has spread from its technical reports (e.g., of the RAND Corporation),origins in the defence community in the U.S.A. to a and conference papers were excluded because theirwide variety of areas in numerous countries. Its quality is less assured than peer-reviewed journalapplications have extended from the prediction of articles and (arguably) book chapters. It may also belong-range trends in science and technology to argued that if the studies reported in these excludedapplications in policy formation and decision mak- formats were significant and of sufficient quality thening. An examination of recent literature, for example, they would have appeared in conventional publishedreveals how widespread is the use of Delphi, with form.applications in areas as diverse as the health care We searched through nine computer databases:industry (Hudak, Brooke, Finstuen & Riley, 1993), ABI Inform Global, Applied Science and Technolo-marketing (Lunsford & Fussell, 1993), education gy Index, ERIC, Transport, Econlit, General Science(Olshfski & Joseph, 1991), information systems Index, INSPEC, Sociofile, and Psychlit. As search(Neiderman, Brancheau & Wetherbe, 1991), and terms we used: ‘Delphi and Evaluation’, ‘Delphi andtransportation and engineering (Saito & Sinha, Experiment’, and ‘Delphi and Accuracy’. Because1991). our interest is in experimental evaluations of Delphi,
Linstone and Turoff (1975) characterised the we excluded any ‘hits’ that simply reported Delphigrowth of interest in Delphi as from non-profitmak- as used in some application (e.g., to ascertain experting organisations to government, industry and, final- views on a particular topic), or that omitted keyly, to academe. This progression caused them some details of method or results because Delphi per seconcern, for they went on to suggest that the was not the focus of interest (e.g., Brockhoff, 1984).‘explosive growth’ in Delphi applications may have Several other studies were excluded because weappeared ‘‘ . . . incompatible with the limited amount believe they yielded no substantial findings, or
356 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375
because they were of high complexity with uncertain authors whether there were any evaluative studiesfindings. For example, a paper by Van Dijk (1990) that we had overlooked. The eight replies caused useffectively used six different Delphi-like techniques to make two minor changes to our tables. None ofthat differed in terms of the order in which each of the replies noted evaluative studies of which we werethree forms of panellist interaction were used in each unaware.of three ‘Delphi’ rounds, resulting in a highly Details of the Delphi evaluative studies are sum-complex, difficult to encode and, indeed, difficult to marised in Tables 1 and 2. Table 1 describes theinterpret study. characteristics of each experimental scenario, includ-
The final type of studies that we excluded from ing the nature of the task and the way in whichour search ‘hits’ were those that considered the Delphi was constituted. Table 2 reports the findingsuniversal validity or reliability of Delphi without of the studies plus ‘additional comments’ (which alsoreference to control conditions or other techniques. concern details in Table 1). Ideally, the two tablesFor example, the study of Ono and Wedermeyer should be joined to form one comprehensive table,(1994) reported that a Delphi panel produced fore- but space constraints forced the division.casts that were over 50% accurate, and used this as We have tried to make the tables as comprehen-‘evidence’ that the technique is somehow a valid sive as possible without obscuring the main findingspredictor of the future. However, the validity of the and methodological features by descriptions of minortechnique, in this sense, will depend as much on the or coincidental results and details, or by repetition ofnature of the panellists and the task as on the common information. For example, a number oftechnique itself, and it is not sensible to suggest that Delphi studies used post-task questionnaires to assessthe percentage accuracy achieved here is in any way panellists’ reactions to the technique, with individualgeneralisable or fundamental (i.e., to suggest that questions ranging from how ‘satisfied’ subjects feltusing Delphi will generally result in 50% correct about the technique, to how ‘enjoyable’ they found itpredictions). The important questions – not asked in (e.g., Van de Ven & Delbecq, 1974; Scheibe et al.,this study – are whether the application of the 1975; Rohrbaugh, 1979; Boje & Murnighan, 1982).technique helped to improve the forecasting of the Clearly, to include every analysis or significantindividual panellists, and how effective the technique correlation involving such measures would result inwas relative to other possible forecasting procedures. an unwieldy mass of data, much of which has littleSimilarly, papers by Felsenthal and Fuchs (1976), theoretical importance and is unlikely to be repli-Dagenais (1978), and Kastein, Jacobs, van der Hell, cated or considered in future studies. We have alsoLuttik and Touw-Otten (1993) – which reported data been constrained by lack of space: so, while we haveon the reliability of Delphi-like groups – are not attempted to indicate how aspects such as theconsidered further, for they simply demonstrate that dependent variables were measured (Table 2), wea degree of reliability is possible using the technique. have generally been forced to give only brief de-
The database search produced 19 papers relevant scriptions in place of the complex scoring formulaeto our present concern. The references in these used. In other cases where descriptions seem overlypapers drew our attention to a further eight studies, brief, the reader is referred to the main text whereyielding 27 studies in all. Roughly following the matters are explained more fully.procedure of Armstrong and Lusk (1987), we sent Regarding Table 1, the studies have been classi-out information to the authors of these papers. We fied according to their aims and objectives as ‘Appli-identified the current addresses of the first-named cation’, ‘Process’ or ‘Technique-comparison’ type.authors of 25 studies by using the ISI Social Citation Application studies use Delphi in order to gaugeIndex database. We produced five prototype tables expert opinion on a particular topic, but considersummarising the methods and findings of these issues on the process or quality of the technique as aevaluative papers (discussed shortly), which we secondary concern. Process studies focus on aspectsasked the authors to consider and comment upon, of the internal features of Delphi, such as the role ofparticularly with regard to our coding and interpreta- feedback, the impact of the size of Delphi groups,tion of each author’s own paper. We also asked the and so on. Technique-comparison studies evaluate
G.
Row
e,G
.W
right/
InternationalJournal
ofF
orecasting15
(1999)353
–375357
Table 1aSummary of the methodological features of Delphi in experimental studies
Study Study Delphi Rounds Nature of Delphi Nature of Task Incentives
type group size feedback subjects offered?
Dalkey and Helmer (1963) Application 7 5 Individual estimates Professionals: e.g., economists, Hypothetical event: number of No
(process) (bombing schedules) systems analysts, electronic bombs to reduce munitions
engineer output by stated amount
Dalkey, Brown and Cochran (1970) Process 15–20 2 Unclear Students: no expertise on task Almanac questions (20 per No
group, 160 in total)
Jolson and Rossow (1971) Process 11 and 14 3 Medians (normalised Professionals: computing corporation Forecast (demand for classes: No
probability medians) staff, and naval personnel one question)
Almanac (two questions)
Gustafson, Shukla, Delbecq and Walster (1973) Technique 4 2 Individual estimates Students: no expertise on task Almanac (eight questions requiring No
comparison (likelihood ratios) probability estimates in the
form of likelihood ratios)
Best (1974) Process 14 2 Medians, range of estimates, Professionals: faculty members Almanac (two questions requiring No
distributions of expertise of college numerical estimates of known
ratings, and, for one group, quantity), and Hypothetical event
reasons (one probability estimation)
Van de Ven and Delbecq (1974) Technique 7 2 Pooled ideas of group Students (appropriate expertise), Idea generation (one item: defining No
comparison members Professionals in education job description of student
dormitory counsellors)
Scheibe, Skutsch and Schofer (1975) Process Unspecified 5 Means, frequency Students (uncertain expertise Goals Delphi (evaluating No
(possibly 21) distribution, reasons (from on task) hypothetical transport facility
Ps distant from mean) alternatives via rating goals)
Mulgrave and Ducanis (1975) Process 98 3 Medians, interquartile Students enrolled in Educational Almanac (10 general, and 10 No
range, own responses to Psychology class (most of related to teaching) and Hypo-
previous round whom were school teachers) thetical (18 on what values
in U.S. will be, and then should be)
358 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375
Tab
le1.
Con
tinue
d
Stud
ySt
udy
Del
phi
Rou
nds
Nat
ure
ofD
elph
iN
atur
eof
Task
Ince
ntiv
es
type
grou
psi
zefe
edba
cksu
bjec
tsof
fere
d?
Bro
ckho
ff(1
975)
Tech
niqu
e5,
7,9,
115
Med
ians
,rea
sons
(of
Prof
essi
onal
s:ba
nkin
gA
lman
acan
dFo
reca
stin
gY
es
com
paris
on,
thos
ew
ithre
spon
ses
(8–1
0ite
ms
onfin
ance
,ban
king
,
Proc
ess
outs
ide
ofqu
artil
es)
stoc
kqu
otat
ions
,for
eign
trade
,
for
each
type
)
Min
er(1
979)
Tech
niqu
e4
Up
to7
Not
spec
ified
Stud
ents
:fr
omdi
vers
ePr
oble
mSo
lvin
g(ro
le-p
layi
ngN
o
com
paris
onun
derg
rad
popu
latio
nsex
erci
seas
sign
ing
wor
kers
to
suba
ssem
bly
posi
tions
)
Roh
rbau
gh(1
979)
Tech
niqu
e5–
73
Med
ians
,ran
ges
Stud
ents
(app
ropr
iate
expe
rtise
)Fo
reca
stin
g(c
olle
gepe
rfor
man
ceN
o
com
paris
onre
grad
epo
int
aver
age
of
40fr
eshm
en)
Fisc
her
(198
1)Te
chni
que
32
Indi
vidu
ales
timat
esSt
uden
ts(a
ppro
pria
teex
perti
se)
Giv
ing
prob
abili
tyof
grad
epo
int
Yes
com
paris
on(s
ubje
ctiv
epr
obab
ility
aver
ages
of10
fres
hmen
falli
ng
dist
ribut
ions
)in
four
rang
es
Boj
ean
dM
urni
ghan
(198
2)Te
chni
que
3,7,
113
Indi
vidu
ales
timat
es,
Stud
ents
(no
expe
rtise
onta
sk)
Alm
anac
(two
ques
tions
Jupi
ter
and
No
com
paris
onon
ere
ason
from
each
PD
olla
rs)
and
Subj
ectiv
eLi
kelih
ood
(pro
cess
)(tw
oqu
estio
ns:
Wei
ght
and
Hei
ght)
Spin
elli
(198
3)A
pplic
atio
n20
4M
odes
(via
hist
ogra
m),
Prof
essi
onal
s:ed
ucat
ion
Fore
cast
ing
(like
lihoo
dof
34N
o
(pro
cess
)(2
4in
itial
ly)
dist
ribut
ions
,rea
sons
educ
atio
n-re
late
dev
ents
occu
rrin
g
in20
year
s,on
1–6
scal
e)
´La
rrec
hean
dM
oinp
our
(198
3)Te
chni
que
54
Mea
ns,l
owes
tan
dM
BA
stud
ents
(with
som
eFo
reca
stin
g(m
arke
tsh
are
ofN
o
com
paris
onhi
ghes
tes
timat
es,
expe
rtise
inbu
sine
ssar
eapr
oduc
tbas
edon
eigh
tcom
mun
icat
ions
reas
ons
ofta
sk)
plan
s;hi
stor
ical
data
prov
ided
)
Rig
gs(1
983)
Tech
niqu
e4–
52
Mea
ns(o
fpo
int
spre
ads)
Stud
ents
Fore
cast
ing
(poi
ntsp
read
sof
two
No
com
paris
onco
llege
foot
ball
gam
esta
king
plac
e4
wee
ksin
futu
re)
´Pa
rent
eet
al.(
1984
)Pr
oces
s80
2Pe
rcen
tage
ofgr
oup
who
Stud
ents
Fore
cast
ing
(30
scen
ario
son
No
pred
icte
dev
ent
tooc
cur,
econ
omic
,pol
itica
lan
d
med
ian
time
scal
em
ilita
ryev
ents)
Erff
mey
eran
dLa
ne(1
984)
Tech
niqu
e4
6In
divi
dual
estim
ates
Stud
ents
(no
expe
rtise
onta
sk)
Hyp
othe
tical
even
t(M
oon
Surv
ival
No
com
paris
on(ra
nks)
,rea
sons
Prob
lem
,inv
olvi
ngra
nkin
g15
item
sfo
rsu
rviv
alva
lue)
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375 359
Erff
mey
er,E
rffm
eyer
and
Lane
(198
6)Pr
oces
s4
6In
divi
dual
estim
ates
Stud
ents
(no
expe
rtise
onta
sk)
Hyp
othe
tical
even
t(M
oon
Surv
ival
No
(rank
s),r
easo
nsPr
oble
m,i
nvol
ving
rank
ing
15
item
sfo
rsu
rviv
alva
lue)
Die
tz(1
987)
Proc
ess
83
Mea
n,m
edia
n,up
per
and
Stud
ents
:st
ate
gove
rnm
ent
Fore
cast
ing
(pro
porti
onof
No
low
erqu
antit
ies,
plus
anal
ysts
orem
ploy
edin
citiz
ens
votin
gye
son
ballo
t
just
ifica
tions
onr3
publ
icse
ctor
firm
sin
Cal
iforn
iael
ectio
n)
Snie
zek
(198
9)Te
chni
que
5Va
ried
Med
ians
Stud
ents
(no
expe
rtise
onta
sk)
Fore
cast
ing
(of
dolla
ram
ount
ofY
es
com
paris
on#
5ne
xtm
onth
’ssa
les
atca
mpu
sst
ore
refo
urca
tego
ries
ofite
ms)
Snie
zek
(199
0)Te
chni
que
5,
6M
edia
nsSt
uden
ts(n
oex
perti
seon
task
)Fo
reca
stin
g(fi
veite
ms
conc
erne
dY
es
com
paris
onw
ithec
onom
icva
riabl
es:
fore
cast
s
wer
efo
r3
mon
ths
inth
efu
ture
)
Tayl
or,P
ease
and
Rei
d(1
990)
Proc
ess
593
Indi
vidu
alre
spon
ses
(i.e.
Elem
enta
rysc
hool
Polic
y(id
entif
ying
the
qual
ities
ofN
o
chos
ench
arac
teris
tics)
teac
hers
effe
ctiv
ein
serv
ice
prog
ram
:
dete
rmin
ing
top
seve
nch
arac
teris
tics)
Leap
e,Fr
esho
ur,Y
ntem
aan
dH
siao
(199
2)Te
chni
que
19,1
12
and
3D
istri
butio
nof
Surg
eons
Polic
y(ju
dged
intra
serv
ice
wor
kN
o
com
paris
onnu
mer
ical
eval
uatio
nsre
quire
dto
perf
orm
each
of55
ofea
chof
the
55ite
ms
serv
ices
,via
mag
nitu
dees
timat
ion)
Gow
anan
dM
cNic
hols
(199
3)Pr
oces
s15
inr1
3R
espo
nse
freq
uenc
yor
Ban
klo
anof
ficer
san
dH
ypot
hetic
alev
ent
(eva
luat
ion
ofN
o
(app
licat
ion)
regr
essi
onw
eigh
tsor
smal
lbu
sine
ssco
nsul
tant
slik
elih
ood
ofa
com
pany
’s
indu
ced
(if-th
en)
rule
sec
onom
icvi
abili
tyin
futu
re)
Hor
nsby
,Sm
ithan
dG
upta
(199
4)Te
chni
que
53
Stat
istic
alav
erag
eof
Stud
ents
(no
parti
cula
rex
perti
sePo
licy
(eva
luat
ing
four
jobs
usin
gth
eN
o
com
paris
onpo
ints
for
each
fact
orbu
tso
me
task
train
ing)
nine
-fac
tor
Fact
orEv
alua
tion
Syst
em)
Row
ean
dW
right
(199
6)Pr
oces
s5
2M
edia
ns,i
ndiv
idua
lSt
uden
ts(n
opa
rticu
lar
expe
rtise
Fore
cast
ing
(of
15po
litic
alN
o
num
eric
alre
spon
ses,
thou
ghite
ms
sele
cted
tobe
and
econ
omic
even
tsfo
r
reas
ons
perti
nent
toth
em)
2w
eeks
into
futu
re)
a*A
naly
sis
sign
ifica
ntat
the
P,
0.05
leve
l;N
S,an
alys
isno
tsig
nific
anta
tthe
P,
0.05
leve
l(tre
ndon
ly);
ND
A,n
ost
atis
tical
anal
ysis
ofth
eno
ted
rela
tions
hip
was
repo
rted;
.,g
reat
er/b
ette
rth
an;
5,n
o(s
tatis
tical
lysi
gnifi
cant
)
diff
eren
cebe
twee
n;D
,Del
phi;
R,fi
rst
roun
dav
erag
eof
Del
phi
polls
;S,
stat
iciz
edgr
oup
appr
oach
(ave
rage
ofno
n-in
tera
ctin
gse
tof
indi
vidu
als)
.Whe
nus
ed,t
his
indi
cate
sth
atse
para
test
atic
ized
grou
psw
ere
used
inst
ead
of(o
rin
addi
tion
to)
the
aver
age
ofth
ein
divi
dual
pane
llist
sfr
omth
efir
stro
und
Del
phi
polls
(i.e.
,ins
tead
of/i
nad
ditio
nto
‘R’)
;I,
inte
ract
ing
grou
p;N
GT,
Nom
inal
Gro
upTe
chni
que;
r,‘r
ound
’;P,
‘pan
ellis
t’;N
A,n
otap
plic
able
.
360 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375
Tab
le2
aR
esul
tsof
the
expe
rim
enta
lD
elph
ist
udie
s
Stud
yIn
depe
nden
tDe
pend
ent
Resu
ltsof
techn
ique
Othe
rres
ults,
Addi
tiona
lcom
men
ts
varia
bles
varia
bles
com
paris
ons
e.g.r
epr
oces
ses
b,c
(and
levels
)(an
dm
easu
res)
Dalk
eyan
dHe
lmer
(196
3)Ro
unds
Conv
erge
nce
(diff
eren
cein
NAIn
crea
sed
conv
erge
nce
(cons
ensu
s)A
com
plex
desig
nwh
ere
quan
titati
vefe
edba
ck
estim
ates
over
roun
ds)
over
roun
ds(N
DA)
from
pane
llists
was
only
supp
lied
inro
und
4,
and
from
only
two
othe
rpan
ellist
s
Dalk
eyet
al.(1
970)
Roun
dsAc
cura
cy(lo
garit
hmof
med
iangr
oup
Accu
racy
:D.
R(N
DA)
High
self-
rated
know
ledge
abili
tywa
sLi
ttle
statis
tical
analy
sisre
porte
d.
estim
atedi
vide
dby
true
answ
er)po
sitiv
elyre
lated
toac
cura
cy(N
DA)
Littl
ede
tailo
nex
perim
ental
meth
od,e
.g.n
ocle
ar
Self-
rated
know
ledge
(1–5
scale
)de
scrip
tion
ofro
und
2pr
oced
ure,
orfe
edba
ck
Jolso
nan
dRo
ssow
(197
1)Ex
perti
se(ex
perts
vs.n
on-e
xper
ts)Ac
cura
cy(n
umer
icale
stim
ates
onth
eAc
cura
cy:D
.R
(expe
rts)(
NDA)
Incr
ease
dco
nsen
sus
over
roun
ds*.
Often
repo
rted
study
.Use
da
small
sam
ple,
Roun
dstw
oalm
anac
ques
tions
only
).Ac
cura
cy:D
,R
(non
-exp
erts)
(NDA
)Cl
aimed
relat
ions
hip
betw
een
high
r1ac
cura
cyso
little
statis
tical
analy
siswa
spo
ssib
le
Cons
ensu
s(d
istrib
utio
nof
Pre
spon
ses)
and
low
varia
nce
over
roun
ds(‘s
igni
fican
t’)
Gusta
fson
etal.
(197
3)Te
chni
ques
(NGT
,I,D
,S)
Accu
racy
(abso
lute
mea
npe
rcen
tage
Accu
racy
:NGT
.I.
S.D
(*fo
rtec
hniq
ues)
NAIn
form
ation
give
nto
allpa
nelli
ststo
cont
rol
Roun
dser
ror).
Accu
racy
:D,
R(N
S)fo
rdiff
eren
ces
inkn
owled
ge.T
his
appr
oach
Cons
ensu
s(v
arian
ceof
estim
ates
Cons
ensu
s:NG
T5D
5I5
Sse
ems
atod
dsto
ratio
nale
ofus
ing
Delp
hiwi
th
acro
ssgr
oups
)he
terog
enou
sgr
oups
Best
(197
4)Fe
edba
ck(st
atisti
cvs
.rea
sons
)Ac
cura
cy(ab
solu
tede
viati
onof
Accu
racy
:D.
R*Hi
ghse
lf-ra
tedex
perti
sepo
sitiv
elyre
lated
Self-
rated
expe
rtise
was
used
asbo
tha
Self-
rated
expe
rtise
(hig
hvs
.low
)as
sess
men
tsfro
mtru
eva
lue).
Accu
racy
:D(1
reas
ons).
D(no
reas
ons)*
toac
cura
cy,a
ndsu
b-gr
oups
ofse
lf-ra
tedde
pend
enta
ndin
depe
nden
tvar
iable
Roun
dsSe
lf-ra
tedex
perti
se(1
–11
scale
)(fo
rone
oftw
oite
ms)
expe
rtsm
ore
accu
rate
than
non-
expe
rts*
Van
deVe
nan
dDe
lbec
q(1
974)
Tech
niqu
es(N
GT,I
,D)
Idea
quan
tity
(num
bero
funi
que
idea
s).Nu
mbe
rofi
deas
:D.
I*;N
GT.
D(N
S)NA
Two
open
-end
edqu
estio
nswe
reals
oas
ked
Satis
facti
on(fi
veite
ms
rated
on1–
5sc
ales).
Satis
facti
on:N
GT.
D*,I
*ab
outl
ikes
and
disli
kes
conc
erni
ngth
e
Effe
ctive
ness
(com
posit
em
easu
reof
idea
Effe
ctive
ness
:NGT
*,D*
.I
proc
edur
es
quan
tity
and
satis
facti
on,e
quall
ywe
ight
ed)
Sche
ibe
etal.
(197
5)Ra
ting
scale
s(ra
nkin
g,ra
ting-
Confi
denc
e(as
sess
edby
seve
nite
mNA
Littl
eev
iden
ceof
corre
latio
nbe
twee
nre
spon
sePo
orde
sign
mea
ntno
thin
gco
uld
beco
nclu
ded
scale
sm
ethod
,and
pair
com
paris
ons)
ques
tionn
aire)
chan
gean
dlo
wco
nfide
nce,
exce
ptin
r2(*
).ab
outc
onsis
tency
ofus
ing
diffe
rent
ratin
gsc
ales.
Roun
dsCh
ange
(five
aspe
cts,e
.g.f
rom
r1to
r2)
False
feed
back
initi
ally
drew
resp
onse
sto
it(N
DA)
Miss
ing
info
rmati
onon
desig
nde
tails
Mul
grav
ean
dDu
cani
s(1
975)
Expe
rtise
(e.g.
,on
item
sab
out
Chan
ge(%
chan
ging
answ
ers)
NAHi
ghdo
gmati
smre
lated
tohi
ghch
ange
over
Abr
iefstu
dy,w
ithlit
tledi
scus
sion
ofth
e
teach
ing
Psco
nsid
ered
expe
rt,Do
gmati
sm(o
nRo
keac
hDo
gmati
smro
unds
*,m
oste
vide
nton
item
son
which
the
unex
pecte
d/co
unter
-intu
itive
resu
lts
onge
nera
litem
sth
eyno
n-ex
pert)
Scale
)m
ostd
ogm
atic
were
relat
ively
inex
pert
Roun
ds
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375 361
Broc
khof
f(19
75)
Tech
niqu
es(D
,I)
Accu
racy
(abso
lute
erro
r)Ac
cura
cy:D
.I(
alman
acite
ms)
(NDA
)No
clear
resu
ltson
influ
ence
ofgr
oup
size.
Sign
ifica
ntdi
ffere
nces
inac
cura
cyof
grou
ps
Grou
psiz
e(5,
7,9,
11,D
;4,6
,8,1
0,I)
Self-
rated
expe
rtise
(1–5
scale
)Ac
cura
cy:I
.D
(fore
cast
item
s)(N
DA)
Self-
rated
expe
rtise
notr
elated
toac
cura
cyof
diffe
rent
sizes
were
due
todi
ffere
nces
in
Item
type
(fore
casti
ngvs
.alm
anac
)Ac
cura
cy:D
.R
(NS)
initi
alac
cura
cyof
grou
psat
roun
d1
Roun
ds(A
ccur
acy
incr
ease
dto
r3;t
hen
decr
ease
d)
Min
er(1
979)
Tech
niqu
es(N
GT,D
,PCL
)Ac
cura
cy/Q
ualit
y(in
dex
from
0.0
to1.
0)Ac
cura
cy:P
CL.
NGT.
D(N
S)NA
Effe
ctive
ness
was
deter
min
edby
the
prod
uct
Acce
ptan
ce(1
1sta
temen
tq’n
aire)
Acce
ptan
ce:D
5NG
T5PC
Lof
quali
ty(ac
cura
cy)a
ndac
cept
ance
Effe
ctive
ness
(pro
duct
ofot
heri
ndex
es)
Effe
ctive
ness
:PCD
L.D*
,NGT
*
Rohr
baug
h(1
979)
Tech
niqu
es(D
,SJ)
Accu
racy
(corre
latio
nwi
thso
lutio
n)Ac
cura
cy:D
5SJ
Quali
tyof
decis
ions
ofin
divi
duals
incr
ease
don
lyDe
lphi
grou
psin
struc
tedto
redu
ced
exist
ing
Post-
grou
pag
reem
ent(
diffe
renc
ebe
twee
nPo
st-gr
oup
agre
emen
t:SJ
.D
(‘sig
nific
ant’)
sligh
tlyaf
terth
egr
oup
Delp
hipr
oces
sdi
ffere
nces
over
roun
ds.A
gree
men
tcon
cern
s
grou
pan
din
divi
dual
post-
grou
ppo
licy)
Satis
facti
on:S
J.D
(‘sig
nific
ant’)
*de
gree
towh
ichpa
nelli
stsag
reed
with
each
othe
r
Satis
facti
on(0
–10
scale
)af
tergr
oup
proc
ess
had
ende
d(p
ost-t
ask
ratin
g)
Fisc
her(
1981
)Te
chni
ques
(NGT
,D,I
,S)
Accu
racy
(acco
rdin
gto
trunc
ated
Accu
racy
:D5
NGT5
I5S
NA
loga
rithm
icsc
orin
gru
le)(P
roce
dure
sM
ainEf
fect:
P.
0.20
)
Boje
and
Mur
nigh
an(1
982)
Tech
niqu
es(N
GT,D
,Iter
ation
)Ac
cura
cy(d
eviat
ion
ofgp
mea
nfro
mAc
cura
cy:R
.D,
NGT
Nore
latio
nshi
pbe
twee
ngr
oup
size
and
Com
pare
dDe
lphi
toan
NGT
proc
edur
ean
dto
Grou
psiz
e(3
,7,1
1)co
rrect
answ
er)Ac
cura
cy:I
terati
onpr
oced
ure.
Rac
cura
cyor
confi
denc
e.an
itera
tion
(no
feed
back
)pro
cedu
re
Roun
dsCo
nfide
nce
(1–7
scale
)(P
roce
dure
s3Tr
ials
was
signi
fican
t)*Co
nfide
nce
inDe
lphi
incr
ease
dov
erro
unds
*
Satis
facti
on(re
spon
ses
on10
item
q’na
ire)
Spin
elli(
1983
)Ro
unds
Conv
erge
nce
(e.g.
,cha
nge
inNA
Noco
nver
genc
eov
erro
unds
Noqu
oted
statis
tical
analy
sis.
inter
quar
tile
rang
e)In
r3an
dr4
,Ps
enco
urag
edto
conv
erge
towa
rds
majo
rity
opin
ion!
´La
rrech
ean
dM
oinp
our(
1983
)Te
chni
ques
(D,I
)Ac
cura
cy(M
ean
Abso
lute
Perc
entag
eEr
ror)
Accu
racy
:D.
R*Ex
perti
seas
sess
edby
self-
ratin
gha
dno
signi
fican
tA
mea
sure
ofco
nfide
ncew
astak
enas
syno
nym
ous
Roun
dsEx
perti
se(v
ia(a)
Confi
denc
ein
estim
ates,
Accu
racy
:D.
I*va
lidity
,but
did
when
asse
ssed
byan
with
expe
rtise
.Evi
denc
eth
atco
mpo
sing
(b)e
xter
nalm
easu
re)Ac
cura
cy:R
5I
exter
nalm
easu
re.*
Aggr
egate
oflat
ter‘e
xper
ts’De
lphi
grou
psof
the
best
(exter
nally
-rated
)
incr
ease
dac
cura
cyov
erot
herD
elphi
grou
ps*
expe
rtsm
ayim
prov
eju
dgem
ent
Rigg
s(1
983)
Tech
niqu
es(D
,I)
Accu
racy
(abso
lute
diffe
renc
ebe
twee
nAc
cura
cy:D
.I*
Mor
eac
cura
tepr
edict
ions
were
prov
ided
fort
heCo
mpa
rison
with
first
roun
dno
trep
orted
.
Info
rmati
onon
task
(hig
hvs
.low
)pr
edict
edan
dac
tual
poin
tspr
eads
)hi
ghin
form
ation
gam
e.*De
lphi
was
mor
eSt
anda
rdin
form
ation
onth
efo
urtea
ms
inth
e
accu
rate
inbo
thhi
ghan
dlo
win
form
ation
case
stw
oga
mes
was
prov
ided
´Pa
rent
e,An
derso
n,Fe
edba
ck(y
esvs
.no)
Accu
racy
(IFan
even
twou
ldoc
cur;
Accu
racy
:D.
R(W
HEN
even
tocc
urs)
Deco
mpo
sitio
nof
Delp
hise
emed
tosh
owTh
ree
relat
edstu
dies
repo
rted:
data
here
onex
3.
Mye
rsan
dO’
Brien
(198
4)Ite
ratio
n(y
esvs
.no)
and
erro
rin
days
reW
HEN
anev
ent
Accu
racy
:D5
R(IF
even
tocc
urs)
impr
ovem
enti
nac
cura
cyre
lated
toth
eite
ratio
nSu
gges
tion
that
feed
back
supe
rfluo
usin
Delp
hi,
woul
doc
cur)
com
pone
nt,n
otfe
edba
ck(W
HEN
item)
howe
ver,
feed
back
used
here
was
supe
rficia
l
Erffm
eyer
and
Lane
(198
4)Te
chni
ques
(NGT
,D,I
,Gro
upQu
ality
(com
paris
onof
rank
ings
to‘c
orre
ctQu
ality
:D.
NGT*
,I*,
Guid
edgp
*NA
Acce
ptan
cebe
ars
simila
rity
toth
epo
st-tas
k
proc
edur
ewi
thgu
ideli
nes
rank
s’as
signe
dby
NASA
expe
rts)
Quali
ty:D
.R*
,also
note:
I.NG
T*co
nsen
sus
(‘agr
eem
ent’)
mea
sure
of
onap
prop
riate
beha
vior
)Ac
cept
ance
(deg
ree
ofag
reem
entb
etwee
nAc
cept
ance
:Gui
ded
gppr
oced
ure.
D*Ro
hrba
ugh
(197
9)
Roun
dsan
indi
vidu
al’sr
anki
ngan
dfin
algr
pra
nkin
g)
362 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375
Tab
le2.
Con
tinue
d
Stud
yIn
depe
nden
tDe
pend
ent
Resu
ltsof
techn
ique
Othe
rres
ults,
Addi
tiona
lcom
men
ts
varia
bles
varia
bles
com
paris
ons
e.g.r
epr
oces
ses
b,c
(and
levels
)(an
dm
easu
res)
Erffm
eyer
etal.
(198
6)Ro
unds
Quali
ty(co
mpa
rison
ofra
nkin
gsto
‘cor
rect
Quali
ty:D
.R*
Quali
tyin
crea
sed
upto
the
four
thro
und,
Isth
isan
analy
sisof
data
from
the
study
of
rank
s’as
signe
dby
NASA
expe
rts)
butn
otth
erea
fter:
‘stab
ility
’was
achi
eved
Erffm
eyer
and
Lane
(198
4)?
Dietz
(198
7)W
eight
ing
(usin
gco
nfide
nce
vs.n
o)Ac
cura
cy(d
iffer
ence
betw
een
fore
cast
Accu
racy
D.
R(N
S)W
eight
edfo
reca
stsles
sac
cura
teth
anno
n-Ve
rysm
allstu
dywi
thon
lyon
egr
oup
ofeig
ht
Sum
mar
ym
easu
re(m
edian
,mea
n,va
lue
and
actu
alpe
rcen
tage)
weig
hted
ones
,tho
ugh
diffe
renc
esm
all.
pane
llists
,and
henc
eit
isno
tsur
prisi
ngth
at
Ham
pel,
biwe
ight
,trim
med
mea
n)Co
nfide
nce
(1–5
scale
)Di
ffere
nces
insu
mm
ary
meth
ods
insig
nific
ant.
statis
ticall
ysig
nific
antr
esul
tswe
reno
tach
ieved
Roun
dsLe
astc
onfid
entn
otm
ostl
ikely
toch
ange
opin
ions
Sniez
ek(1
989)
Tech
niqu
es(D
,I,D
ictato
r,Di
alecti
c)Ac
cura
cy(ab
solu
tepe
rcen
tage
erro
r)Ac
cura
cy:D
ictato
r,D
.I(
NDA)
High
confi
denc
ean
dac
cura
cywe
reco
rrelat
edin
Influ
ence
mea
sure
conc
erne
dex
tentt
owh
ichan
(repe
ated-
mea
sure
sde
sign)
Confi
denc
e(1
–7sc
ale)
Accu
racy
:D.
R(N
S)De
lphi
*.In
fluen
cein
Delp
hiwa
sco
rrelat
edin
divi
dual’
ses
timate
sdr
ewth
egr
oup
judg
men
t.
Roun
dsIn
fluen
ce(d
iffer
ence
betw
een
indi
vidu
al’s
with
confi
denc
e*an
dac
cura
cy*
Tim
e-se
ries
data
supp
lied.
judg
men
tand
r1gr
oup
mea
n)
Sniez
ek(1
990)
Tech
niqu
es(D
,I,S
,Bes
tAc
cura
cy(e.
g.,m
ean
fore
cast
erro
r)Ac
cura
cy:D
.BM
(one
offiv
eitem
s)(P
,0.
10)
Confi
denc
em
easu
res
were
notc
orre
lated
toAl
lind
ivid
uals
were
give
nco
mm
onin
form
ation
Mem
ber(
BM))
Confi
denc
e(d
iffer
ence
betw
een
uppe
rand
Accu
racy
:S.
D(o
neof
five
item
s)(P
,0.
10)
accu
racy
inDe
lphi
grou
ps(ti
me-
serie
sda
ta)pr
iort
oth
etas
k.
Item
diffi
culty
(mea
n%
erro
rs)lo
werl
imits
ofPs
50%
confi
denc
ein
terva
ls)No
.sig
.diff
eren
ces
inot
herc
ompa
rison
sIte
mdi
fficu
ltym
easu
red
post
task
Tayl
oret
al.(1
990)
None
Surv
ivab
ility
(deg
reet
owh
ichr1
cont
ribut
ion
NANo
relat
ions
hip
betw
een
dem
ogra
phic
Aban
donm
enta
ndSu
rviv
abili
tyre
late
to
ofea
chP
isre
taine
din
later
roun
ds)
char
acter
istics
ofpa
nelli
sts(g
ende
r,ed
ucati
on,
influ
ence
and
opin
ion
chan
ge
Aban
donm
ent(
degr
eeof
Psab
ando
ning
r1in
teres
t,an
def
fecti
vene
ssan
d
cont
ribut
ions
infa
vour
ofth
ose
ofot
hers)
surv
ivab
ility
/aba
ndon
men
t)
Dem
ogra
phic
facto
rs(fo
urty
pes)
Leap
eet
al.(1
992)
Tech
niqu
es(D
,sin
gle
roun
dm
ailAg
reem
ent(
com
pare
dm
ean
ofad
juste
dAg
reem
ent:
surv
ey.
D.
Mod
ified
DNA
Agre
emen
twas
mea
sure
dby
com
parin
gpa
nel
surv
ey,m
oder
ated
Dwi
thpa
nel
ratin
gsto
that
gain
edin
natio
nals
urve
y)(M
odifi
edD
inclu
ded
pane
ldisc
ussio
n)es
timate
sto
thos
efro
ma
natio
nals
urve
y.No
ta
disc
ussio
n)tru
em
easu
reof
accu
racy
Gowa
nan
dM
cNich
ols
(199
3)Fe
edba
ck(st
atisti
cal,
regr
essio
nCo
nsen
sus
(pair
wise
subj
ecta
gree
men
tNA
If-th
enru
lefe
edba
ckga
vegr
eater
cons
ensu
sAc
cura
cywa
sm
easu
red,
butn
otre
porte
din
mod
el,if-
then
rule
feed
back
)af
terea
chro
und
ofde
cisio
ns)
than
statis
tical
orre
gres
sion
mod
elfe
edba
ck*
this
pape
r
Horn
sby
etal.
(199
4)Te
chni
ques
(NGT
,D,C
onse
nsus
)De
gree
ofop
inio
nch
ange
(from
aver
aged
Opin
ions
chan
ged
from
initi
alev
aluati
ons
NANo
mea
sure
ofac
cura
cywa
spo
ssib
le.
indi
vidu
alev
aluati
ons
tofin
algp
evalu
ation
)du
ring
NGT*
and
cons
ensu
s*,b
utno
tDelp
hi.
Cons
ensu
sin
volv
esdi
scus
sion
until
allgr
oup
Satis
facti
on(1
2ite
mq’
naire
:1–5
scale
)Sa
tisfa
ction
:D.
cons
ensu
s*,N
GT*
mem
bers
acce
ptfin
alde
cisio
n
Rowe
and
Wrig
ht(1
996)
Feed
back
(stati
stica
l,re
ason
s,Ac
cura
cy(M
ean
Abso
lute
Perc
entag
eEr
ror)
Accu
racy
:D(re
as),
D(sta
rt),I
terate
.R
(all*
)Be
stob
jectiv
eex
perts
chan
ged
least
inth
etw
oAl
thou
ghsu
bjec
tsch
ange
dth
eirfo
reca
stsles
sin
itera
tion
–wi
thou
tfee
dbac
k)Op
inio
nch
ange
(inter
mso
fz-sc
orem
easu
re)Ac
cura
cy:D
(reas
).D(
start)
.Ite
ratio
n(N
S)fe
edba
ckco
nditi
ons*
.th
ere
ason
sco
nditi
on,w
hen
they
did
soth
e
(repe
ated
mea
sure
sde
sign)
Confi
denc
e(ea
chite
m,1
–7sc
ale)
Chan
ge:I
terati
on.
D(sta
rt),D
(reas
)*Hi
ghse
lf-ra
tedex
perts
were
mos
tacc
urate
onr1
*ne
wfo
reca
stswe
rem
ore
accu
rate.
*Th
istre
nd
Roun
dsSe
lf-ra
tedex
perti
se(o
nem
easu
re,1
–7sc
ale)
butn
ore
latio
nshi
pto
chan
geva
riabl
e.di
dno
tocc
urin
the
othe
rcon
ditio
ns,s
ugge
sting
Objec
tive
expe
rtise
ofPs
(via
MAP
ES)
Confi
denc
einc
reas
edov
erro
unds
inall
cond
ition
s*re
ason
sfe
edba
ckhe
lped
disc
rimin
ality
ofch
ange
aSe
eTa
ble
1fo
rexp
lanati
onof
abbr
eviat
ions
.b
Fort
henu
mbe
rofr
ound
s(i.
e.lev
elsof
inde
pend
entv
ariab
le‘R
ound
’),se
eTa
ble
1.c
Inm
any
ofth
estu
dies
,anu
mbe
rofd
iffer
entq
uesti
ons
are
pres
ented
toth
esu
bjec
tsso
that
‘item
s’m
aybe
cons
ider
edto
bean
inde
pend
entv
ariab
le.Us
ually
,how
ever
,the
reis
nosp
ecifi
cra
tiona
lebe
hind
analy
sing
resp
onse
sto
the
diffe
rent
item
s,so
wedo
notc
onsid
er‘it
ems’
anI.V
.her
e.
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375 363
Delphi in comparison to external benchmarks, that is, trend of reduced variance is so typical that thewith respect to other techniques or means of ag- phenomenon of increased ‘consensus’, per se, nogregating judgment (for example, interacting longer appears to be an issue of experimentalgroups). interest.
In Table 2, the main findings have been coded Where some controversy does exist, however, is inaccording to how they were presented by their whether a reduction in variance over rounds reflectsauthors. In some cases, claims have been made that true consensus (reasoned acceptance of a position).an effect or relationship was found even though there Delphi has, after all, been advocated as a method ofwas no backing statistical analysis, or though a reducing group pressures to conform (e.g., Martino,statistical test was conducted but standard signifi- 1983), and both increased consensus and increasedcance levels were not reached (because of, for conformity will be manifest as a convergence ofexample, small sample sizes, e.g. Jolson & Rossow, panellists’ estimates over rounds (i.e., these factors1971). It has been argued that there are problems in are confounded). It would seem in the literature thatusing and interpreting ‘P’ figures in experiments and reduced variance has been interpreted according tothat effect sizes may be more useful statistics, the position on Delphi held by the particular author /particularly when comparing the results of indepen- s, with proponents of Delphi arguing that resultsdent studies (e.g., Rosenthal, 1978; Cohen, 1994). In demonstrate consensus, while critics have arguedmany Delphi studies, however, effect sizes are not that the ‘consensus’ is often only ‘apparent’, and thatreported in detail. Thus, Table 2 has been annotated the convergence of responses is mainly attributableto indicate whether claimed results were significant to other social-psychological factors leading to con-at the P 5 0.05 level (*); whether statistical tests formity (e.g., Sackman, 1975; Bardecki, 1984;were conducted but proved non-significant (NS); or Stewart, 1987). Clearly, if panellists are being drawnwhether there was no direct statistical analysis of the towards a central value for reasons other than aresult (NDA). Readers are invited to read whatever genuine acceptance of the rationale behind thatinterpretation they wish into claims of the statistical position, then inefficient process-loss factors are stillsignificance of findings. present in the technique.
Alternative measures of consensus have beentaken, such as ‘post-group consensus’. This concerns
5. Findings the extent to which individuals – after the Delphiprocess has been completed – individually agree
In this section, we consider the results obtained by with the final group aggregate, their own final roundthe evaluative studies as summarised in Table 2, to estimates, or the estimates of other panellists. Rohr-which the reader is referred. We will return to the baugh (1979), for example, compared individuals’details in Table 1 in our subsequent critique. post-group responses to their aggregate group re-
sponses, and seemed to show that reduction in5.1. Consensus ‘disagreement’ in Delphi groups was significantly
less than the reduction achieved with an alternativeOne of the aims of using Delphi is to achieve technique (Social Judgment Analysis). Furthermore,
greater consensus amongst panellists. Empirically, he found that there was little increase in agreementconsensus has been determined by measuring the in the Delphi groups. This latter finding seems tovariance in responses of Delphi panellists over suggest that panellists were simply altering theirrounds, with a reduction in variance being taken to estimates in order to conform to the group withoutindicate that greater consensus has been achieved. actually changing their opinions (i.e., implying con-Results from empirical studies seem to suggest that formity rather than genuine consensus).variance reduction is typical, although claims tend to Erffmeyer and Lane (1984) correlated post-groupbe simply reported unanalysed (e.g., Dalkey & individual responses to group scores and found thereHelmer, 1963), rather than supported by analysis to be significantly more ‘acceptance’ (i.e., signifi-(though see Jolson & Rossow, 1971). Indeed, the cantly higher correlations between these measures) in
364 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375
an alternative structured group technique than in a ing a final round Delphi aggregate to that of the firstDelphi procedure, although there were no differences round is thus, effectively, a within-subjects com-in acceptance between the Delphi groups and a parison of techniques (Delphi versus staticizedvariety of other group techniques (see Table 2). group); when comparison occurs between Delphi andUnfortunately, no analysis was reported on the a separate staticized group then it is a between-difference between the correlations of the post-group subjects one. Clearly, the former comparison isindividual responses and the first and final round preferable, given that it controls for the highlygroup aggregate responses. variable influence of subjects. Although the com-
An alternative slant on this issue has been pro- parison of round averages should be possible invided by Bardecki (1984), who reported that – in a every study considering Delphi accuracy/quality, astudy not fully described – respondents with more number of evaluative studies have omitted to reportextreme views were more likely to drop out of a round differences (e.g., Fischer, 1981; Riggs, 1983).Delphi procedure than those with more moderate Comparisons of relative accuracy of Delphi panelsviews (i.e., nearer to the group average). This with first round aggregates and staticized groups aresuggests that consensus may be due – at least in part reported in Table 3.– to attrition. Further empirical work is needed to Evidence for Delphi effectiveness is equivocal, butdetermine the extent to which the convergence of results generally support its advantage over firstthose who do not (or cannot) drop out of a Delphi round/staticized group aggregates by a tally of 12procedure are due to either true consensus or to studies to two. Five studies have reported significantconformity pressures. increases in accuracy over Delphi rounds (Best,
´1974; Larreche & Moinpour, 1983; Erffmeyer &Lane, 1984; Erffmeyer et al., 1986; Rowe & Wright,
5.2. Increased accuracy1996), although the two papers of Erffmeyer et al.may be reports of separate analyses on the same data
Of main concern to the majority of researchers is(this is not clear). Seven more studies have produced
the ability of Delphi to lead to judgments that arequalified support for Delphi: in five cases, Delphi is
more accurate than (a) initial, pre-procedure aggre-found to be better than statistical or first round
gates (equivalent to equal-weighted staticizedaggregates more often than not, or to a degree that
groups), and (b) judgments derived from alternativedoes not reach statistical significance (e.g., Dalkey et
group procedures. In order to ensure that accuracyal., 1970; Brockhoff, 1975; Rohrbaugh, 1979; Dietz,
can be rapidly assessed, problems used in Delphi1987; Sniezek, 1989), and in two others it is shown
studies have tended to involve either short-rangeto be better under certain conditions and not others
forecasting tasks, or tasks requiring the estimation of´(i.e., in Parente et al., 1984, Delphi accuracy in-
almanac items whose quantitative values are alreadycreases over rounds for predicting ‘when’ an event
known to the experimenters and about which sub-might occur, but not ‘if’ it will occur; in Jolson &
jects are presumed capable of making educatedRossow, 1971, it increases for panels comprising
guesses. In evaluative studies, the long-range fore-‘experts’, but not for ‘non-experts’).
casting and policy formation items (etc.) that areIn contrast, two studies found no substantial
typical in Delphi applications are rarely used.difference in accuracy between Delphi and staticized
Table 2 reports the results of the evaluation ofgroups (Fischer, 1981; Sniezek, 1990), while two
Delphi with regards to the various benchmarks,others suggested Delphi accuracy was worse. Gustaf-
while Tables 3–5 collate and summarise the com-son et al. (1973) found that Delphi groups were less
parisons according to specific benchmarks.accurate than both their first round aggregates (forseven out of eight items) and than independent
5.2.1. Delphi versus staticized groups staticized groups (for six out of eight items), whileThe average estimate of Delphi panellists on the Boje and Murnighan (1982) found that Delphi panels
first round – prior to iteration or feedback – is became less accurate over rounds for three out ofequivalent to that from a staticized group. Compar- four items.
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375 365
Table 3aComparisons of Delphi to staticized (round 1) groups
Study Criteria Trend of relationship Additional comments(moderating variable)
Dalkey et al. (1970) Accuracy D.R (NDA) 81 groups became moreaccurate, 61 became less
Jolson and Rossow (1971) Accuracy D.R (experts) (NDA) For both items in each caseAccuracy D,R (‘non experts’) (NDA)
Gustafson et al. (1973) Accuracy D,S (NDA) D worse than S on six of eightD,R (NDA) items, and than R on seven of
eight items
Best (1974) Accuracy D.R* For each of two items
Brockhoff (1975) Accuracy D.R (almanac items) Accuracy generally increasedD.R (forecast items) to r3, then started to decrease
Rohrbaugh (1979) Accuracy D.R (NDA) No direct comparison made, buttrend implied in results
Fischer (1981) Accuracy D5S Mean group scores werevirtually indistinguishable
Boje and Murnighan (1982) Accuracy R.D* Accuracy in D decreased overrounds for three out of four items
´Larreche and Moinpour (1983) Accuracy D.R*
´Parente et al. (1984) Accuracy D.R (WHEN items) (NDA) The findings noted here are theAccuracy D5R (IF items) (NDA) trends as noted in the paper
Erffmeyer and Lane (1984) Quality D.R* Of four procedures, D seemedto improve most over rounds
Erffmeyer et al. (1986) Quality D.R* Most improvement in earlyrounds
Dietz (1987) Accuracy D.R Error of D panels decreasedfrom r1 to r2 to r3
Sniezek (1989) Accuracy D.R Small improvement over roundsbut not significant
Sniezek (1990) Accuracy S.D (one of five items) No reported comparison ofaccuracy change over D rounds
Rowe and Wright (1996) Accuracy D (reasons f’back).R*Accuracy D (stats f’back).R*
a See Table 1 for explanation of abbreviations.
5.2.2. Delphi versus interacting groups interacting groups have been compared are presentedAnother important comparative benchmark for in Table 4.
Delphi performance is provided by interacting Research supports the relative efficacy of Delphigroups. Indeed, aspects of the design of Delphi are over interacting groups by a score of five studies tointended to pre-empt the kinds of social /psychologi- one with two ties, and with one study showingcal /political difficulties that have been found to task-specific support for both techniques. Support forhinder effective communication and behaviour in Delphi comes from Van de Ven and Delbecq (1974),
´groups. The results of studies in which Delphi and Riggs (1983), Larreche and Moinpour (1983),
366 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375
Table 4aComparisons of Delphi to interacting groups
Study Criteria Trend of relationship Additional comments(moderating variable)
Gustafson et al. (1973) Accuracy I.D (NDA) D worse than I on five ofeight items
Van de Ven and Delbecq (1974) Number of ideas D.I*
Brockhoff (1975) Accuracy D.I (almanac items) (NDA) Comparisons were between II.D (forecast items) (NDA) and r3 of D
Fischer (1981) Accuracy D5I Mean group scores wereindistinguishable
´Larreche and Moinpour (1983) Accuracy D.I*
Riggs (1983) Accuracy D.I* Result was significant for bothhigh and low information tasks
Erffmeyer and Lane (1984) Quality D.I*
Sniezek (1989) Accuracy D.I (NDA) Small sample sizes
Sniezek (1990) Accuracy D5I No difference at P 5 0.10 levelon comparisons on the five items
a See Table 1 for explanation of abbreviations.
Erffmeyer and Lane (1984), and Sniezek (1989). needs to be done quickly, while Delphi would be aptFischer (1981) and Sniezek (1990) found no dis- when experts cannot meet physically; but whichtinguishable differences in accuracy between the two technique is preferable when both are options?approaches, while Gustafson et al. (1973) found a Results of Delphi–NGT comparisons do not fullysmall advantage for interacting groups. Brockhoff answer this question. Although there is some evi-(1975) seemed to show that the nature of the task is dence that NGT groups make more accurate judg-important, with Delphi being more accurate than ments than Delphi groups (Gustafson et al., 1973;interacting groups for almanac items, but the reverse Van de Ven & Delbecq, 1974), other studies havebeing the case for forecasting items. found no notable differences in accuracy/quality
between them (Miner, 1979; Fischer, 1981; Boje &5.2.3. Delphi versus other procedures Murnighan, 1982), while one study has shown
Although evidence suggests that Delphi does Delphi superiority (Erffmeyer & Lane, 1984).generally lead to improved judgments over staticized Other studies have compared Delphi to: groups ingroups and unstructured interacting groups, it is which members were required to argue both for andclearly of interest to see how Delphi performs in against their individual judgments (the ‘Dialectic’comparison to groups using other structured pro- procedure – Sniezek, 1989); groups whose judg-cedures. Table 5 presents the findings of studies ments were derived from a single, group-selectedwhich have attempted such comparisons. individual (the ‘Dictator’ or ‘Best Member’ strategy
A number of studies have compared Delphi to the – Sniezek, 1989, 1990); groups that received rulesNominal Group Technique or NGT (also known as on interaction (Erffmeyer & Lane, 1984); groupsthe ‘estimate-talk-estimate’ procedure). NGT uses whose information exchange was structured accord-the basic Delphi structure, but uses it in face-to-face ing to Social Judgment Analysis (Rohrbaugh, 1979);meetings that allow discussion between rounds. and groups following a Problem Centred LeadershipClearly, there are practical situations where one (PCL) approach (Miner, 1979). The only studies thattechnique may be viable and not the other, for found any substantial differences between Delphiexample NGT would seem appropriate when a job and the comparison procedure /s are those of
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375 367
Table 5aComparisons of Delphi to other structured group procedures
Study Criteria Trend of relationship Additional comments(moderating variable)
Gustafson et al. (1973) Accuracy NGT.D* NGT-like technique was more accuratethan D on each of eight items
Van de Ven and Delbecq (1974) [ of ideas NGT.D NGT gps generated, on average,12% more unique ideas than D gps
Miner (1979) Quality PCL.NGT.D PCL was more effective than D(P , 0.01) (see Table 2)
Rohrbaugh (1979) Accuracy D5SJ No P statistic reported due tonature of study/analysis
Fischer (1981) Accuracy D5NGT Mean group scores were similar,but NGT very slightly better than D
Boje and Murnighan (1982) Accuracy D5NGT On the four items, D was slightly moreaccurate than NGT gps on final r
Erffmeyer and Lane (1984) Quality D.NGT* Guided gps followed guidelinesD.Guided Group* on resolving conflict
Sniezek (1989) Accuracy Dictator5D5Dialectic Dialectic gps argued for and against(NDA) positions. Dictator gps chose one
member to make gp decision
Sniezek (1990) Accuracy D.BM (one BM5Dictator (above): gp choosesof five items) one member to make decision
a See Table 1 for explanation of abbreviations.
Erffmeyer and Lane (1984), which found Delphi to poor backgrounds in the social sciences and lackedbe better than groups that were given instructions on acquaintance with appropriate research methodolo-resolving conflict, and Miner (1979), which found gies (e.g., Sackman, 1975).that the PCL approach (which involves instructing Over the past few decades, empirical evaluationsgroup leaders in appropriate group-directing skills) to of the Delphi technique have taken on a morebe significantly more ‘effective’ than Delphi. systematic and scientific nature, typified by the
controlled comparison of Delphi-like procedureswith other group and individual methods for obtain-
6. A critique of technique-comparison studies ing judgments (i.e., technique comparison). Althoughthe methodology of these studies has provoked a
Much of the criticism of early Delphi studies degree of support (e.g., Stewart, 1987), we havecentred on their ‘sloppy execution’ (e.g., Stewart, queried the relevance of the majority of this research1987). Among specific criticisms were claims that to the general question of Delphi efficacy (e.g., RoweDelphi questionnaires were poorly worded and am- et al., 1991), pointing out that most of such studiesbiguous (e.g., Hill & Fowles, 1975) and that the have used versions of Delphi somewhat removedanalysis of responses was often superficial (Linstone, from the ‘classical’ archetype of the technique1975). Reasons given for the poor conduct of early (consider Table 1).studies ranged from the technique’s ‘apparent sim- To begin with, the majority of studies have usedplicity’ encouraging people without the requisite structured first rounds in which event statements –skills to use it (Linstone & Turoff, 1975), to devised by experimenters – are simply presented tosuggestions that the early Delphi researchers had panellists for assessment, with no opportunity for
368 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375
them to indicate the issues they believe to be of possess similar knowledge (indeed, certain studiesgreatest importance on the topic (via an unstructured have even tried to standardise knowledge prior to theround), and thus militating against the construction Delphi manipulation; e.g., Riggs, 1983). Indeed, theof coherent task scenarios. The concern of research- few laboratory studies that have used non-studentsers to produce simple experimental designs is com- have tended to use professionals from a singlemendable, but sometimes the consequence is over- domain (e.g., Brockhoff, 1975; Leape et al., 1992;simplification and designs / tasks /measures of uncer- Ono & Wedermeyer, 1994). This point is important:tain relevance. Indeed, because of a desire to verify if varied information is not available to be shared,rapidly the accuracy of judgments, a number of then what could possibly be the benefit of anyDelphi studies have used almanac questions involv- group-like aggregation over and above a statisticaling, for example, estimating the diameter of the aggregation procedure?planet Jupiter (e.g., as in Gustafson et al., 1973; There is some evidence that panels composed ofMulgrave & Ducanis, 1975; Boje & Murnighan, relative experts tend to benefit from a Delphi pro-1982). Unfortunately, it is difficult to imagine how cedure to a greater extent than comparative aggre-Delphi might conceivably benefit a group of un- gates of novices (e.g., Jolson & Rossow, 1971;
´knowledgeable subjects – who are largely guessing Larreche & Moinpour, 1983), suggesting that typicalthe order of magnitude of some quantitatively large laboratory studies may underestimate the value ofand unfamiliar statistic, such as the tonnage of a Delphi. Indeed, the relevance of the expertise ofcertain material shipped from New York in a certain members of a staticized group, and the extent toyear – in achieving a reasoned consideration of which their judgments are shared /correlated, haveothers’ knowledge, and thus of improving the ac- been shown to be important factors related tocuracy of their estimates. However, even when the accuracy in the theoretical model of Hogarth (1978),items are sensible (e.g., of short-term forecasting which has been empirically supported by a numbertype), their pertinence must also depend on the of studies (e.g., Einhorn, Hogarth & Klempner, 1977;relevance of the expertise of the panellists: ‘sensible’ Ashton, 1986).questions are only sensible if they relate to the The absolute and relative expertise of a set ofdomain of knowledge of the specific panellists. That individuals is also liable to interact with the effec-is, student subjects are unlikely to respond to, for tiveness of different group techniques – not onlyexample, a task involving the forecast of economic statistical groups and Delphi, but also interactingvariables, in the same way as economics experts – groups and techniques such as the NGT. Because thenot only because of lesser knowledge on the topic, nature of such relationships are unknown, it is likelybut because of differences in aspects such as their that, for example, though a Delphi-like proceduremotivation (see Bolger & Wright, 1994, for discus- might prove more effective than an NGT approach atsion of such issues). enhancing judgment with a particular set of in-
Delphi, however, was ostensibly designed for use dividuals, a panel possessing different ‘expertise’with experts, particularly in cases where the variety attributes might show the reverse trend. The un-of relevant factors (economic, technical, etc.) ensures comfortable conclusion from this line of argument isthat individual panellists have only limited knowl- that it is difficult to draw grand conclusions on theedge and might benefit from communicating with relative efficacy of Delphi from technique-compari-others possessing different information (e.g., Helmer, son studies that have made no clear attempt at1975). This use of diverse experts is, however, rarely specifying or describing the characteristics of theirfound in laboratory situations: for the sake of panels and the relevant features of the task. Further-simplification, most studies employ homogenous more, any particular demonstration of Delphi effica-samples of undergraduate or graduate students (see cy cannot be taken as an indicator of the moreTable 1), who are required to make judgments about general validity of the technique, since the demon-scenarios in which they can by no means be consid- strated effect will be partly dependent on the specificered expert, and about which they are liable to characteristics of the panel and the task.
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375 369
A second major difference between laboratory tion and reductionism in order to study highlyDelphis and the Delphi ideal lies in the nature of the complex phenomena seems to be at the root of thefeedback presented to panellists (see Table 1). The problem, such that experimental Delphis have tendedfeedback recommended in the ‘classical’ Delphi to use artificial tasks (that may be easily validated),comprises medians or distributions plus arguments student subjects, and simple feedback in the place offrom panellists whose estimates fall outside the meaningful and coherent tasks, experts /profession-upper and lower quartiles (e.g., Martino, 1983). In als, and complex feedback. But simplification can gothe majority of experimental studies, however, feed- too far, and in the case of much of the Delphiback usually comprises only medians or means (e.g., research it would appear to have done so.Jolson & Rossow, 1971; Riggs, 1983; Sniezek, 1989, In order that research is conducted more fruitfully1990; Hornsby et al., 1994), or a listing of the in the future, two major recommendations wouldindividual panellists’ estimates or response distribu- appear warranted from the above discussion. Thetions (e.g., Dalkey & Helmer, 1963; Gustafson et al., first is that a more precise definition of Delphi is1973; Fischer, 1981; Taylor, Pease & Reid, 1990; needed in order to prevent the technique from beingLeape et al., 1992; Ono & Wedermeyer, 1994), or misrepresented in the laboratory. Issues that par-some combination of these (e.g., Rohrbaugh, 1979). ticularly need clarification include: the type ofAlthough feedback of reasons or rationales behind feedback that constitutes Delphi feedback; theindividuals’ estimates has sometimes taken place criteria that should be met for convergence of
´(e.g., Boje & Murnighan, 1982; Larreche & Moin- opinion to signal the cessation of polling; the selec-pour, 1983; Erffmeyer & Lane, 1984; Rowe & tion procedures that should be used to determineWright, 1996) this form of feedback is rare. This number and type of panellists, and so on. Althoughvariability of feedback, however, must be considered suggestions on these issues do exist in the literature,a crucial issue, since it is through the medium of the fact that they have largely been ignored byfeedback that information is conveyed from one researchers suggests that they have not been ex-panellist to the next, and by limiting feedback one pressed as explicitly or as strongly as they mightmust also limit the scope for improving panellist have been, and that the importance of definitionalaggregate accuracy. The issue of feedback will be precision has been underestimated by all.considered more fully in the next section: suffice it to The second recommendation is that a much greatersay here that, once more, it is apparent that sim- understanding is required of the factors that mightplified laboratory versions of Delphi tend to differ influence Delphi effectiveness, in order to preventsignificantly from the ideal in such a way that they the technique’s potential utility from being under-may limit the effectiveness of the technique. estimated through accurate representations used in
From the discussion so far it has been suggested unsympathetic experimental scenarios. It would seemthat the majority of the recent studies that have likely that such research would also help in clarify-attempted to address Delphi effectiveness have dis- ing precisely what those aspects of Delphi adminis-covered little of general consequence about Delphi. tration are that need more specific definition. ForThat research has yielded so little of substance would example, if ‘number of panellists’ was empiricallysuggest that there are definitional problems with demonstrated to have no influence on the effective-Delphi that militate against standard administrations ness of variations of Delphi, then it would be clearof the procedure (e.g., Hill & Fowles, 1975). Fur- that no particular comment or qualifier would bethermore, the lack of concern of experimenters in required on this topic in the technique definition.following even the fairly loose definitional guidelinesthat do exist suggests that there is a general lack ofappreciation of the importance of factors like the 7. Findings of process studiesnature of feedback in contingently influencing thesuccess of procedures like Delphi. The requirement Technique-comparison studies ask the question:of empirical social science research to use simplifica- ‘does Delphi work?’, and yet they use technique
370 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375
forms that differ from one study to the next and that 7.1. The role of feedbackoften differ from the ideal of Delphi. Process studiesask: ‘what is it about Delphi that makes it work?’, Feedback is the means by which information isand this involves both asking and answering the passed between panellists so that individual judg-question: ‘what is Delphi?’ ment may be improved and debiasing may occur.
We believe that the emphasis of research should But there are many questions about feedback thatbe shifted from technique-comparison studies to need to be answered. For example, which individualsprocess studies. The latter should focus on the way change their judgments in response to Delphi feed-in which an initial statistical group is transformed by back – the least confident, the most dogmatic, thethe Delphi process into a final round statistical least expert? What is it about feedback that encour-group. To be more precise, the transformation of ages change? Do different types of feedback causestatistical group characteristics must come about different types of people to change? And so on.through changes in the nature of the panellists, which Although studies such as those of Dalkey andare reflected by changes over rounds in their judg- Helmer (1963) and Dalkey et al. (1970) wouldments (and participation), which are manifest as appear to have examined the role of feedback, this ischanges in the individual accuracy of the ‘group’ not the case, since feedback is confounded by themembers, their judgmental intercorrelations, and unknown role of ‘iteration’, and since only singleindeed – when panellists drop out over rounds – feedback formats were used in each study. Whateven their group size (as per Hogarth, 1978). studies like those of Scheibe et al. (1975) do reveal,
One conceptualisation of the problem is to view however, is the powerful influence of feedback: injudgment change over rounds as coming about their case, by demonstrating the pull that even falsethrough the interaction of internal factors related to feedback can have on panellists. Evidence from mostthe individual (such as personality styles and traits), Delphi studies shows that convergence towards theand external factors related to technique design and ‘group’ average is typical (see section on Consen-the general environment (such as the influence of sus).feedback type and the nature of the judgment task). However, more systematic efforts at assessing theFrom this perspective, the role of research should be role of feedback have been attempted by other
´to identify which factors are most important in authors. Parente et al. (1984), in a series of experi-explaining how and why an individual changes his / ments, used student subjects to forecast ‘if’ andher judgment, and which are related to change in the ‘when’ a number of newsworthy events would occur.desired direction (i.e., towards more accurate judg- In their third study they attempted to separate out thement). Only by understanding such relationships will iteration and feedback components by decomposingit be possible to explain the contingent differences in Delphi. They appeared to demonstrate that, althoughjudgment-enhancing utility of the various available neither iterated polling nor consensus feedback hadprocedures which include Delphi. Importantly from any apparent effect upon group ‘if’ accuracy, itera-this perspective – and unlike in technique-compari- tion alone resulted in increased accuracy for ‘when’son studies – analysis should be conducted at the a predicted event would occur – while feedbacklevel of the individual panellist, rather than at the alone did not. Similarly, Boje and Murnighan (1982)level of the nominal group, since the non-interacting found decreased accuracy over rounds for a ‘stan-nature of Delphi ensures that individual responses dard’ Delphi procedure, while a control procedureand changes are central and measurable and (unlike that simply entailed the iteration of the questionnaireinteracting groups) the ultimate group response is (with no feedback) resulted in increased accuracyeffectively ‘related to the sum of the parts’ and is not (perhaps because of the opportunity for individualsholistically ‘greater than’ these (e.g., Rowe et al., to further reflect on the problem). However, as noted1991). The results of those few studies that we earlier, Delphi studies have tended to involve only
´consider to be of the process type are collated in superficial types of feedback. In Parente et al.subsequent sections. (1984), the feedback comprised modes and medians,
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375 371
while in the study of Boje and Murnighan (1982), as prescribed in the ‘classical’ definition, might leadindividuals’ justifications for their estimates were to improved Delphi performance, this is not to sayused as feedback along with the estimates themselves that ‘proper’ Delphi feedback is liable to be the best(no median or other average statistic was used). type – merely that it is liable to lead to trends in
If different types of feedback differentially in- judgment change that will differ from those in manyfluence individuals to change their judgments over evaluative studies of ‘Delphi’. Indeed, given therounds, then comparisons across the various Delphi limited nature of recommended feedback in theversions become difficult. Unfortunately, there are a classic procedure, the question remains as to hownumber of reasons – empirical, intuitive and theoret- effective even that type will be, given that theical – to suggest that such differential effects do majority of individual panellists are allowed so littleexist. In Rowe and Wright (1996) we compared input (e.g., those within the upper and lower quartiles‘iteration’, ‘statistical’ feedback (means and me- are not required to justify their estimates, evendians) and ‘reasons’ feedback (with no averages) and though they might be made for different – and evenfound that the greatest degree of improvement in mutually incompatible – reasons).accuracy over rounds occurred in the ‘reasons’condition. Furthermore, we found that, although 7.2. The nature of panellistssubjects were less inclined to change their forecastswhen receiving ‘reasons’ feedback than the other A number of studies have considered the role oftypes, when they did change their forecasts they Delphi panellists and how their attributes relate totended to become more accurate – which was not the criteria such as the effectiveness (e.g., accuracy) ofcase in the ‘iteration’ and ‘statistical’ conditions. the procedure. One of the main attributes of panel-
That feedback of a more profound nature can lists is their ‘expertise’ or ‘knowledgeability’. Asimprove the performance of Delphi groups has also discussed earlier, Delphi is intended for use bybeen shown by Best (1974). He found that, for one disparate experts, and yet most empirical studiesof two task items, a Delphi group that was given have used inexpert (often student) panels. Intuitively,feedback of reasons in addition to a median and the use of experts makes sense, but what doesrange of estimates was significantly more accurate research show? A number of ‘process’ studies havethan a Delphi group that was simply provided with considered self-rated expertise, essentially to deter-the latter information. Gowan and McNichols (1993) mine whether self-ratings might be useful for select-considered the relative influences of three types of ing panels. Perhaps unsurprisingly – given theDelphi feedback: statistical, regression model, and number of ways in which self-ratings may be taken –if-then rules. They could not measure the accuracy of results have been equivocal, with a number ofjudgments (the task involved considering the econ- studies suggesting that self-assessment is a validomic viability of companies on the basis of financial procedure (e.g., Dalkey et al., 1970; Best, 1974;ratios), but they did find that (if-then) rule feedback Rowe & Wright, 1996), and others suggesting that it
´induced a significantly greater degree of ‘consensus’ is not (Brockhoff, 1975; Larreche & Moinpour,than the other types. 1983). But these results say more about the utility of
That different feedback types are differentially a particular expertise measure than the role ofrelated to factors such as accuracy, change, and experts in Delphi.consensus should be of no surprise. Indeed, in the Jolson and Rossow (1971), however, found thatfield of social psychology there has been much accuracy increased over rounds for expert groups butresearch on judgment, conformity, and opinion not for inexpert ones (though they used only twochange in interacting groups, and how these relate to groups making judgments on only two problems).the type of information exchanged (e.g., see Deutsch Similarly, Riggs (1983) found that panels were more& Gerard, 1955; Myers, 1978; Isenberg, 1986), but accurate in predicting the result of a college footballsuch concerns seem not to have infiltrated the Delphi game on which they had more information (i.e., weredomain. Although the use of ‘reasons’ in feedback, more ‘expert’) than one on which they had less
372 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375
(although a more interesting comparison might have of quality is generally inappropriate (Armstrong,been of the relative improvement in the two cases). 1985, reviews studies on ‘confidence’ more broadly).Rowe and Wright (1996) found that the most accur- A more interesting conceptualisation of confidence isate Delphi panellists on first rounds changed their as a potential predictor of panellists’ propensity toestimates less over rounds than those who were change their estimates in the face of feedback.initially less accurate (and hence who were, arguab- Scheibe et al. (1975) found a positive relationshiply, less ‘expert’). Importantly, this result appears to between these factors (confidence and change), butsupport the Theory of Errors (see, for example, Rowe and Wright (1996) found no evidence for this.
´ ´Parente & Anderson-Parente, 1987), in which ac- We therefore have no consistent evidence that initialcuracy is improved over rounds as a consequence of confidence explains judgment change over Delphithe panel experts ‘holding-out’, while the less-expert rounds.panellists ‘swing’ towards the group average. The Two studies have considered the impacts of autility of expertise has been reviewed elsewhere, number of personality factors on the Delphi process.with evidence suggesting that there is an interaction Taylor et al. (1990) found no relationship betweenbetween expertise and the nature of the task, so that four demographic factors (e.g., gender and education)expertise is only helpful up to a certain level for and the ‘effectiveness’ of Delphi, or whether panel-forecasting tasks, but of greater importance for lists dropped out. Mulgrave and Ducanis (1975)estimation tasks (e.g., Welty, 1974; Armstrong, considered panellist dogmatism, finding that the most1985). More controlled experiments are required to dogmatic panellists changed judgments the most overexamine how expertise interacts with aspects of the rounds – although the authors had no explanation forDelphi technique, and how it relates to accuracy this rather counter-intuitive result and reported noimprovement over rounds. statistical analysis.
Panellist confidence has been studied from a The impact of the number of panellists has beennumber of perspectives. One conceptualisation of considered by Brockhoff (1975) (who used groups ofconfidence is as an outcome measure. For example, five, seven, nine, and 11) and Boje and MurnighanSniezek (1992) has pointed out that panel confidence (1982) (using groups of three, seven, and 11).may be the only available measure of the quality of a Neither of these studies found a consistent relation-decision (e.g., since one cannot determine the accura- ship between panel size and effectiveness criteria.cy of forecasts a priori), and from this sense it isimportant that ‘confidence’ in some way correlates toother quality measures. Although Rowe and Wright 8. Conclusion(1996) found that both confidence and accuracyincreased over rounds, they found no clear relation- This paper reviews research conducted on theship between the accuracy and confidence of the Delphi technique. In general, accuracy tends toindividual panellists. Conversely, Boje and Mur- increase over Delphi rounds, and hence tends to benighan (1982) found that while confidence increased greater than in comparative staticized groups, whileover rounds, accuracy decreased. Sniezek has also Delphi panels also tend to be more accurate thancompared panellist confidence to accuracy with unstructured interacting groups. The technique hascontradictory results (evidence for a positive rela- shown no clear advantages over other structuredtionship in Sniezek, 1989; but no evidence for such a procedures.relationship in Sniezek, 1990). Dietz (1987) attempt- Various difficulties exist in research of this tech-ed to weight panellist estimates according to how nique-comparison type, however. Our main concernconfident they were, but found that such a weighting is with the sheer variety of technique formats thatprocess gave less accurate results than a standard have been used as representative of Delphi, varyingequal-weighting one. from the technique ‘ideal’ (and from each other) on
From these inconsistent results in Delphi studies, aspects such as the type of feedback used and thewe conclude that the use of confidence as a measure nature of the panellists. If one uses a technique
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375 373
Brockhoff, K. (1975). The performance of forecasting groups informat that varies from the ‘ideal’ on a factor that iscomputer dialogue and face to face discussions. In: Linstone,shown to influence the performance of the technique,H., & Turoff, M. (Eds.), The Delphi method: techniques and
then one is essentially studying a different technique. applications, Addison-Wesley, London.One tentative conclusion from this is that the rec- Brockhoff, K. (1984). Forecasting quality and information. Jour-ommended manner of comprising Delphi groups (as nal of Forecasting 3, 417–428.
Cohen, J. (1994). The Earth is round ( p , .05). Americancommonly used in real-world applications, e.g. Mar-Psychologist December, 997–1003.tino, 1983) may well lead to greater enhancement of
Dagenais, F. (1978). The reliability and convergence of the Delphiaccuracy/quality than might be expected to arisetechnique. The Journal of General Psychology 98, 307–308.
from the typical laboratory panel. In this case, it is Dalkey, N. C., Brown, B., & Cochran, S. W. (1970). The Delphipossible that potential exists for Delphi (conducted method III: use of self-ratings to improve group estimates.
Technological Forecasting 1, 283–291.properly) to produce results far superior to those thatDalkey, N. C., & Helmer, O. (1963). An experimental applicationhave been demonstrated by research.
of the Delphi method to the use of experts. ManagementIndeed, the critique of technique-comparisonScience 9, 458–467.
studies is arguably pertinent to wider research Deutsch, M., & Gerard, H. B. (1955). A study of normative andconcerned with determining which is the best of the informational social influences upon individual judgment.
Journal of Abnormal and Social Psychology 51, 629–636.various potential judgment-enhancing techniques. WeDietz, T. (1987). Methods for analyzing data from Delphi panels:believe that the focus of research needs to shift from
some evidence from a forecasting study. Technological Fore-comparisons of vague techniques, to studies of thecasting and Social Change 31, 79–85.
processes related to judgment change and accuracy Einhorn, H. J., Hogarth, R. M., & Klempner, E. (1977). Quality ofwithin groups. We suggest there should be more group judgment. Psychological Bulletin 84, 158–172.research on the role of feedback in Delphi, and how Erffmeyer, R. C., Erffmeyer, E. S., & Lane, I. M. (1986). The
Delphi technique: an empirical evaluation of the optimalaspects of the task, the measures, and the panellistsnumber of rounds. Group and Organization Studies 11(1),interact to determine how first round Delphi groups120–128.
are transformed to final round groups. We need toErffmeyer, R. C., & Lane, I. M. (1984). Quality and acceptance of
understand the underlying processes of techniques an evaluative task: the effects of four group decision-makingbefore we can hope to determine their contingent formats. Group and Organization Studies 9(4), 509–529.
Felsenthal, D. S., & Fuchs, E. (1976). Experimental evaluation ofutilities.five designs of redundant organizational systems. Administra-tive Science Quarterly 21, 474–488.
Fischer, G. W. (1981). When oracles fail – a comparison of fourReferences procedures for aggregating subjective probability forecasts.
Organizational Behavior and Human Performance 28, 96–110.Gowan, J. A., & McNichols, C. W. (1993). The efforts ofArmstrong, J. S. (1985). Long range forecasting: from crystal ball
alternative forms of knowledge representation on decisionto computer, 2nd ed., Wiley, New York.making consensus. International Journal of Man–MachineArmstrong, J. S., & Lusk, E. J. (1987). Return postage in mailStudies 38, 489–507.surveys: a meta-analysis. Public Opinion Quarterly 51(2),
Gustafson, D. H., Shukla, R. K., Delbecq, A., & Walster, G. W.233–248.(1973). A comparison study of differences in subjectiveAshton, R. H. (1986). Combining the judgments of experts: howlikelihood estimates made by individuals, interacting groups,many and which ones? Organizational Behaviour and HumanDelphi groups and nominal groups. Organizational BehaviorDecision Processes 38, 405–414.and Human Performance 9, 280–291.Bardecki, M. J. (1984). Participants’ response to the Delphi
Helmer, O. (1975). Foreward. In: Linstone, H., & Turoff, M.method: an attitudinal perspective. Technological Forecasting(Eds.), The Delphi method: techniques and applications, Ad-and Social Change 25, 281–292.dison-Wesley, London.Best, R. J. (1974). An experiment in Delphi estimation in
Hill, K. Q., & Fowles, J. (1975). The methodological worth of themarketing decision making. Journal of Marketing Research 11,Delphi forecasting technique. Technological Forecasting and448–452.Social Change 7, 179–192.Bolger, F., & Wright, G. (1994). Assessing the quality of expert
Hogarth, R. M. (1978). A note on aggregating opinions. Organi-judgment: issues and analysis. Decision Support Systems 11,zational Behaviour and Human Performance 21, 40–46.1–24.
Hornsby, J. S., Smith, B. N., & Gupta, J. N. D. (1994). TheBoje, D. M., & Murnighan, J. K. (1982). Group confidenceimpact of decision-making methodology on job evaluationpressures in iterative decisions. Management Science 28,outcomes. Group and Organization Management 19, 112–128.1187–1196.
374 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375
´ ´Hudak, R. P., Brooke, P. P., Finstuen, K., & Riley, P. (1993). Parente, F. J., & Anderson-Parente, J. K. (1987). Delphi inquiryHealth care administration in the year 2000: practitioners’ systems. In: Wright, G., & Ayton, P. (Eds.), Judgmentalviews of future issues and job requirements. Hospital and forecasting, Wiley, Chichester.Health Services Administration 38(2), 181–195. Riggs, W. E. (1983). The Delphi method: an experimental
evaluation. Technological Forecasting and Social Change 23,Isenberg, D. J. (1986). Group polarization: a critical review and89–94.meta-analysis. Journal of Personality and Social Psychology
50, 1141–1151. Rohrbaugh, J. (1979). Improving the quality of group judgment:social judgment analysis and the Delphi technique. Organiza-Jolson, M. A., & Rossow, G. (1971). The Delphi process intional Behavior and Human Performance 24, 73–92.marketing decision making. Journal of Marketing Research 8,
Rosenthal, R. (1978). Combining results of independent studies.443–448.Psychological Bulletin 85(1), 185–193.Kastein, M. R., Jacobs, M., Van der Hell, R. H., Luttik, K., &
Rowe, G., & Wright, G. (1996). The impact of task characteristicsTouw-Otten, F. W. M. M. (1993). Delphi, the issue ofon the performance of structured group forecasting techniques.reliability: a qualitative Delphi study in primary health care inInternational Journal of Forecasting 12, 73–89.the Netherlands. Technological Forecasting and Social Change
Rowe, G., Wright, G., & Bolger, F. (1991). The Delphi technique:44, 315–323.a re-evaluation of research and theory. Technological Forecast-´Larreche, J. C., & Moinpour, R. (1983). Managerial judgment ining and Social Change 39(3), 235–251.marketing: the concept of expertise. Journal of Marketing
Sackman, H. (1975). Delphi critique, Lexington Books, Lex-Research 20, 110–121.ington, MA.Leape, L. L., Freshour, M. A., Yntema, D., & Hsiao, W. (1992).
Saito, M., & Sinha, K. (1991). Delphi study on bridge conditionSmall group judgment methods for determining resource basedrating and effects of improvements. Journal of Transportrelative values. Medical Care 30, NS28–NS39.Engineering 117, 320–334.Linstone, H. A. (1975). Eight basic pitfalls: a checklist. In:
Scheibe, M., Skutsch, M., & Schofer, J. (1975). Experiments inLinstone, H., & Turoff, M. (Eds.), The Delphi method:Delphi methodology. In: Linstone, H., & Turoff, M. (Eds.),techniques and applications, Addison-Wesley, London.The Delphi method: techniques and applications, Addison-Linstone, H. A. (1978). The Delphi technique. In: Fowles, J.Wesley, London.(Ed.), Handbook of futures research, Greenwood Press, Wes-
Sniezek, J. A. (1989). An examination of group process intport, CT.judgmental forecasting. International Journal of Forecasting 5,Linstone, H. A., & Turoff, M. (1975). The Delphi method:171–178.techniques and applications, Addison-Wesley, London.
Sniezek, J. A. (1990). A comparison of techniques for judgmentalLock, A. (1987). Integrating group judgments in subjectiveforecasting by groups with common information. Group andforecasts. In: Wright, G., & Ayton, P. (Eds.), JudgmentalOrganization Studies 15(1), 5–19.forecasting, Wiley, Chichester.
Sniezek, J. A. (1992). Groups under uncertainty: an examinationLunsford, D. A., & Fussell, B. C. (1993). Marketing businessof confidence in group decision making. Organizational Be-services in central Europe. Journal of Services Marketing 7(1),havior and Human Decision Processes 52(1), 124–155.13–21.
Spinelli, T. (1983). The Delphi decision-making process. JournalMartino, J. (1983). Technological forecasting for decision making,of Psychology 113, 73–80.2nd ed., American Elsevier, New York.
Stewart, T. R. (1987). The Delphi technique and judgmentalMiner, F. C. (1979). A comparative analysis of three diverseforecasting. Climatic Change 11, 97–113.group decision making approaches. Academy of Management
Taylor, R. G., Pease, J., & Reid, W. M. (1990). A study ofJournal 22(1), 81–93.survivability and abandonment of contributions in a chain ofMulgrave, N. W., & Ducanis, A. J. (1975). Propensity to changeDelphi rounds. Psychology: A Journal of Human Behavior 27,responses in a Delphi round as a function of dogmatism. In:1–6.Linstone, H., & Turoff, M. (Eds.), The Delphi method:
Van de Ven, A. H., & Delbecq, A. L. (1974). The effectiveness oftechniques and applications, Addison-Wesley, London.nominal, Delphi, and interacting group decision making pro-Myers, D. G. (1978). Polarizing effects of social comparisons.cesses. Academy of Management Journal 17(4), 605–621.Journal of Experimental Social Psychology 14, 554–563.
Van Dijk, J. A. G. M. (1990). Delphi questionnaires versusNeiderman, F., Brancheau, J. C., & Wetherbe, J. C. (1991).individual and group interviews: a comparison case. Tech-Information systems management issues for the 1990s. MISnological Forecasting and Social Change 37, 293–304.Quarterly 15(4), 474–500.
Welty, G. (1974). The necessity, sufficiency and desirability ofOlshfski, D., & Joseph, A. (1991). Assessing training needs ofexperts as value forecasters. In: Leinfellner, W., & Kohler, E.executives using the Delphi technique. Public Productivity and(Eds.), Developments in the methodology of social science,Management Review 14(3), 297–301.Reidel, Boston.Ono, R., & Wedermeyer, D. J. (1994). Assessing the validity of
Wright, G., Lawrence, M. J., & Collopy, F. (1996). The role andthe Delphi technique. Futures 26, 289–304.validity of judgment in forecasting. International Journal of´Parente, F. J., Anderson, J. K., Myers, P., & O’Brien, T. (1984).Forecasting 12, 1–8.An examination of factors contributing to Delphi accuracy.
Journal of Forecasting 3(2), 173–182.
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 –375 375
Biographies: Gene ROWE is an experimental psychologist who George WRIGHT is a psychologist with an interest in thegained his PhD from the Bristol Business School at the University judgmental aspects of forecasting and decision making. He isof the West of England (UWE). After some years at UWE, then at Editor of the Journal of Behavioral Decision Making and anthe University of Surrey, he is now at The Institute of Food Associate Editor of the International Journal of Forecasting andResearch (IFR), Norwich. His research interests have ranged from the Journal of Forecasting. He has published in such journals asexpert systems and group decision support, to judgment and Management Science, Current Anthropology, Journal of Directdecision making more generally. Lately, he has been involved in Marketing, Memory and Cognition, and the International Journalresearch on risk perception and public participation mechanisms in of Information Management.risk assessment and management.