23
© The Author 2013. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For permissions, please e-mail: [email protected] A PROPENSITY-ADJUSTED INTERVIEWER PERFORMANCE INDICATOR BRADY T. WEST* ROBERT M. GROVES Abstract Fair evaluation of interviewers based on their cooperation rates is complicated in most surveys. Depending on the mode(s) of data collection and the methods used to assign sampled cases to interview- ers, different interviewers often work cases that are more or less difficult to interview. Further, in centralized telephone facilities, interviewers are given cases affected by the results of earlier attempts by other interview- ers. This paper proposes and evaluates an interviewer performance indi- cator that attempts to repair these weaknesses and can be used in all data collection modes involving interviewers. For each contact made by an interviewer, the deviation between the outcome of that contact (1 = suc- cessful interview, 0 = other) and the predicted probability of a successful interview for that contact (according to auxiliary information and survey paradata) is recorded. Each interviewer then receives a score that is the mean of these deviations across all contacts. This performance indica- tor gives larger credit to the interviewer who obtains success on difficult cases versus relatively easy cases. The indicator also gives smaller penal- ties to failures on very difficult cases. We analyze face-to-face and tel- ephone contacts from three different surveys to illustrate the computation Brady T. West is a Research Assistant Professor in the Survey Methodology Program (SMP), Survey Research Center (SRC), Institute for Social Research (ISR), University of Michigan, Ann Arbor, MI, USA. Robert M. Groves is the Provost at Georgetown University, Washington, DC, USA. Work on this paper by Robert Groves was completed while he was Director of the Survey Research Center at the University of Michigan–Ann Arbor. The authors thank James Wagner, Richard Curtin, Grant Benson, and Joe Matuzak for access to and assistance with the Survey of Consumer Attitudes data, the National Center for Health Statistics for access to the National Survey of Family Growth data [NCHS Contract 200-2000-07001], and their colleagues at the Institute for Employment Research (IAB) in Nuremberg, Germany (especially Dr. Mark Trappmann and Daniela Hochfellner) for access to and assistance with the PASS data (made possible through the IAB Research Data Center established at the University of Michigan–Ann Arbor). The authors also thank the National Science Foundation for grant support to pursue this idea [SES0207435 to R.M.G., Principal Investigator]. The authors do not have any conflicts of interest. *Address correspondence to Brady T. West, Survey Methodology Program, P.O. Box 1248, 426 Thompson Street, Ann Arbor, MI 48106, USA; e-mail: [email protected]. Public Opinion Quarterly, Vol. 77, No. 1, Spring 2013, pp. 352–374 doi:10.1093/poq/nft002 Advance Access publication 11 April 2013 at Jordan University of Science and Technology on July 26, 2016 http://poq.oxfordjournals.org/ Downloaded from

A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

© The Author 2013. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For permissions, please e-mail: [email protected]

A ProPensity-Adjusted interviewer PerformAnce indicAtor

Brady T. WesT* roBerT M. Groves 

Abstract Fair evaluation of interviewers based on their cooperation rates is complicated in most surveys. depending on the mode(s) of data collection and the methods used to assign sampled cases to interview-ers, different interviewers often work cases that are more or less difficult to interview. Further, in centralized telephone facilities, interviewers are given cases affected by the results of earlier attempts by other interview-ers. This paper proposes and evaluates an interviewer performance indi-cator that attempts to repair these weaknesses and can be used in all data collection modes involving interviewers. For each contact made by an interviewer, the deviation between the outcome of that contact (1 = suc-cessful interview, 0 = other) and the predicted probability of a successful interview for that contact (according to auxiliary information and survey paradata) is recorded. each interviewer then receives a score that is the mean of these deviations across all contacts. This performance indica-tor gives larger credit to the interviewer who obtains success on difficult cases versus relatively easy cases. The indicator also gives smaller penal-ties to failures on very difficult cases. We analyze face-to-face and tel-ephone contacts from three different surveys to illustrate the computation

Brady T. West is a research assistant Professor in the survey Methodology Program (sMP), survey research Center (srC), Institute for social research (Isr), University of Michigan, ann arbor, MI, Usa. robert M. Groves is the Provost at Georgetown University, Washington, dC, Usa. Work on this paper by robert Groves was completed while he was director of the survey research Center at the University of Michigan–ann arbor. The authors thank James Wagner, richard Curtin, Grant Benson, and Joe Matuzak for access to and assistance with the survey of Consumer attitudes data, the National Center for Health statistics for access to the National survey of Family Growth data [NCHs Contract 200-2000-07001], and their colleagues at the Institute for employment research (IaB) in Nuremberg, Germany (especially dr. Mark Trappmann and daniela Hochfellner) for access to and assistance with the Pass data (made possible through the IaB research data Center established at the University of Michigan–ann arbor). The authors also thank the National science Foundation for grant support to pursue this idea [ses0207435 to r.M.G., Principal Investigator]. The authors do not have any conflicts of interest. *address correspondence to Brady T. West, survey Methodology Program, P.o. Box 1248, 426 Thompson street, ann arbor, MI 48106, Usa; e-mail: [email protected].

Public Opinion Quarterly, vol. 77, No. 1, spring 2013, pp. 352–374

doi:10.1093/poq/nft002 advance access publication 11 april 2013

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 2: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

of this indicator and its properties. We find that calling-history paradata are the strongest predictors of obtaining interviews in both modes (espe-cially for face-to-face contacts), and produce propensity-adjusted perfor-mance indicators that more effectively distinguish among interviewers.

introduction

Interviewer-level cooperation rates are often used as performance metrics for interviewers (e.g., Groves and Couper 1998; Kennickell 2006; Fowler 2008; Tarnai and Moore 2008). In area probability sample surveys, interpreting such rates is complicated. Interviewers whose sample assignments lie in large urban areas tend to obtain lower cooperation rates than those working in small rural communities (Groves and Couper 1998). These differences are believed to arise more because of target population differences than interviewer performance differences (e.g., singer, Frankel, and Glassman 1983; Wiggins, Longford, and o’Muircheartaigh 1992; Morton-Williams 1993; o’Muircheartaigh and Campanelli 1999; snijkers, Hox, and de Leeuw 1999; Hox and de Leeuw 2002; but see Hox, de Leeuw, and Kreft 1991). In contrast, interviewers work-ing the same shift in centralized telephone interviewing facilities share near-random subsamples of cases. However, even in centralized telephone surveys, there remain three complications of interpreting interviewer-level cooperation rates: (1) interviewers working different shifts tend to contact people with dif-ferent likelihoods of cooperation; (2) interviewer response rates are affected by the contributions of other interviewers’ prior interactions with those cases; and (3) interviewers assigned refusal conversion cases have a greater burden to obtain a given response rate. Given that cases in telephone surveys are often assigned to any interviewer ready for the next dialing, evaluating performance based on each interaction with a sample member (rather than the final out-comes of cases) is an attractive alternative.

It would therefore be useful for the survey practitioner to have a metric for comparing the effectiveness of interviewers that adjusts for the difficulty of specific contacts with sampled cases (where “contacts” refer to face-to-face or telephone interactions with a sample householder). Unfortunately, most metrics used for evaluating interviewer performance fail to recognize the difficulty of assigned cases (e.g., cooperation rates, calls and completes per hour, measures of data quality, performance relative to other interviewers, etc.; see Barioux [1952] or Tarnai and Moore [2008]). Two recently developed metrics have attempted to address this problem, and we aim to extend these initial efforts in this paper.

durand (2005) developed a scoring system for interviewer performance in telephone surveys that conceptualized performance as a function of completed interviews, refusals, and interviewer time required to achieve those outcomes. For each interviewer within each calling shift, durand’s Net Contribution to Performance Index (NCPI) awards: (1) one point for a completed interview

A Propensity-Adjusted Interviewer Performance Indicator 353

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 3: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

from a never-reached number; (2) two points for a converted refusal; and (3) one point for a completed interview from a previous appointment. It also deducts one point for a refusal from a never-reached number.

The sum of these points for all calls made during a shift is then divided by the maximum number of completed interviews during the shift (defined as the shift duration divided by the average length of a completed interview in that shift) to compute the NCPI. The author acknowledges that the assign-ment of these points, while clearly ordered correctly relative to one another, is somewhat arbitrary. Further, the NCPI combines success in gaining participa-tion with speed of gaining that participation. We seek a measure that reflects success at gaining participation while accounting for the difficulty of assign-ments, without reflecting the speed of the work.

In another work, Laflamme and st. Jean (2011) defined a weighted inter-viewer performance (WIP) measure for CaTI surveys as a function of three factors: the current state of the case being worked (e.g., a contacted case with-out any prior refusals), the number of calls previously made to a case, and the time of day at which the call was made (day or evening). Interviews completed with cases falling into cells with lower cooperation rates receive more weight in the calculation of the WIP measure for an interviewer; see Laflamme and st. Jean (2011, pp.  6–7) for computational details. These authors note that each of the three factors affect the likelihood of gaining participation when calling a given case. This indicator therefore partially incorporates the diffi-culty of cases currently being worked, but is limited by using cells formed by only three factors to define the difficulty of a case.

We propose a metric that can be used to evaluate interviewers that focuses on contacts with sample cases, and incorporates the projected difficulty of those cases based on response propensity modeling. This extends the ideas proposed by durand (2005) and Laflamme and st. Jean (2011) to both face-to-face and telephone data collection modes, and allows for differences in expected response propensity between cases based on a number of predictor variables. For exam-ple, interviewers successfully obtaining cooperation from a more difficult previ-ous refusal (with a lower expected response propensity, based on several other predictor variables) would be given a larger score for the contact. although response propensity modeling provides a more general measure of the difficulty of active cases in any given survey, its value depends on the predictive power of the models. We describe our proposed method in the next section, describing the response propensity model and the use of informative survey paradata.

the Propensity-Adjusted interviewer Performance (PAiP) score

Central to our proposed method is the ability to estimate a strong model of response propensity, where a contact-level response indicator (1 = successful

West and Groves354

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 4: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

interview, 0 = other outcome) is regressed on a rich set of survey paradata and auxiliary variables that are (1) available for all sample contacts; and (2) predictive of cooperation with the survey request at the next contact. With appropriate survey paradata (Couper 1998; Couper and Lyberg 2005; Kreuter and Casas-Cordero 2010), an interviewer performance indicator might be considered that measures the interviewer’s average performance in obtaining interviews over all contacts made with sampled households. The use of para-data, or “data collection process” data, to estimate response propensities and possibly repair nonresponse bias has received a considerable amount of recent research attention (see West [2011] for a review). In this paper, we use selected paradata, in addition to other auxiliary variables available for both respondents and nonrespondents, to estimate response propensities for active cases at sub-sequent contacts.

Consider the following ingredients for computing a propensity-adjusted interviewer performance indicator for a given interviewer:

a) For each active case, estimate the probability that the next contact will generate a positive outcome; this probability is stored in the paradata record of the case, and is called pjc (the probability of a successful out-come for case j at contact c).

b) The interviewer who contacts case j at the next call, based on the out-come of that contact, Ojc, is assigned a “deviation” score of djc = Ojc – pjc, where Ojc = 1 for a completed interview and 0 otherwise.1

c) The interviewer works a total of n cases during the data collection period, making a total of Cj contacts with a given case j; the deviation scores from b) are then averaged over all contacts with all cases, with

the result being dd

c

C

j

n

j

n

j

C=

==

=

∑∑∑

jc

j

11

1

, or a propensity-adjusted interviewer

performance (PaIP) score based on contacts only.

This performance indicator gives a larger credit to the interviewer who obtains success on very difficult cases versus relatively easy cases, and a larger penalty for failure on very easy cases versus relatively difficult cases. We note

1. We acknowledge that establishment of an appointment for a future interview might be con-sidered a “successful” outcome of a contact. The methodology behind computing the PaIP score is flexible enough to allow survey managers to choose an arbitrary contact outcome value for an appointment, such as 0.5 (as opposed to 1 or 0), when computing the PaIP scores. However, appointments do not always guarantee interviews at future contacts, especially in telephone sur-veys (as we will demonstrate in our examples), and we are interested in the ideal product of the survey process: a completed interview. establishing appointments will make cases “easier” in terms of response propensity, which we also demonstrate in our examples, so we include estab-lishment of appointments at prior contacts in the response propensity models that we use to illus-trate the computation of PaIP scores.

A Propensity-Adjusted Interviewer Performance Indicator 355

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 5: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

that response probabilities at subsequent contacts from part a) need to be esti-mated using the response propensity models described earlier. The appropriate classification of cases as “easy” or “difficult” therefore relies on the model used to estimate response propensity.

examples of computing and evaluating PAiP scores

We present three illustrations of the PaIP score: one using face-to-face contacts with households in an area probability sample, where interviewers are resident in primary sampling areas; one using contacts from several years of a monthly random-digit-dial (rdd) telephone survey conducted in a centralized CaTI facility; and one using both CaPI and CaTI contacts from a mixed-mode sur-vey. evaluating the utility of our proposed PaIP score requires estimation of the difficulty of cases assigned to an interviewer. We estimate case difficulty as the mean predicted response propensity for all final contacts made by an interviewer, because these predictions incorporate all of the paradata and other auxiliary information that has been recorded for a case from all prior calling efforts. our evaluation explores the correlations of mean estimated difficulty measures for the cases assigned to interviewers, the cooperation rates of inter-viewers, and our proposed PaIP scores. In addition, the paradata and call dur-ation2 data necessary to compute the evaluation metrics proposed by durand (2005) and Laflamme and st. Jean (2011) were available for each year of the monthly rdd telephone survey, and we use these data to compare the com-puted PaIP scores to these two alternative metrics.3

exaMPLe 1: CyCLe 7 oF THe U.s. NaTIoNaL sUrvey oF FaMILy GroWTH (NsFG)

The National survey of Family Growth (NsFG) is an ongoing area probabil-ity sample survey of U.s.  household members 15–44  years of age, with a 60–80-minute face-to-face interview (using CaPI and aCasI) about sexual and fertility experiences, partnering, and family formation events (Groves et  al. 2009; Lepkowski et  al. 2010). In initial screening interviews, NsFG

2. some of the survey of Consumer attitudes (sCa) call duration data may have been subject to measurement error. across the four sCa years analyzed in this study, between 21.84 and 42.99 percent of the data in fields measuring computer-recorded call durations were missing. The miss-ing call-duration values were “imputed” based on the time elapsed between subsequent calls made by the same interviewer on the same day (which may have, for example, included breaks). We note that the collection of the sCa calling-history data was not designed with computation of the NCPI and WIP metrics in mind.3. sas code for computing the two alternative metrics is available upon request from the authors. as indicated earlier, these competing metrics cannot be computed for face-to-face surveys, and the CaTI contacts from the mixed-mode survey did not have the necessary call-duration data available.

West and Groves356

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 6: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

interviewers complete a roster of a sampled household, and a single respond-ent between the ages of 15 and 44 is then randomly selected from the ros-ter for the main interview, according to a probability sampling scheme that is designed to meet target sample sizes in specified demographic subgroups (Lepkowski et al. 2010). The main interviews are then conducted either imme-diately after the screening interview or at a later appointed date and in english or spanish, depending on the respondent’s preferences. This example exam-ines the performance of 91 interviewers working in 83 different primary sam-pling areas (PsUs) over the first 10 quarters (i.e., 30 months) of data collection for the survey (June 2006 to december 2008). Using the aaPor rr4 calcu-lation (aaPor 2008), with individual outcomes adjusted by base sampling weights, the NsFG achieved overall quarterly weighted response rates ranging from 70 to 80 percent across demographic subgroups during these 10 quarters. Given that most NsFG interviewers work in a single PsU, we attempted to include several PsU-level predictor variables that might explain variation in case difficulty among interviewers. We analyze visits in which contact was established with someone at the sampled housing unit, after completion of an initial screening interview designed to select an eligible person between the ages of 15 and 44 for the main NsFG interview.

The NsFG uses a responsive design framework (Groves and Heeringa 2006), and accordingly collects a very rich set of paradata on sampled area segments within PsUs (e.g., interviewer safety concerns about the area), housing units (e.g., physical access impediments), calls (e.g., number of calls attempted), contacts (e.g., respondents asking questions during the contact), and inter-viewer observations (e.g., evidence of presence of young children) to enhance design decisions during data collection (Groves et al. 2009, pp. 23–24). We use these paradata, along with interviewer experience (in years) and selected Census 2000 information for the PsUs, to build the response propensity model required for computation of the PaIP scores. our data set included 42,200 contacts, and we used ProC LoGIsTIC in sas to fit a discrete-time hazard model (see allison [2010], Chapter 7, for details) and predict the probability of a successful main interview at the next contact. sample cases had as many records in the data set as contacts were made until a main interview was com-pleted or the data collection was finished for a given quarter. The dependent variable was equal to 1 for contacts with successful main interviews and 0 for contacts with other outcomes. We included all of the auxiliary variables indicated in table 1, as our goal was to maximize the fit of the model. Table 1 presents the estimated coefficients for this model (for additional detail on the predictor variables, see Lepkowski et  al. [2010] and table a3 in the online appendix).

The model has a pseudo-R2 value of 0.337. Most of the significant rela-tionships are in the expected directions: For example, interviewers with more experience have higher probabilities of obtaining an interview given contact; cases receiving more calls or presenting resistance have lower probabilities;

A Propensity-Adjusted Interviewer Performance Indicator 357

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 7: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

table  1. estimated discrete-time Hazard model coefficients (and standard errors) for Predictors of main interview completion at the next contact: national survey of family Growth (nsfG) cycle 7, and survey of consumer Attitudes (scA), 2003–2006

Predictor variables

NsFG model sCa model

estimate se estimate se

Intercept –1.551** 0.055 –2.297** 0.639Interviewer experience (years) 0.022** 0.004

Paradata: calling historyNumber of contacts with

resistance (sCa) –0.324** 0.036Number of contacts with no

resistance (sCa) 0.022 0.034Number of non-contacts (sCa) 0.096** 0.034Prior contact established 0.539** 0.051 0.106** 0.031sample person statements at last

contact (NsFG) –0.285** 0.085Number of prior contacts (NsFG) 0.012 0.025Number of prior calls –0.039** 0.004 –0.086* 0.034Last contact produced other result 0.189** 0.033Last contact produced soft

appointment 1.135** 0.072 0.586** 0.037Last contact produced hard

appointment 2.529** 0.040 1.338** 0.049Last contact had max resistance –0.517* 0.215 –0.430** 0.055days since last contact –0.018** 0.002 –0.049** 0.004days left in month (sCa) –0.006** 0.001Prior resistance at any contact –0.286 0.198sample person questions at last

contact (NsFG) –0.119 0.067sample person statements, any

prior contact (NsFG) 0.030 0.084sample person questions, any

prior contact (NsFG) –0.072 0.063

interviewer observations (nsfG only)

Children in the household 0.037 0.026all residential area –0.022 0.026Multi-unit structure 0.084** 0.030evidence of non-english speakers 0.256** 0.095evidence of spanish speakers –0.371** 0.097Neighborhood safety concerns 0.103** 0.031Housing unit access impediments –0.165** 0.042Person likely in sexual relationship –0.134** 0.033

Continued

West and Groves358

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 8: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

Predictor variables

NsFG model sCa model

estimate se estimate se

information from screening interview (nsfG only)

single-person household 0.335** 0.043Teenage sample person 0.080* 0.036Female sample person 0.037 0.025african-american sample person 0.005 0.029spanish-language screener –0.209** 0.052

county-level census / msG variablesUrban (self-representing) primary

sampling area –0.061 0.036occupied housing units (in thou-

sands, Census 2000) –0.001** 0.001% Housing units in segments with

> 10% Black pop. 0.003** 0.001% Housing units in segments with

> 10% Hispanic pop. 0.003** 0.001% Housing units in segments

with > 10% Black and > 10% Hispanic pop. 0.007** 0.002

% Graduated from college 0.012** 0.003% with race = White 0.007** 0.001Median home value in thousands –0.001** < 0.001Northeast Census region

(vs. West) –0.171** 0.041Midwest Census region (vs. West) 0.098** 0.037south Census region (vs. West) –0.145** 0.037Nielsen County a–Top 21 metro

areas (vs. d) -0.133* 0.053Nielsen County B–Metro areas

with more than 85,000 house-holds (vs. d) –0.090 0.049

Nielsen County C–More than 20,000 households (vs. d) –0.075 0.045

% with income $0K– $10K –0.002 0.004% with income > $100K –0.007 0.005% with Hispanic race –0.001 0.001Central city of an Msa (vs. not

in Msa) –0.020 0.046Inside county containing central

city (vs. not in Msa) –0.006 0.046

table 1. Continued

Continued

A Propensity-Adjusted Interviewer Performance Indicator 359

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 9: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

cases with previous appointments have higher probabilities; and cases in larger areas with more urban features tend to have lower probabilities. The coefficients for several interviewer observations and other variables collected from the screening interview are also statistically non-zero, but these variables are not making a substantial contribution to the model; their exclusion yields a pseudo-R2 value of 0.333 (see the discussion section for more details). Predicted probabilities of successful interviews at each contact, based on the estimated coefficients from this model, were used to compute the PaIP scores for each interviewer.

Table 2 sorts all 42,200 contacts in the NsFG data set by quintiles of the estimated probabilities of obtaining a main interview. The second column of table 2 shows a close fit between the predicted response propensities and actual contact outcomes; the propensity model does a good job of discriminat-ing contacts that are difficult for interviewers from those that are easy. The third column of table 2 presents standard deviations of the quintile-specific interviewer-level cooperation rates for all final contacts falling into each quintile. These results show that final contacts having a very high propen-sity of an interview are successfully handled by almost all interviewers, while final contacts with a very low propensity result in more variance among the

Predictor variables

NsFG model sCa model

estimate se estimate se

Inside suburban county of Msa (vs. not in Msa) 0.008 0.046

In Msa with no central city (vs. not in Msa) –0.128 0.099

% renting or other not owning –0.001 0.001Median years of education 0.054 0.046Median household income in

thousands –0.002 0.004Households per 100 square feet –0.050 0.049% Telephone numbers listed 0.001 0.001% with age 0–17 0.003 0.005% with age 65+ 0.001 0.003

Number of contacts 42,200 84,858Nagelkerke pseudo-r2 0.337 0.074

Note.—see table a3 in the online appendix for more detailed descriptions of these variables.*p < 0.05; **p < 0.01

table 1. Continued

West and Groves360

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 10: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

interviewers in cooperation rates. In short, “easy” cases (based on the esti-mated propensity model in table  1) generated uniformly high cooperation rates across all interviewers, and “difficult” cases produced large interviewer variation in cooperation rates. These results show that the predicted response propensities incorporating all available information for a case provide a strong proxy of case difficulty in the NsFG, which is important for the success of the PaIP methodology when evaluating interviewers.

Finally, we expect the interviewer-level cooperation rates (based on final outcomes with their assigned cases) to be positively correlated with the mean predicted response propensity of cases assigned to the interviewer (also based on final contacts). That is, the more traditional way of evaluating interview-ers favors those interviewers who are assigned easier cases. The estimated correlation of the interviewer-level cooperation rate and the mean predicted response propensity of cases assigned to the interviewer was r  =  0.461 (p < 0.001), showing that interviewers with easier assignments do indeed tend to have higher cooperation rates. The estimated correlation of the mean pre-dicted response propensity for an interviewer and the PaIP score computed for the interviewer was r = –0.494 (p < 0.001), suggesting a very different story: Interviewers working more difficult cases tended to have higher PaIP scores (meaning that success with difficult cases is being rewarded), whereas interviewers working easy cases tended to have lower PaIP scores (meaning that failures with easy cases are being penalized). The interviewer-level coop-eration rate and the PaIP score were positively correlated but not strongly (r = 0.367), suggesting that the PaIP scores and the cooperation rates only have about 13.5 percent of their variance in common. These results provide support for the PaIP score as a distinct measure of interviewer performance.

table  2. Percentages of nsfG contacts resulting in a completed interview  and standard deviations of cooperation rates for final contacts across interviewers, by Quintile of Predicted response Propensity

Quintile of contact’s predicted propensity of an interview

Percentage of contacts resulting in an interview/mean predicted response

propensity(n = 42,200 contacts)

sd of final contact cooperation rates for interviewers within the quintile (number

of interviewers)

Number of contacts falling into quintile

(number of final contacts)

1 (Low) 12.6% / 0.124 0.282 (n = 88) 8,437 (2,011)2 15.9% / 0.177 0.269 (n = 87) 8,443 (1,565)3 20.2% / 0.205 0.228 (n = 87) 8,440 (1,915)4 32.4% / 0.300 0.089 (n = 88) 8,440 (2,911)5 (High) 78.2% / 0.787 0.014 (n = 86) 8,440 (6,641)

A Propensity-Adjusted Interviewer Performance Indicator 361

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 11: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

exaMPLe 2: THe MoNTHLy U.s. sUrveys oF CoNsUMers

The surveys of Consumers (abbreviated here as sCa) are monthly rdd household telephone surveys interviewing randomly selected U.s.  adults about their attitudes toward the economy and their personal financial condi-tion (see http://www.sca.isr.umich.edu for more details). each month, about 300 new interviews are taken (in a separate part of the design, second-wave interviews are taken with those first interviewed six months earlier). a  list-assisted rdd design provides an equal probability sample of all telephone households, and when sampled households are initially contacted, interview-ers compile a list of all household members 18 and older. From this list, a specific household member is randomly selected using probability-based “respondent selection tables” assigned to each household (see http://www. sca.isr.umich.edu for additional sampling details), and subsequent interviews are conducted in english or spanish. The sCa analyses in this example are based on all calls resulting in household contact from 2003 to 2006, using only the first-wave rdd sample cases; only landline telephone numbers were sampled during these years. The response rate for the first-wave cases ranges from 43.2 to 47.8 percent (rr2; aaPor 2008) during this period. as is the case with most centralized telephone surveys, interviewers working on the survey share the sample cases, with calling directed by sample administra-tion software. The rdd sample design and the sharing of cases are expected to result in a more uniform distribution of difficult cases across interviewers working the same shifts.

In general, less paradata and other auxiliary information are available for predicting response propensity in telephone surveys like the sCa,4 given the absence of visual observations from the interviewers and relatively sparse frame information. We do consider a set of paradata collected that documents previous calling efforts and contact history (see table 1), generated using the Blaise software for CaTI applications (Westat 2004). We also merged auxil-iary information (purchased from the commercial vendor Marketing systems Group) into the contact history data set, by first determining the ZIP code(s) associated with each telephone number’s exchange, and then linking aggre-gated data for that ZIP code (or the area defined by the first three digits of the ZIP code, if there were missing data for the ZIP code; see Johnson et al. [2006] for a similar example). Table a3 in the online appendix provides additional details on the predictor variables analyzed in table 1. a discrete time hazard model was fitted to all sCa contacts from 2003 to 2006 (84,858 contacts in total) using these paradata and the other auxiliary variables (see table 1 for the

4. In addition, we could link interviewer experience to the sCa contacts only with the consent of interviewers, and only about 25 percent of interviewers consented to this process. Consequently, we could not test the relationship of interviewer experience with response propensity without experiencing a large drop in the number of contacts.

West and Groves362

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 12: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

full list) as predictors of a binary indicator for a successful completed inter-view given contact.

Table 1 presents estimates of the coefficients in the discrete-time hazard model for the sCa contact data, alongside estimates for the NsFG. apparent in this table is the lack of available paradata for building this response propen-sity model relative to the NsFG (the pseudo-R2 value is 0.074). also of note is the fact that the paradata tended to be much stronger predictors of interview completion than the other auxiliary variables. When removing all of the addi-tional auxiliary variables from the full model presented in table 1, the pseudo-R2 value dropped from 0.074 to 0.052, suggesting that the additional auxiliary information was not contributing substantial predictive power to the model. only higher education, white race, lower median home values, and Census region appear to be strong area-level predictors of response propensity, among all the additional auxiliary variables considered.

Table 3 repeats the evaluation performed for the NsFG model by sCa year, to examine the stability of the results across the four years. as with the NsFG analysis, the percentage of contacts yielding an interview increased by propen-sity quintile in a nearly monotonic fashion (in each sCa year), suggesting that the model can predict the likelihood of an interview in a useful fashion despite the low pseudo-r2 value.

across the four sCa years, the relationship between the response propensity quintile and the standard deviation of the interviewer-level cooperation rates (based on final contacts) is consistently much weaker than seen in table 2 for the NsFG, and there could be several reasons for this finding. First, in central-ized telephone facilities, interviewers are given a more uniform mix of sample cases. This also means that individual interviewer cooperation rates are more uniform than in area probability surveys, where cooperation rates vary because of differences across primary areas. second, unlike the NsFG, a much smaller proportion of final contacts in the sCa were with cases in the highest estimated propensity quintile, and a much higher proportion of final contacts were with cases in the lowest quintile. This would have the effect of reducing the standard deviation in the lowest quintile and increasing the standard deviation in the highest quintile (relative to the NsFG), assuming that sums of squared devia-tions from the mean in each quintile were similar to those found in NsFG.

The mean predicted response propensities and the cooperation rates for the interviewers in each year (both based on final contacts) were moderately corre-lated (with correlations ranging between 0.41 and 0.61 across the four years). That is, as with the NsFG, interviewers in sCa who disproportionately call “easy” cases achieve higher cooperation rates. Correlations of the mean pre-dicted response propensities and the computed PaIP scores ranged from –0.11 to 0.25 across the four years, with no correlations significant at p < 0.05. These negligible correlations suggest that working more difficult cases in telephone surveys does not necessarily result in higher PaIP scores, but this result could be a function of the poorer predictive ability of the response propensity model

A Propensity-Adjusted Interviewer Performance Indicator 363

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 13: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

relative to the NsFG.5 The correlations between interviewer cooperation rates and the computed PaIP scores ranged between 0.38 and 0.63 across the four years. These moderate correlations once again suggest that the two indicators are measuring somewhat different phenomena, with at most 40 percent of their variance shared in common.

table 3. Percentages of scA contacts resulting in a completed interview and standard deviations of cooperation rates for final contacts across interviewers, by survey year and Quintile of Predicted response Propensity

survey year

Quintile of contact’s predicted

propensity of an interview

Percentage of contacts resulting in

an interview/mean predicted

response propensity

sd of final contact

cooperation rates for interviewers

within the quintile (number of interviewers)

Number of contacts falling

into quintile (number of final

contacts)

2003 (18,336 contacts)

1 (Low) 8.4% / 0.072 0.2622 (n = 54) 2,875 (967)2 11.5% / 0.116 0.2592 (n = 59) 3,917 (1,061)3 15.5% / 0.156 0.2774 (n = 53) 3,867 (1,015)4 19.9% / 0.195 0.2212 (n = 60) 3,832 (1,020)

5 (High) 26.7% / 0.271 0.2242 (n = 63) 3,845 (1,253)

2004 (20,955 contacts)

1 (Low) 7.8% / 0.069 0.2263 (n = 45) 3,777 (1,421)2 11.4% / 0.116 0.2081 (n = 49) 4,202 (1,205)3 16.2% / 0.157 0.2146 (n = 52) 4,450 (1,222)4 20.2% / 0.195 0.2129 (n = 50) 4,497 (1,364)

5 (High) 29.4% / 0.272 0.2306 (n = 50) 4,029 (1,528)

2005 (22,530 contacts)

1 (Low) 6.4% / 0.066 0.1788 (n = 33) 4,711 (1,655)2 10.6% / 0.116 0.2627 (n = 42) 4,258 (1,080)3 14.7% / 0.157 0.2320 (n = 45) 4,333 (1,012)4 19.3% / 0.195 0.2607 (n = 48) 4,427 (1,153)

5 (High) 27.3% / 0.280 0.1777 (n = 52) 4,801 (1,604)

2006 (23,037 contacts)

1 (Low) 6.5% / 0.064 0.1955 (n = 36) 5,609 (1,983)2 10.6% / 0.116 0.2788 (n = 42) 4,595 (1,208)3 16.2% / 0.156 0.2170 (n = 39) 4,321 (1,118)4 18.0% / 0.195 0.2002 (n = 39) 4,216 (1,094)

5 (High) 29.0% / 0.285 0.2049 (n = 45) 4,296 (1,523)

5. In the NsFG, the mean predicted response propensities for interviewers, as a measure of average case difficulty, ranged from 0.115 to 0.693; in the sCa, the range of mean predicted response propensities across all four years was 0.084 to 0.358. The sCa correlations are there-fore based on a much more limited range of case difficulty, which would generally be the case when comparing telephone surveys (lower response rates) and face-to-face surveys (higher response rates).

West and Groves364

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 14: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

Finally, we examined the correlations of the PaIP scores with the NCPI and WIP measures6 for each sCa year. Table 4 presents the results of these analyses.

We note a general tendency of the PaIP scores and the WIP measures to have a moderate positive correlation, although the correlation in 2003 was rel-atively small. This finding is not entirely surprising, given that the WIP meas-ure adjusts for some of the same indicators of case difficulty (based on calling history) that proved significant in computing the PaIP scores. We remind readers that the PaIP score provides the more general advantage of allowing as many predictors of response propensity as possible (both continuous and categorical), as opposed to the three discrete factors used for computing the WIP measures, resulting in a more sensitive measure of case difficulty (and likely the moderate positive correlations in this example). The PaIP scores and the NCPI scores tended not to be correlated in each of the four sCa years, and appear to be measuring different aspects of interviewer performance. This is also not surprising, given that the NCPI measure incorporates the speed at which the work was completed, unlike the PaIP and WIP measures. To sum-marize these results, the PaIP scores allow for a more general adjustment for

6. The specific categories of each factor (case status, number of calls made to case, and time of day of call) defining the productivity cells required for the WIP measure can be selected by the survey manager. For this particular application, we defined three case-status categories at the time of a contact (never contacted, contacted with previous refusal, contacted with no previous refusal), five categories for number of calls (0, 1–2, 3–5, 6–10, 11+), and three categories for time of day based on three general sCa calling shifts (weekdays 10 am–6 pm, weekdays 6 pm–midnight, and weekends), resulting in 45 total cells for assessing productivity. The NCPI measures for each year were computed based on the three aforementioned sCa calling shifts, and cumulative sums of the daily numerator and denominator values were used to define the NCPI measure (i.e., each interviewer had one NCPI score for the entire year). The sas code used to compute both sets of measures is available from the authors upon request.

table  4. correlations of PAiP scores with Alternative measures of interviewer Performance incorporating case difficulty, scA 2003–2006

year (number of interviewers)

Pearson correlation (r) of PaIP score with:

NCPI measure (durand 2005)

WIP measure (Laflamme and st. Jean 2011)

2003 (69) 0.082 0.0102004 (56) –0.078 0.457**2005 (61) –0.087 0.627**2006 (50) 0.325* 0.744**

Note.—PaIP score ranges: –0.210 to 0.291 (2003), –0.194 to 0.547 (2004), –0.213 to 0.722 (2005), –0.304 to 0.703 (2006). NCPI measure ranges: 0.000 to 0.150 (2003), 0.000 to 0.128 (2004), 0.000 to 0.169 (2005), 0.000 to 0.207 (2006). WIP measure ranges: 0.000 to 0.650 (2003), 0.000 to 0.905 (2004), 0.000 to 0.998 (2005), 0.000 to 1.000 (2006).

*p < 0.05; **p < 0.01

A Propensity-Adjusted Interviewer Performance Indicator 365

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 15: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

case difficulty and do not consider the speed at which interviewers complete their assignments, making them unique from the WIP and NCPI measures.

exaMPLe 3: THe GerMaN LaBor MarKeT aNd soCIaL seCUrITy (Pass) PaNeL sUrvey

Pass is a panel survey conducted by the Institute for employment research (IaB) in Nuremberg, Germany, and uses both CaPI and CaTI to collect labor market, household income, and unemployment benefit receipt data from a nationally representative sample of the German population (covering more than 12,000 households annually). starting in July 2006, two samples of roughly equal size (about 6,000 households each) had data collected from them annu-ally: a sample of households receiving unemployment benefits (as recorded in registers of the German Federal employment agency), or the UB sample, and a sample of households from the general German population with low-income households oversampled, or the GP sample. The UB sample is refreshed each year by a sample of new entries to the UB population, and the GP sample was refreshed in 2010 (the fifth wave of Pass) with a cluster (municipality) sam-ple from the German population register. response rates (using the rr1 cal-culation; see aaPor [2008]) at the household level in waves 4 and 5 ranged from 73.3 to 74.6 percent for panel households, and from 25.1 to 31.5 percent for the refreshment samples (Jesske and Quandt 2011, 2012). an overview of the Pass design can be found in Trappmann et al. (2010).

We analyze CaPI and CaTI contacts with the heads of households from the two most recent waves (waves 4 and 5) of the Pass survey separately in this example. Because a given case may have been contacted multiple times in multiple modes by different interviewers, we control for indicators of previous contacts in the other mode when analyzing the contacts specific to each mode and computing PaIP scores for the CaPI or CaTI interviewers. In addition to these mixed-mode indicators, we also consider as predictors of response pro-pensity the wave of data collection, the state in Germany where a household is located, a variety of calling-history paradata, and detailed commercial infor-mation (from the consumer marketing organization Microm) that describes features of the households and their neighborhoods. Table 5 presents estimates of the coefficients in the discrete-time hazard models7 for each set of Pass contacts.

The results in table  5 are largely consistent with the NsFG and sCa results. The calling-history paradata are the strongest predictors of response

7. Much like the sCa, information on interviewer experience was available only for a limited (less than 50 percent) subsample of Pass interviewers, as this information was provided in a voluntary interviewer survey. Given the importance of interviewer experience shown in the NsFG analysis, future surveys computing PaIP scores should strongly consider linking data on inter-viewing experience for all interviewers into the contact-history data sets used to fit the response propensity models.

West and Groves366

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 16: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

propensity, and although certain types of areas may have significantly higher or lower response propensity (especially for CaPI contacts), the utility of the geographic identifiers and the neighborhood information is fairly limited. When removing the German state identifiers and the Microm information from each model, the pseudo-R2 values dropped to 0.235 and 0.112 for CaPI and CaTI contacts, respectively, suggesting that this additional auxiliary informa-tion adds little predictive power to the models. We also once again see that the model for the in-person CaPI contacts has more predictive power than the model for the CaTI contacts.

Table 6 evaluates the propensity model fits for the CaPI and CaTI contacts separately, and the results are mostly consistent with those seen for the NsFG and the sCa. The models have good fits to the observed outcomes, and unlike for the sCa, the standard deviations of the interviewer cooperation rates for final CaTI contacts decrease as a function of the response propensity quin-tile (more like the NsFG). The same standard deviations indicate a curious curvilinear relationship for the CaPI contacts, where the most difficult cases and the least difficult cases result in the least variance among interviewers in final contact cooperation rates. These slightly different findings could be due to the mixed-mode nature of the Pass survey or general cultural differences between Germany and the United states.

The correlations of mean predicted response propensities at final contacts and cooperation rates at final contacts for CaPI interviewers (n = 321) and CaTI interviewers (n = 214), respectively, were 0.749 and 0.807, indicating strong agreement of cooperation rates with case difficulty in Pass (as expected). The correlations of the PaIP scores with the mean predicted response pro-pensities for CaPI and CaTI interviewers were –0.254 (p < 0.001) and 0.110 (p  =  0.109), respectively, mirroring the NsFG and sCa results. The PaIP score rewards CaPI interviewers achieving success with more difficult cases in Pass, but this does not necessarily hold for CaTI interviewers (presumably due to the lower predictive power of the response propensity model and the lower variation in response propensity for CaTI contacts in Pass, much like the sCa). Finally, the correlations of the PaIP scores with cooperation rates at final contacts for CaPI and CaTI interviewers were 0.356 and 0.324, respec-tively, once again indicating that the PaIP score has only about 13 percent of its variance in common with the more commonly used cooperation rate.

discussion

Cooperation rates obtained by interviewers, based on their assigned cases, are commonly used to evaluate interviewer performance in surveys. If interviewers were assigned sample cases at random, using their cooperation rates to evaluate them would be a fair process. However, interviewers assigned more difficult cases than others are disadvantaged by such a process. The propensity-adjusted

A Propensity-Adjusted Interviewer Performance Indicator 367

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 17: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

interviewer performance indicator proposed in this paper attempts to rectify this inequity by evaluating interviewers based on how their achievement exceeds what was expected for a given contact, given the attributes and prior calling out-comes for an assigned case. Unlike other interviewer evaluation measures that have been proposed in the literature to account for case difficulty, our general methodology can be applied to evaluate interviewers in face-to-face or telephone surveys, and uses a continuous measure of case difficulty based on a response propensity model. our evaluations of the proposed methodology have shown that it provides a unique measure of interviewer performance relative to other more commonly used evaluation measures, especially in face-to-face surveys.

Many survey organizations already collect the data that we used to model response propensity for nonresponse adjustment purposes. Given an informa-tion system that can collect and update information on these auxiliary vari-ables and contact outcomes on a daily basis, these models can be estimated each day from call-record data (organized into a data set of contacts) with standard statistical software capable of fitting logistic regression models. We found that response propensity models fitted to contact history data collected in face-to-face household surveys (the NsFG, and CaPI contacts from Pass) had significantly better fits than models fitted to contact history data collected in telephone surveys (the sCa, and CaTI contacts from Pass). although the in-person survey format enables the collection of a much richer set of paradata, including interviewer observations, we found that these additional observa-tions (including features measured during screening interviews) did not pro-vide a substantial contribution to the overall predictive power of the models.

Cross-survey comparisons: as a check on how robust our findings were, given the additional predictors included in the NsFG model, we attempted to refit the NsFG model using only those predictors from table 1 that were also available (in one form or another) in the sCa contact-history data set (e.g., the previous appointment indicators, counts of previous calls and contacts, and indicators of previous resistance). Table a1 in the online appendix shows the estimated coefficients in this model, and table a2 replicates the NsFG analysis8 presented in table  2. These tables show that the pseudo-R2 value for the substantially reduced NsFG model hardly changed (0.329), and that the predictive power of the resulting model is still quite good. These findings reinforce the need for both in-person and telephone surveys computing PaIP scores to collect paradata that are strongly predictive of response propensity,

8. Interestingly, the use of only eight calling-history variables in the reduced NsFG response propensity model failed to do a good job of discriminating between contacts with lower response propensities. attempts to compute quintiles for model evaluation revealed that the 20th and 40th percentiles tended to be identical, due to a large number of equal predicted response propensities at the lower end of the distribution. While this generally is not problematic for computation of the PaIP scores (at the beginning of a data collection, when not a lot of paradata are available, cases will have similar predicted response propensity), this is the reason for the presentation of tertiles for predicted response propensities based on the reduced model in table a2.

West and Groves368

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 18: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

Continued

table  5. estimated discrete-time Hazard model coefficients (and standard errors) for Predictors of interview completion at the next contact: PAss waves 4 and 5, cAPi and cAti contacts separately

CaPI contacts CaTI contacts

Predictor variables estimate se estimate se

Intercept –2.112** 0.166Wave 5 (relative to wave 4) 0.144** 0.039 –0.082** 0.025

Paradata: calling historyany previous contact in CaTI –0.262** 0.067any previous contact in CaPI –0.153** 0.053Number of prior calls –0.016** 0.002 –0.023** 0.001Number of prior contacts –0.041** 0.014 –0.099** 0.006days since last contact –0.032** 0.001 –0.016** 0.001refusal at last contact –1.023** 0.185 0.515** 0.133other result (e.g., appointment)

at last contact 1.422** 0.084 1.157** 0.070any prior refusal –0.097 0.100 –0.287** 0.062any prior contact 1.331** 0.092 0.004 0.074

sample status (relative to GP sample, wave 1)

New UB sample, wave 1 0.138** 0.047 0.374** 0.031New UB entries,9 wave 2 0.059 0.085 –0.025 0.049New UB entries, wave 3 0.074 0.085 0.092* 0.042New UB entries, wave 4 0.098 0.058 0.313** 0.073New GP sample, wave 5 –0.782** 0.069 –0.736** 0.101New UB sample, wave 5 –0.819** 0.056 –1.078** 0.063New UB entries, wave 5 –0.845** 0.065 –1.050** 0.081

German state (relative to schleswig-Holstein)

Hamburg –0.069 0.133 0.069 0.108Niedersachsen 0.004 0.098 0.045 0.072Bremen –0.537** 0.164 –0.084 0.135Nordrhein-Westfalen 0.014 0.093 0.034 0.066Hessen –0.123 0.111 –0.051 0.076rheinland-Pfalz 0.209 0.128 –0.061 0.084Baden-Wurttemberg –0.399** 0.105 –0.009 0.073Bayern –0.090 0.099 0.095 0.072saarland 0.186 0.159 –0.205 0.128

9. “New UB entries” refers to a new sample (in a particular wave) of households that are new recipients of unemployment benefits and were not receiving unemployment benefits in any previ-ous waves (and therefore had no chance of being sampled). “New UB sample” refers to a new refresher sample of unemployment benefit recipients that may have been eligible to be sampled at previous waves. Note that the first refresher samples were selected in wave 5.

A Propensity-Adjusted Interviewer Performance Indicator 369

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 19: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

as the eight calling-history variables collected in both the NsFG and the sCa were the strongest predictors of eventual survey outcomes.

If the same calling-history paradata were the strongest predictors in the response propensity models that we fitted to both the in-person and telephone contacts, why did the PaIP score do a better job of distinguishing interviewer performance as a function of case difficulty in the NsFG and Pass-CaPI cases? examining tables 1 and 5, it would appear that the relationships of these paradata with the cooperation outcomes in the sCa and Pass-CaTI mod-els were consistently smaller in magnitude. For example, establishment of a “hard” appointment with an sCa case increased the odds of a successful inter-view at the next contact by 281 percent, holding the other predictors constant.

table 5. Continued

CaPI contacts CaTI contacts

Predictor variables estimate se estimate se

Berlin –0.273* 0.116 0.167* 0.077Brandenburg –0.013 0.111 0.203* 0.083Mecklenburg-vorpommern 0.596** 0.140 0.118 0.096sachsen 0.071 0.112 0.286** 0.081sachsen-anhalt –0.095 0.116 0.171* 0.084Thuringen –0.373** 0.119 0.219* 0.089

microm variablesaverage household purchasing

power in euros, for street segment of household <0.001 <0.001 <0.001 <0.001

Number of privately-owned households in HUa10 –0.002 0.002 –0.001 0.001

average age of homeowners in HUa 0.034** 0.009 0.055** 0.007

strength of need for anonymity in HUa 0.002 0.009 0.012 0.007

Proportion of foreigners in HUa –0.004 0.007 0.002 0.006socio-economic status of HUa –0.034** 0.011 0.014 0.008Total of households moving per

1,000 in HUa –0.009 0.009 0.008 0.007

Number of contacts 30,520 54,779Nagelkerke pseudo-r2 0.240 0.116

*p < 0.05; **p < 0.01

10. a housing unit aggregate (HUa) is defined by Microm (for data protection purposes) as a col-lection of six adjacent single-family homes, or an entire building for multiunit housing facilities. These Microm variables are all assigned codes from 1 (lowest) to 9 (highest), using an ordinal scoring system.

West and Groves370

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 20: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

tabl

e 6.

Per

cent

ages

of

PAss

con

tact

s (w

aves

4 a

nd 5

) r

esul

ting

in

a c

ompl

eted

int

ervi

ew a

nd s

tand

ard

dev

iati

ons

of

coo

pera

tion

rat

es f

or f

inal

con

tact

s ac

ross

int

ervi

ewer

s, b

y m

ode

of c

onta

ct (

cA

Pi

or c

At

i) a

nd Q

uint

ile o

f P

redi

cted

r

espo

nse

Pro

pens

ity

Mod

e of

con

tact

Qui

ntile

of

cont

act’s

pr

edic

ted

prop

ensi

ty

of a

n in

terv

iew

Perc

enta

ge o

f co

ntac

ts

resu

lting

in a

n in

terv

iew

/ m

ean

pred

icte

d re

spon

se

prop

ensi

ty

sd o

f fi

nal c

onta

ct

coop

erat

ion

rate

s fo

r in

terv

iew

ers

with

in th

e qu

intil

e (n

umbe

r of

in

terv

iew

ers)

Num

ber

of c

onta

cts

falli

ng

into

qui

ntile

(nu

mbe

r of

fi

nal c

onta

cts)

Ca

PI (

30,5

20

cont

acts

)1

(Low

)4.

7% /

0.04

10.

1619

(n 

= 2

78)

6,10

4 (3

,657

)2

6.9%

/ 0.

074

0.26

58 (

n =

 282

)6,

104

(2,8

90)

313

.0%

/ 0.

131

0.31

35 (

n =

 260

)6,

104

(1,9

25)

436

.4%

/ 0.

375

0.26

38 (

n =

 286

)6,

104

(3,6

29)

5 (H

igh)

62.2

% /

0.61

00.

2170

(n 

= 2

47)

6,10

4 (4

,095

)

Ca

TI

(54,

779

cont

acts

)1

(Low

)3.

4% /

0.02

70.

2394

(n 

= 1

63)

10,9

55 (

3,64

5)2

8.3%

/ 0.

085

0.26

60 (

n =

 185

)10

,957

(3,

213)

314

.9%

/ 0.

156

0.20

43 (

n =

 199

)10

,955

(2,

642)

421

.1%

/ 0.

229

0.17

17 (

n =

 205

)10

,957

(2,

846)

5 (H

igh)

37.6

% /

0.35

70.

0785

(n 

= 2

08)

10,9

55 (

4,34

4)

A Propensity-Adjusted Interviewer Performance Indicator 371

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 21: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

In the NsFG, the relative increase in these odds given establishment of a “hard” appointment was 1,154 percent. These differences in the effects of the same predictors across the two data collection modes, which were also noted for the Pass survey, are likely due to inherent differences between the modes. Honoring an appointment for an interview is likely easier for someone when an interviewer has introduced themselves in person and made the effort to visit a person’s house, establishing more of a relationship with the potential respond-ent. In telephone surveys, it is much easier to ignore previous commitments to appointments and simply not answer the phone at the time of the appointment (especially if multiple interviewers have made contact with a household).

When combined with the reduced variation in response indicators in gen-eral in telephone surveys (due to lower response rates), these mode differ-ences likely contributed to the poorer fits of the sCa and Pass-CaTI models and the reduced effectiveness of the PaIP scores in these cases. despite this reduced predictive power, the PaIP scores still captured unique informa-tion about interviewer performance relative to the simple cooperation rate. Telephone survey managers can identify interviewers with extreme PaIP scores based on the best possible response propensity model to see if anything unusual is occurring given an interviewer’s current workload. Because linked auxiliary information from consumer marketing organizations was not found to substantially improve the fits of the sCa and Pass-CaTI models, survey methodologists need to continue to research strong predictors of response pro-pensity in telephone surveys to improve these models (and thus the perfor-mance of the PaIP scores).

conclusions

The evaluation methodology proposed in this paper measures one aspect of interviewer performance: the ability to obtain cooperation given that con-tact has been established with a sampled individual. Interviewers need to be evaluated along multiple dimensions (steinkamp 1964), and an additional skill that could be measured using the proposed methodology is the ability to successfully make contact at a given call attempt. The methods used to compute the PaIP score in this paper could easily be adapted to this situa-tion: The response propensity models would be fitted to a data set of call-ing attempts, rather than contacts, where the dependent variable would be an indicator of whether contact was established. PaIP scores based on calling attempts would provide an evaluation tool that recognizes whether the cases assigned to an interviewer are more or less difficult to contact, and how suc-cessful the interviewer has been at making contact given these expectations. Collectively, PaIP scores that are computed in line with the examples in this paper and scores that measure success at establishing contact (adjusted for the expected difficulty of establishing contact) could be used to more fully evaluate interviewer performance.

West and Groves372

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 22: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

Finally, we note that transparency with field staff will be essential for the evaluation of interviewers based on PaIP scores to be effective and accepted in practice. Interviewers may question the importance of collecting paradata and recording call histories, but this paper has demonstrated the importance of these process data in predicting response propensity in both CaPI and CaTI data collection modes. If interviewers know that they will (in part) be evalu-ated based on their performance with respect to the difficulty of their assigned cases, and that the collection of these paradata will help provide the best possi-ble picture of how difficult their cases really are, they may come to understand the importance of collecting high-quality paradata. all interviewers should be clearly aware of how the PaIP score is computed, and how it is designed to account for the fact that some interviewers may be assigned more difficult workloads than others. This aspect of the PaIP score should appeal to inter-viewers who are accustomed to being evaluated based on cooperation rates and other related measures, and accordingly may get discouraged when they are assigned cases that are more difficult than seen in previous data collections.

supplementary data

supplementary data are freely available online at http://poq.oxfordjournals.org/.

references

allison, Paul. 2010. Survival Analysis Using the SAS System. 2nd ed. Cary, NC: sas Publishing.american association for Public opinion research. 2008. Standard Definitions: Final

Dispositions of Case Codes and Outcome Rates for Surveys. 5th ed. Lenexa, Ks: aaPor.Barioux, Maxwell. 1952. “a Method for the selection, Training, and evaluation of Interviewers.”

Public Opinion Quarterly 16:128–30.Couper, Mick P. 1998. “Measuring survey Quality in a CasIC environment.” Paper presented at

the Joint statistical Meetings of the american statistical association, dallas.Couper, Mick P., and Lars Lyberg. 2005. “The Use of Paradata in survey research.” Proceedings

of the 55th session of the International statistical Institute.durand, Claire. 2005. “Measuring Interviewer Performance in Telephone surveys.” Quality and

Quantity 39:763–78.Fowler, Floyd, Jr. 2008. Survey Research Methods. 4th ed. Thousand oaks, Ca: sage

Publications.Groves, robert M., and Mick P. Couper. 1998. Nonresponse in Household Interview Surveys.

New york: Wiley.Groves, robert M., and steven G. Heeringa. 2006. “responsive design for Household surveys:

Tools for actively Controlling survey errors and Costs.” Journal of the Royal Statistical Society Series A 169(3):439–57.

Groves, robert M., William d. Mosher, James M. Lepkowski, Nicole G. Kirgis, and National Center for Health Care statistics. 2009. “Planning and development of the Continuous National survey of Family Growth.” Vital Health Statistics 1(48).

Hox, Joop J., and edith d. de Leeuw. 2002. “The Influence of Interviewers’ attitude and Behavior on Household survey response: an International Comparison.” In Survey Nonresponse, edited by r. M. Groves, d. a. dillman, J. L. eltinge, and r. J. a. Little, pp. 103–20. New york: Wiley.

A Propensity-Adjusted Interviewer Performance Indicator 373

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from

Page 23: A ProPensity-Adjusted interviewer PerformAnce indicAtorperpustakaan.unitomo.ac.id/repository/A Propensity... · performance (PaIP) score based on contacts only. This performance indicator

Hox, Joop J., edith d.  de Leeuw, and Ita G.  G.  Kreft. 1991. “The effect of Interviewer and respondent Characteristics on the Quality of survey data: a  Multilevel Model.” In Measurement Errors in Surveys, edited by P. P. Biemer, r. M. Groves, L. e. Lyberg, N. a. Mathiowetz, and s. sudman, 439–61. New york: Wiley.

Jesske, Birgit, and sylvia  Quandt. 2011. “Methodenbericht Panel arbeitsmarkt und soziale sicherung Pass * 4. erhebungswelle - 2010 (Haupterhebung).” FdZ Methodenreport, avail-able at http://fdz.iab.de/342/section.aspx/Publikation/k110418a06.

———. 2012. “Methodenbericht Panel arbeitsmarkt und soziale sicherung Pass * 5. erhebungswelle – 2011 (Haupterhebung).” FdZ Methodenreport, available at http://fdz.iab.de/342/section.aspx/Publikation/k121214302.

Johnson, Timothy P., I. K. Cho young, richard T. Campbell, and allyson L. Holbrook. 2006. “Using Community-Level Correlates to evaluate Nonresponse effects in a Telephone survey.” Public Opinion Quarterly 70:704–19.

Kennickell, arthur B. 2006. “Who’s asking? Interviewers, Their Incentives, and data Quality in Field surveys.” survey of Consumer Finances Working Paper, available at http://www.federal-reserve.gov/pubs/oss/oss2/scfindex.html.

Kreuter, Frauke, and Carolina  Casas-Cordero. 2010. “Paradata.” section II.4 in Building on Progress: Expanding the Research Infrastructure for the Social, Economic, and Behavioral Sciences, vol. 1, edited by the German data Forum (ratsWd). opladen, Germany, and Farmington Hills, MI: Budrich UniPress Ltd.

Laflamme, François, and Helene st.  Jean. 2011. “Proposed Indicators to assess Interviewer Performance in CaTI surveys.” Proceedings of the survey research Methods of the american statistical association, Joint statistical Meetings, 118–28.

Lepkowski, James M., William d. Mosher, Karen e. davis, robert M. Groves, John van Hoewyk, and National Center for Health statistics. 2010. “The 2006–2010 National survey of Family Growth: sample design and analysis of a Continuous survey.” Vital and Health Statistics 2(150), June.

Morton-Williams, Jean. 1993. Interviewer Approaches. aldershot, UK: dartmouth Publishing Company.

o’Muircheartaigh, Colm a., and Pamela Campanelli. 1999. “a Multilevel exploration of the role of Interviewers in survey Nonresponse.” Journal of the Royal Statistical Society Series A 162(3):437–46.

singer, eleanor, Martin r.  Frankel, and Marc B.  Glassman. 1983. “The effect of Interviewer Characteristics and expectations on response.” Public Opinion Quarterly 47:68–83.

snijkers, Ger, Joop J. Hox, and edith d.  de Leeuw. 1999. “Interviewers’ Tactics for Fighting survey Nonresponse.” Journal of Official Statistics 15:185–98.

steinkamp, stanley W. 1964. “The Identification of effective Interviewers.” Journal of the American Statistical Association 59:1165–74.

Tarnai, John, and danna L.  Moore. 2008. “Measuring and Improving Telephone Interviewer Performance and Productivity.” Chapter 17 in Advances in Telephone Survey Methodology, edited by J. M. Lepkowski, C. Tucker, J. M. Brick, e. d. de Leeuw, L. Japec, P. J. Lavrakas, M. W. Link, and r. L. sangster. Hoboken, NJ: John Wiley & sons.

Trappmann, Mark, stefanie  Gundert, Claudia  Wenzig, and daniel  Gebhardt. 2010. “Pass: a  Household Panel survey for research on Unemployment and Poverty.” In Schmollers Jahrbuch: Zeitschrift für Wirtschafts- und Sozialwissenschaften 130:609–22.

West, Brady T. 2011. “Paradata in survey research: examples, Utility, Quality, and Future directions.” Survey Practice, august, available at http://www.surveypractice.org.

Westat. 2004. Blaise CATI Guide. available at http://www.westat.com/Westat/pdf/blaise/BlaiseCatiGuide.pdf.

Wiggins, richard d., Nicholas T.  Longford, and Colm a.  o’Muircheartaigh. 1992. “a variance Components approach to Interviewer effects.” In Survey and Statistical Computing, edited by a. Westlake, r. Banks, C. Payne, and T. orchard, pp. 243–54. amsterdam: North-Holland.

West and Groves374

at Jordan University of Science and T

echnology on July 26, 2016http://poq.oxfordjournals.org/

Dow

nloaded from