24
DIAGNOSING MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH Michael R. Mullen* Florida Atlantic University Abstract. Many researchers have pointed out that it is necessary to ensure measurementequivalencein cross-national comparativeresearch. Three aspects of measurement equivalence, translation, metric, and calibration equivalence,are necessary to establish the cross-national reli- ability and validity of items used to measure theoretical constructs. This paper discusses these issues and proposes two recently developed empiricaltechniques,Multiple GroupLISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence. These tech- niques are illustrated by reanalysis of a pioneering U.S. and Japanese study. The two techniques yield convergent results, indicating measure- ment equivalence for some, but not all, ordinal-level items under con- sideration. The findings demonstrate that the proposed methods are useful diagnostic tools for exploring measurement equivalence. Several suggestions for reducing the likelihood of problems with measurement equivalenceand a numberof methods for dealing with items where lack of equivalencepersists are also discussed. INTRODUCTION An important purpose of research in more than one country is to find similarities and differences among and betweena priori known populations. Cross-national comparative research is done in accounting[Reitsberger and Daniel 1990], organizationbehavior [Harpaz 1990], strategic management [Caponet al. 1987],and marketing [Agarwal1993],among other fields.The results of these comparisons can have far-reaching implicationsfor inter- national businesstheory and practice[van Raaij 1978; Sekaranand Martin 1982; Sekaran 1983; Davis,Douglas and Silk 1981; Kotabe and Murray 1990]. Yet-, there are many well-known problems with conducting cross-national research for comparative purposes(e.g., Green and White [1976]; Boddewyn *Michael Mullen's (Ph.D.,University of North Carolina-Chapel Hill) other research interests are international trade's effect on economic development and international marketing. The author wishes to thank Gerald Albaum, Nicholas Didow, Patricia Doney, George Milne, Gregory Osland, William Perreault, Saeed Samiee, and participants in David Ricks' International Business Development research seminar at the University of South Carolina, the AMA Summer Educators Conference session on International Marketing Research, my seminar on International Business Research Issues at Michigan State University, and the Reviewers for helpful comments and suggestions on earlier versions of this paper. I also thank Arne L. Kalleberg and James R. Lincoln for making their data set available for this study. Received: February 1992; Revised: June 1992, July 1994, January 1995; Accepted: January 1995. 573 Palgrave Macmillan Journals is collaborating with JSTOR to digitize, preserve, and extend access to Journal of International Business Studies www.jstor.org ®

DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

DIAGNOSING MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH

Michael R. Mullen* Florida Atlantic University

Abstract. Many researchers have pointed out that it is necessary to ensure measurement equivalence in cross-national comparative research. Three aspects of measurement equivalence, translation, metric, and calibration equivalence, are necessary to establish the cross-national reli- ability and validity of items used to measure theoretical constructs. This paper discusses these issues and proposes two recently developed empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence. These tech- niques are illustrated by reanalysis of a pioneering U.S. and Japanese study. The two techniques yield convergent results, indicating measure- ment equivalence for some, but not all, ordinal-level items under con- sideration. The findings demonstrate that the proposed methods are useful diagnostic tools for exploring measurement equivalence. Several suggestions for reducing the likelihood of problems with measurement equivalence and a number of methods for dealing with items where lack of equivalence persists are also discussed.

INTRODUCTION

An important purpose of research in more than one country is to find similarities and differences among and between a priori known populations. Cross-national comparative research is done in accounting [Reitsberger and Daniel 1990], organization behavior [Harpaz 1990], strategic management [Capon et al. 1987], and marketing [Agarwal 1993], among other fields. The results of these comparisons can have far-reaching implications for inter- national business theory and practice [van Raaij 1978; Sekaran and Martin 1982; Sekaran 1983; Davis, Douglas and Silk 1981; Kotabe and Murray 1990].

Yet-, there are many well-known problems with conducting cross-national research for comparative purposes (e.g., Green and White [1976]; Boddewyn

*Michael Mullen's (Ph.D.,University of North Carolina-Chapel Hill) other research interests are international trade's effect on economic development and international marketing. The author wishes to thank Gerald Albaum, Nicholas Didow, Patricia Doney, George Milne, Gregory Osland, William Perreault, Saeed Samiee, and participants in David Ricks' International Business Development research seminar at the University of South Carolina, the AMA Summer Educators Conference session on International Marketing Research, my seminar on International Business Research Issues at Michigan State University, and the Reviewers for helpful comments and suggestions on earlier versions of this paper. I also thank Arne L. Kalleberg and James R. Lincoln for making their data set available for this study.

Received: February 1992; Revised: June 1992, July 1994, January 1995; Accepted: January 1995.

573

Palgrave Macmillan Journalsis collaborating with JSTOR to digitize, preserve, and extend access to

Journal of International Business Studieswww.jstor.org

®

Page 2: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

574 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

[1981]; Douglas and Craig [1983]; Albaum and Peterson [1984]; Mullen, Milne and Didow [1995]). Among other issues, the equivalence of constructs, samples and measurement must be addressed in order to ensure cross-national comparability [Mayer 1978; Sekaran 1983; Douglas and Craig 1983; Parameswaran and Yaprak 1987; Bhalla and Lin 1987; Aulakh and Kotabe 1993]. This paper focuses on measurement equivalence.

A fundamental, unresolved issue with multinational research is whether similarities or differences are in fact real [Barksdale and McTier-Anderson 1982]. If results are different than expected (that is, statistical significance is not achieved, items do not load in factor analysis as expected, or reliability assessment is low), researchers (e.g., Adler, Campbell and Laurent [1989]) often question whether measurement problems inherent in international research have attenuated the results,1 that is, whether the results are measure- ment and scaling artifacts or true cultural differences. Cross-national re- searchers must tackle the hard issues of measurement equivalence in order to reduce the threats to measuring reliability and validity (e.g., Adler et al. [1989]; Albaum and Peterson [1984]; Davis, Douglas and Silk [1983]; Nason [1989]; Aulakh and Kotabe [1993]). For instance, Aulakh and Kotabe [1993] recently noted that a major reason for lack of attention to equivalence issues is the insufficiency of existing techniques, and this "methodology issue is one area in need of immediate attention to make international research more rigorous" [Aulakh and Kotabe 1993, p. 24].

The purpose of this paper is to introduce two empirical diagnostic techniques for evaluating measurement equivalence in cross-national research. These recently developed techniques are adapted to address cross-national measure- ment issues to help researchers know whether similarities or differences in their data are in fact real. The first is a psychometric data analysis technique, alternating least squares optimal scaling (Optimal Scaling), that is useful for exploring scalar equivalence. It allows estimation and comparison, item by item, of the underlying metrics of ordinal measures across countries. The second, analysis of multiple group structural equation measurement models (Multiple Group LISREL), is proposed as a general technique for exploring whether the same measurement model is operating in multiple groups. The empirical methods presented in this paper, taken together with existing approaches, will allow researchers to make a stronger case for the equivalence of their measures.

The balance of the paper is organized as follows. First, measurement equiv- alence is discussed in some detail, including current techniques for diagnosing measurement equivalence. Next, two new approaches for diagnosing measure- ment equivalence diagnosis, Optimal Scaling and Multiple Group LISREL, are introduced. Relevant measurement issues and theory are briefly covered as introductions to both techniques. Third, Optimal Scaling and Multiple Group

Page 3: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH 575

LISREL are demonstrated with data from a survey of Japanese and U.S. workers, and results from both methods are compared. The discussion section addresses the implications of the findings, limitations of the proposed methods, and recommendations. It also presents several preventative approaches to help reduce problems with measure inequivalence and suggests possible remedies if problems remain. The conclusion section assesses the contribution and points out directions for future research.

MEASUREMENT EQUIVALENCE

Measurement equivalence addresses the question of whether the same models hold across different populations. Douglas and Craig [1983] divide measure- ment equivalence into three overlapping areas: calibration, translation and metric equivalence.

Calibration and Translation Equivalence

Both calibration and translation equivalence seek to assure that measurement instruments mean the same thing after translation. Translation equivalence also implies that the same questionnaire items measure the same latent constructs in different populations.2 For instance, we need to know whether significant differences in parameter estimates of the relationship between or among latent variables may be due to assuming that the variables' measures are scaled the same in both groups. "The situation is analogous to that of studying food weight and its cost in two different countries without con- sidering the differences in currencies and the units of weight. We would expect to find different coefficients relating costs to weight even if the slopes were equivalent after the proper conversion of the measures" [Bollen 1989, p. 356]. In other words, we need to know whether a construct can be measured by the same questionnaire items in different countries (i.e., translation equivalence) and whether the units of measure are the same in different countries (i.e., calibration equivalence).

To establish calibration equivalence, we must correctly convert measures so they are comparable. For example, if temperature is measured with Celsius in one country and Fahrenheit in another, questionnaire items regarding temper- ature should be carefully converted when the instrument is translated. To establish translation equivalence, researchers should have the original instru- ment translated to a target language by a bilingual person and back-translated to the original language by another bilingual person. Any discrepancies indicate problems with translation equivalence [Brislin, Lonner and Thorndike 1973; Berry 1980] and must be resolved before proceeding. Once the data have been gathered, it is also necessary to assure that the same items measure the same constructs. Buss and Royce [1975] among others (e.g., Irvine and Carroll [1980]) have argued that it is necessary to establish invariance of factor structures across groups using factor matching procedures before quantitative

Page 4: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

576 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

comparisons are justified. Douglas and Craig [1983] also suggest examining factor patterns for similarity and note that others (e.g., Anderson, Engledow and Becker [1980]) have computed coefficients of factor congruence.

Metric Equivalence

The psychometric properties of data from multiple groups must exhibit the same coherence or structure in order for metric equivalence to exist [Berry 1980]. That is, subjects must respond to the measurement scales in the same way. There are two threats to metric equivalence in cross-national research: inconsistent scoring across populations and scalar inequivalence [Douglas and Craig 1983]. First, respondents in some countries may not be as familiar with various scaling and scoring formats or research methodologies as those in other countries. This may result in inconsistent scoring which poses a threat to the reliability of the measurements [Davis et al. 1981; Douglas and Craig 1983; Parameswaran and Yaprak 1987; Bhalla and Lin 1987]. The second threat to metric equivalence relates to scalar equivalence or response set bias [Prezeworski and Teune 1970; Cunningham, Cunningham and Green 1977; England and Harpaz 1983]. The question is whether the scores obtained from respondents in different countries have the same meaning and interpretation [Douglas and Craig 1983]. The scores may differ due to cultural charac- teristics, such as, social desirability, acquiescence, evasiveness, or humility, all of which may influence respondents scoring [Vijier and Poortinga 1982]. For example, Lee and Green [1991] question the metric equivalence of their seven- point scales across Korean and U.S. samples because Koreans, who tend to avoid extremes, prefer responding around the mid-points on the scales. Differences in scalar equivalence/response set bias may add systematic error to measurements, threatening their validity for cross-national comparison. An equation relating these issues may be helpful. A basic equation of test theory is

Xi = Ti + si + ei,

where xi is the observed variable, Ti is the true score, si is systematic variance uncorrelated with the true score, and ei is the random error term.3 Reliability depends on how much variation in observed scores is attributable to random error (el) and does not imply validity [Churchill 1979]. For example, repeated measures of weight on a scale may give a reliable but not valid measure if the scale always overstates weight by ten pounds. The overstatement of weight is systematic variance (si) unrelated to the true score (Ti) which makes the scale invalid [Bollen 1989]. The first threat to metric equivalence relates to reliability and random error (ei). For instance, if the subjects in a country are unfamiliar with a scale and are confused by its application, they may respond to it inconsistently, resulting in random error (es) which threatens the reliability of the scale. The second threat, scalar inequivalence, relates to validity and si. If the subjects in one country systematically differ in the way they respond to the scale of an item, this systematic variance (se), unrelated to the true score (Ti),

Page 5: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH 577

would be a threat to the cross-national validity of that scale. It should be noted that measurement equivalence is a necessary but not sufficient condition for establishing the reliability and validity of measures used in cross-national comparative research, because other issues such as construct and sample equivalence must also be addressed.

Current Diagnostic Approaches

The international business literature suggests that researchers demonstrate (1) factor structure invariance to assess translation equivalence and (2) equal measurement reliabilities and scalar equivalence to assess metric equivalence. Table 1 summarizes the techniques for diagnosing measurement equivalence. Current methods for comparing factor structures, reliabilities, and scalar equivalence are lacking in several important ways. Those available for exploring translation equivalence include visual examination of factor patterns for similarity [Douglas and Craig 1983] and testing for factor structure congruence [Douglas and Craig 1983; Buss and Royce 1975; Irvine and Carroll 1980], but they are seldom used and have limited value. Visual examination is not precise and a coefficient of congruence is, for various reasons [Gorsuch 1974: pp. 253-54], not a desirable way to determine factor similarity across groups [Buss and Royce 1975].

TABLE 1 Diagnosing Measurement Equivalence

I. Calibration equivalence

1. Independently check conversions of measurement units

11. Translation equivalence

1. Translate/back-translate, rework exceptions [Brislin, Lonner & Thorndike 1973] 2. Visual examination of factor patterns for similarity [Douglas & Craig 1983] 3. Establish factor structure invariance [Buss & Royce 1975; Irvine & Carroll 1980;

Douglas & Craig 1983] 4. Use Multiple Group LISREL to test measurement models for common form and

invariance of factor loadings (k) [Bollen 1989; Joreskog & Sorbom 1989]

Ill. Metric equivalence

A. Consistent Scoring

1. Compare reliabilities [Davis, Douglas & Silk 1981; Parameswaran & Yaprak 1987] 2. Test for equality of measurement error variances (8) with Multiple Group LISREL

[Bollen 1989; Joreskog & Sorbom 1989]

B. Scalar Equivalence

1. Multiple methods of measurement [Douglas & Craig 1983] 2. Explore response set bias with Profile Analysis [Morris & Pravett 1992] 3. Examine underlying metrics for parallel spacing between item categories by

Optimal Scaling data 4. Test for equality of measurement error variances (S) with Multiple Group LISREL

[Bollen 1989; Joreskog & Sorbom 1989]

Page 6: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

578 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

Davis et al. [1981] and Parameswaran and Yaprak [1987] present empirical approaches for diagnosing cross-national differences in reliability. Their research demonstrates that the same scales may have different reliability in different countries, which may confound comparison of results because unreliability may attenuate or accentuate parameter estimates and statistical tests. Parameswaran and Yaprak [1987] emphasize that comparison of findings should only be made after establishing consistent reliability across markets, lest inaccurate conclusions be drawn. They recommend comparing Cronbach's alpha across groups for each measurement item, but point out that there are no statistical tests for doing so and leave this important issue to future research.

Currently, there are two approaches for examining scalar equivalence or response set bias. Multiple methods of measurement [Campbell and Stanley 1966; Berry 1980; Brislin 1980] assess metric equivalence [Douglas and Craig 1983] but are time consuming and expensive [Boddewyn 1981]. This approach is certainly appropriate but is rarely used because of the financial burden and other difficulties of comparative research. Morris and Pavett [1992] suggest Profile Analysis as an additional method for analyzing response set bias. With this approach, the means for each measure of a construct are graphed and connected by lines (see Figure 1). Profile Analysis tests whether the lines in Figure 1 are parallel [Johnson and Wichern 1982; Bray and Maxwell 1985]. Morris and Pavett [1992] argue that a lack of parallelism "suggests that the differences between the two data sets are not caused by systematic response bias." While Profile Analysis may give researchers some insights into response

FIGURE 1 Profile Analysis

"Loyalty" Construct for Two Groups

4

3

2 -4

ANYJOB NOTLYAL PROUD STYHERE VALUES WILLWK Item

.---American .----- Japanese The data are from the empirical example later in the paper (see Table 3).

Page 7: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH 579

set bias, it does not indicate whether the differences in means between groups are caused by real differences in the variables or by response set bias. However, even if the lines in the graph are parallel, the differences in means may not all be due to response bias. Unfortunately, there has been no diagnostic method avail- able to assess cross-national differences in scalar equivalence with confidence.

These limitations in the international research methods literature suggest the need for additional empirical techniques to explore threats to measurement equivalence. The development of a theoretical science is virtually impossible without accurate and reliable measurement of variables [Churchill 1979; Kerlinger 1986]. This paper highlights these issues and fills this gap in the literature.

OPTIMAL SCALING

Relevant Measurement Theory

Business researchers often use ordered categorical (ordinal) scales rather than continuous (interval) scales to measure latent variables, primarily for ease of use by respondents [Perreault and Young 1980; Srinivasan and Basu 1989]. Likert-type scales are the most popular in cross-national research [Albaum and Peterson 1984]; (for recent examples, see Seringhaus, Rolf and Botschen [1991] and Gray, Radebaugh and Roberts [1990]). Although widely used and accepted in U.S. research, Likert scales "may not elicit the same manner of response in other markets" [Onkvisit and Shaw 1989: p. 344]. Prezeworski and Teune [1970] note that cross-national comparisons of means on perceptual measures are hazardous because they are very sensitive to cultural response bias. As Adler [1983] points out, differences in means may be uninterpretable unless measured on equivalent scales.

Most of the statistical techniques in international business research assume interval or higher level measurement. For instance, to compare means, interval or higher level data are preferred [Kerlinger 1986; Lodge 1981]. Strictly speak- ing, the median should be used as a measure of central tendency with ordinal- level measures [Kerlinger 1986; Churchill 1987] because intervals in such data are not always equal [Jones 1960; Martilla and Carvey 1975; Perreault and Young 1980; Young 1981; Didow, Keller, Barksdale, and Franke 1985; Bechtel and Ofir 1988]. In practice, multiple-item composites provide higher quality measures. A substantial body of literature supporting this practice addresses the reliability and validity of composite measures (e.g., Churchill [1979]; Peter [1979, 1981]). In domestic research, treating ordinal measures as if they were interval scales is generally not a problem. Srinivasan and Basu [1989] show that even if the distances between the scale intervals for a given item are not equal, the resultant loss of information is minimal if the scales are well constructed. However, this assumes that the structure of the underlying metric is equivalent among respondents. The underlying metrics indicate how the

Page 8: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

580 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

subjects, as a group, respond to the scales. The question raised by Srinivasan and Basu's work is whether there is metric equivalence across a priori known populations. If the structure of the underlying metric for an ordinal scale is the same between groups, the current practice of assuming interval measurement for comparative analysis should pose no problem. If the underlying metric differs between populations, however, that difference adds systematic variance (si) to the observed variable (xi), threatening its cross-national validity. To establish metric equivalence, we must demonstrate that the rank orders and distances between scale values are maintained across national populations and that the measurement error is symmetrical.

Optimal Scaling with PRINCIPALS

Alternating least-squares optimal scaling (Optimal Scaling) examines the underlying metrics of data gathered from different a priori known populations. Optimal Scaling procedures are a family of psychometric techniques developed for the quantitative analysis of qualitative, metric, or mixed metric data [Young, Takane and de Leeuw 1978]. Perreault and Young [1980] introduced Optimal Scaling to business researchers. They describe it as a procedure by which "consistent with a set of measurement restrictions, nominal or ordinal valued variables in the data are transformed to interval scale values" (p.2).4

PRINCIPALS,5 a data analysis procedure that is a member of the Optimal Scaling family, is proposed as a technique to examine metric equivalence. It is a general extension of principal component analysis for use with nonmetric or mixed metric data [Young et al. 1978]. Optimal Scaling with PRINCIPALS can be thought of as linearizing all the bivariate scatter plots of the items in a principal components analysis wherein the loading matrix is predetermined by the conceptual model guiding the research. That model is reflected by the set of items being used to measure the latent variable. Given this conceptual model and the initial measurement characteristics (i.e., rank order must be maintained if the data are ordinal), a fitting function iteratively rescales the data to reduce the "stress" between the conceptual model and the scale values, while holding the mean constant. Conditional least-squares estimates of the principal component model parameters are made and then held constant while the measures are optimally rescaled. These newly estimated scale values are then held constant while new conditional parameter estimates for the principal component model are estimated. This iterative procedure results in the best possible interval-level numerical estimate of the underlying metric, given the measurement model and the measure characteristics.

It is worth noting that the mean for the rescaled data is the same as that of the raw data. These best estimates of the underlying metrics are dependent on the mean, as opposed to the mean being dependent on the estimated metric. In other words, optimally scaling the data does not address the question of

Page 9: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH 581

whether the mean is "correct" in any sense; the procedure only estimates the underlying metric given the measurement model and the data characteristics supplied by the researcher. Thus, the rescaled values allow the researcher to address the narrow question of metric equivalence but not absolute value or "truth."

For optimal scaling with PRINCIPALS, the data are partitioned into mutually exclusive and exhaustive subsets made up of the separate populations from which the data are collected. Separate optimal scalings or transformations of the data reveal the structure of the respective underlying metrics in each subset. If the rank orders and distances between scale values are the same across subsets, then scalar equivalence exists. In this way, the comparability, or lack thereof, of measures from different populations can be directly examined.

MULTIPLE GROUP LISREL

Recent developments in structural equation modeling make it possible to analyze and compare the measurement models from multiple groups at the same time [Bollen 1989; Joreskog and Sbrbom 1989]. 'Multiple Group LISREL' is appropriate for comparing measurement models from mutually exclusive groups of respondents that are clearly differentiated [Joreskog and Sorbom 1989]. It is a theoretically appealing way to explore whether the respondents relate observed measures to latent constructs the same way in different populations. In structural equation terms (following Joreskog and Sorbom [1989]), the measurement models of a latent variable for a set of groups are defined by the parameter matrices Ax(g), 06(g), and ?(g), where the superscript (g) refers to the g-th group, g = 1,2,..., G. With a structural equation measurement model of the form x = Ax+ 8, the measurement properties are the same in all groups, if:

Ax0) = Ax(2) = .. = (G)

&8(1) = 9 .(2) = =

The first line of equalities requires that the As are invariant across both groups. This is a test of the invariance of factor loading and helps diagnose translation equivalence. The second line of equalities asserts that the as are invariant across both groups. Referring back to the equation from test theory, we have

x= T+ s + e.

According to Bollen [1989], the structural equation measurement model can be written as

x = + s + e,

where s is a vector of systematic variance components unrelated to g and e, e is a vector of errors of measurement, and 8 = s +e. Therefore, the test of the invariance of 8 across groups is a joint test of the equality of random error

Page 10: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

582 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

variance (e) and systematic "error" variance (S).6 The test of equality of random errors across groups addresses the reliability issue of inconsistent scoring, while the test of equality of systematic "error" variance addresses the validity issue of scalar equivalence. If these measurement invariances hold, then any remaining differences between groups are manifest by differences in the distributions of the latent variables [Joreskog and Sorbom 1989].

The overall hypothesis for measurement equivalence is that the measurement models are invariant across groups. There is a hierarchy of invariance in such models with two overlapping dimensions: model form and similarity in parameter estimates [Bollen 1989]. Bollen [1989] and Joreskog and Sorbom [1989] recommend developing a hierarchy of tests for use with Multiple Group LISREL depending on the purpose of the analysis.

For comparing measurement models, testing the assumption of invariance of model form (Hform) is the least demanding test [Bollen 1989]. Hform tests whether the number of factors is the same for all groups but does not imply any equality constraints across parameters [Joreskog and Sorbom 1989]. Measurement models have the same form if the model for each group has the same Ax(9), 06(g), and P(g) matrices with the same location of fixed, free and constrained parameters and the same dimensions. Since Hform is the least restrictive [Bollen 1989], it makes little sense to move to the more demanding hypotheses if model fit is poor. Assuming Hform holds, the next hypothesis (HAx) tests whether the coefficients (A) linking the latent to the observed variables are the same in both groups. Assuming Hfor;i and HAS, the last hypothesis (HAxo8) tests the equality of measurement error variances (a) across groups [Bollen 1989]. Table 2 summarizes this hierarchy of hypotheses

TABLE 2 Multiple Group LISREL: Hypotheses Testing Hierarchy

for Measurement Equivalence

Hform: same form - there are two correlated common factors in both groups with a factor pattern of the form:

*0 *0 *0

0* Ax= 0*

0* 0* 0* 0*

HAX : AX(1) = -AX(2)

HAx,95: AX(1) = AX(2), Oa (1) = 5(2)

where superscript (g) refers to the g-th group of mutually exclusive individuals from different countries.

Note: adapted from Joreskog & Sorbom [1989: p. 259] and Bollen [1989: p. 360]

Page 11: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH 583

for testing measurement equivalence.7 Hform is depicted for two constructs measured with three and six indicators in two groups. HA, and HAxo5 can be tested together or separately, depending on your interest.

The sample covariance matrices, from each mutually exclusive group, are analyzed simultaneously. Given ordinal level measures such as Likert scales, polychoric correlation matrices and their asymptotic variances and covari- ances appear to be the most consistent estimators of population correlations [Bollen 1989; Joreskog and Sorbom 1988] and should be used with the weighted least squares fit function (FWLS) [Bollen 1989]; for further details, see Bollen [1989] and Joreskog and Sorbom [1989]. Chi-square difference tests can be used to assess the relative fit of more restricted models because they are nested in less constrained models in the hypotheses testing hierarchy [Bollen 1989].

AN EXAMPLE

The Data

To demonstrate these new approaches to assessing measurement equivalence, data are used from a large survey of U.S. and Japanese workers by Lincoln and Kalleberg [1985]. The study measured the concepts of satisfaction and organizational commitment (loyalty) from a job or work-related perspective. The three satisfaction items use three-category scales, and the six occupational commitment items use five-category Likert scales. Lincoln and Kalleberg compared the samples on many dimensions and found them to be reasonably comparable in terms of participating firms and employees. The Japanese (n=3,735) and U.S. (n=4,567) samples are each randomly subdivided into mutually exclusive subsamples of 400 respondents each for this example.

Optimal Scaling Analysis and Results

The PRINCIPALS analysis was done with recently available software developed at the SAS Institute [1988],8 using Kruskal's [1964] secondary least- squares monotonic transformation as recommended by Perreault and Young [1980] and Young [1981] for categorical, ordinal-level data. For each subsample, the satisfaction and loyalty scales were analyzed separately with the PRINCIPALS algorithm, resulting in independent best estimates of the underlying interval metric for each response category of each item. The process for conducting the Optimal Scaling procedure developed in this paper is outlined in Table 3.

Typical optimal scaled values (OSV) for the U.S. and Japanese samples are shown in Table 4. The raw, or original, ordinal categories are in the left column. The transformed, or optimally scaled, values are listed under the acronym for each item. The OSV for each item can now be compared for each raw category. The pattern exhibited by the OSV for the NOTLOYAL item in

Page 12: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

584 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

TABLE 3 Guide to Optimal Scaling Procedure

1 st. Specify your measurement model. You need to know which items measure each construct. In this example, we have two constructs, job satisfaction and loyalty, with three and six items respectively (see Table 3).

2nd. Separate the data by country into mutually exclusive data sets. We divided the Japanese and American subjects into separate data sets so they can be examined independently.

3rd. Optimally scale the data separately for each latent variable and each country. There are three items measuring satisfaction. PRINCIPALS is used to optimally scale response categories for those three items for Japanese and American data sets separately. Then the six items measuring the loyalty construct are optimally scaled separately for each country.

4th. Examine the results. Compare the Japanese optimally scaled values (OSVs) to the U.S. OSVs for each item. We find it useful to graph the transformed scales (OSVs) from both countries on the vertical axis and the raw scores on the horizontal axis. A separate graph is helpful for each item. This can be done manually, with a spread- sheet, or with SAS. If these lines are parallel (see graphs of job satisfaction items in Figure 2), then the response patterns are similar and exhibit scalar equivalence. If the lines are not parallel (see graphs of loyalty items in Figure 3), then the response patterns appear different between the countries. That is, the underlying metrics are not the same and scalar equivalence does not appear to exist.

TABLE 4 Optimally Scaled Data

Loyalty

U.S.A., N = 400

Raw Tanyjob Tnotloyal Tproud Tstyhere Tvalues Twillwk

1 2.29 2.65 2.77 1.66 2.16 2.70 2 2.29 2.65 2.77 1.66 2.16 2.70 3 2.58 2.65 2.93 3.02 2.73 2.95 4 3.49 3.19 3.62 3.35 3.69 3.76 5 6.10 6.00 5.51 6.06 6.45 5.28

Japan, N= 400

1 .74 .83 .36 .60 .53 .62 2 2.12 2.24 2.07 2.40 2.31 2.01 3 3.01 2.84 3.20 3.14 3.05 3.24 4 4.11 4.06 3.97 3.74 3.87 3.92 5 4.49 4.94 4.80 4.53 3.87 4.88

Job Satisfaction

U.S.A., N = 400 Japan, N= 400

Raw Tjbchoic Tjbrec Tsizeup Tjbchoic Tjbrec Tsizeup

0 .30 .09 -.21 .10 .20 -.01 1 .85 .92 1.11 .79 .80 1.02 2 2.02 2.01 1.89 2.11 2.18 1.88

Page 13: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

MEASUREMEN-T EQUIVALENCE IN CROSS-NATIONAL RESEARCH 585

FIGURE 2 Graphic Analysis of Metric Equivalence - Satisfaction Items

2 2 2: E

0

9:0- 0 0

JBCHOICE JOBREC SIZEUP

, , , , t 1 1-- -1 I~~~~ t---- I 0 1 2 3 0 1 2 3 0 1 2 3

0-- American .----- Japanese

FIGURE 3 Graphic Analysis of Metric Equivalence - Loyalty Items

7 7 7 ~~~I) ~~~6 6

02 1 2 2 p

1 A 1 1

6 6

04 4 4

2 2

1 STNYHERE 1 VATLUESL 1 4 WPLLWK 0 i !0 0 d

0123 4 0 3 0 1 3 4 - A n 2? J e0 1 2 3 4 0 1 2 3 4

E *Aerca 5--- Jaans

Page 14: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

586 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

the U.S. sample provides useful insight. The raw response categories of 1, 2 and 3 all rescaled to the OSV of 2.65. The "strongly disagree," "disagree" and "neither agree or disagree" response categories all collapsed into the same OSV. The U.S. subjects apparently do not discriminate among these response categories.

Graphic analyses of metric equivalence by item are presented in Figures 2 and 3. The results indicate that the responses on the loyalty items do not exhibit scalar equivalence across the two nations and are not directly comparable. The responses to the satisfaction items, however, exhibit similar underlying metrics and are therefore metrically equivalent.

Multiple Group LISREL Analysis and Results

The inputs to Multiple Group LISREL were polychoric correlation matrices and their asymptotic variances and covariances formed from separate Japanese and U.S. data sets with PRELIS software [Joreskog and Sorbom 1988]. The Multiple Group analysis was done with LISREL VII software [Joreskog and Sorbom 1989] using the FWLS fitting function. Observed vari- ables were declared ordinal, with a maximum of five categories in the scales. The hierarchy of hypotheses (see Table 2) was the basis for analysis.9 The first hypothesis, Hfo,n, was tested with both constructs simultaneously because it tests whether there are two correlated common factors with the pattern of fixed and free parameters shown in Table 2. Table 5a shows the results. The resulting x2 = 75.76 with 52 degrees of freedom (df) is significant at the .02 level. While the statistical significance raises concerns about the fit of the model to the data, the x2 statistic is a poor measure of model fit (e.g., see Fornell [1983]; Bollen [1989]). The x2 appears reasonable given a sample size of 800 and since x2 = N - 1 times the minimized fit function. Several indices of model fit also indicate a reasonable fit (i.e., Bentler and Bonett's [1980] al = .98, Bollen's [1986] pi = .97, and McDonald and Marsh's [1990] RNI = .99). The results provide evidence supporting the hypotheses of equivalent model form for both groups for both constructs.

The second hypothesis, HA, tests whether factor loadings are invariant simultaneously for both latent variables across Japanese and U.S. data. The x2 = 137.86 with 59 degrees of freedom (df) has a probability of .00, indicating poor fit of this model to the data. A x2 difference test was used to test this hypothesis because the hierarchy of hypotheses contains nested models. The X2diff = 61 with 7 df has a probability of .00, which is statistically significant (ot - .05). The assumption that the As are invariant for job satisfaction and loyalty for the Japanese and U.S. data is not supported, and HAX is rejected. Because both constructs were included, however, this analysis does not tell us whether the measurement inequivalence is for one or both constructs. Therefore, the measurement models for job satisfaction and loyalty need to be examined separately.

Page 15: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH 587

TABLE 5 Results: Diagnosing Measurement Equivalence

with Multiple Group LISREL

a.. Job satisfaction and loyalty simultaneously

Hypotheses X2 prob. df X2diff df prob.

Hform 75.76 .02 52

HAx( 137.86 .00 59 61.33 7 .00

b. Job satisfaction and loyalty separately

Hypotheses X2 prob. df X2diff df prob.

Hform Job Satisfaction 1.59 .21 1 Loyalty 9.53 .48 10

HAx( Job Satisfaction 4.31 .12 2 2.72 1 .10 Loyalty 141.78 .00 29 131.25 19 .00

HA&8 Job Satisfaction 6.06 .30 5 1.75 3 .63

The above test of Hfor,n demonstrated common factor structure for both constructs across both groups. Next, Hfor,n was analyzed for each construct separately to set a benchmark for x2 difference tests of subsequent hypotheses. The results are shown in Table 5b. For job satisfaction, the x2 (1.59, 1 df) is insignificant (p=.21), and model fit is excellent (Bentler and Bonett's [1980]

=.99, Bollen's [1986] pi = .99, and McDonald and Marsh's [1990] RNI = .99). For loyalty, the x2 (9.53, 10 df) is also insignificant (p=.48) and model fit is also excellent (Bentler and Bonett's [1980] Al = .99, Bollen's [1986] pi = .98, and McDonald and Marsh's [1990] RNI = 1.00). The insignificant x2s and indices of model fit for each construct corroborate the earlier result supporting Hform and provide a basis for evaluating subsequent hypotheses.

For HAX, the x2 for job satisfaction is 4.31 (2 df) which is insignificant, demon- strating good model fit. The X2diff (2.72, 1 df) is not significant, providing support for the hypothesis of invariant factor loadings. For loyalty, however, the X2 (141.78, 29 df) is significant, indicating poor fit of the model to the data. The X2diff (131, 19 df) is highly significant, providing evidence that the invariance constraints on the As across groups have a deleterious effect on model fit. HA, must be rejected for the loyalty construct, indicating that the same measurement model does not hold for loyalty in the Japanese and U.S. data sets.

Therefore, the third hypothesis, HAX3,@6 is tested for job satisfaction only. The x2 (6.60, 5 df) is insignificant and signifies an excellent fit of the model to the

Page 16: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

588 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

data from both groups. The X2diff (1.75, 3 df) is insignificant, providing evidence that HAXE, holds for the job satisfaction construct. These findings provide strong evidence that the measurement model for job satisfaction is the same in these two samples.

The results indicate that measurement equivalence exists for the job satisfaction construct but not for loyalty in the Japanese and U.S. data sets. The findings are parallel to those obtained from Optimal Scaling. The latter indicated that job satisfaction was metrically equivalent across the samples, while loyalty was not.

DISCUSSION

Summary of Findings

The empirical example produced convergent evidence of measurement equivalence for one construct, job satisfaction, but not for loyalty. The findings of the first analysis demonstrate the usefulness of Optimal Scaling for explor- ing scalar equivalence as it allows researchers to look at the underlying metrics of the same items across populations. The findings from the second analysis show that Multiple Group LISREL is useful for determining whether the same measurement model is operating in different populations. The questionnaire items for the satisfaction latent variable exhibited scalar equivalence with Optimal Scaling and had the same overall measurement model with Multiple Group LISREL. Therefore, the means and other measures of association calculated from these items would be comparable, assuming that other issues relevant to cross-national research were adequately addressed. In contrast, the items making up the loyalty construct were shown by Optimal Scaling to have very different structures in underlying metrics for the Japanese and U.S. respondents. Similarly, Multiple Group LISREL indicated that the same measurement model was not operating in the data from both populations. In sum, the loyalty items are not scalar equivalent and do not exhibit parallel measurement models. As such, they are not reliable and valid measures for cross-cultural comparisons between these populations.

Appropriate Use and Limitations

Both Optimal Scaling and Multiple Group LISREL require multiple measures of constructs in order to diagnose measurement equivalence, but this limit- ation is not as serious as it might first appear. "Hard" variables, such as demographics, do not pose much risk in terms of reliability [Davis et al. 1981] and can be measured with single items. Measurement equivalence is a more serious problem with "soft" or conceptual constructs, and we know that these latent variables should be measured with multiple items whenever possible [Jacob 1978; Churchill 1979], which makes the proposed methods suitable for use with well-designed research. Unfortunately, researchers using single-item

Page 17: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH 589

measures of latent variables still must address measurement equivalence using traditional approaches, such as test/retest [Davis et al. 1981] or multiple methods of measurement [Campbell and Stanley 1966; Berry 1980; Brislin 1980].

An important limitation to Optimal Scaling is that there are no statistical tests to determine whether optimally scaled values (OSVs) are equivalent. While the application of this method is straightforward, the interpretation of results depends on the judgment and experience of researchers. The heuristic method recommended is that researchers graph the results (see Figures 2 and 3) by item. If the plots of the underlying metrics from the different populations are parallel, then the measures exhibit scalar equivalence. It is worth noting that the distance between the lines is not important; rather the spacing should remain somewhat constant (i.e., the lines should not cross or veer apart). For stable results, samples should be greater than twenty-five observations and fifty or more observations are preferred [Didow and Mullen 1992].

The data requirements for Multiple Group LISREL can be a limitation. FWLS is computationally intensive. Sample size requirements may be larger for convergence with FWLS than with other fit functions [Bollen 1989], as little is known about small or medium sample analysis. In general, LISREL sample size requirements depend on the number of parameters to be estimated (for discussions, see Bagozzi [1980]; Bollen [1989]). If a small sample results in convergence problems, it may be necessary to diagnose measurement equiv- alence one construct at a time in order to estimate fewer parameters simul- taneously. However, if items are measured using scales with five or more points, then Pearson correlations and the FML fit function may produce satisfactory results with less computational difficulty and smaller samples [Bollen 1989]. Missing data may also create a problem with LISREL. Pairwise deletion of cases with missing values results in unstable parameter estimates and standard errors. Listwise deletion provides reliable results only if the data are missing at random; otherwise, the results might be biased.

The methods presented here allow researchers to discover and demonstrate measurement equivalence where it exists. This positive finding allows re- searchers to eliminate an important rival hypothesis for their results - namely that the results are measurement artifacts. Yet, the limitations of these methods should be acknowledged. Neither previous work (such as Davis et al. [1981]; Parameswaran and Yaprak [1987]) nor this study offer a remedy which facilitates direct comparison of means when threats to reliability and validity remain. Furthermore, even if researchers can demonstrate measurement equivalence, cross-national comparisons will be reliable and valid only if other issues discussed in the literature such as construct and sample equivalence have been adequately addressed.

Page 18: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

590 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

Recommendations for Diagnosis

Rather than some combination of other techniques, Multiple Group LISREL is, in general, the preferred approach for diagnosing measurement equivalence. It offers a more comprehensive diagnosis of measurement equivalence than the other methods and there are statistical tests of hypotheses. Nonetheless, LISREL requires larger samples than Optimal Scaling or the other methods currently used. This is especially problematic for pre-tests. To diagnose measurement equivalence with small samples, a combination of the other techniques is appropriate (see Table 1). Translation equivalence can be evalu- ated by comparing factor structures with factor matching procedures or by visually checking the pattern for apparent invariance. Cronbach's alpha can be calculated for each group and compared to check for differences in the consistency of scoring, and the data can be Optimally Scaled to diagnose scalar equivalence. These steps can help researchers at the pre-test stage identify questionnaire items that appear comparable and those that do not so that items can be added to or dropped from the instrument before the final study. For the final analysis of measurement equivalence, researchers should use Multiple Group LISREL when possible.

Before the Data Are Gathered

Whenever latent variables are studied in more than one country or culture, measurement equivalence is a potential problem. There are several ways to minimize inequivalence and data unreliability. Extra attention can be given to describing and anchoring the scales [Zhang and Dadzie 1991]. Examples may also help subjects respond to the scales in the same way. Sekaran [1983] advises adopting the same data collection procedures in each country, and Sekaran and Martin [1982] recommend using the same researcher. Bhalla and Lin [1987] suggest generating questionnaires with two or three times the required number of items and then using the traditional method of purification, dele- tion of unreliable items. An alternative to Likert scales is magnitude scaling, which provides ratio data [Lodge 1981]. It allows researchers to avoid some of the problematic assumptions required with ordinal data and to use powerful statistical techniques with more reliability and confidence.

After the Data Are Gathered

There are several proposals for dealing with data when metric equivalence is still a concern. Cunningham et al. [1977] recommend ipsatization,10 but they note that "in order to apply the ipsative measure, the researcher must have an a priori reason to suspect response set differences between the different samples" [1977, p. 383]. Optimal scaling can help researchers investigate response set bias before application of the ipsative measure. Douglas and Craig [1983] also recommend standardizing or normalizing the data to achieve some degree of equivalence. These approaches are useful in that they remove scaling factors

Page 19: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH 591

from the measurements so that the relationships among the variables in one country may be compared with the relationships in another country. Yet, these remedies make direct mean difference comparisons meaningless. Kotabe, Duhan, Smith, and Wilson [1991] use a variation of this approach. They standardize the data by respondent rather than by variable, which produces a relative mean for each population on each item.1' These standardized means are used for relative, rather than absolute, comparison of the variables. Bhalla and Lin [1987] suggest adjusting scores from different countries with different conversion rate coefficients based on experience. For instance, West Germans are less likely to overstate their purchase intentions than Italians and Spaniards. Therefore, intention to purchase scores for Italians and Spaniards should be adjusted downward relative to Germans. If researchers discover that their measures do not have similar reliabilities cross-nationally, Davis et al. [1981] recommend (and demonstrate) making adjustments for differences in reliability.

CONCLUSION

Future Research

An interesting issue raised by the empirical example stems from the finding that the three-category scales (job satisfaction) were more equivalent than the five-category scales (loyalty), even though the latter would seem preferable from an item design perspective. 12 Do scales with fewer categories pose fewer equivalence problems? Is it more difficult to systematically bias scales with fewer categories because choices are constrained? If so, it would create a trade- off between designing items to capture variance more accurately and designing items less susceptible to response bias. This merits more research. The practice suggested by Bhalla and Lin [1987] of adjusting scores from different countries based on experience is helpful only if the evidence is available. Their work is based on proprietary data. Therefore, meta analyses of national and/or cultural effects could be extremely helpful in identifying both systematic and random error components.

Developing statistical tests to interpret the results from Optimal Scaling also would be an important contribution to comparative research (for one approach, see Mullen, Milne and Didow [forthcoming [1995]) as would the development of a process to rescale data gathered from several cultures so they would be directly comparable. Another important contribution would be to compare and evaluate the different methods proposed for dealing with data that have metric equivalence problems. For instance, data with known psycho- metric properties could be generated so that standardization (by variable and by respondent), normalization, and ipsatization could be compared and contrasted. In addition, the extension of the cross-national literature on equivalence to other areas of comparative investigation (such as husbands and wives, or blacks, whites and hispanics) might make an important contribution.

Page 20: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

592 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

CONTRIBUTION

International business researchers often seek to compare findings from different countries, and potential lack of measurement equivalence threatens the reliability and validity of the results. This paper introduced Optimal Scaling and Multiple Group LISREL to help diagnose measurement equivalence. The findings from the empirical example demonstrated that the proposed methods are useful diagnostic tools. Future studies will gain from their use because they are much less time consuming and expensive than traditional approaches such as test/retest and multiple methods of measurement. Also discussed were several possible ways to reduce the likelihood of problems with measurement equivalence and to deal with those problems when they persist.

NOTES 1. Measurement error can also lead to enhanced correlations under certain circumstances, although few researchers question such a problem.

2. Translation and construct equivalence are closely related, but it is useful to distinguish a concept from its operational measures.

3. Some researchers (e.g., Bollen [1989]) consider si to be part of ri while others (e.g., Churchill [1979]) consider it to be part of ei.

4. For an excellent overview of the methodological and theoretical foundations of Optimal Scaling, see Young [1981].

5. Didow, Perreault and Williamson [1983] and Didow et al. [1985] demonstrated the use of PRINCIPALS. The first study used it in the context of evaluating mixed metric and non- metric data and "don't know" responses. The second study used it to improve the measure- ment quality of data by substituting rescaled values in lieu of raw values. For instance, Cronbach's alpha is shown to be monotonically higher when optimally scaled values are substituted for original response categories.

6. Earlier, s was described as systematic variance unrelated to true score. This is the same systematic variance but is referred to as systematic "error" variance because it is included in the measurement error variance, S.

7. Bollen [1989] and Joreskog and Sorbom [1989] note that this process can be continued to test whether the covariances of the latent variables are the same across groups as well. This last strong test (i.e., that the phi's are equal) is a highly restrictive hypothesis. If the covariances of the latent variables are equal, the results indicate that the same model operates in both groups [Bollen 1989]. This last test, however, moves beyond strictly measurement issues to possible differences between or among populations in the theoretical constructs of interest, so it is not included in the hierarchy of tests for diagnosing measurement equivalence.

8. See SAS Technical Report P-179 for use with SAS Version 6.03 or later [SAS Institute, Inc. 1988]. The SAS codes for this example are available from the author.

9. Typical PRELIS and LISREL codes from this example are available from the author.

10. The mathematical definition of ipsative data is "any score matrix which has the property that the sum of the scores over attributes for each of the entities is a constant" [Clemans 1956]. Ipsatization is a complex process. First, the primary data matrix must be standardized by column. Then the standard score matrix is ipsatized. The ipsative scores are then transformed into standard scores by columns.

11. The data matrix is standardized by row instead of column. In this way, the average score of each subject across items is zero. The resulting columns of data matrix are averaged for each population to obtain a relative mean for the variables.

12. The first two important areas for future research were pointed out by a reviewer.

Page 21: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH 593

REFERENCES Adler, Nancy J. 1983. A typology of management studies involving culture. Journal of

International Business Studies, 14(3): 29-47.

, Nigel Campbell & Andre Laurent. 1989. In search of appropriate methodology: From outside the PRC looking in. Journal of International Business Studies, 20(1): 61-74.

Agarwal, Sanjeev. 1993. Influence of formalization on role stress, organizational commitment, and work alienation of salespersons: A cross-national comparative study. Journal of International Business Studies, 24(4): 715-39.

Albaum, Gerald & Robert A. Peterson. 1984. Empirical research in international marketing, 1976-1982. Journal of International Business Studies, 15(1): 161-73.

Anderson, Ronald D., Jack L. Engledow & Helmut Becker. 1980. Advertising attitudes in West Germany and the U.S.A.: An analysis over age and time. Journal of International Business Research, 9(3): 27-38 (cited in Douglas & Craig [1983]).

Aulakh, Preet S. & Masaaki Kotabe. 1993. An assessment of theoretical and methodological developments in international marketing: 1980-1990. Journal of International Marketing, 1(2): 5-28.

Bagozzi, Richard P. 1980. Causal models in marketing. New York: Wiley.

Barksdale, Hiram C. & L. McTier-Anderson. 1982. Comparative marketing: A review of the literature. Journal of Macromarketing, 2(1): 57-62.

Bechtel, G. G. & C. Ofir. 1988. Aggregate item response analysis. Psychometrika, 53(March): 93-107.

Bentler, Peter M. & D. G. Bonett. 1980. Significance tests and goodness-of-fit in the analysis of covariance structures. Psychological Bulletin, 88: 588-600.

Berry, J. W 1980. Introduction to methodology. In H. C. Triandis & W Lonner, editors, Handbook of cross-cultural psychology. Methodology, Vol. 2, 1-28. Boston: Allyn and Bacon.

Bhalla, Gaurav & Lynn Y S. Lin. 1987. Cross-cultural marketing research: A discussion of equivalence issues and measurement strategies. Psychology & Marketing, 4(4): 275-85.

Boddewyn, Jean J. 1981. Comparative marketing: The first twenty-five years. Journal of International Business Studies, 12(1): 61-79.

Bollen, Kenneth A. 1986. Sample size and Bentler and Bonett's normed fit index. Psychometrika, 51: 375-77.

. 1989. Structural equations with latent variables. New York: John Wiley.

Bray, James H. & Scott E. Maxwell. 1985. Multivariate analysis of variance. Newbury Park, Calif.: Sage Publications.

Brislin, Richard W 1980. Translation and content analysis of oral and written materials. In H. C. Triandis & W Lonner, editors, Handbook of cross-cultural psychology. Methodology, Vol. 2, 389-444. Boston: Allyn and Bacon.

Brislin, Richard W, Walter J. Lonner & Robert M. Thorndike. 1973. Cross-cultural research methods. New York: John Wiley.

Buss, Allan R. & Joseph R. Royce. 1975. Detecting cross-cultural commonalities and differences: Intergroup factor analysis. Psychological Bulletin, 82(1): 128-36.

Campbell, Donald T. & J. C. Stanley. 1966. Experimental and quasi-experimental design for research. Chicago: Rand McNally.

Capon, Noel, Chris Christodoulou, John U. Farley & James M. Hulbert. 1987. A comparative

Page 22: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

594 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

analysis of the strategy and structure of United States and Australian corporations. Journal of International Business Studies, 18(1): 51-74.

Churchill, Gilbert A. 1979. A paradigm for developing better measures of marketing constructs. Journal of Marketing Research, 16(February): 64-73.

. 1987 (fourth edition). Marketing research, methodologicalfoundations. Ft. Worth Tex.: The Dryden Press.

Clemans, William V. 1956. An analytical and empirical examination of some properties of ipsative measures. Psychometric Monographs. 14 (referenced in Cunningham, Cunningham & Green [1977]).

Cunningham, William, Isabella C. M. Cunningham & Robert T. Green. 1977. The ipsative process to reduce response set bias. Public Opinion Quarterly, 41(Fall): 379-94.

Davis, Harry L., Susan P.Douglas & Alvin J. Silk. 1981. Measure unreliability: A hidden threat to cross-national marketing research? Journal of Marketing, 45(2): 98-109.

Didow, Nicholas & Michael R. Mullen. 1992. Sample size effects on the stability of optimal scaling estimates - A preliminary investigation. In R. P. Leone & V. Kumar, editors, Enhanc- ing knowledge development in marketing, Vol. 3, 290-95. (American Marketing Association Educator's Proceedings).

Didow, Nicholas M., Kevin Lane Keller, Hiram C. Barksdale, Jr. & George R. Franke. 1985. Improving measure quality by alternating least squares optimal scaling. Journal of Marketing Research, 22(February): 30-40.

Didow, Nicholas M., William D. Perreault Jr. & Nicholas C. Williamson. 1983. A cross- sectional optimal scaling analysis of the index of consumer sentiment. Journal of Consumer Research, IO(December): 339-47.

Douglas, Susan P. & C. Samuel Craig. 1983. International marketing research. Englewood Cliffs, N.J.: Prentice Hall.

England, George W & Itzhak Harpaz. 1983. Some methodological and analytic considerations in cross-national comparative research. Journal of International Business Studies, 14(3): 49-59.

Fornell, Claes. 1983. Issues in the application of covariance structure analysis: A comment. Journal of Consumer Research, 9(March): 443-48.

Gorsuch, R. L.. 1974. Factor analysis. Toronto: WB. Saunders (cited in Buss and Royce [1975]).

Gray, Sidney J., Lee H. Radebaugh & Clare B. Roberts. 1990. International perceptions of cost constraints on voluntary information disclosures: A comparative study of U.K. and U.S. multinationals. Journal of International Business Studies, 21(4): 597-622.

Green, Robert T. & Philip D. White. 1976. Methodological considerations in cross-national consumer research. Journal of International Business Studies, 7(3): 81-87.

Harpaz, Itzhak. 1990. The importance of work goals: An inter-national perspective. Journal of International Business Studies, 2 1(1): 75-93.

Irvine, Sid H. & William K. Carroll. 1980. Testing and assessment across cultures: Issues in methodology and theory. In H. C. Triandis & W Lonner, editors, Handbook of cross-cultural psychology: Methodology, 181-244. Boston: Allyn and Bacon.

Jacob, Jacob. 1978. Consumer research: A state of the art review. Journal of Marketing, 42(April): 87-96.

Johnson, Richard A. & Dean W Wichern. 1982 (second edition). Applied multivariate statistical analysis. Englewood Cliffs, N.J.: Prentice Hall.

Page 23: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

MEASUREMENT EQUIVALENCE IN CROSS-NATIONAL RESEARCH 595

Jones, L. V. 1960. Some invariant findings under the method of successive intervals. In 0. Gulliksen & S. Messick, editors, Psychological scaling. Theory and applications. New York: John Wiley.

Jbreskog, Karl G. & Dag Sbrbom. 1989. LISREL VII user's reference guide. Mooresville, Ind.: Scientific Software.

. 1988. PRELIS: A preprocessor for LISREL. Mooresville, Ind.: Scientific Software.

Kerlinger, Fred N. 1986. Foundations of behavioral research. Fort Worth, Tex.: Holt, Rinehart and Winston.

Kotabe, Masaaki, Dale F. Duhan, David K. Smith & R. Dale Wilson. 1991. The perceived veracity of PIMS strategy principles in Japan: An empirical inquiry. Journal of Marketing. 55(1): 26-41.

& Janet Y Murray. 1990. Linking product and process innovations and modes of international sourcing in global competition: A case of foreign multinational firms. Journal of International Business Studies, 21(3): 383-425.

Kruskal, J. B. 1964. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1): 1-27.

Lee, Chol & Robert T. Green. 1991. Cross-cultural examination of the Fishbein behavioral intentions model. Journal of International Business Studies, 22(2): 289-305.

Lincoln, James R. & Arne L. Kalleberg. 1985. Work organization and workforce commitment: A study of plants and employees in the U.S. and Japan. American Sociological Review, 50(December): 738-60.

Lodge, Milton. 1981. Magnitude scaling, quantitative measurement of opinions. Beverly Hills, Calif.: Sage Publications.

Martilla, John A. & Davis W Carvey. 1975. Four subtle sins in marketing research. Journal of Marketing, 39(January): 8-15.

Mayer, Charles S. 1978. Multinational marketing research: The magnifying glass of methodo- logical problems. European Research, 6(March): 77-83.

McDonald, Roderick P. & Herbert W Marsh. 1990. Choosing a multivariate model: Non- centrality and goodness of fit. Psychological Bulletin, 2(March): 247-55.

Morris, Tom & Cynthia M. Pavett. 1992. Management style and productivity in two cultures. Journal of International Business Studies, 23(1): 169-79.

Mullen, Michael R, George Milne & Nicholas Didow. 1995 (forthcoming). Determining cross- cultural metric equivalence in survey research: A statistical test. In T. Madsen, editor, Advances in international marketing

Mullen, Michael R., George Milne & Patricia Doney. 1995. An international marketing application of outlier analysis for structural equations: A methodological note. Journal of International Marketing, 3(1): 45-62.

Nason, Robert W 1989. From the editor: Building block. Journal of Macromarketing, 9(1): 3-4.

Onkvisit, Sak & John J. Shaw. 1989. International marketing. Analysis and strategy. Columbus, Ohio: Merrill.

Parameswaran, Raji & Attila Yaprak. 1987. A cross-national comparison of consumer research measures. Journal of International Business Studies, 8(1): 35-49.

Perreault, William D., Jr. & Forrest W Young. 1980. Alternating least squares optimal scaling: Analysis of nonmetric data in marketing research. Journal of Marketing Research, 27(February): 1-13.

Page 24: DIAGNOSING MEASUREMENT EQUIVALENCE IN2015/05/28  · empirical techniques, Multiple Group LISREL and Optimal Scaling, for use in diagnosing cross-national measurement equivalence

596 JOURNAL OF INTERNATIONAL BUSINESS STUDIES, THIRD QUARTER 1995

Peter, J. Paul. 1979. Reliability: A review of psychometric basics and recent marketing practices. Journal of Marketing Research, 16(February): 6-17.

. 1981. Construct validity: A review of basic issues and marketing practices. Journal of Marketing Research, May, 133-45.

Prezeworski, Adam & Henry Teune. 1970. The logic of comparative social inquiry. London: Cambridge University Press.

Reitsberger, Wolf D. & Shirley J. Daniel. 1990. Japan vs. Silicon Valley: Quality-cost trade-off philosophies. Journal of International Business Studies, 21(2): 289-300.

SAS Institute Inc. 1988. SAS technical report P-179: Additional SAS/STAT procedures, release 6.03. Cary, N.C.: SAS Institute..

Sekaran, Uma. 1983. Methodological and theoretical issues and advancements in cross-cultural research. Journal of International Business Studies, 14(2): 61-73.

& Harry J. Martin. 1982. An examination of the psychometric properties of some commonly researched variables in two cultures. Journal of International Business Studies, 13(1): 51-66.

Seringhaus, F.H. Rolf & Guenther Botschen. 1991. Cross-national comparison of export promotion services: The views of Canadian and Austrian companies. Journal of International Business Studies, 22(1): 115-33.

Srinivasan, Venkat & Amiya K. Basu. 1989. The metric quality of ordered categorical data. Marketing Science. 8(3): 205-230.

van Raaij, Fred W 1978. Cross-cultural research methodology as a case of construct validity. In H. K. Hunt, editor, Association for consumer research, 5: 693-701.

Vijier, Van de & Y H. Poortinga. 1982. Cross-cultural generalization and universality. Journal of Cross-Cultural Psychology, 13: 387-408.

Young, Forrest W 1981. Quantitative analysis of qualitative data. Psychometrika, 46(4): 357-88.

, Yoshio Takane & Jan de Leeuw. 1978. The principal components of mixed measurement data: An alternating least squares method with optimal scaling features. Psychometrika, 43(June): 279-82.

Zhang, Li & Kofi Q. Dadzie. 1991. Developing more meaningful and accurate measurement models in global marketing research: An empirical illustration. In M. Gilly, editor, Enhancing knowledge development in marketing, American Marketing Association Educators Proceedings 610-11.