6
J Clin Epidemiol Vol. 45, No. 7, pp. 785-790, 1992 0895-4356192 $5.00 + 0.00 Printedin Great Britain. All rightsreserved Copyright 0 1992 Pergamon Press Ltd SOURCES OF INTEROBSERVER VARIATION IN HISTOPATHOLOGICAL GRADING OF CERVICAL DYSPLASIA HENRICA C. W. DE VET,‘* PAUL G. KNIPSCHILD,’ HUBERT J. A. SCHOUTEN,’ JOHAN KOUDSTAAL,* WIE-SIEN KWEE,~DIRK WILLEBRAND,~ FERD STURMANS’ and JAN WILLEM ARENDS’ ‘Department of Epidemiology and Biostatistics, University of Limburg, Maastricht, 2Department of Pathology, De Wever Hospital, Heerlen, ‘Department of Pathology, St Laurentius Hospital, Roermond, 4Department of Pathology, Municipal Public Health Services, Haarlem and Department of Pathology, University of Limburg, Maastricht, The Netherlands (Received 26 November 1991) Abstract-The present study aimed to assess where the interobserver variation in grading cervical dysplasia stems from. Four experienced pathologists examined 93 histological slides, after they agreed on which morphological characteristics should be considered relevant for grading. They scored 6 morphological characteristics for each slide and assigned it to a degree of dysplasia. Compared to a previous study, the interobserver variation showed a statistically significant improvement: the weighted group kappa value increased from 0.55 to 0.69. For the scores of the individual characteristics considerable interobserver variation was observed: weighted group kappa values ranged from 0.28 to 0.49. The pathologists slightly differed in which characteristics they considered most important for their grading. The agreement on the degree of dysplasia turned out to be better than the agreement on the morphological characteristics on which this diagnosis is based. In the discussion, a few explanations for this paradoxical finding are put forward. Cervix dysplasia Histology Concordance INTRODUCTION Several studies have pointed at the interobserver variation in the interpretation of cervical biopsy specimens by pathologists [l-4]. We became interested in the sources of this variation in histological grading, when we also found con- siderable disagreement when four pathologists graded a sample of 106 biopsy specimens [5]. Knowledge about the origin might help to reduce variation between pathologists, there- by improving the quality of the diagnosis. *All correspondence should be addressed to: Henrica C.W. de Vet, Department of Epidemiology and Biostatistics, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands [Tel: 43-8822221. Therefore we decided to study the sources of variation in histological grading. Various explanations for the observed dis- agreement can be put forward. Grading is hampered by the arbitrary division into distinct categories of a continuously progressing process without natural and sharply defined borders. Moreover, the criteria for grading are equivocal. With regard to histological grading several morphological characteristics are taken into account, related to cell differentiation, cyto- nuclear features and mitotic activity. The ulti- mate diagnosis depends on the emphasis which is put on each of these characteristics in grading. Thus, the extent of interobserver variation may depend firstly on which characteristics are con- sidered to be important for grading, secondly 785

Sources of interobserver variation in histopathological grading of cervical dysplasia

Embed Size (px)

Citation preview

Page 1: Sources of interobserver variation in histopathological grading of cervical dysplasia

J Clin Epidemiol Vol. 45, No. 7, pp. 785-790, 1992 0895-4356192 $5.00 + 0.00 Printed in Great Britain. All rights reserved Copyright 0 1992 Pergamon Press Ltd

SOURCES OF INTEROBSERVER VARIATION IN HISTOPATHOLOGICAL GRADING OF

CERVICAL DYSPLASIA

HENRICA C. W. DE VET,‘* PAUL G. KNIPSCHILD,’ HUBERT J. A. SCHOUTEN,’ JOHAN KOUDSTAAL,* WIE-SIEN KWEE,~ DIRK WILLEBRAND,~ FERD STURMANS’

and JAN WILLEM ARENDS’

‘Department of Epidemiology and Biostatistics, University of Limburg, Maastricht, 2Department of Pathology, De Wever Hospital, Heerlen, ‘Department of Pathology, St Laurentius Hospital, Roermond, 4Department of Pathology, Municipal Public Health Services, Haarlem and Department

of Pathology, University of Limburg, Maastricht, The Netherlands

(Received 26 November 1991)

Abstract-The present study aimed to assess where the interobserver variation in grading cervical dysplasia stems from. Four experienced pathologists examined 93 histological slides, after they agreed on which morphological characteristics should be considered relevant for grading. They scored 6 morphological characteristics for each slide and assigned it to a degree of dysplasia.

Compared to a previous study, the interobserver variation showed a statistically significant improvement: the weighted group kappa value increased from 0.55 to 0.69. For the scores of the individual characteristics considerable interobserver variation was observed: weighted group kappa values ranged from 0.28 to 0.49. The pathologists slightly differed in which characteristics they considered most important for their grading. The agreement on the degree of dysplasia turned out to be better than the agreement on the morphological characteristics on which this diagnosis is based. In the discussion, a few explanations for this paradoxical finding are put forward.

Cervix dysplasia Histology Concordance

INTRODUCTION

Several studies have pointed at the interobserver variation in the interpretation of cervical biopsy specimens by pathologists [l-4]. We became interested in the sources of this variation in histological grading, when we also found con- siderable disagreement when four pathologists graded a sample of 106 biopsy specimens [5]. Knowledge about the origin might help to reduce variation between pathologists, there- by improving the quality of the diagnosis.

*All correspondence should be addressed to: Henrica C.W. de Vet, Department of Epidemiology and Biostatistics, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands [Tel: 43-8822221.

Therefore we decided to study the sources of variation in histological grading.

Various explanations for the observed dis- agreement can be put forward. Grading is hampered by the arbitrary division into distinct categories of a continuously progressing process without natural and sharply defined borders. Moreover, the criteria for grading are equivocal. With regard to histological grading several morphological characteristics are taken into account, related to cell differentiation, cyto- nuclear features and mitotic activity. The ulti- mate diagnosis depends on the emphasis which is put on each of these characteristics in grading. Thus, the extent of interobserver variation may depend firstly on which characteristics are con- sidered to be important for grading, secondly

785

Page 2: Sources of interobserver variation in histopathological grading of cervical dysplasia

786 HENRICA C. W. DE VET et cl.

on the variability in the observation of these characteristics and thirdly on the variability in the grading of these observations into the various categories of cervical dysplasia.

In the present study we first reached con- sensus on which characteristics are relevant for grading. The influence of this consensus can be inferred from the improvement in interobserver agreement compared with the previous study. Furthermore, we tried to distinguish whether the variation stems from the differences in the observation of the relevant morphological characteristics, or from the grading of these observations into the arbitrary categories of dysplasia. This latter source of variation may arise when pathologists differ in their gradings, although they had similar scores on the morphological characteristics. This may occur either because they have a tendency to assign systematically to a less or more severe grade, because of random variation in the assignments of the grades, or because they attach different weights to the various characteristics.

Thus, the emphasis in this paper is more on the identification of the source(s) of inter- observer variation than, directly, on the im- provement of the grading of cervical dysplasia.

METHODS

The same four pathologists of the former study [5], experienced in histopathological grading of cervical dysplasia, participated. For this new study, they first reached consensus about which morphological characteristics are relevant to consider for the grading of cervical dysplasia. They decided to score (1) location of immature cells, (2) hyperchromasia, (3) nucleus/ cytoplasm ratio, (4) polymorphous nuclei, (5) location of mitotic activity, and (6) appearance of mitotic activity. No agreement was pursued about the ranking on importance of the various characteristics. We used 93 biopsy specimens instead of the original 106 used in the former study, because the slides of one of the four hospitals participating in the previous study were not available for the present assessment. Using the same set of slides enabled a valid comparison with the interobserver variation found in the previous study. For that purpose, we recalculated the kappa-values of the previous study for the 93 slides involved in the present study. The pathologists assigned a score to each of the six morphological characteristics on an ordinal scale of 2 or 3 categories, and thereafter

Table 1. Scoring list for each biopsy specimen

Characteristics Categories Score

Differentiation Location of immature

cells

Cytonuclear characteristics Hyperchromasia

Nucleus/cytoplasm ratio

Polymorphous nuclei

Mitotic activity Location

Appearance

Degree of dysplasia

Within l/3 of the layer Within 2/3 of the layer Through the whole layer

Absent Moderately present Severely present Normal Moderately abnormal Severely abnormal Absent Moderately present Severely present

Within l/3 of the layer Within 2/3 of the laver Through’ the whole iayer Pathological Not pathological No dysplasia Mild dvsnlasia Moderate dysplasia Severe dysplasia

2 3 0

2 3 4

Carcinoma in sifu 5

graded the slides into an ordinal scale of 5 categories: no dysplasia, mild dysplasia, moderate dysplasia, severe dysplasia, carcinoma in situ. The scoring form is presented in Table 1. Later on, after the pathologists had scored the slides, but before they had seen the results, they were asked to rank the six morphological characteristics according to the weight given in their grading.

Statistical analysis The interobserver variation for the scores

on each morphological characteristic and the assigned degree of dysplasia was assessed by cal- culation of kappa statistics [6,7]. Unweighted and weighted group kappa values were calcu- lated, the latter by using quadratic disagreement weights [8]. Weighted kappa values are based on the idea that if two observers differ by more than one category, then their disagreement should be given more weight than if they differ by only one category. In the group kappa coefficient, the average observed agreement is compared to the average chance agreement, with the average taken over all pairs of observers and over all slides [8]. The standard jackknife technique was used to compute 95% confidence intervals (95% CI) [8].

We used a graphical presentation to see whether the pathologists showed a tendency to assign systematically to less or more severe grades.

Page 3: Sources of interobserver variation in histopathological grading of cervical dysplasia

Interobserver Variation in Grading Cervical Dysplasia 787

In order to assess the weight given by each pathologist to the various characteristics gamma coefficients were calculated [9], as a measure of association between each morphological char- acteristic and the assigned degree of dysplasia. This coefficient of correlation is adequate to compare ranks on ordinal variables with un- equal numbers of classes. It is defined as gamma = (C - D)/(C + D), where C and D are the numbers of concordant and discordant pairs of slides. In our situation, a pair of slides is con- cordant if the one slide has a higher score of a morphological characteristic and a higher degree of dysplasia than the other slide. A pair of slides is discordant if the slide has a higher score of the morphological characteristic but a lower degree of dysplasia than the other slide. Gamma equals unity if there are no discordant pairs of slides.

RESULTS

Out of the total number of 372 assessments (93 slides *4 pathologists), 370 diagnoses were noted, as pathologist B felt unable to assign two biopsy specimens to one of the five categories.

The unweighted and weighted group kappa values for the agreement on degree of dysplasia were 0.32 (CI: 0.25-0.39) and 0.69 (CI: 0.61- 0.77) respectively. In our previous study, with- out discussion about which morphological characteristics were considered relevant for grading, the unweighted and weighted group kappa values amounted to 0.27 (CI: 0.19- 0.35) and 0.55 (CI: 0.44-0.66) respectively. Thus, improvement in interobserver agreement was especially seen for the weighted group kappa @ = 0.003). Kappa values for pairs of pathologists showed no outliers.

Table 2 shows the frequency distributions of the scores of the six morphological char- acteristics for the four pathologists. It appeared that the frequency distributions of the scores for all characteristics were quite different for the various pathologists.

Table 2. Frequency distribution of the scores for the various characteristics for the four pathologists (A, B, C, D)

Characteristic Within l/3 Within 2/3 Whole layer

Location of immature cells

A 37% 51% 13% B 42% 51% c 9% 0% 9;; D 38% 57% 5%

Hyperchromasia Moderately Severely Absent present present

A 37% 53% 11% B 47% 45% 9% C 9% 71% 20% D 13% 63% 24%

Nucleus/cytoplasm Moderately Severely ratio Normal abnormal abnormal

A 14% 75% 11%

: 27% 61% 12%

9% 78% 13% D 4% 65% 31%

Polymorphous nuclei Moderately Severely Absent present present

A 9% 81% 11% B 25% 59% 16% C 8% 73% 19% D 5% 62% 32%

Location of mitotic activity Within l/3 Within 213 Whole layer

A 56% 40% 4% B 63% 27% 10% C 52% 35% 13% D 37% 49% 14%

Appearance of mitotic activity Pathological Not pathological

A 86% 14%

: 66% 34% 97% 3%

D 90% 10%

Degree of Carcinoma dysplasia No Mild Moderate Severe in situ

A 2% 35% 38% 24% 1% B 7% 30% 28% 26% 9% C 8% 26% 44% 19% 3% D 4% 27% 39% 27% 3%

For all individual characteristics unweighted and weighted group kappa values are presented

Table 3. Unweighted and weighted kappa values and their 95% CI for the various characteristics and the degree of dysplasia

Unweighted Weighted Number Characteristic kappa kappa of classes

Location of immature cells 0.21 (0.14-0.28) 0.32 (0.24040) 3 Hypcrchromasia 0.23 (0.13-0.33) 0.40 (0.29-0.51) 3 Nucleus/cytoplasm ratio 0.24 (0.13-0.35) 0.38 (0.37-0.49) 3 Polymorphous nuclei 0.19 (0.12-0.26) 0.34 (0.26-0.42) 3 Location of mitotic activity 0.34 (0.25-0.43) 0.49 (0.37-0.61) 3 Appearance of mitotic activity 0.28 (0.15-0.41) 0.28 (0.15-0.41) 2

Degree of dysplasia 0.32 (0.25-0.39) 0.69 (0.61-0.77) 5

Page 4: Sources of interobserver variation in histopathological grading of cervical dysplasia

788 HENRICA C. W. DE VET et al.

in Table 3. The unweighted and weighted kappa values for the two-class characteristic “appearance of mitotic activity” are equal, as discrepancy over more than one category is not possible. It turned out that for each charac- teristic separately there was much dissension. “Location of mitotic activity” showed the highest kappa values (0.34 unweighted, 0.49 weighted).

In order to approach the third source of vari- ation: interpreting the observed characteristics and grading them into the five class ordinal scale

of degree of dysplasia, we first compared the sum of the scores on the six characteristics with the assigned diagnosis of cervical dysplasia for each pathologist. The results are depicted in Fig. 1. No serious systematic differences between the pathologists were observed, in assigning to less or more severe grades of dysplasia, in case of the same summed score. However, there was some overlap in degree of dysplasia for various summed scores. This over- lap may be partly explained by the fact that this comparison between the summed scores

Pathologist A ND1 -L

LD A MD J I I I

SD I I I I .

CIS

4 8 12 summed score

Pathologist B ND -LL I

LD d-U_ MD I * I

SD 1 I . I I

CIS I I

4 8 12 summed score

Pathologist C ND _I_

LD -I_

MD LLI-

SD IIll

4 8 12 summed score

Pathologist D ND J-

LD

SD III

4 8 12 summed score

Fig. 1. Frequency distributions of summed scores by the various degrees of dysplasia for each pathologist (ND = no dysplasia, LD = light dysplasia, MD = moderate dysplasia, SD = severe dysplasia,

CIS = carcinoma in siru).

Page 5: Sources of interobserver variation in histopathological grading of cervical dysplasia

Interobserver Variation in Grading Cervical Dysplasia 789

Table 4. The gamma correlation coefficients for the association of the six characteristics with the assigned degree of dysplasia, compared to the rankings of characteristics

in order of importance

Gamma Order of coefficient importance

Pathologist A Location of immature cells 0.998 1 Location of mitotic activity 0.981 2 Nucleus/cytoplasm ratio 0.944 4 Polymorphous nuclei 0.911 6 Hyperchromasia 0.834 5 Appearance of mitotic activity 0.669 3

Pathologist B Location of immature cells 0.994 I Appearance of mitotic activity 0.986 3 Location of mitotic activity 0.933 2 Nucleus/cytoplasm ratio 0.918 4 Polymorphous nuclei 0.838 5 Hyperchromasia 0.801 6

Pathologist C Location of immature cells 1.000 1 Hyperchromasia 0.983 4 Location of mitotic activity 0.976 2 Nucleus/cytoplasm ratio 0.957 5 Polymorphous nuclei 0.919 3 Appearance of mitotic activity 0.801 6

Pathologist D Polymorphous nuclei 0.997 4 Nucleus/cytoplasm ratio 0.994 1 Hyperchromasia 0.982 2 Location of immature cells 0.971 5 Location of mitotic activity 0.970 3 Appearance of mitotic activity 0.527 6

and assigned degree of dysplasia neglects the different weights that pathologists attached to the various characteristics. By comparing the relation between each morphological character- istic with the degree of dysplasia, it can be determined which characteristics were strongly associated with the ultimate diagnosis of each pathologist. Table 4 presents in the left column gamma coefficients of correlation for each char- acteristic with the assigned degree of dysplasia and in the right column the rankings of the characteristics reported by the pathologists afterwards (1 indicates the most important and 6 the least important characteristic). It appeared that most characteristics, except “pathological appearance of the mitotic activity”, correlated quite well with the assigned degree of dysplasia. As the gamma coefficients of some character- istics did not vary much (most of them were larger than 0.90), we could not expect the pathologists to rank them, on paper, in exactly the same order. The characteristics which showed a correlation lower than 0.90, were ordered as less important (ranking 5 or 6) by the path- ologists five out of six times. “Pathological appearance of mitotic activity” showed the weakest correlation with the degree of dysplasia

in the scores of pathologists A, C and D, and was also listed in the last place by two of these pathologists. Pathologist B listed the “patho- logical appearance of mitotic activity” in third place of importance, corresponding with a strong correlation between this characteristic and the assigned degree of dysplasia in the data of pathologist B.

DISCUSSION

This study examined sources of variation in histopathological grading of cervical dysplasia. It appeared that consensus on which morpho- logical characteristics are relevant to consider for grading, being the only difference between the previous and the present study, already increased the agreement between pathologists. The fact that the improvement was especially seen in the weighted kappa values and hardly in the unweighted kappa values, means that it did not help so much in distinguishing mild from moderate dysplasia, or moderate from severe dysplasia, but differences over more than one category occurred less often.

Apart from the variation due to the choice of relevant characteristics to consider, the observ- ation of these morphological characteristics and grading of the observations into the five categor- ies of dysplasia are possible sources of variation.

It turned out that considerable disagreement was observed in the scoring of the six morpho- logical characteristics. This means that the pathologists disagreed on, for example, when one classifies the nuclei/cytoplasm ratio as disturbed, when one judges the mitotic activity as pathological, or when polymorphous nuclei are absent. The criteria used for grading are equivocal apparently.

The weights attached to the various mor- phological characteristics were inferred from the correlation (expressed by the gamma coefficients) between these characteristics and the degree of dysplasia. All characteristics but one showed a strong correlation with the degree of dysplasia. The pathologists differed to some extent in which characteristics they considered most important. However, these characteristics were highly interrelated (data not presented), because they are all indicators of the degree of dysplasia. For that reason, the variation in weights attached to the various characteristics seems to have no major influ- ence on the ultimate diagnosis of degree of dysplasia.

Page 6: Sources of interobserver variation in histopathological grading of cervical dysplasia

790 HENIUCA C. W. DE VET et al.

In conclusion, the variation in the observ- ation of the relevant characteristics seemed to contribute more to the interobserver variation than the weighing of these characteristics for the grading into categories of dysplasia.

It seems illogical that the agreement on the grade of dysplasia was fairly good, while the agreement in scores on the individual morpho- logical characteristics on which the degree of dysplasia is based was disappointing. Part of it may be due to random errors in the scores on the characteristics, which are averaged out in the assignment of the degree of dysplasia. But there are probably more important explanations for this paradoxical finding:

-it is possible that besides the criteria on which we reached consensus in this study, there are other considerations or unspoken criteria involved in the grading of cervical dysplasia.

-it is also supposable that the pathologists first judge the degree of dysplasia and then score the characteristics in accordance with this diagnosis.

The first explanation refers to a completely rational approach to clinical judgment. Based on criteria for presence of several morpho- logical considerations the degree of dysplasia is determined.

The second explanation refers to a method of pattern recognition, using the morphological characteristics as support for it. The sequence of the reasoning is the other way around.

It is impossible to design a study with inde- pendent observation of the degree of dysplasia and the individual characteristics. Of course, the pathologists will bring the observations of the characteristics and the degree of dysplasia in accordance. But the sequence of observation and reasoning can be questioned.

Therefore our study does not prove whether the first or the second explanation is true. The statistically significant improvement achieved by only reaching consensus on the relevant characteristics supports the first explanation. However, the finding that the agreement on the degree of dysplasia is better than the agreement on the characteristics on which it is based favors the second explanation.

Irrespective of which explanation holds, it is evident that more consensus in either definition of criteria or pattern recognition can only be achieved by joint sessions behind a microscope. We are currently organizing these sessions.

1.

2.

3.

4.

5.

6.

7.

8.

9.

REFERENCES

Pieters WJLM. The atypical mitosis as a character- istic in classifying squamous lesions of the uterine cervix. Dissertation, University of Groningen; 1987: 68-82. Robertson AJ, Anderson JM, Swanson Beck J et al. Observer variability in histopathological reporting of cervical biopsy specimens. J Clln Pathol 1989; 42: 231-238. Ringsted J, Amtrup F, Asklund C et al. Reliability of histopathological diagnosis of squamous epithelial changes of the uterine cervix. Acta Path Microbial Scend Sect A 1978; 86: 273-278. Ismail SM, Colclough AB, Dinnen JS et al. Observer variation in histopathological diagnosis and grading of cervical intraepithelial neoplasia. Br Med J 1989; 298: 707-710. Vet HCW de, Knipschild PG, Schouten HJA et al. Interobserver variation in histopathological grading of cervical dysplasia. J Clln Epidemiol 1990; 43: 1395-1398. - - Cohen J. A coefficient of agreement for nominal scales. Educ Psvchnl Meas 1960: 20: 37-46. Cohen i Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psycho1 Bull 1968; 70: 213-230. Schouten HJA. Nominal scale agreement among observers. Psyehometrikn 1986, 51: 453-466. Goodman LA, Kruskal WH. Measurement of associ- ation for cross classifications. J Am Stat Assoc 1954; 49: 732-764.