Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Volume-based Response Evaluationwith Consensual Lesion Selection:
A Pilot Study by Using Cloud Solutions and Comparison to RECIST 1.1
Estanislao Oubel, PhD, Eric Bonnard, MD, Naoko Sueoka-Aragane, MD, PhD,Naomi Kobayashi, MD, Colette Charbonnier, PhD, Junta Yamamichi, MSc, MPH,
Hideaki Mizobe, MSc, Shinya Kimura, MD, PhD
Rationale and Objectives: Lesion volume is considered as a promising alternative to Response Evaluation Criteria in Solid Tumors (RE-
CIST) to make tumor measurements more accurate and consistent, which would enable an earlier detection of temporal changes. In this
article, we report the results of a pilot study aiming at evaluating the effects of a consensual lesion selection on volume-based response(VBR) assessments.
Materials and Methods: Eleven patients with lung computed tomography scans acquired at three time points were selected from Refer-
ence Image Database to Evaluate Response to therapy in lung cancer (RIDER) and proprietary databases. Images were analyzed accord-ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. Cloud solutions were used to connect
readers and carry out a consensus process on the selection of lesions used for computing response. Because there are not currently
accepted thresholds for computing VBR, we have applied a set of thresholds based on measurement variability (�35% and +55%).The benefit of this consensuswasmeasured in terms ofmultiobserver agreement by using Fleiss kappa (kfleiss) and corresponding standard
errors (SE).
Results: VBR after consensual selection of target lesions allowed to obtain kfleiss = 0.85 (SE = 0.091), which increases up to 0.95
(SE = 0.092), if an extra consensus on new lesions is added. As a reference, the agreement when applying RECIST without consensuswas kfleiss = 0.72 (SE = 0.088). These differences were found to be statistically significant according to a z-test.
Conclusions: An agreement on the selection of lesions allows reducing the inter-reader variability when computing VBR. Cloud solutions
showed to be an interesting and feasible strategy for standardizing response evaluations, reducing variability, and increasing consistencyof results in multicenter clinical trials.
Key Words: Clinical trials; RECIST; cloud computing; consensus; lesion volume; biomarkers; volume thresholds.
ªAUR, 2015
According to recent statistics of the World Health Or-
ganization, cancer is the leading cause of death world-
wide accounting for 8.2 million deaths in 2012 (1).
Lung cancer accounted for 1.6 million deaths (19.4%), which
makes it the first cause of cancer death even ahead of liver
(9.1%) and stomach (8.8%) cancers. Computed tomography
(CT) is currently the standard imaging modality for assessing
the response to treatment in patients with solid tumors. In
Acad Radiol 2015; 22:217–225
From the R&D Department, MEDIAN Technologies, Les Deux Arcs B, 1800Route des Cretes, Valbonne 06560, France (E.O., C.C.); RadiologyDepartment, Nice University Hospital, Nice, France (E.B., S.K.); Division ofHematology, Respiratory Medicine and Oncology, Department of InternalMedicine, Faculty of Medicine, Saga University, Saga, Japan (N.S.A., N.K.);and Global Healthcare IT Project, Medical Equipment Group, Canon Inc,Tokyo, Japan (J.Y., H.M.). Received April 25, 2014; accepted September 20,2014. Address correspondence to: E.O. e-mail: [email protected]
ªAUR, 2015http://dx.doi.org/10.1016/j.acra.2014.09.008
general, the quantification of the response is performed by us-
ing Response Evaluation Criteria in Solid Tumors (RECIST)
(2,3). In summary, this standard establishes the way of
measuring lesions and provides a set of thresholds to classify
the response into partial response, stable disease, and
progressive disease. In the context of this article, the term
RECIST refers to its revised version (1.1) (3).
Even when RECIST criteria have been broadly adopted in
clinical trials, they present some drawbacks that are a source of
measurement variability. This is quite inconvenient because the
consistency in the production of trial results is necessary for com-
parison purposes (4). For example, the lesion size is measured as
its longest axial diameter, which is not a robust measure in case of
complex lesions and creates problems of accuracy and precision
(5). To cope with this drawback, the use of volume is currently
being considered as a promising direction to make tumor mea-
surements more accurate and consistent, which would enable
an earlier detection of temporal changes (6–9). The benefit of
using volume as biomarker has already been reported in the
literature (5,10); however, the use of volume also presents some
217
OUBEL ET AL Academic Radiology, Vol 22, No 2, February 2015
drawbacks like the long segmentation time in case of big lesions
(in particular for thin-slice acquisitions) and the lack of accepted
thresholds for computing response. Regarding this last point, it is
worth mentioning the intensive efforts performed by the Quan-
titative Imaging Biomarkers Alliance (11) to better understand
volumetric biomarkers and their sources of variability. In the
context of this article, the term volume-based response (VBR)
refers to the response estimated exactly as establishedbyRECIST,
except for the use of volume of lesions and volume-specific
thresholds instead of diameters. The use of VBR was preferred
to names like 3D-RECIST to avoid confusion with the applica-
tion of RECISTwith 3D-extended thresholds and also because
currently there is no formal definition of the volumetric version
of RECIST.
Another reported source of variability of RECIST is the dif-
ference in the target lesions (TLs) selected for computing
response (12,13). TLs are selected on the basis of their size and
their suitability for reproducible measurements (6). Even when
it is relatively easy to measure the lesion size required for its in-
clusion, it is more difficult to assess measurement repeatability
by visual inspection only. For example, the edges of irregular
or infiltrating lesions are often difficult to define and, in some
cases, they are even impossible to measure (14). Another
commonly found obstacle is the presence of peritumoral fibrosis,
which is difficult to distinguish from tumor spread and adds
further uncertainty to measurements. A very interesting discus-
sion about limitations of RECIST can be found in (15).
To cope with the problems of variability mentioned before,
regulatory authorities have been recommending the use of an
Independent Central Review (ICR). The original purpose of
ICR as recommended by the Food and Drug Administration
(FDA) was to eliminate the bias associated with local evaluation
(LE). This is thought to be relevant for studies in which the
investigator knows the patients enrolled in different trial
arms. Another advantage of ICR is the possibility of standard-
izing evaluations because all patients are evaluated by a
restricted number of reviewers (typically one reviewer and
one adjudicator) using the same software solutions. However,
the advantages of ICR over LE have been questioned in the
literature (16,17,18), and some alternatives to ICR like the
use of audit tools are currently being considered (19,20). The
main criticism is that ICR does not remove bias completely,
and in some cases, it may introduce bias by itself (eg, by
informative censoring). Finally, the implementation and
management of the ICR process is costly and burdensome
for sites and sponsors.
Several analyses, notably the one by the Pharmaceutical
Research and Manufacturers Association Working Group
(21), show that no systematic bias is introduced by LE. The
same conclusions have been presented recently by the FDA,
who proposes the use of ICR only as an audit tool to detect
evaluation bias in LE assessments (19). However, in an LE
approach, there is an intrinsic variability among sites because
evaluation protocols cannot be guaranteed to be the same or
equivalent. Therefore, a standardization of the evaluation
could make results from different sites more consistent.
218
Cloud computing offers an opportunity for reducing the
variability in the application of RECIST in a LE approach
because it can facilitate the interaction between different
stakeholders and provide common tools for image analysis
(which reduces the variability coming from the use of
different measurement systems). The application of cloud
computing in the context of clinical trials has become
possible, thanks to the availability of high-capacity networks,
low-cost computers, and storage services. The huge amount
of data generated by medical imaging acquisition systems
makes cloud computing an interesting alternative for process-
ing, storing, and sharing images. Besides, clouds free cus-
tomers (eg, hospitals) from the responsibility of installing
and maintaining hardware and basic computational services
because these tasks are performed by cloud providers.
In this article, we have evaluated a cloud-based system to
perform a consensus on the selection of lesions used for
computing VBR. As suggested by previous publications
(12,13), this is expected to reduce the inter-reader variability
in LE approaches. Evenwhen in this articlewe focus on the spe-
cific problem of lesion selection, the proposedworkflow can be
extended to control other sources of variability. The final objec-
tive is to make the application of RECIST more consistent
among sites. Several contributions are provided in this article.
First, we show the feasibility of applying cloud computing in
the context of multicenter clinical trials. To do this, we propose
a novel cloud-basedworkflowenabling amore consistent appli-
cation of RECISTamong participant sites. Second, we applied
this workflow to investigate the potential benefits of a lesion
consensus on the response. This is only one specific application
among others that illustrates how the system could be used, and
the advantages provided. Third, we compare the proposed
approach with respect to the current standard and LE. Finally,
we analyze the impact on the RECIST compliance, which is
expected to be higher after a consensus process.
MATERIALS AND METHODS
Data Set
Eleven patients (aged 62 years in average; six men and 5
women) presenting solid lesions were retrospectively selected
fromRIDER (22) (6 patients) andMedian Technologies’ pro-
prietary databases (5 patients). The selection criteria were the
type of lesion (solid), location (lung), number of acquired time
points (at least at baseline, 3, and 6 months), and homogeneity
of image acquisition parameters (eg, filter type). Chest CT
scans with contrast were performed by using General Electric
LightSpeed and Siemens Sensation scanners. Image acquisi-
tion was performed at 120kVp and (354 � 103 mAs). The
slice thickness was #2.5 mm (1.93 � 0.63 mm), and the in-
plane resolution was#0.78 mm (0.71� 0.06 mm). A filtered
back projection reconstruction method was used in all cases
with B60f (Siemens) and Lung (GE) convolution kernels.
Lesions were selected and measured by two oncologists
with 27 and 12 years of experience (the on-site readers) and
Figure 1. Distribution of lesions according to its longest axis diam-eter (D). (Color version of figure is available online.)
Figure 3. Distribution of lesions according to surroundings.JP, juxta-pleural; JV, juxta-vascular; PF, perifissural; SL, solitary
lesion. (Color version of figure is available online.)
Figure 2. Distribution of lesions according to location. LLL, left
lower lobe; LUL, left upper lobe; MED, mediastinum; RLL, right lower
lobe; RML, right middle lobe; RUL, right upper lobe. (Color version offigure is available online.)
Figure 4. Distribution of lesions according to shape. (Color version
of figure is available online.)
Academic Radiology, Vol 22, No 2, February 2015 RESPONSE EVALUATION USING CLOUDS
a radiologist with 5 years of experience (the independent
reviewer [IR]). The oncologists are specialized in clinical trials
and perform RECIST evaluations in the clinical routine.
Figures 1–4 show the distribution of the selected lesions
according to different characteristics.
Cloud-based Setup System
Figure 5 shows the set up cloud solution and the different
stakeholders involved in the reading process, which can be
explained in a general manner as follows:
1) The Data Managers (DMs) perform a quality control of
images and, if the quality control test is successful, they
are transferred to the data center where the processing
required before analysis is performed (image identifica-
tion, image reconstruction, lung mask extraction, and
so forth). Once the processing is completed, the DMs
make the images available to the IR and readers for
analysis.
2) The reader recovers the images of a particular patient from
the data center and performs a lesion segmentation and
response assessment according to RECIST and VBR.
The performed analysis is saved into the data center for
it to be reviewed by the IR.
3) The reader and IR start exchanging opinions about the
selected TLs, and the reader is allowed to modify his selec-
tion if the feedback provided is considered to be relevant.
This process named ‘‘consensus’’ is coordinated by the
DMs, who are in charge of keeping records of changes
in evaluations.
4) Once an agreement or maximum number of iterations (set
to four in this study) is reached, the final response is calcu-
lated. For clarity and readability purposes, sometimes it is
necessary to emphasize under which conditions (with or
without consensus) the response was calculated. Therefore,
we use a superscript (asterisk) to distinguish between re-
sponses computed with and without consensus. In this
way, RECIST* and VBR* refer to responses calculated
219
Figure 5. Architecture of the setup cloud
solution showing the different stakeholders
involved in the reviewing process. UK,United Kingdom. (Color version of figure
is available online.)
OUBEL ET AL Academic Radiology, Vol 22, No 2, February 2015
after consensus; RECIST and VBR refer to the response
before consensus or when talking about the criteria in a
general manner.
The geographic location of the different stakeholders was
different, which allowed evaluating the solution for a potential
use in multicountry clinical trials. DMs were based in Sophia
Antipolis (France), the readers in Saga (Japan) and Glasgow
(United Kingdom), the IR in Nice (France), and the data cen-
ter (Canon IT Solutions Inc.) was installed in Tokyo (Japan). A
more detailed workflow of the reading process is shown in
Figure 6.
Volume Measurement
The lesion volume was computed from segmentations per-
formed by using LesionManagement Solutions (Median Tech-
nologies, Sophia Antipolis, France). If necessary, the initial
results provided by the semiautomatic segmentation were cor-
rected by using interactive segmentation tools implementing
shape-based interpolation methods (23). To remove normal
lung tissue from the lesion, a final threshold step is automatically
applied after each correction performed by the user.
Response Agreement
Target Lesions. We have evaluated the impact of the consensus
on the inter-reader response agreement by comparing
TL-based responses before and after consensus. As explained
before, the response without consensus is always assessed
before IR’s feedback, and it is potentially modified after
this. Therefore, both responses are available, and a comparison
of agreements can be carried out. We have quantified the
220
inter-reader agreement by using the Fleiss Kappa (kfleiss) statis-
tic (24). The Fleiss Kappa allows computing the agreement
between multiple observers when assigning categorical ratings
to a number of statistical units.
The inter-reader agreement was also computed for the tu-
mor burden (sum of lesion volumes) and its change over time.
The objective of this analysis was to evaluate the impact of the
consensus on the measure itself, independently of thresholds.
This is important because the use of thresholds could hide dif-
ferences in the continuous values, which is an interesting
result even if no changes are observed in the categorical
data. This agreement was quantified by using reproducibility
coefficient values (25), which can be computed for multiple
observers. A Shapiro-Wilk test of normality (26) was per-
formed, and a logarithmic transformation was applied to
data in case of null hypothesis rejection for normalization
purposes.
New Lesions. The consensus process was performed on TLs
only. However, the impact of new lesions (NLs) on the
response is strong because the appearance of at least one
NL changes automatically the response status to progress
disease according to RECIST, independently of changes
in TLs. Therefore, the effect of a potential consensus on
NL on the response deserves special consideration. To
analyze this, we simulated a consensus on NL by assigning
to all readers a NL identified by at least one reader. For
example, suppose that the reader #1 identifies an NL at
given time point of a given patient, but this lesion was
missed by both reader #2 and IR, then this NL is also taken
into account for computing the responses of reader #2 and
IR. The Fleiss Kappa was also used to measure the inter-
reader agreement.
Figure 6. UnifiedModel Language diagram of the reading process. The sequence part inside the loop corresponds to the consensus process
aiming to apply RECIST uniformly. IR, independent reviewer.
Academic Radiology, Vol 22, No 2, February 2015 RESPONSE EVALUATION USING CLOUDS
Volume Thresholds. Currently, there are no accepted thresh-
olds for assessing response based on volume changes. One
could simply apply RECIST thresholds after extrapolation
to three dimensions, but, according to Mozley et al. (10),
this is known to work well only in some situations where
the tumor morphology is simple, but it cannot be applied
in a general manner. Therefore, we preferred to apply
thresholds empirically estimated from measurements of
interobserver variability and established to +55% for pro-
gression and �35% for response (27). The rationale behind
the use of this set of thresholds is that the actual differences
in lesion size can be hidden by the variability of the measure.
Then, for a difference in size to be measured significantly, it
must be necessarily larger than this variability. This way of
estimating thresholds is similar to the approach followed by
Mozley et al. (10), the main difference is the application of
the Geary-Hinkley transformation to model response
evaluation.
Data Analysis. All statistical analyses were performed with the
R software (28). The Fleiss kappa was computed by using the
irr package (29), and the Shapiro-Wilk test of normality was
computed by using the stats package. A z-test on original
and bootstrapped data was applied to assess the statistical sig-
nificance of differences in kfleiss.
221
TABLE 1. Fleiss Kappa and Standard Errors (BetweenParentheses) According to Different Types of Consensus andCriteria
ConsensusTL Only
Response RECIST VBRTL NL
— — X 0.72 (0.808)/SuA 0.76 (0.090)/SuA
— — — 0.72 (0.090)/SuA 0.76 (0.090)/SuA
X — X 0.72 (0.088)/SuA 0.85 (0.091)/APA
X — — 0.72 (0.088)/SuA 0.85 (0.092)/APA
X X — 0.81 (0.089)/APA 0.95 (0.093)/APA
APA, almost perfect; NL, new lesion criteria; RECIST, Response
Evaluation Criteria in Solid Tumors; SuA, substantial; TL, target
lesion; VBR, volume-based response.
Consensus on TL, NL, and TL only response has been marked with
‘‘X’’ when they are applied and ‘‘—’’ when they are not. The interpre-
tation of values according to Landis andKoch (30) has been added as
follows: slight/fair/moderate/substantial/almost perfect agreement.
The highest value is shown in bold.
TABLE 2. Reproducibility Coefficient for ln(SLV), ln(DSLVBL),and ln(DSLVNADIR) for Consensual and NonconsensualSelection of target lesions
Consensus ln(SLV) ln(DSLVBL) ln(DSLVNADIR)
No 2.081 1.167 0.915
Yes 0.593 0.678 0.677
SLV, sum of lesion volumes; DSLVBL, change in SLV with respect to
baseline (BL); DSLVNADIR, change in SLV with respect to nadir.
TABLE 3. Contingency Table of Responses Provided byRECIST and VBR*
VBR*
RECIST
PR SD PD Total
PR 14 15 0 29
SD 0 8 2 10
PD 2 2 23 27
Total 16 25 25 66
PD, progressive disease; PR, partial response; RECIST, Response
Evaluation Criteria in Solid Tumors; SD, stable disease; VBR,
volume-based response.
VBR* was computed after consensus on target lesion and new
lesion, whereas RECIST was calculated in the classical way (without
consensus).
OUBEL ET AL Academic Radiology, Vol 22, No 2, February 2015
RESULTS
Response Agreement
Table 1 shows the inter-reader agreement measured by Fleiss
kappa agreements (24) for different types of consensus and
response criteria. The table shows that the use of volume pro-
vides higher levels of agreement than RECIST. The table also
shows that, differently from one-dimensional measurements,
the VBR does benefit from the consensus on TLs (RECIST
and RECIST* provide similar results). A paired sample
z-test for mean of paired differences between VBR* and
RECIST rejected the null hypothesis in favor of the alterna-
tive hypothesis of higher agreement for VBR* (z-value=1.77;
P value = .038). This means that the difference found between
both approaches is statistically significant. A z-test was also
applied to compare VBR* and VBR (ie, to measure the ef-
fects of consensus only). In this case, the agreement with
consensus was significantly higher than without consensus at
a 10% level (z-value = 1.46; P value = .072). Finally, when
comparing RECIST versus VBR, differences were not statis-
tically significant in favor of volume (z-value = 0.31; P
value = .378).
We have also performed z-tests on 1000 bootstrapped com-
binations of the original ratings. In this case, the z-test rejected
the null-hypotheses with P value <.01 for all three compari-
sons performed before.
Table 2 shows reproducibility index values (25) for the sum
of lesion volumes (SLV), its change with respect to baseline
(DSLVBL), and its change with respect to nadir (DSLVNADIR).
A Shapiro-Wilk test of normality showed that the distribu-
tions of these parameters were not normal. To reduce these
deviations from normality, the values were transformed by us-
ing the logarithmic function. The success of this correction
was verified by a new application of the normality test which
showed that the null hypothesis could not be rejected. Table 2
222
shows that the reproducibility index values decrease after
consensus (ie, the inter-reader agreement increases), which
is consistent with the results presented in Table 1. These results
provide further evidence about the reduction of variability
when there is an agreement on the lesions to be used for
response assessment.
Comparison Between VBR and RECIST
Sometimes, the response agreement may be meaningless. For
example, if thresholds are too high, only one type of response
and 100% of agreement are obtained; however, this result is
meaningless because the set of thresholds is not discriminative.
As RECIST is an already validated and broadly accepted
criteria by the scientific community, a new response criteria
is expected to provide similar results for a large population
(but not exactly the same for the new criteria to be inter-
esting). The kappa value between RECIST and VBR* on
TLs and NLs was equal to 0.53 (SE = 0.152), which corre-
sponds to a moderate agreement according to the classification
by Landis and Koch (30).
Besides the kappa value itself, it is interesting to analyze the
contingency table used for its calculation. Table 3 shows that
the matrix of contingency presents some asymmetries
deserving further analysis. Even when the lack of ground truth
precludes an analysis of accuracy, a comparison between
methods can be performed. When rows and columns of the
contingency matrix are ordered by decreasing order of
response, this matrix becomes close to upper triangular,
TABLE 4. Response Agreement (as a Percent of Matches)According to the Number of Lesions Used for ComputingResponse
1L 2L 4L DIFF(1-2) DIFF(2-4)
93% 75% 100% 0% 83%
1 L/2 L/4L, 1/2/4 lesion(s) chosen by both raters; DIFF(1-2), different
number of lesions chosen varying between 1 and 2;
DIFF(2-4), different number of lesions chosen varying between2 and4.
The best value is shown in bold.
Academic Radiology, Vol 22, No 2, February 2015 RESPONSE EVALUATION USING CLOUDS
which means that VBR* presents a positive bias with respect
to RECIST. In other words, for each category of response
provided by VBR*, the response provided by RECIST is
equal or inferior. For example, when a partial response is pro-
vided by VBR*, RECIST provides either partial response or
stable disease; when VBR* is stable disease, RECIST says sta-
ble disease or progressive disease. Of course, this is not valid
for the last row because this matrix is not strictly upper
triangular.
Number of Lesions
Some articles in the literature (31,32) suggest that the number
of lesions is important when computing response. More
specifically, the use of a large number of lesions has been
shown to reduce the interobserver variability. We have
analyzed this aspect for our data by using volume
measurements before consensus to assess if this is actually a
source of variability. This is interesting because the number
of selected lesions is a parameter of lesion selection that
could be potentially controlled during the consensus
process, which would be an interesting extension of the
workflow presented in this article. For example, if there is a
difference in the number of selected lesions between readers
and if an influence on the variability of such differences is
proven, the IR could ask one of the readers to select
additional lesions to match the number of lesions selected by
the second reader.
We have compared the responses between readers by taking
two readers at the time. The results presented in Table 4
confirm that the number of lesions used for computing
response is important for reducing variability. In cases where
four lesions were chosen, 100% of response agreement was
achieved. This supports the claims by Moskowitz et al. (31)
who propose a minimum number of five lesions for avoiding
response misclassification. These results are also in full agree-
ment with Darkeh et al. (32), who discourage the use of less
than four lesions because inter-rater discrepancies may be
introduced.
RECIST Compliance
Before consensus, we have found 12% of response
aluations noncompliant with RECIST, whereas after
consensus all evaluations were RECIST-compliant. The
following RECIST nonconformities have been found as
follows:
� Number of lesions per organ higher than two. For example,
one reader selected three pulmonary TLs, but the
maximum number of lesions per organ established by RE-
CIST is two.
� Lesion location mistakes. For example, pulmonary lesions
labeled as mediastinal.
� Measurability problems. For example, one of the readers
chose lesions with ill-defined contours next to the pulmo-
nary artery, which could not be measured accurately.
Noncompliant RECISTevaluations (and sources of devia-
tion) have been already reported in the literature (33), and
therefore, these results confirm and are consistent with previ-
ously published results.
DISCUSSION
Table 1 suggests that the RECIST-based response agreement is
independent of the consensus on TLs. Indeed, the choice of
the same TLs does not guarantee a lower variability. This sit-
uation may seem to be paradoxical, but the following example
provides an explanation to this observation. Let us consider,
for example, the lesion shown in Figure 7. This lesion is quite
irregular and the measured diameters are very different, which
leads to an also high response variability even if all readers
choose only this lesion as target. On the other hand, the
same Table 1 shows that a consensus on NLs improves the
agreement, which owns probably to the strong effect of this
type of lesions on the response.
Table 1 shows that the consensus on TLs has an impact on the
VBR agreement; however, no differences were observed for
RECIST. These differencesmay be explained by a higher repro-
ducibility of volume measurements with respect to diameter
(5,6,10).The inter-reader agreement forVBR*was significantly
higher thanRECISTandVBR.This is in agreementwith results
published recently by Zhao et al. (13) and Khul et al. (12).
We have found a moderate agreement between VBR* and
RECIST. However, this agreement is not a metric of perfor-
mance because RECIST is not a gold standard for response.
In fact, if a new response index provides the same results as
RECIST (k = 1), it would be completely useless. On the other
hand, k=�1 is not wished either because this would mean that
the new index is saying the opposite thanRECIST. This would
mean in turn that the proposed response index presents major
drawbacks, which would preclude its use in clinical trials.
Therefore, a k = 0.53 seems to be a good compromise because
it implies neither perfect agreement nor completely opposed
results and completely random differences.
Without consensus, the selection of lesions was non-
compliant with RECIST requirements in 12% of cases. The
existence of non-compliances has already been reported by
Skougaard et al. (33) who found 46% of inter-reader discrep-
ancies associated with non-conformities in the application of
RECIST. In this study, 100% of evaluations were RECIST-
223
Figure 7. Differences in assessment of longest axial diameter between readers. The figure shows the slice containing the LAD (blue line) and
the short axis diameter (red line). (a) Reader #1, (b) Reader #2, and (c) independent reviewer. (Color version of figure is available online.)
OUBEL ET AL Academic Radiology, Vol 22, No 2, February 2015
compliant after consensus. This is an important aspect of the
workflow explained in the section on cloud-based setup sys-
tem for quality control purposes.
In this article, we have applied cloud-based solutions for
achieving a consensus on the selection of lesions. However,
the proposed framework can be applied for the standardization
of clinical trials in a broader sense. For example, the lesion seg-
mentation could be performed only on specific images of the
whole set of images corresponding to a given time point.
Another example is the use of similar visualization settings
(zoom, window level, and so forth) for all readers. The number
of lesions used for computing response could also be controlled
because we have observed in the section on number of lesions
that this is a source of variability. These standardizations could
potentially reduce the variability of measurements and can be
easily included as a part of the proposed workflow.
One limitation of this study is the low number of patients of
the data set. This is the result of applying multiple image selec-
tion criteria when creating the data set as described in the sec-
tion on data set. The ideawas to remove themaximum number
of sources of variability related to image characteristics to focus
only on the effects of lesion selection. It is important to take into
account that the statistical analyses were performed on 66 re-
sponses (three time points per patient evaluated by three
readers), which is an acceptable number of statistical units to
perform such analyses. For the purposes of a pilot study like
this, the number of patients included was quite convenient
because it avoided the complexity associated with the manage-
ment of large data sets and to test the workflow promptly.
The consensus process increases necessarily the reviewing
time. However, in some clinical centers, this additional time
is spent anyway in discussions between radiologists and oncol-
ogists to agree on lesion selection. Another aspect to take into
account is that, in general, physicians follow different and
more time-consuming protocols when evaluating patients in
the context of a clinical trial (use of specific software, different
measurements, and filling of case report forms); therefore, the
time invested in performing a consensus seems to be
compliant with clinical trial protocols. Finally, this additional
time may reduce the intersite variability, which in turn may
eventually improve the quality of the generated statistical data.
224
In a real clinical context, several sites must be managed. The
use of cloud solutions reduces the installation complexity
because, differently from a LE approach, it is not necessary to
perform specific installations at each site. However, to manage
the consensus, additional DMs and IRs might be required.
Clouds also allow simplifying the workflow and the communi-
cation between stakeholders. For example, reports can be signed
electronically by the adjudicator and transmitted immediately to
the sponsor, needless to exchange reports in paper format.
Another example is the update of the sponsor’s database: once
the radiologist finishes an evaluation, the site can connect
directly to the sponsor’s database to update it with the results
for the corresponding patient. In this way, data are immediately
available for analysis. The same idea is applicable to images: once
a scan is available for a specific patient, it can be imported into
the system through a computer terminal and then transmitted
to the sponsor to be included in a central database. This solves
the problem of sending images by using support materials like
digital versatile disks (DVDs) or hard drives.
One of the main concerns with cloud solutions is confiden-
tiality. To keep data confidential, the cloud infrastructure and
data were centrally hosted in a data center managed by a
vendor guaranteeing security and confidentiality. The data
center provides early warning systems to identify potential at-
tacks. These types of infrastructures are generally more secure
than hospital infrastructures. Besides, our cloud service re-
quires strong authentication to identify users and provides
an audit trail to record the actions performed by authorized
users. Our system is compliant with regulations such as
CFR21 Part 11 (FDA–Electronic records, electronic signa-
ture) and International Committee for Harmonization–
Good Clinical Practices to guarantee the application of
security, confidentiality, and safety best practices. Finally, the
communication between the client and the server hosted by
the data center is encrypted over the Secure Sockets Layer/
Transport Layer Security protocol. We are aware that cloud
security is a major concern because data are transferred across
the network, and therefore, we use state-of-art technology to
make data transfer as safe as possible. We think that, with the
use of such technology, data are probably as well protected in a
cloud infrastructure as in a local infrastructure.
Academic Radiology, Vol 22, No 2, February 2015 RESPONSE EVALUATION USING CLOUDS
CONCLUSIONS
In this article, we have set up a cloud-based system aiming at
standardizing response evaluations in multicenter clinical tri-
als. The use of cloud solutions showed to be an interesting
and feasible strategy for standardizing response evaluations
and reducing variability. The use of VBR increased the
inter-reader agreement compared to RECIST, and therefore,
it seems to be a promising alternative for evaluating the
response to a treatment. For VBR, the consensus improved
the inter-reader response agreement and the RECIST
compliance, and therefore, it is an interesting phase to be
considered in a clinical trial protocol. Finally, the consensus
on NLs showed to play an important role because of their
strong impact on the global response, and therefore, it deserves
special attention as a topic of future research.
REFERENCES
1. Globocan 2012: Estimated incidence, mortality, and prevalence world-
wide in 2012. International Agency for Research on Cancer. Available
from: http://globocan.iarc.fr; 2012.
2. Therasse P, Arbuck SG, Eisenhauer EA, et al. New guidelines to evaluate
the response to treatment in solid tumors. J Natl Cancer Inst 2000 Feb;
92(3):205–216. Available from: http://jnci.oxfordjournals.org/content/92/
3/205.abstract.
3. Eisenhauera EA, Therasse P, Bogaerts J, et al. New response evaluation
criteria in solid tumours: revised RECIST guideline (version 1.1). Eur J Can-
cer 2009 Jan;45(2):228–247. Available from: http://www.ncbi.nlm.nih.gov/
pubmed/19097774.
4. ErasmusJJ,GladishGW,BroemelingL, et al. Interobserver and intraobserver
variability inmeasurement of non–small-cell carcinoma lung lesions: implica-
tions for assessment of tumor response. J Clin Oncol 2003 Jan;21(13):
2574–2582. Available from, http://jco.ascopubs.org/content/21/13/2574.
5. Suzuki C, Jacobsson H, Hatschek T, Torkzad MR, Bod�en K, Eriksson-Alm
Y, et al. Radiologic measurements of tumor response to treatment: prac-
tical approaches and limitations. Radiographics [Internet]. [cited 2013
Nov 27];28(2):329–344. Available from: http://www.ncbi.nlm.nih.gov/
pubmed/18349443
6. Nishino M, Jagannathan JP, Ramaiya NH, et al. Revised RECIST guideline
version 1.1: what oncologists want to know and what radiologists need to
know. American Roentgen Ray Society. AJR Am J Roentgenol 2010 Aug
23;195(2):281–289. Available from:, http://www.ajronline.org/doi/full/10.
2214/AJR.09.4110.
7. Buckler AJ, Mozley PD, Schwartz L, et al. Volumetric CT in lung cancer: an
example for the qualification of imaging as a biomarker. Acad Radiol 2010
Jan;17(1):107–115. Available from: http://www.ncbi.nlm.nih.gov/pubmed/
19969254.
8. Buckler AJ, Mulshine JL, Gottlieb R, et al. The use of volumetric CT as
an imaging biomarker in lung cancer. Acad Radiol 2010 Jan 1;17(1):
100–106. Available from: http://www.academicradiology.org/article/
S1076-6332(09)00574-1/abstract.
9. Seyal AR, Parekh K, Velichko YS, et al. Tumor growth kinetics versus recist
to assess response to locoregional therapy in breast cancer liver metasta-
ses. Acad Radiol. Available from: http://www.ncbi.nlm.nih.gov/pubmed/
24833565; 2014 May 12.
10. Mozley PD, Bendtsen C, Zhao B, et al. Measurement of tumor volumes im-
proves RECIST-based response assessments in advanced lung cancer.
Transl Oncol 2012 Feb;5(1):19–25. Available from: http://www.ncbi.nlm.
nih.gov/pubmed/22348172.
11. Quantitative Imaging Biomarkers Alliance [Internet]. Available from:
https://www.rsna.org/QIBA.aspx
12. Khul CK, Barabasch A, Dirrichs T, et al. Target lesion selection as a source
of variability of response classification by RECIST 1.1. 2013 ASCO Annual
Meeting. J Clin Oncol 2013; 31 2013;5(suppl). abstr 11077.
13. ZhaoB,LeeSM,LeeH-J,etal.Variability inassessing treatment response:met-
astatic colorectal cancer as a paradigm. Clin Cancer Res 2014 Jul 1;20(13):
3560–3568. Available from: http://www.ncbi.nlm.nih.gov/pubmed/24780294.
14. Padhani AR, Ollivier L. The RECIST criteria: implications for diagnostic ra-
diologists. Br J Radiol 2001 Nov 1;74(887):983–986. Available from: http://
bjr.birjournals.org/content/74/887/983.full.
15. Schuetze SM, Baker LH, Benjamin RS, et al. Selection of response criteria
for clinical trials of sarcoma treatment. Oncologist 2008 Jan 1;13(Suppl 2):
32–40. Available from: http://theoncologist.alphamedpress.org/content/
13/suppl_2/32.full.
16. Dodd LE, Korn EL, Freidlin B, et al. Blinded independent central
review of progression-free survival in phase III clinical trials: impor-
tant design element or unnecessary expense? J Clin Oncol 2008
Aug 1;26(22):3791–3796. Available from: http://www.pubmedcentral.
nih.gov/articlerender.fcgi?artid=2654812&tool=pmcentrez&rendertype=
abstract.
17. Shamsi K, Patt RH. Onsite image evaluations and independent image
blinded reads: close cousins or distant relatives? J Clin Oncol 2009 Apr
20;27(12):2103–2104. author reply 2104–5. Available from, http://jco.
ascopubs.org/content/27/12/2103.
18. Ford R, Schwartz L, Dancey J, et al. Lessons learned from independent
central review. Eur J Cancer 2009 Jan;45(2):268–274. Available from:
http://www.ncbi.nlm.nih.gov/pubmed/19101138.
19. Zhang JJ, Chen H, He K, et al. Evaluation of blinded independent central
review of tumor progression in oncology clinical trials: a meta-analysis.
Ther Innov Regul Sci 2012 Sep 7;47(2):167–174. Available from: http://
dij.sagepub.com/content/47/2/167.short.
20. Dodd LE, Korn EL, Freidlin B, et al. An audit strategy for progression-free
survival. Biometrics 2011 Sep;67(3):1092–1099. Available from: http://
www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3160504&tool=pm
centrez&rendertype=abstract.
21. Amit O, Mannino F, Stone AM, et al. Blinded independent central review of
progression in cancer clinical trials: results from a meta-analysis. Eur J
Cancer 2011 Aug;47(12):1772–1778. Available from: http://www.ncbi.
nlm.nih.gov/pubmed/21429737.
22. Armato SG, Meyer CR, Mcnitt-Gray MF, et al. The Reference Image Data-
base to Evaluate Response to therapy in lung cancer (RIDER) project: a
resource for the development of change-analysis software. Clin Pharma-
col Ther 2008 Oct;84(4):448–456. Available from: http://www.ncbi.nlm.
nih.gov/pubmed/18754000.
23. HermanGT, Zheng J, Bucholtz CA. Shape-based interpolation. IEEECom-
put Graph Appl 1992; 12.
24. Fleiss JL. Measuring nominal scale agreement among many raters. Psy-
chol Bull 1971;76(5):378–382. Available from: http://content.apa.org/
journals/bul/76/5/378.
25. Barnhart HX, Haber MJ, Lin LI. An overview on assessing agreement with
continuous measurements. J Biopharm Stat 2007 Jan;17(4):529–569.
Available from: http://www.ncbi.nlm.nih.gov/pubmed/17613641.
26. Royston JP. An extension of Shapiro andWilksW test for normality to large
samples. Appl Stat 1982; 31:115–124.
27. Beaumont H, Labatte J-M, Souchet S, et al. Determination of thresholds
for the assessment of response to therapy relying on volume-based tumor
burden in pulmonary CT images. ASCO Meet Abstr 2013;31(15_suppl):
e18508. Available from: http://meeting.ascopubs.org/cgi/content/abstract/
31/15_suppl/e18508.
28. R Development Core Team. R: a language and environment for statistical
computing. R Foundation for Statistical Computing; 409. Available from:
http://www.r-project.org; 2013.
29. Gamer M. Package irr. Available from: http://cran.r-project.org/web/
packages/irr/; 2012.
30. Landis JR, Koch GG. The measurement of observer agreement for cate-
gorical data. Biometrics 1977 Mar;33(1):159–174. Available from: http://
www.ncbi.nlm.nih.gov/pubmed/843571.
31. Moskowitz CS, Jia X, Schwartz LH, et al. A simulation study to evaluate the
impact of the number of lesions measured on response assessment. Eur
J Cancer 2009 Jan;45(2):300–310. Available from: http://www.pubmed
central.nih.gov/articlerender.fcgi?artid=2652848&tool=pmcentrez&render
type=abstract.
32. Darkeh MHSE, Suzuki C, Torkzad MR. The minimum number of target
lesions that need to be measured to be representative of the total number
of target lesions (according to RECIST). Br J Radiol 2009 Aug;82(980):
681–686. Available from: http://www.ncbi.nlm.nih.gov/pubmed/1936
6735.
33. Skougaard K, McCullagh MJD, Nielsen D, et al. Observer variability in a
phase II trial—assessing consistency in RECIST application. Acta Oncol
2012 Jul;51(6):774–780. Available from: http://www.ncbi.nlm.nih.gov/
pubmed/22432439.
225