View
2
Download
0
Category
Preview:
Citation preview
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
Computer Networks xxx (2015) xxx–xxx
Contents lists available at ScienceDirect
Computer Networks
journal homepage: www.elsevier.com/locate/comnet
On the impact of knowledge extraction and aggregation on
crowdsourced annotation of visual artworks
Jasper Oosterman a,∗, Jie Yang a, Alessandro Bozzon a, Lora Aroyo b,Q1
Geert-Jan Houben a
a Delft University of Technology, P.O. Box 5, 2600 AA, Delft, The Netherlandsb VU University, De Boelelaan 1081a, 1081 HV, Amsterdam, The Netherlands
a r t i c l e i n f o
Article history:
Received 31 October 2014
Revised 14 June 2015
Accepted 3 July 2015
Available online xxx
Keywords:
Crowdsourcing
Cultural heritage
Artwork annotation
a b s t r a c t
Cultural heritage institutions more and more provide online access to their collections. Collec-
tions containing visual artworks need detailed and thorough annotations of the represented
visual objects (e.g. plants or animals) to enable human access and retrieval. To make these
suitable for access and retrieval, visual artworks need detailed and thorough annotations of
the visual classes. Crowdsourcing has proven a viable tool to cater for the pitfalls of automatic
annotation techniques. However, differently from traditional photographic image annotation,
the artwork annotation task requires workers to possess the knowledge and skills needed to
identify and recognise the occurrences of visual classes. The extent to which crowdsourcing
can be effectively applied for artwork annotation is still an open research question. Based on
a real-life case study from Rijksmuseum Amsterdam, this paper investigates the performance
of a crowd of workers drawn from the CrowdFlower platform. Our contributions include a
detailed analysis of crowd annotations based on two annotation configurations and a com-
parison of these crowd annotations with the ones from trusted annotators. In this study we
apply a novel method for the automatic aggregation of local (i.e. bounding box) annotations,
and we study how different knowledge extraction and aggregation configurations affect the
identification and recognition aspects of artwork annotation. Our work sheds new light on the
process of crowdsourcing artwork annotations, and shows how techniques that are effective
for photographic image annotation cannot be straightforwardly applied to artwork annota-
tion, thus paving the way for new research in the area.
© 2015 Published by Elsevier B.V.
1. Introduction1
Visual artwork1 annotation recently emerged as an im-2
portant multidisciplinary discipline, fuelled by the growing3
needs of cultural heritage institutions. Galleries, Libraries,4
∗ Corresponding author. Tel.: +31152786346.
E-mail address: j.e.g.oosterman@tudelft.nl,
jasper.oosterman@gmail.com (J. Oosterman).1 A visual artwork is an artistic expression represented on a flat surface
(e.g., canvas or sheet of paper) in the form of a painting, printing or drawing
[1].
Archives, and Museums (GLAMs) have the mission of ensur- 5
ing that the art produced by mankind is properly preserved, 6
described, catalogued, and made accessible to the public. To 7
unlock the value of their artwork collections, GLAMs must 8
enable and facilitate browsing and retrieval for a broad yet 9
unforeseen variety of users, having an unknown variety of 10
needs. To this end, textual annotations are used to describe 11
the instances of classes of visual objects, e.g. objects, plants, 12
animals and human body parts, represented in the artworks. 13
Image annotation is a notoriously hard problem for comput- 14
ers to solve but, thanks to recent progress in computer vi- 15
sion techniques [2,3], it is now possible to correctly identify 16
http://dx.doi.org/10.1016/j.comnet.2015.07.008
1389-1286/© 2015 Published by Elsevier B.V.
Please cite this article as: J. Oosterman et al., On the impact of knowledge extraction and aggregation on crowdsourced anno-
tation of visual artworks, Computer Networks (2015), http://dx.doi.org/10.1016/j.comnet.2015.07.008
2 J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
the presence of several visual classes in photographic images.17
Alas, such techniques cannot be applied in cultural heritage18
collections due to the unique nature of visual artworks [1].19
Therefore, GLAMs employ professionals, mostly art histori-20
ans, to analyse artworks and create annotations about the21
occurrence of visual classes of interest; but the quality and22
extent of their annotation work is subject to temporal, mon-23
etary, and knowledge limitations.24
Crowdsourcing has emerged as a viable solution to com-25
plement (or substitute) computer vision algorithms for many26
“difficult” visual analysis tasks, including the annotation of27
visual content [4–7]. Crowdsourced artwork annotation is28
a representative example of a Crowdsourced Knowledge Cre-29
ation (CKC) task [8], i.e. a class of crowdsourcing tasks where30
workers are requested to stress their high-level cognitive31
abilities (e.g., knowledge synthesis, data interpretation), and32
draw from their experience or education, in order to solve33
problems for which a unique, factual solution might not exist.34
In the case of artwork annotation, a crowd worker must pos-35
sess the knowledge and skills required to: (1) understand the36
abstract, symbolic, or allegorical interpretation of the reality37
depicted in the artwork to identify the occurrences of visual38
classes; and (2) recognise the type of such visual classes, de-39
scribing them with an expressive text.40
Crowdsourcing of artwork annotation is still an open re-41
search challenge.42
Domain-specific experts are hard and expensive to re-43
cruit. Knowledgeable contributors (e.g. pro-amateurs and44
enthusiasts) might be present in anonymous human compu-45
tation marketplaces, but must be located, engaged and moti-46
vated.47
The identification and recognition of visual classes are48
aspects of artwork annotation that can be influenced by the49
CKC process design. The knowledge extraction step is of great50
importance, as it requires work interfaces that guide, but not51
constrain, their high-level cognitive and memory processes52
of contributors. This is in contrast to traditional “computa-53
tional” crowdsourcing tasks, where a well-defined work in-54
terface guarantees execution efficiency and consistency. Also,55
the aggregation of knowledge from individual workers must56
account for the broad diversity of opinions and interpreta-57
tions that a crowd knowledge elicitation task might imply.58
This work studies the crowdsourcing of artwork annotation,59
and addresses the following research questions:60
g61
62
63
s64
f65
66
67
68
69
70
71
72
73
that is likely present in a general population. Three trusted 74
assessors created a reference annotation ground-truth to as- 75
sess both the number, type, and location of flowers depicted 76
in the dataset3. To test the effect of the CKC’s knowledge ex- 77
traction step, we evaluated two annotation configurations: 78
an Artwork–centric configuration where textual annotations 79
about visual objects are specified for the whole artwork; and 80
a Class–centric configuration where occurrences of visual ob- 81
jects are identified using bounding boxes with distinct tex- 82
tual annotations. 83
The experiment was performed on the CrowdFlower 84
human computation marketplace, and involved a crowd 85
of 235 workers. For each knowledge extraction configura- 86
tion, we tested the impact of different aggregation methods 87
on the identification and recognition performance. We anal- 88
yse the quality of annotations provided crowd workers to 89
study the richness of their vocabulary, and its overlap with 90
respect to annotations created by domain experts. 91
The main contribution of this paper is a study on how 92
extraction and aggregation methods affect annotation qual- 93
ity of visual artworks in a Artwork–centric and Class–centric 94
configuration. To enable our study we created a novel algo- 95
rithm for aggregating annotations in the Class–centric con- 96
figuration. 97
Results confirm the unique nature of the artwork annota- 98
tion problem, showing how crowdsourcing techniques that 99
are effective for photographic image annotation cannot be 100
straightforwardly applied. The high percentage of workers 101
who dropped out during recruitment testifies to the chal- 102
lenges related to the identification of visual objects, even 103
when as simple as flowers. The experiments highlight the 104
impact that the CKC process can have on the identification 105
and recognition quality: a Artwork–centric configuration en- 106
hances recognition aspects, and comes with a richer annota- 107
tion vocabulary; on the other hand, a Class–centric configura- 108
tion guarantees better identification performance, but poorer 109
recognition and vocabulary. 110
The remainder of the paper is structured as follows. In 111
Section 2 we present the related work. Section 3 details and 112
exemplifies the complexities of visual object identification 113
and recognition in artwork annotation. Next, Section 4 de- 114
scribes the design and execution of our evaluation. Sections 5 115
and 6 present and discuss experimental results. Section 7 116
concludes and sets the scene for future work. 117
118
- 119
- 120
- 121
RQ1: Can non-professional annotators from crowdsourcin
platforms provide high quality artwork annotations?
RQ2: To what extent can the extraction and aggregation
steps of a crowdsourced knowledge creation proces
influence the identification and recognition aspects o
visual artwork annotation?
To answer these questions, we partnered with the Ri-
jksmuseum Amsterdam2, and set-up an extensive evaluation
campaign aimed at testing the performance of workers from
human computation platforms when asked to identify and
recognise occurrences of visual object classes in artworks.
We assembled a collection of 80 Rijksmuseum prints, and fo-
cused on the “flowers” class, to target an area of expertise
2 http://rijksmuseum.nl. The Rijksmuseum Amsterdam is the largest and
most prestigious museum in the Netherlands.
k 122
- 123
e 124
- 125
126
- 127
- 128
Please cite this article as: J. Oosterman et al., On the impact of kn
tation of visual artworks, Computer Networks (2015), http://dx.do
2. Related work
Recent literature [1,9] shows how in contrast to photo
graphic images, which carefully represent the real world, art
works provide less and typically inconsistent visual informa
tion (texture, colour, depth, etc.); this, together with the lac
of sufficiently large training sets and the presence of a size
able number of visual classes to be recognised, are among th
main causes for ineffective automatic artwork annotation al
gorithms. Hybrid image annotation methods [10,11] emerged
as a promising solution to reduce costs and error rate by com
plementing automatic techniques with crowdsourced an
notations. Inspired by these works, our paper focuses on 129
3 The dataset and the results of this study are available for download from
http://bit.ly/CN-SI-artworks.
owledge extraction and aggregation on crowdsourced anno-
i.org/10.1016/j.comnet.2015.07.008
J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx 3
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
cr130
o131
g132
fi133
b134
135
a136
p137
th138
fe139
fr140
p141
to142
w143
ta144
le145
in146
g147
a148
e149
te150
fr151
u152
ia153
154
p155
fo156
a157
m158
w159
n160
in161
se162
163
w164
T165
in166
fo167
a168
[2169
to170
[2171
n172
w173
174
v175
w176
co177
T178
m179
cu180
o181
ca182
ce183
ta184
p185
3186
187
th188
p 189
h 190
fy 191
b 192
3 193
194
w 195
to 196
re 197
h 198
o 199
D 200
c 201
ti 202
p 203
d 204
fl 205
te 206
su 207
o 208
o 209
210
se 211
o 212
b 213
p 214
ta 215
e 216
a 217
218
a 219
a 220
e 221
m 222
ti 223
to 224
id 225
th 226
T 227
th 228
it 229
(d 230
231
p 232
fo 233
id 234
th 235
p 236
c 237
d 238
in 239
th 240
241
g 242
P
t
owdsourcing visual object occurrence annotations, a form
f annotation where strict quality control and effective ag-
regation techniques are needed to cater for the natural dif-
culties related to the drawing of geometric coordinates to
ound image objects [12].
Previous works investigated how crowds can support the
rtwork annotation process [13–15]. For instance, “The Steve
roject” [13] studied crowd tagging of collections from more
an 12 USA-based museums and compared crowd and pro-
ssional taggers. Authors found crowd annotators, drawn
om museum attendees, to use a different vocabulary than
rofessional ones, but that such annotations were effective
improve the retrieval of the artworks. In [15], crowds
ithout prior domain knowledge were engaged to anno-
te prints, while gaming mechanisms stimulated them to
arn about the domain. Based on the gaming mechanism
troduced by Luis von Ahns popular ESP-game [16], several
ames with a purpose [17,18] have been proposed to collect
rtwork metadata. Differently from previous works, which
xploited domain-knowledgeable volunteers or museum at-
ndees, our approach explicitly focuses on crowds drawn
om human computation platforms, i.e. anonymous individ-
als for which no assumption can be made about their famil-
rity with artworks or their domain knowledge.
Recent studies [4,19,20] compared the performance of ex-
ert and human annotators from human computation plat-
rms. All studies agreed on the potential and the scalability
nd reduced costs of crowds compared to experts, but also
entioned that additional actions, such as repetition and
orker qualification, are needed to obtain high quality an-
otations. Standard aggregation techniques for crowd results
clude removing results failing qualification tests and sub-
quently using majority voting to combine the results [4].
Most studies however focused on tasks that required
orkers to have only basic skill and common knowledge.
here is demand for more complex tasks, for example requir-
g creativity [21]. Only recently, several works advocated
r specialised crowdsourcing techniques for knowledge cre-
tion tasks, such as nichesourcing [22] or community sourcing
3,24]. For instance, in the context of domain-specific on-
logy and taxonomy creation, Noy et al and Chilton et al.
5,26] found a crowd performance of around 80% correct-
ess which, although being lower than that of domain expert,
as very promising and above all scalable.
The results described in this paper are rooted in our pre-
ious work on crowdsourced knowledge creation. The early
ork [27] defined the problem of artwork annotation in the
ntext of the Rijksmuseum Amsterdam in the Netherlands.
he subsequent works [8,28] introduced an experimental
ethodology, and reports on preliminary results mainly fo-
sed on Artwork–centric knowledge extraction. To the best
f our knowledge, our work is the first one that systemati-
lly studies the performance of Artwork–centric and Class–
ntric knowledge extraction techniques in human compu-
tion platforms, and assess their identification and location
erformance with respect to a high-quality ground truth.
. Challenges in artwork annotation
This section elaborates on the major challenges related to
e annotation of visual objects in artworks. We describe the n
lease cite this article as: J. Oosterman et al., On the impact of know
ation of visual artworks, Computer Networks (2015), http://dx.doi.o
rocess currently employed at the Rijksmuseum Amsterdam,
ighlighting typical annotation requirements, and exempli-
ing how the compliance to such requirements is hindered
y the nature of visual artworks.
.1. The need for professional annotation of artworks
The Rijksmuseum has a collection of over 1 million art-
orks, 700, 000 of which are prints, that the museum wants
make accessible for online consumption. The museum cur-
ntly conducts the following digitization process. First, a
igh quality digital representation is created. Then, a team
f 6 professional, in-house, annotators describe the artwork.
uring discussions with the museums curator of the online
ollection the museums’ interest regarding artwork annota-
on was stated: descriptions of both the art-historical as-
ects (such as creator, material and date of creation) and the
epicted visual objects, such as depicted persons, buildings,
ora and fauna. People using or studying their collection of-
n try to answer questions regarding a visual object class Xch as: “How many prints depict X?”; “Which are the type
f X most commonly represented in a given artistic period?”;
r “How are X depicted in different genres and periods?”.
To enable these retrieval scenarios, annotations must pos-
ss the following 2 properties: (1) coverage, i.e. all instances
f the visual object classes represented in the artwork should
e identified, (possibly) located, and annotated, and (2) ex-
ressiveness, i.e. visual objects should be recognised and anno-
ted with texts that should serve a broad spectrum of knowl-
dge levels; this is to allow both common and expert users to
ccess the collection by using the most familiar language.
Professional annotators are given 25 min to retrieve an
rtwork from storage, analyse it, find relevant information
nd publications online and in their library, and describe by
ntering annotations and references in collection manage-
ent software4. Twenty minutes are devoted to the descrip-
on of art-historical aspects, while only 5 min are allocated
visual object classes. The latter involves two steps: the
entification of instances of a given class, and subsequently,
eir recognition and association with a representative label.
he flower annotation case study. The size and importance of
e Rijksmuseum make it an exemplary case of cultural her-
age institution striving for high-quality annotation of digital
igitized) collections.
Let us consider the print in Fig. 1, drawn from the 700K
rints collection. According to the requirements defined be-
re, professional annotators from the Rijksmuseum must
entify all the instances of depicted flowers, and describe
em with their common and scientific names. The print de-
icts a woman holding flowers sitting in a flower decorated
art pulled by dogs; the print contains 71 flower instances,
istributed in at least 8 areas (highlighted with black bound-
g boxes), not considering the flowers decorating the cart or
e decorations resembling flowers on the cart’s wheels.
While focusing on prints and flower annotations, we ar-
ue this use case to be representative of a broader class of
4 The allocated time is constrained by budgetary considerations and can-
ot be significantly increased.
ledge extraction and aggregation on crowdsourced anno-
rg/10.1016/j.comnet.2015.07.008
4 J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
a243
-244
-245
-246
247
y248
e249
e250
251
252
-253
-254
-255
k256
-257
r258
-259
r260
r261
-262
263
o264
265
266
267
-268
,Q2
269
t270
e271
272
-273
r274
275
276
-277
-278
.279
,
- 280
. 281
e 282
s 283
e 284
s 285
- 286
e 287
288
289
- 290
f 291
- 292
o 293
, 294
295
; 296
- 297
t 298
- 299
4 300
301
g 302
6 303
s 304
305
306
- 307
r 308
309
310
311
312
- 313
314
e 315
6 https://sealincmedia.wordpress.com. SEALINCMedia is part of the Dutch
national program COMMIT.
Fig. 1. ”Scent“ by Petrus Cool.
artwork annotation problem: on the one hand, it reflects
typical annotation subject, i.e. a non-photographic represen
tation of (a possibly symbolic) reality containing many, di
verse visual objects. On the other hand, “flower” is an ex
ample of a common visual object class, which is not related
to historical aspects, and for which frequency and specificit
are a major bottleneck in the manual annotation process. Th
next sections describe two main challenges that affect th
creation of qualitative annotations.
3.2. Challenges in visual object identification
Visual artworks often provide an abstract, symbolic, or al
legorical interpretation of reality. In such context, the iden
tification of all visual object occurrences is a very time
consuming and error-prone task, complicated by: (1) the lac
of colours or details; (2) the abstract or stylised representa
tion of the visual class occurrence; (3) the size, density, o
composition of the depicted visual objects; and ( 4) subjec
tive or personal interpretations. In such conditions, in orde
to identify all the occurrences of visual objects, an annotato
must show both commitment to the annotation task (to ac
count for the potentially high number of occurrences), and
some degree of experience in the art domain, to be able t
correctly infer the content of visual artworks.
3.3. Challenges in visual object recognition
To correctly recognise visual objects, and describe them
with expressive text, domain-specific expertise is often re
quired. Let us consider the domain of flowers: arguably
everyone is exposed, to some extent, to knowledge abou
flowers: in the mind of the writers, it is difficult to imagin
someone not being able to recognise the red flower in Fig. 2
as a rose. However, going beyond such a shallow descrip
tion requires domain-specific knowledge. Would the reade
be able to tell a Rosa canina from a Rosa multiflora?5. And
which terminology would the reader be able to use?
In our case study, the Rijksmuseum is interested in an
notating each flower instance at different levels of speci
ficity, according to the flowers (formally: plants) taxonomy
5 Both Rosa canina and Rosa multiflora belong to the rosa genus and
through the eyes of a non-expert, they look pretty similar.
Please cite this article as: J. Oosterman et al., On the impact of kn
tation of visual artworks, Computer Networks (2015), http://dx.do
Fig. 2. An extract of the flower taxonomy for box 8 of Fig. 1.
It ranges from the top element Kingdom, for example Plan
tae (all plants), to the most specific element Species, e.g
Hyacinthus orientalis, one plant. Fig. 2 depicts part of th
flower description taxonomy. Above thespecies level i
genus, which describes multiple flower species, e.g. th
Hyacinthus genus. Above genus is family, which describe
multiple flower genus, e.g., the Asparagaceae family. Anno
tations could use both thecommon, e.g. Dutch hyacinth, or th
botanical, e.g. Hyacinthus orientalis, flower name.
4. Experimental design
Our goal is to study the annotation coverage and accu
racy of non-professional annotators drawn from a crowd o
anonymous workers, and to analyse their performance un
der different knowledge creation process configurations. T
this end, we instrumented an extensive evaluation campaign
discussed in this section.
Section 4.1 describes the experimental dataset
Section 4.2 introduces the two annotation configura
tions subject of our study; Section 4.3 provides details abou
the setup and quality control mechanism of the experi
ments performed on the CrowdFlower platform; Section 4.
describes the procedure employed for the normalisation
of crowd annotations; Section 4.5 describes the gatherin
of annotations from domain experts; finally, Section 4.
introduces the aggregation methods and evaluation metric
used for quantitative assessment.
4.1. Experimental dataset
In collaboration with personnel of the Rijksmuseum Am
sterdam, we selected 80 prints containing at least one flowe
instance. We then instrumented a ground-truth creation
task, aimed at defining, for each print, the exact number and
location of each contained flower instance.
We recruited 3 trusted annotators. They were selected
from the SEALINCMedia project6 staff; we considered in
dividuals familiar with the targeted collections and with
the technology used for image annotation, namely th
owledge extraction and aggregation on crowdsourced anno-
i.org/10.1016/j.comnet.2015.07.008
J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx 5
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
Easy Average Hard
Nr.
flow
ers
(log)
1
10
100
(a) Number of flowers
Easy Average Hard
Nr.
type
s
0
10
(b) Number of types
Easy Average Hard
Avg
. bb.
siz
e (lo
g)
10 4
10 2
1
(c) Avg. bounding box size
Easy Average Hard
Nr.
bb. o
verla
ps
0
1
2
(d) Avg. overlapped bounding box ratio
Fig. 3. Distributions of the number of flowers and types and bounding box statistics per print difficulty class based on the ground-truth data created by the
trusted annotators.
A316
(s317
w318
in319
D320
a321
ca322
sa323
324
d325
th326
T327
to328
a329
330
w331
p332
W333
sp334
th335
o336
o337
a338
to339
g340
m341
p342
d343
a344
co345
th346
347
cu348
a349
id350
ti351
e352
ti
is
m
e 353
a 354
e 355
c 356
h 357
c 358
h 359
360
o 361
c 362
ty 363
fo 364
fl 365
th 366
th 367
th 368
n 369
la 370
re 371
b 372
c 373
4 374
375
in 376
A 377
n 378
C 379
su 380
a 381
ti 382
c 383
a 384
a 385
A 386
w 387
ti 388
P
t
nnotorious library.7 Annotorius allows users to draw
quare) bounding boxes on images. Each trusted annotator
as instructed to identify, in each print, all flowers conform-
g to the following definition:
efinition. A flower is considered to be the flowering part of
plant with petals and distinguishable from leaves. A branch
n have multiple flowers (but each of those flowers has the
me name). A flower bud counts as a flower.
For each print, they were asked to independently: (1)
raw a bounding box of each flower instance; and, (2) count
e number of different flower types depicted in the print.
o guarantee maximum correctness and precision, annota-
rs were given no time limits, and they were allowed to stop
nd resume at any time.
Upon completion of the independent annotation sessions
e gathered the results and created a new version of each
rint featuring the bounding boxes from each annotator.
e then organised three deliberation sessions (in total 8 h,
read over 3 days) where annotators were asked to discuss
eir work. For each print in the dataset, they needed to agree
n a unique set of bounding boxes, and on a unique number
f flower types. For each bounding box, the annotators were
sked to also agree on its location and size; they were asked
redraw each bounding box as accurately as possible, so to
uarantee that each flower occurrence was fully, but mini-
ally, contained by it. Trusted annotators were also asked to
rovide comments and remarks about the issues they faced
uring the annotation process, including properties of the
nnotated prints such as the presence of little flowers, low
ntrast flower and leaves, flower orientation and overlap,
at could have hindered the recognition activity.
These comments were uses to support an additional dis-
ssion session aimed at classifying each print in the dataset,
ccording to the difficulty encountered in its annotation. We
entified two difficulty dimensions, related to identifica-
on and recognition, while lead to three difficulty classes:
asy, which identifies prints where flower instances are
7 http://annotorious.github.io/. Annotorius is a javascript image annota-
on library developed in the context of the Europeana project. Europeana
a large European content aggregator for Cultural Heritage and strives to
ake Cultural Heritage more accessible.
v
ty
b
o
lease cite this article as: J. Oosterman et al., On the impact of know
ation of visual artworks, Computer Networks (2015), http://dx.doi.o
asy to identify and recognise; the class features 20 prints;
verage which identifies prints where identification is
asy, but recognition is hard — 29 prints; and hard, which
ontains prints where both identification and recognition are
ard to perform — 31 prints. Notably, no print in the dataset
ontained flower instance that were easy to recognise but
ard to identify.
This carefully crafted annotation work led to the creation
f 1047 bounding boxes. On average, each of the 80 prints
ontains 13 flower instances (σ = 15, Max = 71) and 3 flower
pes (σ = 3, Max = 16). Figs. 3a–d depict the distribution,
r each print in the dataset, of: (1) the number of identified
ower instances; (2) the number of unique flower types; (3)
e average size of the annotation bounding boxes ; and (4)
e average number of overlapping bounding boxes. Prints in
e average difficulty class (easy to identify, hard to recog-
ise) have, on average, the least number of flowers and over-
pping bounding boxes. The annotation difficulty is strictly
lated to the average size of the bounding boxes. The num-
er of flower types is similarly distributed across difficulty
lass.
.2. Experimental configurations
We consider two experimental configurations, each test-
g a different knowledge extraction modality. In the
rtwork–centric configuration, crowd workers define an-
otations on the artwork as a whole; in contrast, the
lass–centric requires annotations to be defined for each vi-
al object instance, thus including its exact location in the
rtwork. In both configurations, annotators are asked to iden-
fy and recognise the occurrences of objects in a given visual
lass. Then, for each configuration, we assess how different
ggregation methods impact on the quality of the resulting
nnotations.
rtwork–centric knowledge extraction. In this configuration,
e ask annotators to analyse the artwork as a whole. Iden-
fication is performed by counting the occurrences of the
isual objects of interest, and the number of object class 389
pes therein represented. Recognition also occurs globally, 390
y asking the worker to provide labels for each distinct type 391
f object types identified in the print. The Artwork–centric 392
ledge extraction and aggregation on crowdsourced anno-
rg/10.1016/j.comnet.2015.07.008
6 J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
-
393
.394
e395
396
-397
-398
-399
.400
-401
402
l403
-404
r405
s406
o407
e408
o409
o410
r411
412
-413
e414
y415
t416
r417
418
-419
s420
-421
e422
-423
424
425
426
427
428
h
-
f 429
r 430
r 431
432
e 433
434
o 435
436
437
- 438
. 439
- 440
e 441
s 442
- 443
e 444
e 445
- 446
- 447
- 448
e 449
450
- 451
- 452
s 453
o 454
455
s 456
e 457
Fig. 4. Artwork–centric annotation user interface. Static image left and an
notation fields right.
annotation configuration has as advantage the simplicity and
speed by which an annotator can perform the assigned task
The main drawback, on the other hand, is that workers ar
not allowed to provide information about where and how (in
terms of size, clarity, etc.) in the artwork a given object in
stance is represented. Likewise, labels are not directly associ
ated with occurrences of the visual object class, thus provid
ing no insights about their importance or role in the artwork
Fig. 4 depicts the user interface used in our evaluation for Art
work–centric annotation. The print to annotate is presented
on the left-hand side. The worker can open the image at ful
screen by clicking on it. The right hand side lists the anno
tation fields. Workers are requested to indicate the numbe
of flower instances, and the number of distinct flower type
they identify in the print. Three text fields are available t
provide up to three labels describing the flower types in th
print. To avoid biases in the knowledge extraction process, n
auto-suggestions functionality was provided: workers had t
manually specify flower names, both in their botanical o
common form.
To account for unidentified flower types, workers can re
port a print to contain only flower occurrences of imaginativ
nature (fantasy); or simply indicate their inability to specif
any suitable label (unable). In both cases, an additional tex
box is prompted to collect comments about the reasons fo
their judgment.
Class–centric knowledge extraction. This configuration im
proves on the Artwork–centric one by providing worker
with the ability to pinpoint the occurrences of the sought vi
sual object in the artwork. Fig. 5 depicts the user interfac
used to collect Class–centric annotations: using the Anno
torius library, workers were asked to draw bounding boxes
directly on the image. Labels are specified for each iden-
tified occurrence, thus allowing for more fine-grained and
localised annotations. A bounding box is supposed to fully
contain the identified flower occurrence. Like in the Art-
t
s
Please cite this article as: J. Oosterman et al., On the impact of kn
tation of visual artworks, Computer Networks (2015), http://dx.do
Fig. 5. Class–centric annotation user interface. Interactive image on whic
bounding boxes are drawn left. Annotation fields appear per drawn bound
ing box.
work–centric user interface, on the right hand side a set o
annotation fields to allow workers to provide the numbe
of occurrences of the objects of interest; and the numbe
of distinct flower types for which a distinct label has been
provided. An additional field is automatically filled with th
number of bounding boxes drawn for the current print. Such
fields are defined for quality control purposes, but are als
used to further study the behaviour of crowd annotators.
4.3. Crowdsourced artwork annotation task design
The experiment was performed on CrowdFlower8, a hu
man computation platform that recruits workers worldwide
We set up two annotation jobs, one for each annota
tion configuration. Each task required the annotation of fiv
prints, and was rewarded with € 0.18. Compensation ha
been defined after several pilot experiments, used to test
drive the annotation interfaces and to receive feedback on th
monetary reward (which resulted in a 3 out of 5 score). Th
moderate sum was set so to attract people intrinsically mo
tivated while still giving some monetary appreciation. Work
ers were recruited in the Level 1 Contributors class of Crowd
Flower9 to exclude workers known by CrowdFlower to hav
a low quality.
Quality control mechanisms. Each annotation task was intro
duced by a description page, listing instructions about the re
quested input, an explanation of the user interface, example
of target annotations, and the same flower definition given t
trusted annotators.
Each worker had to first perform a qualification task, a
a necessary pre-condition to participate in the rest of th
8 http://crowdflower.com9 Level 1 contributors are “high performance contributors who accoun
for 60% of monthly judgments and maintain a high level of accuracy acros
a basket of CrowdFlower jobs.”
owledge extraction and aggregation on crowdsourced anno-
i.org/10.1016/j.comnet.2015.07.008
J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx 7
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
experiment. In the qualification task, workers were asked to458
annotate 5 prints, sampled from the ones belonging to the459
easy difficulty class, and having an unambiguously number460
of flowers and flower types, as determined by the trusted461
annotators. Each task paid € 0.18 upon completion. Work-462
ers were evaluated against the number of flowers and number463
of types defined in the ground truth; we allowed an off-by-464
one error to account for small counting errors. Workers that465
failed to correctly annotate more than one print in the qual-466
ification task were blocked from performing further tasks.467
Successful workers were allowed to continue with other an-468
notation tasks; to avoid learning bias, prints used in the qual-469
ification tasks were not shown twice to the same worker.470
For the same reason, workers that performed tasks in the471
Artwork–centric job were not allowed to perform tasks in472
the Class–centric job, and vice-versa. As an additional quality473
control mechanism one hidden control question, a so-called474
honeypot, was added to each annotation task. Each control475
question contained a print with an unambiguously number476
of flowers and flower types were used, as determined by477
the trusted annotators. Workers were paid regardless of their478
performance with the control question. Prints were shown to479
workers in random order. Each worker could annotate every480
remaining print in the set, but only once. We designed the481
ta482
a483
e484
th485
4486
487
a488
th489
w490
ti491
m492
m493
a494
w495
e496
u497
o498
a499
W500
h501
F502
h503
w504
a505
th506
cl507
4508
509
a510
1
is
sought botanical experts through our social and work envi- 511
ronments, but also looking for candidates in our geograph- 512
ical surroundings. Initially we only contacted members of a 513
garden-historical society, but unfortunately those members 514
did not have time available. We then expanded our inquiries 515
to the Wageningen University of Agriculture in the Nether- 516
lands and to friends, family and acquaintances who were 517
known to be working with, or are passionate about, flowers. 518
Our efforts resulted in three plant-related researchers from 519
Wageningen University, a former forest ranger and a prac- 520
titioner in the flower domain, who volunteered their time 521
to annotate our dataset collection with flower labels. We 522
trusted their self-assessment of their capabilities after show- 523
ing sample images from our print dataset. We let these do- 524
main experts annotate the images and test the crowd labels 525
with respect to the ones produced by these annotators with 526
known domain expertise. 527
Two domain experts annotated all (80) prints. The other 528
three annotated 24, 13 and 3 prints, respectively. In total, 186 529
flower labels were provided. The majority of all the labels 530
provided by experts (80%) were defined at the genus level. 531
The botanical form was used in 37% of the labels. In 57 532
annotation tasks (28%) at least one expert reported unable to 533
name any of the flowers in the print. Notably, for 12 prints 534
(1 535
la 536
la 537
la 538
C 539
540
1 541
e 542
v 543
c 544
H 545
c 546
p 547
u 548
le 549
4 550
551
th 552
g 553
sc 554
4 555
556
p 557
b 558
w 559
ty 560
th 561
ri 562
Matching: given a ground truth bounding box gb, a bound- 563
ing box b created by a worker is defined as matching if it over- 564
laps with gb. If b matches multiple gb’s, we take the closest 565
one (measured by the L1 distance between the bounding box 566
vectors) as the matched ground truth bounding box. 567
P
t
sk such that each print could be annotated by at most 10
nnotators. If the overall accuracy of a worker across all ex-
cuted tasks on control questions dropped below 60%, then
e annotations from such workers were discarded.
.4. Annotation labels pre-processing
Labels produced underwent a pre-processing step aimed
t providing a uniform label dictionary and, thus, support
e subsequent analysis. Due to the employment of a world-
ide workforce, labels were assumed to be provided in mul-
ple languages, and to include spelling errors. To achieve
aximum accuracy, each flower label has been manually
apped to a Wikipedia10 page using the search function-
lity on that site. If no corresponding page could be found,
e used the label plus the word wiki on a regular search
ngine to find wiki’s in other languages. Subsequently we
sed the Show this page in other languages functionality
f Wikipedia to revert back the English form. DBPedia is
semantic knowledge base, based on information from
ikipedia. As almost every page on the English Wikipedia
as a similar page on DBPedia which has the same content.
or example the DBPedia page about the Californian Rose is
ttp://dbpedia.org/page/Rosa_californica. Thanks to the link
ith DBPedia, it has been possible to reconcile synonyms,
nd automatically associate each flower label as belonging to
e Genus, Species, or Family level. Each label was also
assified as a common name or as botanical name.
.5. Label quality assessment
As the prior expertise of crowd workers is unknown, we
ssess the quality of labels they provide. For this purpose, we
0 Although Wikipedia has information on many topics, the flower domain
very well represented having most of the flower taxonomy present.
lease cite this article as: J. Oosterman et al., On the impact of know
ation of visual artworks, Computer Networks (2015), http://dx.doi.o
1 hard and 1 average) no expert was able to provide any
bel. The remaining 68 prints received on average 2.7 flower
bels (σ = 1.45, Max = 7) each. The most frequently used
bels have been Rose (20% of all labels), Lily (17%), Tulip (6%),
arnation (5%) and Iris (5%).
On average, experts respectively provided 2.40 (σ =.60), 2.40 (σ = 0.73), and 1.64 (σ = 1.47) unique labels for
asy, average, and hard prints. The perceived experts
ocabulary featured 59 distinct flowers names. By recon-
iling botanical and common name of flowers, such as
elianthus and. Sunflower, the number of unique labels de-
reases to 52. Fifty-six percent of experts’ vocabulary is ex-
ressed in botanical form. Looking at the distribution of
nique labels, most were expressed at the genus specificity
vel (66% versus 34% at the species level).
.6. Evaluation and aggregation
In this subsection we define the metrics used to assess
e quality of individual annotations and define our aggre-
ation methods for multiple annotations. Section 4.6.3 de-
ribes our novel bounding box aggregation algorithm.
.6.1. Evaluation of individual annotations
We assess crowd workers’ contribution quality by com-
aring their annotation results with the ground-truth created
y trusted annotators. In the Artwork–centric configuration,
e compare the number of flowers and the number of flower
pes. In the Class–centric configuration, for which we have
e bounding box information, we define 3 additional met-
cs related to the precision of the bounding boxes.
ledge extraction and aggregation on crowdsourced anno-
rg/10.1016/j.comnet.2015.07.008
8 J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
a568
-569
570
571
f572
e573
o574
-575
576
577
a578
579
x580
y581
-582
s583
y584
585
-586
g587
y588
-589
590
591
-592
593
594
r595
-596
-597
r598
;599
-600
e601
602
s603
604
605
s606
607
)
r608
609
s610
611
612
613
614
615
616
617
- 618
g 619
– 620
g 621
622
- 623
- 624
- 625
s 626
y 627
y 628
- 629
- 630
s 631
y 632
e 633
, 634
. 635
- 636
t 637
638
639
g 640
641
r 642
a 643
. 644
645
- 646
o 647
- 648
- 649
s 650
e 651
652
s 653
a 654
g 655
r 656
g 657
e 658
- 659
s 660
; 661
Centroid distance: If a worker bounding box b matches
ground truth bounding box gb, we define its centroid dis
tance as
cDist =
√(gbcX − bcX
gbw
)2
+(
gbcY − bcY
gbh
)2
where gbw, gbh denotes the width and height of the ground
truth bounding box, bw, bh denotes the width and height o
the matched bounding box. Intuitively, cDist measures th
distance between the centres of the two bounding boxes; t
account for differences in bounding boxes size, cDist is calcu
lated by normalising the coordinate distances by the ground
truth bounding box width and height.
Area overlap: If a worker bounding box b matches
ground truth bounding box gb, we define its area overlap as
aOverlap = gba ∩ ba
gba ∪ ba
where gba, ba denotes the area of ground truth bounding bo
and worker bounding box. Note that for aOverlap we simplif
the intersection/union of the geometric areas of two bound
ing box using set operators. The value of the aOverlap varie
in the range (0, 1], in which 1 means that gb and b perfectl
cover each other.
Notice that cDist and aOverlap measure different proper
ties: a worker bounding box close to ground truth boundin
boxes (i.e., with a small value of cDist) might not necessaril
cover a large portion of area of the same ground truth bound
ing boxes (i.e., a large value of aOverlap), and vice versa.
4.6.2. Artwork–centric annotation aggregation
As common in crowdsourcing, the works of several con
tributors are aggregated in order to produce a single “crowd
truth”.
Identification. The number of flowers and number of flowe
types identified in a print are numerical variables. We there
fore adopt as aggregation operators traditional numeric func
tions: (1) the median of the figures provided by each worke
for a given print, which guarantees robustness to outliers
and (2) the maximum value provided by one of the annota
tors which, on the other hand, can be seen as a conservativ
measure. We decided to experiment the maximum function
to account for the under-specification tendency of worker
described in the previous section.
The ratio of erroneous number of flowers (ξ flower) and
number of types (ξ types) reported for a print are defined a
follows:
ξflower = |#FLat − #FLgt |#FLgt
; ξtypes = |#TY at − #TY gt |#TY gt
(1
where #FLat and #TY at are, respectively the average numbe
of flower and the average number of flower types provided
by workers, and #FLgt and #TY gt are the number of lower
and number of flower types in ground truth.
Recognition. We aggregate the labels provided by the crowd
for each artwork using three different methods: (1) the union
of all crowd labels; (2) labels specified by at least 2 crowd
workers; and (3) labels specified by the majority of workers
annotating a print.
r 662
s 663
664
665
Please cite this article as: J. Oosterman et al., On the impact of kn
tation of visual artworks, Computer Networks (2015), http://dx.do
(a)Individual (b)Aggregated
Fig. 6. An example of aggregating multiple individual bounding boxes.
4.6.3. Class–centric annotation aggregation
Identification. For this configuration we assess, as in the Art
work–centric, both the number of flowers and types usin
the median and maximum. Given the local nature of Class
centric annotations, we address the problem of boundin
boxes aggregation. This task is not trivial, as it is hindered
by several factors, e.g.: (1) the usage of different pointing de
vices, e.g. mouse or touch screens, which affects the draw
ing action; (2) workers might not be able to correctly iden
tify an entity instance; and (3) the presence of maliciou
or poorly motivated workers, that might incorrectly (e.g. b
drawing random or very big bounding boxes), or partiall
perform the task at hand. To cater for such problems, we pro
pose a novel method to aggregate the Class–centric annota
tions produced by multiple workers, so that bounding boxe
can be aggregated together to obtain a correct, high-qualit
identification. We propose a novel method to aggregat
Class–centric annotations produced by multiple workers
each providing multiple bounding boxes to the same image
We note that a similar method, [29], has been used for aggre
gating bounding boxes from multiple sources. However, tha
method assumes each image has only one object, i.e. each
worker only provides one bounding box to an image.
See Fig. 6 for an example of the application of a boundin
box aggregation method.
Fig. 7 summaries the steps that compose our method fo
bounding box merging. The bounding boxes defined for
print are first pre-processed to remove low quality ones, i.e
bounding boxes having an area bigger than 3σ (3 standard
deviations) of the mean value of all the bounding boxes spec
ified for the same image. The threshold of 3σ is chosen t
retain as many bounding boxes that could contribute to ag
gregation. Such an heuristic is justified by the empirical ob
servation that some users draw very large bounding boxe
that contains multiple flowers, in contrast with other, mor
committed and precise users.
Then, the method requires the identification of all group
of bounding boxes in a print. Such a step can be modelled as
clustering task, where the goal is to find clusters of boundin
boxes such that all bounding boxes within the same cluste
target the same visual object occurrence, and that boundin
boxes of different clusters target different occurrences. W
test the performance of three clustering techniques: (1) sim
ple geometric clustering, where bounding boxes are defined a
belonging to the same cluster if they overlap (also partially)
(2) k-means; and (3) Gaussian Mixture Model (GMM) [30]. Fo
the two unsupervised clustering methods, bounding box Bi i
represented by the coordinates of its upper-left point p and
its bottom-right point q, i.e., B = (px, py, qx, qy)T .
owledge extraction and aggregation on crowdsourced anno-
i.org/10.1016/j.comnet.2015.07.008
J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx 9
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
ic annotation aggregation method.
666
to667
ra668
fo669
fe670
to671
b672
th673
a674
B675
676
e677
b678
cl679
m680
se681
o682
R683
m684
sa685
5686
687
d688
ti689
su690
w691
5692
693
w694
(2695
4696
w697
th698
a699
5700
701
e702
d703
re704
d705
d706
o707
a708
th709
su710
fl711
a712
fo713
re714
but less correlated with the size of such instances (which is 715
smaller in average than easy prints). 716
5.1.2. Recognition 717
Workers provided a total of 461 flower name labels. Al- 718
most all labels were provided in English, except for some 719
la 720
c 721
a 722
u 723
p 724
c 725
w 726
lo 727
e 728
b 729
o 730
p 731
732
p 733
b 734
c 735
Fa 736
W 737
a 738
sa 739
2 740
v 741
p 742
a 743
v 744
fr 745
b 746
F 747
e 748
re 749
5 750
is 751
o 752
la 753
is 754
m 755
5 756
757
b 758
sp 759
m 760
h 761
a 762
h 763
n 764
c 765
fo 766
sp 767
P
t
Fig. 7. Steps of our Class–centr
Note that for unsupervised clustering methods, we need
explicitly set the number of clusters K. Due to the wide
nge of flower occurrences in all images, K is set differently
r each print according to annotations from workers. Dif-
rent estimation strategies could be instantiated according
this principle. We compare: (1) the maximum number of
ounding boxes drawn by one of the print annotators; (2)
e median number among the workers; and (3) a value
utomatically obtained according to model selection such as
ayesian information criterion (BIC) [31].
This clustering step is then followed by the derivation, for
ach cluster or bounding boxes, of a single, representative
ounding box. Similarly as for the setting of the K number of
usters, we compare the performance of several aggregation
ethods, which, for each of the four bounding box vertexes,
lect (1) the maximum; or (2) the median value among the
nes of the bounding boxes in the cluster.
ecognition. To enable comparison, bounding box labels are
anaged as global artwork labels, and assessed using the
me metrics as in the Artwork–centric configuration.
. Results
This section discusses the outcomes of the experiments
escribed in Section 4. For each configuration, we assess mul-
ple annotation aggregation methods, and we analyse the re-
lting identification and recognition performance of crowd
orkers.
.1. Artwork–centric knowledge extraction evaluation
A total of 151 workers started a qualification task, out of
hich 67 (44%) failed. Out of the remaining 84 workers, 40
6%) decided to stop at the qualification task. The remaining
4 workers performed a total of 475 annotation tasks. Each
orker annotated an average of 10.8 prints (σ = 7.6). Despite
e high variance each print was sufficiently annotated by an
verage 5.9 workers (σ = 1.4).
.1.1. Identification
For each annotation task, we analyse the number of flow-
rs and flower types indicated by crowd workers. Table 1
escribes the under-/over-specification of these values with
spect to the ground-truth, broken down according to the
ifficulty class of the corresponding prints. Regardless of the
ifficulty class, workers tend to report less flowers than the
nes actually contained in the print; hard prints, which on
verage contain more flowers, are also the ones for which
is under-specification effect is more evident. A similar re-
lt is observable for hard prints for the reported number of
ower types. Crowd workers also under-specified for easynd average prints but the effect is less evident, especially
r the number of types. Note that this trend is highly cor-
lated with the number of flower instances in the print,
lease cite this article as: J. Oosterman et al., On the impact of know
ation of visual artworks, Computer Networks (2015), http://dx.doi.o
bels provided in Italian, Spanish and Dutch. All prints re-
eived at least one flower label from at least one worker. In 69
nnotations tasks (14%) at least one crowd worker reported
nable to name any flowers in the print. Figs. 8 and 9 de-
ict the distribution of workers’ labels in the 3 print diffi-
ulty classes. Fantasy labels were equally distributed. Prints
here multiple workers were unable are distributed as fol-
ws: easy (1), average (7), and hard (11). As expected,
asy prints received on average more unique labels than
oth average and hard. We account the higher number
f unique labels for hard prints to the higher number of de-
icted flower types.
Fig. 10a depicts the distribution of label specificity for
rints belonging to the three difficulty classes. Genus la-
els and common names are consistently the most used by
rowd workers (respectively 85% and 77% of all annotations).
mily names are rarer, but mainly specified in easy prints.
orkers’ vocabulary included 74 distinct flowers labels (58
fter reconciliation of botanical and common version of the
me flower name). On average, respectively 4.30 (σ = 2.90),
.13 (σ = 1.22), and 4.03 (σ = 2.06) unique labels were pro-
ided for easy, average, and hard print. Twenty-eight
ercent of flower labels were defined at species level, 67%
t genus, 5% at family level. Sixty-one percent of workers’
ocabulary is expressed in botanical form. The labels most
equently used by crowd workers are Rose (33% of all la-
els), Lily (15%), Tulip (7%), Sunflower (6%) and Carnation (5%).
ig. 10b depicts the vocabulary size per worker. Four work-
rs, which provided only unable and fantasy labels are not
ported. On average each worker has a vocabulary size of
.4 (σ = 3.5, Max = 16) distinct labels. The vocabulary size
strongly correlated (c = 0.75, p � 0.005) with the amount
f labels a workers provided . To account for this we calcu-
te a worker’s label diversity using Shannon entropy which
also shown in 10 b (indicated by a line). Users who provide
ore labels, also use such labels more often.
.1.3. Aggregation
Table 2 compares the results, in terms of error ratio, and
roken down according to the difficulty class of the corre-
onding prints, obtained after applying the median and
aximum aggregation functions. Performance is better (i.e.
igher identification precision) for prints in the easy and
verage than in the hard difficulty class. On the other
and, workers perform worse on the identification of the
umber of flowers in the easy class than in the averagelass. Such a result is mostly due to 5 prints in the easy class,
r which there is an error rate higher than 50%. Manually in-
ecting the annotations for these prints we observed that
ledge extraction and aggregation on crowdsourced anno-
rg/10.1016/j.comnet.2015.07.008
10 J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
Table 1
Average difference and under-/over-specification of number of flowers and types by workers in Artwork–centric annotation.
Number of flowers Number of types
Under Equal Above Avg. diff. Under Equal Above Avg. diff.
Easy 80% 20% 0% −5.35 ± 7.93 40% 40% 20% −0.94 ± 1.76
Average 72% 21% 7% −0.90 ± 0.96 24% 52% 24% −0.11 ± 1.09
6 ± 11.1
nt aggre
ard
45 ± 0.2
41 ± 0.5
sy Ave
(b) Lab
) label, c
across d
Family coFamily boGenus coGenus boSpecies cSpecies b
t class.
ls in Art
s768
t769
-770
e771
e772
773
774
s775
.776
t777
;778
-779
780
y781
r782
s 783
s 784
785
- 786
e 787
788
. 789
- 790
- 791
t 792
s 793
- 794
. 795
e 796
t 797
, 798
Hard 97% 0% 3% −9.6
Table 2
Error rate for number of flowers and types per class for differe
# Flowers error ratio
Method Easy Average H
Median 0.34 ± 0.30 0.14 ± 0.20 0.
Maximum 0.11 ± 0.20 0.16 ± 0.25 0.
Easy Average Hard
Fan
tasy
0
0.5
(a) FantasyEa
Labe
l
0.5
1.0
Fig. 8. Percentage of workers that specified at least 1 a) fantasy, b
Easy Average Hard
Nr.
labe
ls
0
10
(a) Number of labels
Fig. 9. Number of (unique) labels on prints
Easy Average Hard
Per
cent
age
(%)
0
20
40
60
80
100
(a) Specificity of used labels by per prin
Fig. 10. Analysis of labe
such errors are accountable to badly followed instruction
(contrary to our flower definition, workers did not coun
flower buds) or to wrong interpretation of a flower compo
sition (some workers counting each individual flower of th
plant — a hyacinth — while others counted them as a singl
flower).
5.1.4. Interpretation of the results
Workers tended to under-specify the number of flower
and flowers types, especially for prints in the hard category
We account this result to the fatigue effect [32] that migh
occur when the number of visual classes instances increases
we interpret the propensity of workers to conservatively es
timate the number as an indication of genuine effort.
Despite no requirements from our side, a surprisingl
large proportion of the labels (23% in total and 61% of thei
Please cite this article as: J. Oosterman et al., On the impact of kn
tation of visual artworks, Computer Networks (2015), http://dx.do
4 71% 23% 6% −1.52 ± 1.88
gation methods in Artwork–centric annotations.
# Types error ratio
Easy Average Hard
5 0.11 ± 0.17 0.11 ± 0.23 0.30 ± 0.27
0 0.27 ± 0.54 1.27 ± 3.75 0.22 ± 0.25
rage Hard
elEasy Average Hard
Una
ble
0.3
0.4
0.5
(c) Unable
) unable across different print classes in Artwork–centric annotation.
Easy Average Hard
Nr.
uniq
ue la
bels
0
10
(b) Number of unique labels
ifferent classes in Artwork–centric annotation.
mmontanicalmmontanicalommonotanical
0
5
10
15
Worker Id0 5 10 15 20 25 30 35 40
(b) Vocabulary size (points) andlabel diversity (line) per worker
work–centric annotations.
vocabulary) used the botanical form for flower names. Thi
result hints to two possible explanations: (1) the annotator
knew the botanical name; or (2) annotators actively looked
up the flower in (Web) knowledge bases to retrieve a suit
able name. We interpret the result as a sign of knowledgeabl
and intrinsically motivated workers, which stood out from
the crowd despite the task complexity and moderate reward
No aggregation function consistently provided better per
formance. The maximum aggregation method performs bet
ter for the identification of number of flowers. This resul
can be intuitively explained as follows: even when worker
tend to under-specify the number of object instances, a sin
gle good performer suffices for good-quality identification
On the other hand, median has better performance in th
identification of number of flower types. Again, such a resul
can be motivated by the behaviour of crowd workers who
owledge extraction and aggregation on crowdsourced anno-
i.org/10.1016/j.comnet.2015.07.008
J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx 11
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
Table 3
Average difference and under-/over-specification of number of flowers and types by workers in Class–centric annotations.
Number of flowers Number of types
Under Equal Above Avg. diff. Under Equal Above Avg.diff
Easy 75% 15% 10% −3.70 ± 4.45 45% 20% 35% −1.29 ± 2.33
Average 72% 21% 7% −0.75 ± 0.86 45% 31% 24% −0.26 ± 0.69
Hard 87% 3% 10% −11.53 ± 14.83 71% 6% 23% −1.63 ± 2.14
Table 4
Quality and number of bounding boxes compared to ground truth annotations in Class–centric annotations.
Quality of workers Number of bounding boxes
Matching ratio cDist aOverlap Under Equal Above Avg. diff.
Easy 0.70 ± 0.30 0.15 ± 0.14 0.23 ± 0.19 80% 10% 10% −3.92 ± 4.54
Average 0.76 ± 0.29 0.25 ± 0.30 0.49 ± 0.20 72% 28% 0% −1.20 ± 1.31
Hard 0.43 ± 0.28 0.33 ± 0.45 0.59 ± 0.23 100% 0% 0% −12.59 ± 15.13
on average, were more often correct. In such a condition, us-799
ing the majority of worker annotations is a better approach,800
which can compensate to some extent the presence of out-801
liers.802
5.2. Class–centric knowledge extraction evaluation803
Eighty-fourworkers started the qualification tasks, out of804
which 21 (25%) failed. Of the remaining 63 workers, 17 (27%)805
decided to stop at the qualification task. The remaining 46806
workers created 552 annotations. All labels were provided in807
English. On average each worker annotated 12.0 prints (σ =808
17.1). Despite the high variance each print was sufficiently809
annotated by, on average, 6.9 workers (σ = 1.37). A total of810
3442 bounding boxes were drawn.811
5.2.1. Identification812
Table 3 reports the identification performance in terms of813
number of flowers and number of types for Class–centric an-814
notations. Workers tend to define fewer flowers and fewer815
flower types than the ones actually contained in prints, es-816
pecially for the hard ones. On the other hand, average817
prints, which contains the least number of flowers and types,818
are the ones where workers perform best. Table 4 shows819
the quality of created bounding boxes across different print820
classes. Matching ratio in this table denotes the ratio821
o822
g823
824
w825
in826
re827
in828
ra829
m830
b831
p832
n833
a834
835
te836
ca837
specified by workers. When comparing these result with the 838
ones in Table 3, we observe how workers often (22% of the 839
executed tasks) draw less bounding boxes than the ones they 840
specify in the dedicated text field of the user interface for the 841
same annotated print. Only in a handful of tasks (0.7%) they 842
drew more bounding boxes. 843
5.2.2. Recognition 844
Workers drew 3,442 bounding boxes, of which 1,583 845
(46%) contained a label that could be mapped to a DBPedia 846
resource. 616 (18%) were annotated as fantasy, while for 1149 847
bounding boxes (33%) workers indicated they were unable 848
to name the flower. The ratio of unable annotations is sim- 849
ilarly distributed across print difficulty class — respectively 850
34% ineasy prints , 31% inaverage prints, and 37% inhard 851
prints. The size of bounding boxes has no effect on the ability 852
of workers to provide a label for a flower. The same applies to 853
the number of flowers or flower types in the print. 854
Fig. 11 a shows the specificity of the provided labels 855
per print difficulty category. Both family names (the most 856
generic) and botanical forms of species are rarely used. 857
The large majority of labels, for all three difficulty classes 858
uses the genus specificity in common name form. For prints 859
in the average difficulty class less species labels are used 860
than in the other classes. 861
In total 1566 labels corresponding to flower names were 862
p 863
fl 864
c 865
sp 866
2 867
h 868
869
le 870
th 871
is 872
u 873
(6 874
in 875
5 876
877
centric configuration, in Table 5 we first report the 878
P
t
f matched bounding boxes out of the ones defined in the
round truth for the considered prints.
The better matching ratio is obtained for average prints,
hile hard prints have significantly worse workers’ bound-
g box quality. Considering the reference ground-truth, this
sult suggests a correlation between the number of flowers
a print, as prints with less flower features better matching
tio. cDist and aOverlap are, however, measures which are
ore influenced by the sizes of the ground truth bounding
oxes: intuitively, it is easier for workers to more accurately
osition boxes for large flowers, although their overlap is not
ecessarily better (i.e. it is more difficult for workers to draw
ccurate bounding boxes).
In the right part of Table 4 we report the performance, in
rms of under-/over-specification of number flowers, which
n be achieved by counting the number of bounding boxes
lease cite this article as: J. Oosterman et al., On the impact of know
ation of visual artworks, Computer Networks (2015), http://dx.doi.o
rovided. Workers featured a vocabulary of 45 distinct
owers labels (43 after reconciliation of botanical and
ommon version of the same flower name). On average, re-
ectively 4.05 (σ = 3.78), 2.03 (σ = 1.12), and 2.97 (σ =.18) unique labels were provided for easy, average, and
ard prints.
Crowd workers specified 24% of the labels at speciesvel, 69% at the genus level, and 7% at the family level of
e flower taxonomy. Thirty-six percent of their vocabulary
expressed in botanical form. The label most frequently
sed by workers are Rose (43%), Tulip (14%), Lily (11%), Daisy
%) and Sunflower (4%). The botanical form of labels is used
6% of the labels.
.2.3. Aggregation
Similarly to the analysis performed for the Artwork–
ledge extraction and aggregation on crowdsourced anno-
rg/10.1016/j.comnet.2015.07.008
12 J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
HardAverageEasy
Family commonFamily botanicalGenus commonGenus botanicalSpecies commonSpecies botanical
Per
cent
age
(%)
0
20
40
60
80
100
(a) Specificity of used labels by per print class.
0
5
10
15
Worker Id0 5 10 15 20 25 30 35
(b) Vocabulary size (points) andlabel diversity (line) per worker
Fig. 11. Analysis of labels in Class–centric annotations.
Table 5
Error rate for number of flowers and types per class for different aggregation methods in Class–centric annotation.
Flowers error ratio Types error ratio
Method Easy Average Hard Easy Average Hard
Median 0.25 ± 0.29 0.12 ± 0.17 0.48 ± 0.27 0.17 ± 0.28 0.13 ± 0.23 0.40 ± 0.31
Maximum 0.12 ± 0.22 0.18 ± 0.24 0.46 ± 0.70 0.57 ± 0.87 1.11 ± 4.04 0.72 ± 0.75
sters K.
H
0
0
0
-879
-880
r881
f882
883
884
885
-886
g887
s888
889
-890
891
-892
e893
-894
y895
g896
897
r898
t899
.900
e901
g902
-903
r904
-905
906
e 907
7 908
t 909
t 910
- 911
912
- 913
t 914
915
g 916
g 917
. 918
919
- 920
921
922
923
924
. 925
, 926
Table 6
Error rates of different methods for setting the number of clu
Error rate
Method Easy Average
Max 0.15 ± 0.28 0.11 ± 0.19
Median 0.23 ± 0.28 0.12 ± 0.16
BIC 0.96 ± 0.82 1.26 ± 0.97
performance of different aggregations methods on the num
ber of flowers and number of flower types explicitly speci
fied by workers. Median always outperforms maximum fo
the aggregation of the number of types. For the number o
flowers however, maximum outperforms median in both
easy and hard prints, while having worse performance in
average prints.
Next, we compare the performance of the three K esti
mation techniques per print difficulty class for the clusterin
step of our novel algorithm presented in Section 4.6.3. To thi
end, we define the error rate
ξ = |#BBs − #FLgt |#FLgt
which measures the normalised difference between K de
termined by a configuration and the ground truth. Table 6
reports the resulting performance. Simpler methods sub
stantially outperform BIC. This results suggest that, for th
purpose of estimating the number of clusters, the num
ber of bounding boxes given by the workers is a less nois
signal than the overall coordinates of all input boundin
boxes. Maximum outperforms median in both the easy and
hard difficulty classes, while being only slightly better fo
average prints. The result can be justified by the fact tha
the average category have the lowest number of flowers
By comparing the figures in Table 6 and Table 5, we observ
how using the maximum value is best both for aggregatin
the reported number of flowers and bounding boxes; how
ever, for average and hard prints, estimating the numbe
of flowers in a print using bounding boxes lead to more pre
cise results.
Please cite this article as: J. Oosterman et al., On the impact of kn
tation of visual artworks, Computer Networks (2015), http://dx.do
Error sign
ard Under Equal Above
.34 ± 0.26 32% 39% 29%
.49 ± 0.26 68% 30% 2%
.59 ± 0.42 45% 5% 50%
Table 7
Performance of different algorithm configurations.
Configuration Matching ratio cDist aOverlap
Geometric_median 0.46 ± 0.12 0.31 ± 0.20 0.45 ± 0.16
k-means_median 0.82 ± 0.22 0.28 ± 0.20 0.48 ± 0.20
GMM_median 0.81 ± 0.22 0.28 ± 0.20 0.49 ± 0.16
Geometric_max 0.48 ± 0.29 0.58 ± 0.51 0.28 ± 0.18
k-means_max 0.83 ± 0.21 0.49 ± 0.92 0.38 ± 0.18
GMM_max 0.83 ± 0.21 0.48 ± 0.92 0.39 ± 0.18
Being the best aggregation method, we select the valu
returned by maximum as an input for the next steps. Table
compares the performance in terms of matching ratio, cDis
and aOverlap of the different algorithm configurations tha
can be obtained by varying clustering techniques and strate
gies for picking the representative.
In all clustering techniques, median performs signifi
cantly better than maximum for both cDist and aOverlap, bu
with a slightly worse Matching ratio. Such a result can
be explained by the fact that larger representative boundin
boxes are more likely to match with ground truth boundin
boxes, but will also feature bigger and less accurate areas
Among the tested clustering techniques, the ones based on
unsupervised learning clearly outperform the simple geomet
ric clustering, while there is no significant difference between
k-means and GMM.
To conclude, we investigate how the performance of GMMand k-means vary according to the print difficulty class, and
report the results Fig. 12 (red for k-means and blue for GMM)
It can be observed how GMM is less effective for easy prints
owledge extraction and aggregation on crowdsourced anno-
i.org/10.1016/j.comnet.2015.07.008
J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx 13
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
HardAverageEasy
Mat
chin
g R
atio
0
1
(a) Matching ratioAverage HardEasy
cDis
t
0
1
(b) Centroid distance
Easy Average Hard
aOve
rlap
0
0.5
(c) Area overlapping
Fig. 12. kmeans_median (red) and GMM_median (blue) performance on images of different difficulty class.
while more effective for hard prints. Technically, the result927
could be explained by the nature of GMM, a more flexible ver-928
sion of k-means that models the variance and covariance929
of bounding box coordinate vector. The coordinates of end-930
ing point in drawing a bounding box for small flower is more931
influenced by the starting point than for large flower, which932
could make GMM perform better than the simpler k-means933
clustering on hard prints; on the other hand, by being less934
robust than k-means on this aspect, GMM performs worse935
in easy prints, where the effect of little pointing errors are936
more evident. We note that for their performance, however,937
none of the differences in the three print classes are statisti-938
cally significant.939
5.2.4. Interpretation of the results940
941
th942
b943
te944
b945
e946
n947
in948
b949
co950
h951
th952
p953
e954
th955
ce956
a957
958
o959
p960
w961
(e962
fo963
m964
965
p966
re967
b968
th969
so970
e971
types in a print, the maximum aggregation function is better 972
suited for the estimation of the number of flower instances 973
which, in turns, provide better identification (matching ra- 974
tio) performance. In contrast, the location of bounding boxes 975
(centre and area) can be better achieved using a consensus- 976
based function such as median. 977
These results demonstrate how, in this context, and con- 978
trary to traditional image annotation techniques, better out- 979
comes can be achieved by selectively using methods not 980
based on crowd consensus. This is supported by recent work 981
on crowd disagreement [33]. 982
6. Discussion 983
This section elaborates on the results reported in 984
Section 5 with respect to the research questions defined in 985
the introduction. 986
6.1. Research question 1 987
In this section we provide answers to RQ1: Can non- 988
professional annotators from crowdsourcing platforms provide 989
high quality artwork annotations? 990
6.1.1. Identification 991
Section 5.1 and Section 5.2 provide quantitative evidence 992
o 993
ca 994
th 995
m 996
ti 997
fo 998
e 999
8 1000
is 1001
C 1002
si 1003
m 1004
h 1005
w 1006
a 1007
a 1008
to 1009
p 1010
im 1011
th 1012
P
t
In this configuration, workers were asked to both count
e number of flowers in the print, and to draw bounding
oxes around them (see Section 4); a counter in the user in-
rface reported the number of currently drawn bounding
oxes. Despite the presence of a self-defined anchor, work-
rs often drew a number of bounding boxes less than the
umber of identified flowers. The result can be interpreted
two ways. On the one hand, as there was a direct relation
etween the number and size of flowers in prints, workers
uld have simply ignored very small instances; on the other
and, due to fatigue, workers simply stop annotating even if
ey knew that more flowers existed in the print. An analysis
erformed over several annotation tasks suggests that both
xplanations are correct to some extent. The result hints to
e importance of intrinsically motivated workers for Class–
ntric artwork annotation, as a high number of instances to
nnotate might be discouraging.
Annotation specificity hints at a similar conclusion: 36%
f the annotations, but only 6% of their vocabulary, are ex-
ressed in botanical form. This suggests less knowledgeable
orkers, but also workers less available for online research
.g. in knowledge bases), possibly due to the additional ef-
rt needed to draw bounding boxes, which diminishes their
otivation to extend their domain knowledge.
Our method for bounding boxes aggregation shows
romising results. The significantly improved aggregation
sults (up to 45% improvement in matching ratio, 20% in
ounding box distance, and 30% in area overlap) achieved by
e GMM and k-means algorithms demonstrates the need for
phisticated aggregation techniques. Due to workers gen-
rally under-specifying the number of flowers and flowers
lease cite this article as: J. Oosterman et al., On the impact of know
ation of visual artworks, Computer Networks (2015), http://dx.doi.o
f the effectiveness of crowd workers in identifying and lo-
ting flower instances. Despite the moderate reward and
e demanding nature of the annotation tasks, our experi-
ents attracted a considerable number of skilled and mo-
vated workers; they often matched the identification per-
rmance of our trusted annotators, especially on prints of
asy and average difficulty (matching rates of 89% and
4%, respectively); hard prints lead to worse, but still sat-
factory figures (a matching rate of 59%). Even with the
lass–centric configuration, which was more labour inten-
ve, workers achieved good location accuracy (bounding box
atching rates of 70%, 76% and 43% for easy, average and
ard prints, respectively). These results are indicative of the
illingness and ability of crowds to identify visual classes in
rtworks. Indeed, the identification of elements on images is
task that is familiar to crowd workers and requires little
no domain knowledge. On the other hand, artworks often
resent additional complexities with respect to photographic
ages (e.g. abstract, symbolic, or allegoric interpretations):
e results of our experiments are in line with the outcomes
ledge extraction and aggregation on crowdsourced anno-
rg/10.1016/j.comnet.2015.07.008
14 J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
Table 8
Comparison of number of common and new labels, split by print difficulty class and aggregation methods, in Artwork–centric
annotations.
Matching label New label
Method Easy Average Hard Easy Average Hard
All 1.30 ± 0.80 0.79 ± 0.50 1.00 ± 0.79 3.00 ± 2.88 1.25 ± 2.35 3.45 ± 2.35
Lab.freq. ≥ 2 0.95 ± 0.60 0.68 ± 0.55 0.65 ± 0.67 0.65 ± 1.09 0.29 ± 0.46 0.65 ± 0.81
Majority 0.70 ± 0.66 0.64 ± 0.56 0.40 ± 0.50 0.15 ± 0.37 0.18 ± 0.39 0.25 ± 0.55
Table 9
Label comparison per print difficulty class for different aggregation methods in visual class annotation in Class–centric annotation.
Matching label New label
Method Easy Average Hard Easy Average Hard
1.05
0.80
0.80
-1013
l1014
1015
1016
-1017
-1018
-1019
e1020
-1021
1022
1023
1024
1025
-1026
1027
-1028
e1029
e1030
1031
o1032
1033
y1034
1035
-1036
1037
-1038
s1039
:1040
1041
.1042
1043
r1044
s1045
s1046
s1047
1048
-1049
-1050
-1051
.1052
s1053
- 1054
- 1055
1056
l 1057
1058
r 1059
y 1060
- 1061
1062
1063
- 1064
d 1065
d 1066
1067
1068
f 1069
- 1070
1071
1072
1073
1074
e 1075
o 1076
t 1077
- 1078
f 1079
- 1080
o 1081
- 1082
1083
s 1084
e 1085
e 1086
s 1087
1088
- 1089
e 1090
l 1091
All 1.21 ± 1.24 0.79 ± 0.57
Label.freq. ≥ 2 1.11 ± 1.24 0.64 ± 0.62
Majority 0.95 ± 1.18 0.64 ± 0.62
of other recent studies [25,26], and provide additional evi
dences on the suitability of crowdsourcing as an accurate too
for cultural heritage content annotation.
6.1.2. Recognition
In terms of recognition performance, workers consis
tently provide a high number of labels and show a rich vo
cabulary. In Table 8 and Table 9 we compare, for each experi
mental configuration, the number of unique labels which ar
provided by the domain experts and the crowd (Matching la
bel) and the number of unique labels provided by the crowd
but not by the experts (New label)
The vocabulary size of crowd workers is comparable (58
and 43 compared to 52 for the Artwork–centric and Class–
centric, respectively) to the one of experts, and, for both con
figurations, we observed how workers often used labels with
lower specificity (genus and family) in the flower classifi
cation taxonomy. This result suggests familiarity with th
domain-specific vocabulary. On the other hand, it can also b
interpreted as an indicator of the potential information need
of crowd-workers, who prefer using laymen terminology t
describe botanical entities: workers were allowed to look up
flower names online, but they deliberately choose to specif
common names at lower level of specificity.
The previous interpretation is supported by the follow
ing observation. In both configurations (Artwork–centric and
Class–centric), and even using the most conservative la
bel aggregation policy (i.e. using all crowd labels), expert
and crowd workers respectively share a limited vocabulary
34%, 28%, 34% (Artwork–centric) and 43%, 44%, 52% (Class–
centric) respectively for easy, average, and hard prints
The size of the shared vocabulary only slightly decreases (on
average) with stricter aggregation conditions. On the othe
hand, the decrease is more evident in terms of new label
(right part of Table 8 and Table 9): the number of new label
introduced by crowd workers is relatively consistent acros
aggregation methods and print difficulty.
Both crowd workers and experts showed a similar ten
dency to have low agreement on their labels. This is an in
teresting phenomena, also observed in other knowledge
intensive content annotation use cases (e.g. medical [33])
The result suggests the need for more articulated annotation
Please cite this article as: J. Oosterman et al., On the impact of kn
tation of visual artworks, Computer Networks (2015), http://dx.do
± 1.00 2.84 ± 2.98 1.18 ± 1.06 2.35 ± 2.39
± 0.89 1.79 ± 2.04 0.89 ± 0.88 1.70 ± 1.42
± 0.89 1.53 ± 1.58 0.89 ± 0.83 1.45 ± 1.19
campaigns, possibly organised in workflows [34] that inter
leave automatic and human operations. While this is sub
ject of future work, we can envision the following annotation
flow: crowd labels are used to instrument a Web retrieva
step, where images of flowers associated with the provided
label are collected. Such images can then be used in anothe
crowdsourcing task as a comparison term, to visually verif
the similarity of the labelled flower instance with the real
world examples.
6.2. Research question 2
In this section we provide answers to RQ2: To what ex
tent can the extraction and aggregation steps of a crowdsource
knowledge creation process influence the identification an
recognition aspects of visual artwork annotation?
6.2.1. Extraction
The experimental evaluation shows that the adoption o
different knowledge extraction interfaces has a relevant ef
fect on the identification and recognition performance. Such
an effect is not uniformly distributed across print annotation
difficulties, but it allows for interesting considerations.
By comparing the figures of Table 1 (Artwork–centric) and
Table 3 (Class–centric) we observe how the presence of th
bounding box functionality, which should push workers t
a more precise identification of flower instances, does no
result in a significant difference (less than 10%) in the dis
tribution of under- and over-specification of the number o
flowers. However, we observe that using drawn bounding
boxes as a way to count the number of flowers do lead t
significantly better results (see Tables 2, 5 and 6). This is es
pecially observable with more difficult prints; 21% and 24%
decrease of the error rate for average and hard print
using the optimal aggregation method, respectively. On th
other hand, the identification performance concerning th
number of flower types shows a different trend, as worker
in Artwork–centric annotation achieve a lower error than in
Class–centric annotation. These results suggest that the iden
tification aspect could benefit from an annotation interfac
that benefits from both global (Artwork–centric) and loca
owledge extraction and aggregation on crowdsourced anno-
i.org/10.1016/j.comnet.2015.07.008
J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx 15
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
(Class–centric) knowledge extraction interfaces, to be used1092
according to the difficulty of the print to annotate.1093
Different observations can be made for the recognition as-1094
pect. The presence of a bounding box forces workers into pro-1095
viding a label for each identified visual class instance. Despite1096
the availability of a “Don’t Know” option for labels, workers1097
often provided the same annotation even for flower having1098
clear visual differences. Moreover, by comparing Fig. 10 with1099
Fig. 11, we observe that the labels provided in the Class–1100
centric configuration are more frequent of a more generic1101
nature (genus and family) compared to the Artwork–centric1102
configuration. A higher proportion of labels is also specified1103
using the common name. The vocabulary size of workers in1104
the Class–centric configuration is also smaller. These results1105
suggest that, while a Class–centric configuration can help1106
obtaining higher-granularity annotations, the overhead re-1107
quired for drawing bounding boxes might penalise the recog-1108
nition capabilities of workers. A possible explanation of this1109
result can be attributed to the well-known fatigue effect1110
[31111
w1112
in1113
K1114
th1115
a1116
n1117
re1118
im1119
th1120
61121
1122
p1123
m1124
p1125
n1126
ti1127
1128
q1129
e1130
in1131
d1132
th1133
ti1134
e1135
m1136
ti1137
p1138
ta1139
1140
cl1141
fe1142
im1143
te1144
to1145
a1146
71147
1148
d1149
and aimed at studying how different knowledge extraction 1150
and aggregation configurations affect the identification and 1151
recognition aspects of artwork annotation. We instrumented 1152
two knowledge extraction configurations: an Artwork–centric 1153
design, where textual annotations about visual objects are 1154
specified for the whole artwork; and a Class–centric de- 1155
sign, where occurrences of visual objects are identified using 1156
bounding boxes with distinct textual annotations. To support 1157
the Class–centric design, we proposed a novel bounding-box 1158
aggregation algorithm. Then, we experimented with differ- 1159
ent annotation aggregation methods, and tested their impact 1160
on identification and recognition performance. 1161
We engaged with 235 workers from a crowdsourcing plat- 1162
form, and asked them to annotate 80 Rijksmuseum prints of 1163
varying annotation difficulty; easy, where both identification 1164
and recognition of flower instances is easy; average, where 1165
identification is easy and recognition is difficult; and hard, 1166
where both identification and recognition is difficult. 1167
Both knowledge extraction configurations (Artwork– 1168
c 1169
ti 1170
o 1171
a 1172
to 1173
o 1174
1175
a 1176
c 1177
re 1178
C 1179
b 1180
su 1181
ta 1182
re 1183
re 1184
1185
le 1186
1187
d 1188
m 1189
su 1190
th 1191
a 1192
o 1193
ie 1194
1195
fu 1196
d 1197
m 1198
w 1199
in 1200
in 1201
b 1202
m 1203
1204
e 1205
m 1206
p 1207
a 1208
im 1209
m 1210
P
t
2], that often occurs with repetitive tasks. Due to fatigue,
orkers could be led to provide wrong labels, thus introduc-
g noise. On the other hand, the cost of providing a “Don’t
now” annotation was of a single click, considerably lower
an typing a work, or copy/paste it. We weren’t able to find
comprehensive explanation to justify the different recog-
ition performance with the Class–centric configuration. We
ly on future work to obtain a better understanding of the
pact of the cognitive load of an annotation interface over
e quality of the retrieved annotation.
.2.2. Aggregation
The experiments show that the aggregation step can im-
act the quality the visual class identification results. The
aximum and median functions find different optimal ap-
lications. The former allowed a better estimation of the
umber of flowers, whole the latter was more suited to es-
mate the number of flower types contained in the print.
Once more, we can explain these results as a conse-
uence of a fatigue effect [32] that emerged with work-
rs. This was most apparent for images with many small
stances in the Class–centric configuration. Some workers
rew fewer bounding boxes than the number of flowers
ey had counted and reported. Countermeasures to the fa-
gue effect are also studied in the field of crowdsourcing, for
xample in [35], where they studied the effect of inserting
icro-breaks during tasks. Our results show that the adop-
on of a different aggregation function, e.g. maximum, can
rovide satisfactory results without the need for additional
sk execution time and, consequently, cost.
Altogether, the results discussed in this sub-section
early indicate how artworks annotation demands from dif-
rent aggregation techniques with respect to photographic
age annotation: consensus-based knowledge aggregation
chniques [4,16,17] need to be supported by other methods,
counter some of the additional visual complexity of
rtworks.
. Conclusion
In this paper we report the results of an evaluation, con-
ucted in collaboration with the Rijksmuseum Amsterdam,
lease cite this article as: J. Oosterman et al., On the impact of know
ation of visual artworks, Computer Networks (2015), http://dx.doi.o
entric and Class–centric) resulted in satisfactory identifica-
on performance (Sections 5.1.1, 5.2.1 and 6.1.1). For tasks
f easy and average difficulty crowd workers can achieve
n identification performance comparable to trusted annota-
rs. However, as print difficulty increases the performance
f crowd workers lowers considerably.
In terms of recognition performance (Sections 5.1.2, 5.1.2,
nd 6.1.2), we observed that the crowd provided a rich vo-
abulary with little overlap with respect to domain experts,
gardless of the knowledge extraction configuration. In the
lass–centric configuration more workers provide a single la-
el than in the Artwork–centric configuration. These results
ggest that, while a Class–centric configuration can help ob-
ining higher-granularity annotations, the work overhead
quired for drawing bounding boxes might penalise the
cognition capabilities of workers.
Crowd workers consistently provide labels at varying
vel of specificity
(species, genus, family) and show their familiarity with
omain specific (i.e. botanical) names. The crowd provides
ore distinct labels per image than domain experts. This
ggests an opportunity for the creation of annotation sets
at are complementary to the ones of experts, and that can
ccommodate a broader variety of information needs. On the
ther hand, the general low agreement calls for further stud-
s, in order to better assess the quality of each annotation.
Our experiments with different annotation aggregation
nctions (Sections 5.1.3, 5.2.3 and 6.2.2) show performance
iversities. In terms of identification performance, the
aximum function outperforms the median, to counter the
orkers’ tendency to under-specify the number of identified
stances in complex artworks. On the other hand, aggregat-
g the location of bounding boxes (centre and area) can be
etter achieved using a consensus-based function such as
edian.
These results shows how: (1) the adoption of differ-
nt knowledge extraction configurations and aggregation
ethods influences both the identification and recognition
erformance; (2) artworks annotation demands for different
ggregation techniques than the ones used for photographic
age annotation (e.g. image-centric annotations and
ajority voting).
ledge extraction and aggregation on crowdsourced anno-
rg/10.1016/j.comnet.2015.07.008
16 J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
-1211
e1212
1213
e1214
f1215
1216
e1217
1218
t1219
e1220
-1221
s1222
1223
1224
l1225
e1226
1227
1228
1229
-1230.1231
1232-1233-1234.1235
1236d1237-1238.1239
1240:1241-1242-1243,1244
1245-1246f1247
d1248.1249
1250-1251)1252
1253g1254d1255n1256,1257
1258l-1259e1260n1261/1262
1263t1264).1265
1266e1267-1268
1269l1270-1271
1272t1273-1274
1275y127661277
1278
’s 1279b 1280
1281- 1282e 1283- 1284- 1285– 1286
1287: 1288- 1289. 1290
1291, 1292- 1293e 1294
1295k 1296s, 1297, 1298
1299i- 1300s 1301, 1302
1303s 1304e 1305n 1306- 1307. 1308
1309s 1310
1311, 1312- 1313: 1314, 1315
1316i- 1317l 1318
n 13191320
e 1321f 1322
l- 1323. 1324
1325k 1326n 1327l 1328, 1329
1330: 1331I 1332, 1333/ 1334
1335n 1336- 1337, 1338, 1339
1340, 1341l 1342. 1343
1344- 1345- 1346- 1347. 1348
13491, 1350
1351) 1352
1353k 1354s 1355/ 1356
1357
While promising, these results were obtained by study
ing a single knowledge domain. Further investigations ar
needed in order to assess the impact of the crowdsourced
knowledge creation process in area of knowledge that ar
less common in the general population (e.g. annotation o
birds, castles, etc.). As part of the future work we also plan
to test the impact of other steps of the Crowd Knowledg
Creation process, e.g. the discovery of expert workers from
the crowd to dynamically assign annotation tasks to the mos
suited performer, also proposed as challenge in “The Futur
of Crowd Work” [34] . Other future work includes the assess
ment of our novel algorithm for aggregation bounding boxe
in other contexts and on other datasets.
Acknowledgements
This publication was supported by the Dutch nationa
program COMMIT. We like to thank the members of th
SEALINCMedia worktable and in particular the Rijksmuseum
Amsterdam for their support.
References
[1] G. Carneiro, Artistic image analysis using graph-based learning ap
proaches, IEEE Trans. Image Process. 22 (8) (2013) 3168–3178http://dx.doi.org/10.1109/TIP.2013.2260167.
[2] Q. Le, Building high-level features using large scale unsupervised learning, in: 2013 IEEE International Conference on Acous
tics, Speech and Signal Processing (ICASSP), 2013, pp. 8595–8598
http://dx.doi.org/10.1109/ICASSP.2013.6639343.[3] Q. Wang, K. Boyer, Feature learning by multidimensional scaling an
its applications in object recognition, in: Conference on Graphics, Patterns and Images (SIBGRAPI), 2013 26th SIBGRAPI, 2013, pp. 8–15
http://dx.doi.org/10.1109/SIBGRAPI.2013.11.[4] S. Nowak, S. Rüger, How reliable are annotations via crowdsourcing
a study about inter-annotator agreement for multi-label image anno
tation, in: Proceedings of the International Conference on Multimedia Information Retrieval, in: MIR ’10, ACM, New York, NY, USA, 2010
pp. 557–566. http://doi.acm.org/10.1145/1743384.1743478.[5] T. Yan, V. Kumar, D. Ganesan, Crowdsearch: exploiting crowds for ac
curate real-time image search on mobile phones, in: Proceedings othe 8th International Conference on Mobile Systems, Applications, an
Services, in: MobiSys ’10, ACM, New York, NY, USA, 2010, pp. 77–90
http://doi.acm.org/10.1145/1814433.1814443.[6] N. Sawant, J. Li, J.Z. Wang, Automatic image semantic interpretation us
ing social action and tagging data, Multimedia Tools Appl. 51 (1) (2011213–246. http://dx.doi.org/10.1007/s11042-010-0650-8.
[7] S. Park, G. Mohammadi, R. Artstein, L.-P. Morency, Crowdsourcinmicro-level multimedia annotations: the challenges of evaluation an
interface, in: Proceedings of the ACM Multimedia 2012 Workshop oCrowdsourcing for Multimedia, in: CrowdMM ’12, ACM, New York, NY
USA, 2012, pp. 29–34. http://doi.acm.org/10.1145/2390803.2390816.
[8] J. Oosterman, et al., Crowd versus experts: Nichesourcing for knowedge intensive tasks in cultural heritage, in: Proceedings of th
Companion Publication of the 23rd International Conference oWorld Wide Web, 2014, pp. 567–568. http://dx.doi.org/10.1145
2567948.2576960.[9] E.L. Spratt, A. Elgammal, Computational beauty: aesthetic judgmen
at the intersection of art and science, CoRR abs/1410.2488 (2014
http://arxiv.org/abs/1410.2488.[10] J. Whitehill, T.-f. Wu, J. Bergsma, J.R. Movellan, P.L. Ruvolo, Whose vot
should count more: optimal integration of labels from labelers of unknown expertise, in: NIPS’09, 2009, pp. 2035–2043.
[11] P. Welinder, S. Branson, P. Perona, S.J. Belongie, The multidimensionawisdom of crowds, in: Advances in neural information processing sys
tems, 2010, pp. 2424–2432.
[12] H. Su, J. Deng, L. Fei-Fei, Crowdsourcing annotations for visual objecdetection, in: Workshops at the Twenty-Sixth AAAI Conference on Ar
tificial Intelligence, 2012.[13] J. Trant, B. Wyman, Steve, Investigating social tagging and folksonom
in art museums with steve.museum, in: Proceedings of the WWW’0Collaborative Web Tagging Workshop, 2006.
Please cite this article as: J. Oosterman et al., On the impact of kn
tation of visual artworks, Computer Networks (2015), http://dx.do
[14] A. Ellis, D. Gluckman, A. Cooper, A. Greg, Your paintings: a nationoil paintings go online, tagged by the public, Museum and the We
(2012).[15] M.C. Traub, J. van Ossenbruggen, J. He, L. Hardman, Measuring the ef
fectiveness of gamesourcing expert oil painting annotations, in: M. dRijke, T. Kenter, A. de Vries, C. Zhai, F. de Jong, K. Radinsky, K. Hof
mann (Eds.), Advances in Information Retrieval, Lecture Notes in Com
puter Science, 8416, Springer International Publishing, 2014, pp. 112123. http://dx.doi.org/10.1007/978-3-319-06028-6_10.
[16] L. von Ahn, L. Dabbish, Labeling images with a computer game, inProceedings of the SIGCHI Conference on Human Factors in Comput
ing Systems, in: CHI ’04, ACM, New York, NY, USA, 2004, pp. 319–326http://doi.acm.org/10.1145/985692.985733.
[17] R. Gligorov, L.B. Baltussen, J. van Ossenbruggen, L. Aroyo, M. BrinkerinkJ. Oomen, A. van Ees, Towards integration of end-user tags with pro
fessional annotations, in: Proceedings of the WebSci10: Extending th
Frontiers of Society On-Line, 2010.[18] C. Wieser, F. Bry, A. Bérard, R. Lagrange, Artigo: building an artwor
search engine with games and higher-order latent semantic analysiin: First AAAI Conference on Human Computation and Crowdsourcing
2013.[19] A. Kittur, E.H. Chi, B. Suh, Crowdsourcing user studies with mechan
cal turk, in: Proceedings of the SIGCHI Conference on Human Factor
in Computing Systems, in: CHI ’08, ACM, New York, NY, USA, 2008pp. 453–456. http://doi.acm.org/10.1145/1357054.1357127.
[20] R. Snow, B. O’Connor, D. Jurafsky, A.Y. Ng, Cheap and fast—but iit good?: Evaluating non-expert annotations for natural languag
tasks, in: Proceedings of the Conference on Empirical Methods iNatural Language Processing, in: EMNLP ’08, Association for Com
putational Linguistics, Stroudsburg, PA, USA, 2008, pp. 254–263
http://dl.acm.org/citation.cfm?id=1613715.1613751.[21] A. Kittur, Crowdsourcing, collaboration and creativity, ACM Crossroad
17 (2) (2010) 22–26.[22] V. Boer, M. Hildebrand, L. Aroyo, P. Leenheer, C. Dijkshoorn, B. Tesfa
G. Schreiber, Nichesourcing: harnessing the power of crowds of experts, in: Knowledge Engineering and Knowledge Management, in
Lecture Notes in Computer Science, 7603, Springer Berlin Heidelberg
2012, pp. 16–20. http://doi.acm.org/10.1007/978-3-642-33876-2_3.[23] K. Heimerl, B. Gawalt, K. Chen, T. Parikh, B. Hartmann, Commun
tysourcing: engaging local crowds to perform expert work via physicakiosks, in: Proceedings of the 2012 ACM annual conference on Huma
Factors in Computing Systems, ACM, 2012, pp. 1539–1548.[24] A. Bozzon, M. Brambilla, S. Ceri, M. Silvestri, G. Vesci, Choosing th
right crowd: Expert finding in social networks, in: Proceedings o
the 16th International Conference on Extending Database Technoogy, in: EDBT ’13, ACM, New York, NY, USA, 2013, pp. 637–648
http://doi.acm.org/10.1145/2452376.2452451.[25] N.F. Noy, J. Mortensen, M.A. Musen, P.R. Alexander, Mechanical tur
as an ontology engineer?: Using microtasks as a component of aontology-engineering workflow, in: Proceedings of the 5th Annua
ACM Web Science Conference, in: WebSci ’13, ACM, New York, NY, USA
2013, pp. 262–271. http://doi.acm.org/10.1145/2464464.2464482.[26] L.B. Chilton, G. Little, D. Edge, D.S. Weld, J.A. Landay, Cascade
crowdsourcing taxonomy creation, in: Proceedings of the SIGCHConference on Human Factors in Computing Systems, in: CHI ’13
ACM, New York, NY, USA, 2013, pp. 1999–2008. http://doi.acm.org10.1145/2470654.2466265.
[27] C. Dijkshoorn, J. Oosterman, L. Aroyo, G. Houben, Personalization icrowd-driven annotation for cultural heritage collections, in: Work
shop and Poster Proceedings of the 20th Conference on User Modeling
Adaptation, and Personalization, Montreal, Canada, July 16-20, 20122012. http://ceur-ws.org/Vol-872/patch2012_paper_3.pdf.
[28] J. Oosterman, A. Nottamkandath, C. Dijkshoorn, A. Bozzon, G. HoubenL. Aroyo, Crowdsourcing knowledge-intensive tasks in cultura
heritage, in: ACM Web Science Conference, 2014, pp. 267–268http://doi.acm.org/10.1145/2615569.2615644.
[29] P. Welinder, P. Perona, Online crowdsourcing: rating annotators and ob
taining cost-effective labels, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Confer
ence on, 2010, pp. 25–32. http://doi.acm.org/10.1109/CVPRW.20105543189.
[30] C.M. Bishop, et al., Pattern recognition and machine learning,springer New York, 2006.
[31] G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (2) (1978
461–464. http://dx.doi.org/10.1214/aos/1176344136.[32] R.A. Hennfng, S. Sauter, G. Salvendy, E.F.J. Krieg, Microbrea
length, performance, and stress in a data entry task, Ergonomic32 (7) (1989) 855–864 pMID: 2806221. http://dx.doi.org/10.1080
00140138908966848.
owledge extraction and aggregation on crowdsourced anno-
i.org/10.1016/j.comnet.2015.07.008
J. Oosterman et al. / Computer Networks xxx (2015) xxx–xxx 17
ARTICLE IN PRESSJID: COMPNW [m3Gdc;July 22, 2015;16:6]
[31358135913601361136213631364
[313651366136713681369
[31370137113721373137413751376
13771378137913801381138213831384138513861387
138813891390139113921393
139413951396139713981399140014011402
1403140414051406140714081409141014111412141314141415
ti 1416
1417141814191420142114221423142414251426142714281429
ga 1430
P
t
3] O. Inel, K. Khamkham, T. Cristea, A. Dumitrache, A. Rutjes, J. van derPloeg, L. Romaszko, L. Aroyo, R. Sips, Crowdtruth: machine-human
computation framework for harnessing disagreement in gatheringannotated data, in: The Semantic Web - ISWC 2014-13th International
Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014.Proceedings, Part II, 2014, pp. 486–504. http://dx.doi.org/10.1007/978-
3-319-11915-1_31.
4] A. Kittur, J.V. Nickerson, M. Bernstein, E. Gerber, A. Shaw, J. Zim-merman, M. Lease, J. Horton, The future of crowd work, in: Pro-
ceedings of the 2013 Conference on Computer Supported CooperativeWork, in: CSCW ’13, ACM, New York, NY, USA, 2013, pp. 1301–1318.
http://dx.doi.org/10.1145/2441776.2441923.5] J.M. Rzeszotarski, E.H. Chi, P. Paritosh, P. Dai, Inserting micro-
breaks into crowdsourcing workflows, in: Human Computationand Crowdsourcing: Works in Progress and Demonstration Ab-
stracts, An Adjunct to the Proceedings of the First AAAI Con-
ference on Human Computation and Crowdsourcing, Novem-ber 7-9, 2013, Palm Springs, CA, USA, 2013. http://www.aaai.
org/ocs/index.php/HCOMP/HCOMP13/paper/view/7544.
Born in Amsterdam, the Netherlands, Jasper
Oosterman studied Computer Science at theDelft University of Technology. His MSc thesis,
titled ’Finding Experts in a Corporate Environ-
ment’, concluded his MSc in 2011. Following hismaster, Jasper started a PhD student position su-
pervised by Alessandro Bozzon at the Web Infor-mation Systems research group led by Geert-Jan
Houben. The topic of his thesis is the maximiza-tion of Crowd Generated Knowledge and he plans
to finish in 2015.
Jie Yang is a PhD student at the Web Infor-
mation Systems Group of the Faculty of En-gineering, Mathematics and Computer Science
(EEMCS/EWI), Delft University of Technology (TU
Delft). Jie’s research is focused on crowdsourcingfor knowledge intensive tasks.
lease cite this article as: J. Oosterman et al., On the impact of know
ation of visual artworks, Computer Networks (2015), http://dx.doi.o
Dr. Alessandro Bozzon is an Assistant Professor
at the Delft University of Technology, and hadcompleted his PhD from Politecnico di Milano,
Italy in 2009. His current research interests are
in the fields of Web data management; humancomputation, crowdsourcing, and games with a
purpose; and information retrieval; with applica-tions to crowdsourced knowledge creation, urban
computing, and smart enterprise workforce.
Lora Aroyo is associate professor Intelligent In-
formation Systems Web and Media Departmentof Computer Science, at the Free University Am-
sterdam, in the Netherlands. Her specialities areall explicitly ’user centred’ crowdsourcing, user
and context modelling, recommender systems,
user centred design for semantic systems on e-learning. In the recent past Aroyo was scientific
coordinator of the EU Integrated Project: NoTube:integration of Web and TV data with the help
of semantics. Aroyo also coordinated the NWOproject: CHIP: Cultural Heritage Information Per-
sonalization. This project was awarded a Seman-
c Web Challenge prize.
Geert-Jan Houben is full professor of Web In-
formation Systems at the Software & ComputerTechnology department at TU Delft. His main re-
search interests are in Web Engineering, Web Sci-ence, and User Modeling, Adaptation and Per-
sonalization. He is managing editor of the Jour-nal of Web Engineering, editorial board member
for Journal of Web Science, International Journal
of Web Science, and ACM Transactions on theWeb. He is scientific director of Delft Data Sci-
ence, TU Delft’s coordinating initiative in DataScience, leading TU Delft’s research program on
Open & Online Education, and principal investi-tor in AMS, Amsterdam Institute for Advanced Metropolitan Solutions.
ledge extraction and aggregation on crowdsourced anno-
rg/10.1016/j.comnet.2015.07.008
Recommended