Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Transfer Learning in Joint Neural Embeddings for Cooking Recipes and Food Images1
submitted in partial fulfillment for the degree of master of science2
Mariel van Staveren3
117739524
master information studies5
data science6
faculty of science7
university of amsterdam8
Date of defence 2018-06-229
Internal Supervisor External SupervisorTitle, Name Dr Thomas Mensink Dr Vladimir NedovićAffiliation UvA, FNWI FlavourspaceEmail [email protected] [email protected]
10
11
Contents12
Contents 013
Abstract 114
1 Introduction 115
2 Related Work 216
2.1 Transfer Learning 217
2.2 Nutritional value 218
3 The joint neural embedding model 219
3.1 Representation of ingredients 220
3.2 Representation of cooking instructions 221
3.3 Representation of images 322
3.4 Joint neural embedding 323
4 The recipe collections 324
4.1 Recipe1M (R1M) 325
4.2 Jamie Oliver (JO) 326
4.3 Allerhande (AH) 327
5 Performance of the pre-trained model 428
5.1 Pre-processing JO and AH 429
5.2 Preparing the test-sets 430
5.3 Intra-collection retrieval 531
5.4 Inter-collection retrieval 632
6 Experiments 633
6.1 Fine-tune pre-trained model on Jamie Oliver 634
6.2 Fine-tune pre-trained model on Allerhande 735
6.3 Nutritional value as a feature 936
7 Fine-Tuning versus Training from scratch 1037
8 Conclusions 1038
8.1 Acknowledgements 1039
References 1040
9 Appendix 1041
Transfer Learning in Joint Neural Embeddings for CookingRecipes and Food Images
ABSTRACTThe research focus of this paper is two-fold. First, we address the per-42
formance of the joint neural embedding model for cooking recipes43
and food images [2] over different recipe collections. The model44
is trained on a large recipe collection that contains many user-45
submitted recipes and food images. Performance on professional46
recipes is therefore expected to be low. To enhance the usability47
of the model, our aim is to produce one model that performs well48
on both professional and amateur recipes via transfer learning.49
Second, given the increased interest in and access to nutritional50
value information, we assess the benefit of adding nutritional value51
information as a new feature in the model. Two small professional52
recipe collections are used in this project; the Jamie Oliver (JO)53
and the Dutch Allerhande (AH) collection. Transfer learning is de-54
ployed to increase the model’s performance on these professional55
recipe collections via multiple fine-tuning methods. The results56
suggest that the JO collection is too small to achieve an increase in57
performance through fine-tuning. We found that the best method58
to increase the model’s performance on the AH collection is to59
fine-tune the pre-trained model on the translated AH collection,60
using the pre-trained text representation models. Interestingly, this61
method results in increased performance on amateur recipes as well.62
This means that the benefits of transfer learning are not restricted63
to the target task (i.e. professional recipes), but serve the base task64
(i.e. amateur recipes) as well. Finally, qualitative and quantitative65
experiments show that the model’s performance can be increased66
by adding nutritional value as a new feature.67
1 INTRODUCTIONPeople’s lives are increasingly intertwined with the World Wide68
Web, including one of our most fundamental needs; food. A large69
corpus of cooking recipes and food images is currently available on70
the Web. The largest publicly available recipe collection has been71
constructed by Salvador et al. [2]. This collection contains over 172
million cooking recipes and 800k food images, referred to as the73
Recipe1M (R1M) collection. Using this collection, Salvador et al.74
created a joint neural embedding model to embed recipes and food75
images in a common high-dimensional vector space. The model is76
trained to minimize the distance in space between a recipe and its77
corresponding food image. The model yields impressive results on78
image-recipe and recipe-image retrieval tasks. Based on this model,79
an application could be developed where, for example, users can80
input a picture of a delicious lunch and receive a corresponding81
recipe so they can recreate the dish.82
However, the R1M collection contains many user-submitted83
recipes and food images. Amateur recipes generally differ from84
Master Thesis, June 2018, UvA© 2018
professional recipes in the type and number of ingredients and85
the style of the instructions. Additionally, amateur food images86
differ from professional images in aspects such as composition, res-87
olution, lighting, clarity, and color distribution. Consequently, the88
joint neural embedding model’s performance on professional recipe89
collections is expected to be low. This limits the model’s usability90
in the sense that application developers have to restrict the possible91
scope of inputs and outputs to amateur recipes and food images.92
This means that users can only correctly retrieve recipes and food93
images from fellow amateur cooks, while they may actually be94
interested in suggestions from a professional chef.95
Transfer learning refers to improved learning of a target task96
through the transfer of knowledge from a base task [14]. Amateur97
and professional recipes can be considered as two separate tasks.98
In this case, the joint neural embedding model, pre-trained on the99
amateur recipes and food images of the R1M collection, serves as100
the base model. This pre-trained model’s knowledge on amateur101
recipes can be exploited to train a new model that learns to em-102
bed professional recipes and food images. However, our aim is to103
produce one model that performs well on both professional and104
amateur recipes. Therefore, our approach diverges from traditional105
transfer learning in the sense that we aim for improved learning of106
a target task (i.e. professional recipes) without compromising the107
performance on the base task (i.e. amateur recipes).108
In this project, two small professional recipe collections are used;109
the Jamie Oliver (JO) collection and the Allerhande (AH) collection.110
The AH collection has two interesting properties; the recipes are111
in Dutch instead of English and the recipes contain nutritional112
value information. First, we assess how well the model performs on113
these professional recipe collections compared to amateur recipes114
from the R1M collection. Next, transfer learning is applied by fine-115
tuning the pre-trained model on the professional recipe collections.116
Additionally, we want to assess the benefit of adding nutritional117
value information as a new feature in the model. Consequently, the118
research questions posed in this work are as follows:119
Question 1: Performance of pre-trained model Does the joint120
neural embedding model, pre-trained on amateur recipes121
and food images, perform equally well on professional and122
amateur recipe collections? To answer this question, the pre-123
trained model is tested on image-recipe and recipe-image124
retrieval tasks for the R1M, JO, and AH recipe collections125
separately. This is referred to as intra-collection retrieval, be-126
cause the query item and the retrieved items originate from127
the same collection. The pre-trained model is also tested on128
inter-collection retrieval. For example, recipes from the JO129
collection are retrieved for a query image from the R1M col-130
lection. Low performance on inter-collection retrieval would131
indicate a discrepancy between amateur and professional132
recipes.133
1
Question 2: Fine-tune pre-trained model on Jamie Oliver134
What is the best fine-tuning method to enhance the pre-135
trained model’s performance on the JO collection? To an-136
swer this question, multiple fine-tuning methods are applied.137
Learning is deemed to be correct when no over- or under-138
fitting is apparent. All methods that result in correct learning139
are further evaluated by testing the new models on the intra-140
and inter-collection retrieval tasks. The goal is to increase141
the model’s performance on the JO collection without com-142
promising the performance on the R1M collection.143
Question 3: Fine-tune pre-trained model on Allerhande144
What is the best method to enhance the pre-trained model’s145
performance on a recipe collection that is in Dutch instead of146
English (i.e. the AH collection)? Retrieval performance of the147
pre-trained model is used as a baseline. Multiple methods are148
applied to increase the model’s performance on the AH col-149
lection. The goal is to increase the model’s performance on150
the AH collection without compromising the performance151
on the R1M collection.152
Question 4: Nutritional value as a feature Does adding nu-153
tritional value as a feature increase the model’s performance154
on the AH collection? Retrieval performance of the best per-155
forming model of the previous section is used as a baseline.156
The AH collection contains nutritional value information for157
each recipe. A qualitative assessment is designed to investi-158
gate if nutritional value has any meaningful discriminative159
power. Next, nutritional value is incorporated as a feature160
in the model. Retrieval performance of the new model is161
compared to the baseline.162
Overview of thesis. The Related Work section reviews relevant163
academic work. Next, the joint neural embedding model and the164
text-representation models are described. The next section contains165
information on the content of the three recipe collections (R1M,166
JO, and AH). The fifth section ("Performance of the pre-trained167
model") describes the experiments that are used to test the model’s168
performance on all three recipe collections, and reviews the re-169
sults. The Experiments section first explains the experiments that170
are designed to increase the model’s performance on the JO and171
AH recipe collections through transfer learning. Additionally, the172
last experiment assesses the added value of utilizing nutritional173
value information as a feature. The next section ("Fine-Tuning ver-174
sus Training from scratch") discusses an additional observation175
concerning large performance differences between training from176
scratch and fine-tuning methods. Finally, the outcomes are summa-177
rized in the Conclusions section.178
2 RELATEDWORK2.1 Transfer Learning179
Transfer learning is a method where a trained model is used as a180
starting point for another model on a related task [15]. Transfer181
learning is a popular method in deep learning because using a182
pre-trained model saves time and computer resources [17].183
In research by [16], a pre-trained deep convolutional neural184
network (CNN) was fine-tuned on medical images to perform tasks185
such as classification, detection, and segmentation. A pre-trained186
CNN is trained, for example, on a large set of labeled natural images.187
They compared the fine-tuned CNN-model to a CNN-model that188
has been trained from scratch. The results showed that the fine-189
tuned model outperformed the model that has been trained from190
scratch. Importantly, they analyzed how the size of the training set191
influenced the performance of both models. A reduced training set192
size led to a larger decrease in performance for the model trained193
from scratch than for the fine-tuned model. This means that the194
fine-tuned CNN is more robust to training set size.195
The fine-tuning approach used by [16] is similar to our approach.196
In our approach, the joint neural embedding model by Salvador et197
al. [2] is fine-tuned on small professional recipe collections. Our198
approach diverges from the approach used by [16] because we aim199
to preserve the model’s performance on the base task (i.e. amateur200
recipes), while increasing the model’s performance on the target201
task (i.e. professional recipes). This project will contribute to the202
field of transfer learning by testing the feasibility of this approach203
within the joint neural embedding model.204
2.2 Nutritional value205
People increasingly take nutritional value into account when mak-206
ing food choices. Recent research has been focused on algorithmic207
nutritional estimation from text [10] or images [11]. Interestingly,208
research by [9] showed that simple models outperform human209
raters on nutritional estimation tasks. This means that nutritional210
value information can be obtained for all recipes, even for recipes211
that do not explicitly contain nutritional value information. Due212
to the increased interest in and easy access to nutritional value213
information, incorporating it into image-recipe embeddings is a214
meaningful contribution.215
3 THE JOINT NEURAL EMBEDDING MODELRecipes consist of three features; the ingredients, the cooking in-216
structions, and the food image. The representations of these fea-217
tures are discussed first. Next, the joint neural embedding model is218
described.219
3.1 Representation of ingredients220
Ingredient names are extracted from the ingredient-text. For exam-221
ple, "olive oil" is extracted from "2 tbsp of olive oil". Each ingredient222
is represented by a word2vec representation [3]. The skip-gram223
word2vec model represents each word as a vector. Two vectors are224
close in vector space when the corresponding words are placed225
in similar context. The word2vec model has been pre-trained on226
the cooking instructions of the R1M collection, and returns vectors227
with a dimensionality of 300. The pre-trained word2vec model has228
been made publicly available by Salvador et al. [2].229
3.2 Representation of cooking instructions230
Cooking instructions are represented through a two-stage LSTM231
method. A LSTM is a recurrent neural network that can learn232
long-term word dependencies. LSTM’s are suitable for language233
modeling because the probability of a word sequence can be mod-234
eled. In the first stage, a sequence-to-sequence LSTM model is235
applied to each single cooking instruction to obtain a so-called236
skip-instructions vector representation [5]. This first LSTM model237
2
is referred to as the skip-instructions model, and it has been trained238
on the R1M collection. The second stage of the two-stage LSTM239
method is integrated in the joint neural embedding model, and is240
discussed in section 3.4.241
The word2vec model and the skip-instructions model together242
are referred to as the text-representation models.243
3.3 Representation of images244
All food images are resized and center-cropped to 256 x 256 images.245
The images are represented by adopting the deep convolutional246
neural network (CNN) Resnet-50 [6]. The Resnet-50 model is in-247
tegrated in the joint neural embedding model, and is discussed in248
section 3.4.249
3.4 Joint neural embedding250
The joint neural embedding model is implemented in Torch7 [13].251
The model is visualized in Figure 1 (adopted from Salvador et al.252
[2]). It contains two encoders: one for ingredients, and one for cook-253
ing instructions. The ingredients-encoder combines the word2vec254
vectors of all ingredients, through the use of a bidirectional LSTM255
model. The instructions-encoder forms the second stage of the256
two-stage LSTM method as discussed in section 3.1. This second257
LSTM model represents all skip-instructions vectors of a recipe as258
one vector. The encoder outputs are concatenated to obtain the259
recipe representation. The recipe representation is embedded into260
the joint neural embedding space. As discussed before, the image261
representations are obtained through the Resnet-50 model. The262
Resnet-50 model is incorporated into the joint neural embedding263
model by removing the final softmax classification layer and pro-264
jecting the image representation into the embedding space through265
linear transformation. The joint neural embedding model is trained266
to learn transformations that minimize the distance in space be-267
tween a recipe and its corresponding image.268
4 THE RECIPE COLLECTIONSThree recipe collections are used in this project; R1M, JO, and AH.269
An overview of collection characteristics is depicted in Table 1.270
Complete example recipes from each recipe collection are added to271
the Appendix (see Figures 11, 12, and 13).272
Figure 2: Examples of food images from the Recipe1M col-lection
4.1 Recipe1M (R1M)273
The Recipe1M collection has been made publicly available by Sal-274
vador et al. [2]. This dataset was collected by scraping over two275
dozen cooking websites, extracting and cleaning relevant text from276
the raw HTML, and downloading associated images. The features277
that are stored for each recipe are: ID, title, instructions, ingredient278
names, partition (i.e. train, test, or validation), the URL, and the279
names of the images that the recipe is associated with. Examples of280
R1M food images are shown in Figure 2.281
4.2 Jamie Oliver (JO)282
The Jamie Oliver (JO) collection was scraped from the Internet283
website jamieoliver.com. Compared to the R1M collection, the JO284
collection is much smaller, and on average contains more ingre-285
dients and cooking instructions (see Table 1). Food images in the286
JO collection are of higher quality (with respect to composition,287
resolution, etc.) than the food images from the R1M collection (see288
Figures 3).289
Figure 3: Examples of food images from the Jamie Oliver col-lection
4.3 Allerhande (AH)290
The Allerhande (AH) collection was scraped from the Internet291
website allerhande.nl. The AH collection is much smaller than292
the R1M collection, but larger than the JO collection. Compared293
to the R1M collection, food image quality is high (with respect294
to composition, resolution, etc.) (see Figure 4). Additionally, food295
images in the AH collection are much wider and bigger than food296
images from the other two collections.297
Figure 4: Examples of food images from the Allerhande col-lection
3
Figure 1: Overview of joint neural embedding modelFigure adopted from Salvador et al.
Recipe1M (R1M) Jamie Oliver (JO) Allerhande (AH)
Website of originVarious well-known recipe
collections (e.g. food.com, kraftrecipes.com,
allrecipes.com, tastykitchen.com)
jamieoliver.com allerhande.nl
Language English English Dutch
Total number of recipes 1,029,720 1097 13179
Train | Test | Val n/a | 3480 | n/a 571 | 142 | 77 8645 | 2463 | 1233
Average number of ingredients 9.3 ± 4.3 11.7 ± 5.6 7.6 ± 2.2
Average number of instructions 10.4 ± 6.9 15.6 ± 8.2 11.8 ± 4.9
Average instruction length (in words) 60.2 ± 36.8 92.3 ± 66.6 51.8 ± 33.6
Average image size (height x width) 562 x 646 689 x 513 1600 x 550
Table 1: Overview of collection characteristics. Total number of recipes includes recipes that are removed by pre-processingprocesses.
5 PERFORMANCE OF THE PRE-TRAINEDMODEL
In this section, we assess the performance of the pre-trained model298
for all three recipe collections.299
5.1 Pre-processing JO and AH300
After scraping the JO and AH datasets from their corresponding301
websites of origin, relevant text is extracted from the raw HTML.302
The text is cleaned by removing excessive whitespace, HTML enti-303
ties, and non-ASCII characters (method adopted from Salvador et304
al. [2]). Next, all recipes are assigned a unique 10-digit hexadecimal305
ID. The recipes are segmented into training, test, and validation306
sets (ratio: 0.7, 0.2, 0.1, respectively).307
5.2 Preparing the test-sets308
Test-sets are prepared for the R1M, JO, and AH collections. Since the309
original R1M collection is very big, as subset of the R1M test-recipes310
suffices. The sizes of the test-sets are depicted in Table 1. To be311
able to apply the pre-trained model to the AH collection (originally312
in Dutch), the AH test-recipes are translated into English through313
4
im2recipe recipe2im
MedR R@1 R@5 R@10 MedR R@1 R@5 R@10
R1M 5.75 0.229 0.495 0.621 6.9 0.217 0.462 0.587JO 9.95 0.129 0.383 0.514 14.1 0.098 0.292 0.438
(Translated) AH 18.55 0.093 0.287 0.389 21.8 0.059 0.206 0.342
Table 2: Performance of pre-trained model on the R1M, JO, and AH collections
Google Translate. The JO test-set consists of all JO test-recipes.314
For each of the three test-sets, recipe representations are extracted315
using the text-representation models as discussed in section 3.316
The JO test-set is small because any recipes that contain more317
than 20 instructions or ingredients are excluded. Recipes that do318
not contain any known ingredients (i.e. ingredients that are in the319
vocabulary of the word2vec model) are excluded as well. From320
the JO collection, 307 recipes were excluded, often because of the321
number of instructions exceeding 20.322
Finally, the pre-trained joint neural embedding model is applied323
to all three test-sets. For each recipe, the model returns two vectors324
that represent the recipe and the corresponding image in embedding325
space. These vector representations are used in the subsequent326
retrieval experiments.327
5.3 Intra-collection retrieval328
The pre-trained model is tested, for each test-set, on two retrieval329
tasks; the im2recipe and the recipe2im task. In the im2recipe task,330
recipes are retrieved for a query food image. In the recipe2im task,331
food images are retrieved for a query recipe. The im2recipe task332
is performed by randomly selecting a subset of 100 test recipes333
and their corresponding images. Each recipe and food image is334
represented by a vector in the embedding space. The similarity of335
two vectors is determined by their cosine similarity according to336
the equation:337
cos(xxx ,yyy) = xxx ·yyy| |xxx | | · | |yyy | | (1)
For each image in the subset, all recipes are ranked on the basis of338
their cosine similarity to the image. The rank signifies the position339
of the ground truth recipe in the list of ranked recipes. When all340
images in the subset have been queried, the median rank (MedR)341
and the recall rates at top 1, 5, and 10 (R@1, R@5, and R@10) are342
calculated (adopted from Salvador et al. [2]). This experiment is343
repeated 10 times. Mean performances are reported. The recipe2im344
task is evaluated in the same manner.345
Mean performances are displayed in Table 2. As expected, perfor-346
mance on the R1M test-set is higher than on the JO and AH test-sets,347
for both retrieval tasks and all performance measures. Interestingly,348
performance on the JO test-set is much higher than for the (trans-349
lated) AH test-set. This signifies that the model’s performance is350
collection-specific. This collection-specificity most likely depends351
on how similar the specific collection is to the R1M collection with352
respect to image and recipe features and the co-occurrence of these353
features. The results imply that the R1M collection is more similar354
to the JO collection than to the AH collection. Another possibility355
is that the low performance on the AH test-set is due to translation356
Figure 5: Inter-collection (R1M & JO) ranking resultsThese plots show the sorted reported ranks (on the y-axis) for 10randomly chosen query items (on the x-axis). When no relevant
recipe was found, the rank was set to 16.
Figure 6: Inter-collection (R1M & AH) ranking resultsThese plots show the sorted reported ranks (on the y-axis) for 10randomly chosen query items (on the x-axis). When no relevant
recipe was found, the rank was set to 16.
5
Figure 7: Example of ranking and subsequent relevance judgmentIn this example, JO recipes are retrieved for a R1M image query (i.e., im2recipe). Only the first six recipes (excluding cooking instructions)
are shown due to limited space. The recipe that has been judged as "relevant" is encircled by the green box. In this case, rank = 1.
errors. Overall, the results suggest that the pre-trained joint neural357
embedding model does not perform equally well on professional358
and amateur recipes.359
5.4 Inter-collection retrieval360
Inter-collection retrieval refers to retrieving items from collection361
A for a query item from collection B. This experiment is designed362
to assess the model’s ability to directly match items from different363
recipe collections. The R1M collection will be matched with both364
the AH and JO collection. The experiment is performed in both365
directions (i.e. from R1M to JO/AH, and from JO/AH to R1M). The366
method will be explained by walking through an example where367
recipes from the JO collection are retrieved for food image queries368
from the R1M collection (i.e. im2recipe).369
The JO test-set does not contain the ground truth recipe that370
belongs to the R1M query image. Therefore, the relevance of the371
retrieved recipes has to be determined qualitatively. A recipe is372
deemed to be relevant to the query image if it describes a similar373
dish-type, with similar ingredients.374
First, one R1M query image is randomly selected from the R1M375
test-set. For any R1M query image, there is a possibility that the JO376
test-set by chance does not contain any relevant items. To diminish377
the probability of this happening, a recipe-subset of 130 (instead378
of 100) recipes is randomly selected from the JO test-set. The size379
of this subset is limited due to the size of the JO test-set. Similar to380
the intra-collection experiment, all JO test-recipes are ranked on381
the basis of their cosine similarity to the R1M query image.382
Finally, the first 15 retrieved recipes are manually inspected, and383
the rank of the first relevant recipe is reported. When no relevant384
recipe is found, the rank is set to 16. This experiment is repeated385
10 times. An example is shown in Figure 7.386
The reported ranks for the R1M and JO combination are sorted387
and shown in Figure 5. In the im2recipe task, six out of the ten388
queries resulted in a relevant recipe in the top-15. Performance is389
lower for the recipe2im task, which corresponds to the results of390
intra-collection retrieval (see Table 2). The reported ranks for the391
R1M and AH combination are shown in Figure 6. In the im2recipe392
task, only three out of the ten queries resulted in a relevant recipe393
in the top-15. Overall, these results suggest that the pre-trained394
model’s ability to directly match items from amateur and profes-395
sional collections is limited. This emphasizes the discrepancy be-396
tween amateur and professional recipes.397
6 EXPERIMENTS6.1 Fine-tune pre-trained model on Jamie398
Oliver399
This section describes the experiments that have been designed to400
answer the second research question: What is the best method to401
enhance the pre-trained model’s performance on the JO collection?402
6
Method Text representation models Fixed weights1 Trained on R1M No2 Trained on R1M Yes3 Trained on JO No4 Trained on JO Yes
Table 3: Fine-tuning methods for Jamie Oliver collection
Four different fine-tuning methods are proposed. The approaches403
differ in weight fixation and the specific text-representation models404
that are used. These variables are described below. An overview of405
all four approaches is shown in Table 3.406
Preparing the dataset. Tomaximize the number of training recipes,407
the JO recipe collection is re-segmented into a training and valida-408
tion set (ratio: 0.9, 0.1, respectively). The number of instructions409
and ingredients is limited to 20, to prevent recipes from being ex-410
cluded. The training-set contains 985 recipes, and the validation-set411
contains 110 recipes.412
Text-representation models. As discussed before, the word2vec413
and the skip-instructions model are together referred to as the text-414
representation models. These models are used to extract recipe415
representations when preparing the dataset for training and testing.416
There are two possibilities; either the pre-trained text-representation417
models are used (i.e. trained on the R1M collection), or the text-418
representation models are completely re-trained on the JO collec-419
tion.420
Weight-fixation. Weight fixation refers to freezingmodel-parameters421
during training. This can be used to restrict the learning to a spe-422
cific part of the model. The amount of weight fixation depends423
on which text-representation models are used. If the pre-trained424
text-representation models are used, either all model-parameters425
are fine-tuned (i.e. no weights are fixed) or only the parameters of426
the last two layers are fine-tuned. These are the layers that project427
the recipe and image representations onto the embedding space.428
If the text-representation models are trained on the JO collection,429
fine-tuning only the last two layers is insufficient because the ingre-430
dients and instructions encoders have to be adjusted to incorporate431
the new text-representation models. Therefore, the layers repre-432
senting the two encoders are fine-tuned in addition to the last two433
layers.434
The loss curves for each fine-tuning method are shown in Figure435
8. For all plots, the values of the hyper-parameters are fixed to436
allow for comparison. All plots show a decreasing training loss437
yet unchanging validation loss. This suggests that the model is not438
learning any new trends. Transferability of features is limited when439
the distance between the base task (i.e. R1M) and target task (i.e.440
JO) is large [18]. However, this is an unlikely explanation given the441
relatively small performance difference of the pre-trained model on442
the R1M and JO recipe collections (see Table 2). The unchanging443
validation loss could be an indication of over-fitting. This means444
that instead of learning to match JO recipes to JO food images, the445
model "memorizes" the correct recipe-image mappings from the JO446
training-set. This implies that the JO training set is too small and the447
Figure 8: Training (blue) and validation (red) loss curvesFor all plots, the hyper-parameters are fixed; batch size = 15,
learning rate = 0.00008, number of iterations = 15000, running time= 3 hours.
model too complex. The unchanging validation loss could also be448
an indication of an imbalance between the training and validation449
sets. In that case, the model correctly learns the underlying trends450
in the training set, but fails to perform well on the validation set due451
to the differences between the sets. Adjusting the hyper-parameters452
or the amount of weight fixation in any of the fine-tuning methods453
did not improve the results.454
Given that none of the four methods resulted in correct learning,455
no model evaluation is performed. These results suggest that it is456
difficult to increase the pre-trained model’s performance on the JO457
collection. This might be due to the small size of the JO collection458
and the high complexity of the joint neural embedding model. The459
model’s performance on the JO collection can possibly be increased460
by either increasing the size of the JO collection, or decreasing the461
complexity of the model.462
6.2 Fine-tune pre-trained model on Allerhande463
This section describes the experiments that have been designed464
to answer the third research question; What is the best method to465
7
Language Text representation models Fixed weights1 Dutch Trained on Dutch AH No2 Dutch Trained on Dutch AH Yes3 English Trained on R1M No4 English Trained on R1M Yes5 English Trained on English AH No6 English Trained on English AH Yes
Table 4: Fine-tuning methods for Allerhande collection
AH R1M
im2rec rec2im im2rec rec2im
Baseline 36.0 39.55 5.75 6.91 7.55 7.45 50.7 49.12 6.05 5.8 48.25 46.053 3.5 3.4 3.25 3.44 5.15 5.35 3.35 3.455 6.35 6.55 21.1 22.76 13.95 14.9 29.15 34.15
Table 5: Performance on retrieval tasks for each fine-tuningmethod, on both R1M and AH. The best method is depictedin bold.
enhance the pre-trained model’s performance on a recipe collection466
that is in Dutch instead of English (i.e. the AH collection)? Six differ-467
ent fine-tuning methods are proposed. A new variable is introduced468
in addition to weight fixation and the specific text-representation469
models; language. The joint neural embedding model is fine-tuned470
either on the original Dutch AH recipe collection, or on the AH471
recipe collection that has been translated to English. An overview472
of all six methods is depicted in Table 4.473
The baseline is the performance of the pre-trained model on474
the intra-collection retrieval experiment for both the R1M and475
Dutch AH collections (see Table 5). Only the median rank (MedR)476
measures are reported for clarity. As expected, baseline performance477
for the Dutch AH collection is low. This is due to the fact that478
the text-representation models have been trained on the English479
R1M dataset. The dictionary of the word2vec representation model480
therefore does not contain any Dutch words.481
The optimized training and validation loss curves and hyper-482
parameters are shown in the Appendix (Figure 14 and Table 7,483
respectively.) The evaluation results are depicted in Table 5. The484
third method results both in the highest performance (for AH and485
R1M, separately) and in the smallest performance difference (be-486
tween AH and R1M). In this method, the pre-trained model was487
fine-tuned on the English AH collection, using the pre-trained text-488
representation models.489
An interesting observation is that fine-tuning the pre-trained490
model on the English AH collection increases performance on the491
Figure 9: Inter-collection ranking results for method 3These plots show the sorted reported ranks (on the y-axis) for 10
randomly chosen query items (on the x-axis).
R1M collection as well. This indicates that the AH and R1M col-492
lections share a certain pattern that the pre-trained model did not493
sufficiently detect when training on the R1M collection. This corre-494
sponds to the assumption that, in transfer learning, the factors that495
explain the variations in one setting are needed to capture the vari-496
ations in the other setting [8]. In this case, factors that explain the497
variations in the AH collection are used to capture the variations498
in the R1M collection, and vice versa. These results suggest that499
the benefit of transfer learning is not restricted to one direction (i.e.500
from R1M to AH), but can manifest itself bi-directionally (i.e. from501
R1M to AH and vice versa).502
The third method has also been tested on inter-collection re-503
trieval. The reported ranks are shown in Figure 9. These ranks504
are generally lower than the reported ranks using the pre-trained505
model (see Figure 6). This means that fine-tuning the model on506
the English AH collection increased the model’s ability to directly507
match items from the R1M and AH collections. Overall, the best508
method to enhance the model’s performance on the AH collection509
is the third method, where the pre-trained model is fine-tuned on510
the translated AH dataset, using the pre-trained text-representation511
models.512
8
Figure 10: First ranking based on nutritional valueThe query recipe (on the left) and the first five retrieved recipes (on the right) on the basis of Euclidean distance between the normalized
nutritional value vectors.
im2recipe recipe2im
MedR R@1 R@5 R@10 MedR R@1 R@5 R@10
Excl. nutritional value 3.29 0.299 0.624 0.757 3.255 0.303 0.629 0.755Incl. nutritional value 3.05 0.312 0.648 0.773 3.13 0.309 0.646 0.773
t-value 2.560 -2.172 -3.448 -2.574 1.234 -0.977 -2.388 -2.882p-value 0.011* 0.031* 0.0006* 0.010* 0.218 0.329 0.017* 0.004*
Table 6: Effect of nutritional value feature on performance on (translated) AH collection
6.3 Nutritional value as a feature513
This section describes the experiments that have been designed514
to answer the fourth research question: Does adding nutritional515
value as a feature increase the model’s performance on the AH516
collection?517
Pre-processing nutritional information. The AH collection con-518
tains information on nutritional value for each recipe. There are519
six nutritional categories; fat, protein, fibers, energy, sodium, and520
carbohydrates. The values of each category are normalized across521
all recipes to bring all values into a range between 0 and 1 (i.e.522
unity-based normalization), following the equation:523
Xnorm =X − Xmin
Xmax − Xmin(2)
For each recipe, all six nutritional values are stored in one vector.524
Qualitative assessment of discriminative power. In this experi-525
ment, a query recipe is randomly selected. Next, all other recipes526
are ranked on the basis of the Euclidean distance between the vec-527
tors that represent nutritional value. The top-5 retrieved recipes are528
inspected manually. This is repeated three times. Nutritional value529
information is deemed to have meaningful discriminative power if530
the top-5 retrieved recipes are of a similar dish-type as the query531
recipe (e.g. all desserts).532
One of the rankings is shown in Figure 10. This figure shows533
that the top-5 retrieved recipes are of a similar dish-type as the534
query recipe; the retrieved recipes and the query recipe are all535
sweet desserts. These recipes all contains relatively large amounts536
of sugar (i.e. carbohydrates) and energy. The other two rankings537
reveal a similar pattern, and are shown in Figures 15 and 16 in the538
Appendix. These results show that nutritional value has meaningful539
discriminative power, in the sense that it can be used to distinguish540
between different types of dishes.541
Incorporation of nutritional feature into model. The 6-dimensional542
nutritional value vector is incorporated into the joint neural em-543
bedding model through a single linear layer with 6 nodes. This544
9
linear layers represents the nutritional-encoder within the joint545
neural embedding model. The encoder returns a 4-dimensional546
vector that is concatenated to the recipe representation. The new547
model is fine-tuned on the translated AH collection (including the548
nutritional value vector representations), using the pre-trained text-549
representation models. In the baseline model, the nutritional value550
feature is excluded ("Excl. nutritional value" in Table 6).551
The new model and the baseline model are evaluated using the552
intra-collection retrieval experiment. The im2recipe and recipe2im553
retrieval tasks are repeated 100 times to increase statistical power.554
Two-sided independent T-tests are performed to test the difference555
in performance for all performance measures (i.e. MedR, R@1, R@5,556
R@10). The results are shown in Table 6. All p-values below 0.05 are557
assumed to signify a significant difference, and are denoted by an558
asterisk. For the im2recipe task, all performance measures are sig-559
nificantly different from the baseline. For the recipe2im task, only560
R@5 and R@10 are significantly different. These results suggest561
that nutritional value contributes new information to the recipe rep-562
resentation, in addition to the ingredients and cooking instructions,563
and increases the model’s performance on the AH collection.564
7 FINE-TUNING VERSUS TRAINING FROMSCRATCH
In this project, transfer learning has been exploited by fine-tuning565
the pre-trained joint neural embeddingmodel on professional recipe566
collections. We also tried training the joint neural embedding model567
from scratch on the JO and AH collections. To increase the prob-568
ability of success, we experimented with the model’s complexity.569
Model complexity is related to the number of learn-able parameters570
in the model. Decreasing model complexity can be beneficial for571
training, especially when using a relatively small training set. The572
complexity of the joint neural embedding model has be decreased573
by, for example, decreasing the dimensionality of the embedding574
space. Irrespective of model complexity or hyper-parameter set-575
tings, training on the JO or AH collections did not result in any576
learning. This corresponds to the findings of [16], where fine-tuning577
outperformed training from scratch.578
The failure to train the model from scratch is most likely due to579
the small training set sizes of the JO and AH collections. Training580
deep neural networks requires a large amount of training data [17].581
Even though training from scratch did not work, fine-tuning the582
pre-trained model on the AH collection resulted in an increase of583
performance for both the R1M and AH collections. These results584
have two implications; 1) the model’s learning of the AH collection585
greatly benefited from transfer learning; and 2) learning the target586
task (i.e. AH) can even improve performance on the base task (i.e.587
R1M). This project demonstrates the large advantage of transfer588
learning via fine-tuning over training from scratch.589
8 CONCLUSIONSIn this paper we focused on the performance of the joint neural em-590
bedding model for amateur and professional recipes and the benefit591
of utilizing nutritional value information within this model. We592
showed that the pre-trained model does not perform equally well593
on amateur and professional recipes. As expected, performance is594
higher for amateur than professional recipes. Fine-tuning the model595
on the Jamie Oliver collection has not worked. This is probably due596
to the small size of the JO collection. This inference is supported by597
the fact that fine-tuning did work for the larger AH collection. The598
best method to enhance the pre-trained model’s performance on599
the AH collection is to fine-tune the pre-trained model on the trans-600
lated AH collection, using the pre-trained representation models.601
Surprisingly, this method resulted in an increase of performance602
for both the AH and the R1M collection. This suggests that the603
benefit of transfer learning is not restricted to the target task (i.e.604
professional recipes), but also serves the base task (i.e. amateur605
recipes). Finally, we found that nutritional value has meaningful606
discriminative power, in the sense that it can be used to distinguish607
between different types of dishes. We showed that adding nutri-608
tional value as a feature through a simple linear encoder increases609
the model’s performance on the AH collection.610
8.1 Acknowledgements611
I want to thank my two supervisors Thomas Mensink and Vladimir612
Nedović for their enthusiasm and good advice.613
REFERENCES[1] P. Domingos. A few useful things to know about machine learning. Communica-614
tions of the ACM, 55(10):78-87, 2012.615
[2] A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli, I. Weber, and A. Torralba. Learning616
cross-modal embeddings for cooking recipes and food images. Training, 720:619-617
508, 2017.618
[3] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word repre-619
sentations in vector space. arXiv preprint, 1301.3781, 2013.620
[4] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and621
S. Fidler. Skip-thought vectors. In Advances in neural information processing622
systems, 3294-3302, 2015623
[5] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural624
networks. In Advances in neural information processing systems, 3104-3112, 2014625
[6] K. He, X. Zhang, and S. Ren, and J. Sun. Deep residual learning for image recogni-626
tion. In Proceedings of the IEEE conference on computer vision and pattern recognition,627
770-778, 2016628
[7] H. Bal, D. Epema, C. de Laat, R. van Nieuwpoort, J. Romein, F. Seinstra, C. Snoek,629
and H. Wijshoff A Medium-Scale Distributed System for Computer Science630
Research: Infrastructure for the Long Term IEEE Computer, Vol. 49, No. 5, pp.631
54-63, May 2016.632
[8] I. Goodfellow, Y. Bengio, and A. Courville Deep Learning MIT Press, http:633
//www.deeplearningbook.org, 2016634
[9] M. Rokicki, C. Trattner, and E. Herder The Impact of Recipe Features, Social Cues635
and Demographics on Estimating the Healthiness of Online Recipes. 2018636
[10] T. Kusmierczyk, and K. Nørvåg Online food recipe title semantics: Combining637
nutrient facts and topics. In Proc. of CIKM, 2013–2016, 2016.638
[11] M. Chokr, and S. Elbassuoni Calories prediction from food images. In AAAI,639
4664–4669. 2017.640
[12] Z. Zheng, L. Zheng, M. Garrett, Y. Yang, and Y.D. Shen Dual-Path Convolutional641
Image-Text Embedding. arXiv preprint, 1711.05535, 2017.642
[13] R. Collobert, K. Kavukcuoglu, and C. Farabet Torch7: A matlab-like environment643
for machine learning. In BigLearn, NIPS workshop (No. EPFL-CONF-192376), 2011.644
[14] L. Torrey, and J. Shavlik Transfer learning. In Handbook of research on machine645
learning applications and trends: algorithms, methods, and techniques, pp. 242-264,646
2010.647
[15] S.J. Pan, and Q. Yang A survey on transfer learning. IEEE Transactions on648
knowledge and data engineering, 22(10), 1345-1359, 2010.649
[16] N. Tajbakhsh, J.Y. Shin, S.R. Gurudu, R.T. Hurst, C.B. Kendall, M.B. Gotway, and650
J. Liang Convolutional neural networks for medical image analysis: Full training651
or fine tuning? IEEE transactions on medical imaging, 35(5), 1299-1312, 2016.652
[17] D. Erhan, P.A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent The difficulty653
of training deep architectures and the effect of unsupervised pre-training. In654
Artificial Intelligence and Statistics, pp. 153-160, 2009.655
[18] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson How transferable are features in656
deep neural networks? In Advances in neural information processing systems, pp.657
3320-3328), 2014.658
9 APPENDIX
10
Figure 11: Example recipes of the Recipe1M collection
11
Figure 12: Example recipes of the Jamie Oliver collection
12
Figure 13: Example recipes of the Allerhande collection
Approach Learning rate Number of iterations Running time (hours) Snapshot at iteration1 0.00011 30.000 7 21.5002 0.0001 25.500 6 21.5003 0.000015 30.000 7 21.5004 0.00013 45.000 10 33.0005 0.00007 25.500 6 21.5006 0.00002 18.000 4 17.000
Table 7: Hyper-parameter values during training
13
Figure 14: Training (blue) and validation (red) loss curves; fine-tuning pre-trained model on AH collection
14
Figure 15: Second ranking based on nutritional valueThe query recipe (on the left) and the first five retrieved recipes (on the right) on the basis of Euclidean distance between the normalized
nutritional value vectors.
15
Figure 16: Third ranking based on nutritional valueThe query recipe (on the left) and the first five retrieved recipes (on the right) on the basis of Euclidean distance between the normalized
nutritional value vectors.
16