Transfer Learning in Joint Neural Embeddings for Cooking Recipes …flavourspace.com/wp-content/uploads/2018/06/Thesis_MarielvanSta… · 29 5.1 Pre-processing JO and AH 4 30 5.2

Transfer Learning in Joint Neural Embeddings for Cooking Recipes and Food Images1

submitted in partial fulfillment for the degree of master of science2

Mariel van Staveren3

117739524

master information studies5

data science6

faculty of science7

university of amsterdam8

Date of defence 2018-06-229

Internal Supervisor External SupervisorTitle, Name Dr Thomas Mensink Dr Vladimir NedovićAffiliation UvA, FNWI FlavourspaceEmail [email protected] [email protected]

10

11

Contents12

Contents 013

Abstract 114

1 Introduction 115

2 Related Work 216

2.1 Transfer Learning 217

2.2 Nutritional value 218

3 The joint neural embedding model 219

3.1 Representation of ingredients 220

3.2 Representation of cooking instructions 221

3.3 Representation of images 322

3.4 Joint neural embedding 323

4 The recipe collections 324

4.1 Recipe1M (R1M) 325

4.2 Jamie Oliver (JO) 326

4.3 Allerhande (AH) 327

5 Performance of the pre-trained model 428

5.1 Pre-processing JO and AH 429

5.2 Preparing the test-sets 430

5.3 Intra-collection retrieval 531

5.4 Inter-collection retrieval 632

6 Experiments 633

6.1 Fine-tune pre-trained model on Jamie Oliver 634

6.2 Fine-tune pre-trained model on Allerhande 735

6.3 Nutritional value as a feature 936

7 Fine-Tuning versus Training from scratch 1037

8 Conclusions 1038

8.1 Acknowledgements 1039

References 1040

9 Appendix 1041

Transfer Learning in Joint Neural Embeddings for CookingRecipes and Food Images

ABSTRACTThe research focus of this paper is two-fold. First, we address the per-42

formance of the joint neural embedding model for cooking recipes43

and food images [2] over different recipe collections. The model44

is trained on a large recipe collection that contains many user-45

submitted recipes and food images. Performance on professional46

recipes is therefore expected to be low. To enhance the usability47

of the model, our aim is to produce one model that performs well48

on both professional and amateur recipes via transfer learning.49

Second, given the increased interest in and access to nutritional50

value information, we assess the benefit of adding nutritional value51

information as a new feature in the model. Two small professional52

recipe collections are used in this project; the Jamie Oliver (JO)53

and the Dutch Allerhande (AH) collection. Transfer learning is de-54

ployed to increase the model’s performance on these professional55

recipe collections via multiple fine-tuning methods. The results56

suggest that the JO collection is too small to achieve an increase in57

performance through fine-tuning. We found that the best method58

to increase the model’s performance on the AH collection is to59

fine-tune the pre-trained model on the translated AH collection,60

using the pre-trained text representation models. Interestingly, this61

method results in increased performance on amateur recipes as well.62

This means that the benefits of transfer learning are not restricted63

to the target task (i.e. professional recipes), but serve the base task64

(i.e. amateur recipes) as well. Finally, qualitative and quantitative65

experiments show that the model’s performance can be increased66

by adding nutritional value as a new feature.67

1 INTRODUCTIONPeople’s lives are increasingly intertwined with the World Wide68

Web, including one of our most fundamental needs; food. A large69

corpus of cooking recipes and food images is currently available on70

the Web. The largest publicly available recipe collection has been71

constructed by Salvador et al. [2]. This collection contains over 172

million cooking recipes and 800k food images, referred to as the73

Recipe1M (R1M) collection. Using this collection, Salvador et al.74

created a joint neural embedding model to embed recipes and food75

images in a common high-dimensional vector space. The model is76

trained to minimize the distance in space between a recipe and its77

corresponding food image. The model yields impressive results on78

image-recipe and recipe-image retrieval tasks. Based on this model,79

an application could be developed where, for example, users can80

input a picture of a delicious lunch and receive a corresponding81

recipe so they can recreate the dish.82

However, the R1M collection contains many user-submitted83

recipes and food images. Amateur recipes generally differ from84

Master Thesis, June 2018, UvA© 2018

professional recipes in the type and number of ingredients and85

the style of the instructions. Additionally, amateur food images86

differ from professional images in aspects such as composition, res-87

olution, lighting, clarity, and color distribution. Consequently, the88

joint neural embedding model’s performance on professional recipe89

collections is expected to be low. This limits the model’s usability90

in the sense that application developers have to restrict the possible91

scope of inputs and outputs to amateur recipes and food images.92

This means that users can only correctly retrieve recipes and food93

images from fellow amateur cooks, while they may actually be94

interested in suggestions from a professional chef.95

Transfer learning refers to improved learning of a target task96

through the transfer of knowledge from a base task [14]. Amateur97

and professional recipes can be considered as two separate tasks.98

In this case, the joint neural embedding model, pre-trained on the99

amateur recipes and food images of the R1M collection, serves as100

the base model. This pre-trained model’s knowledge on amateur101

recipes can be exploited to train a new model that learns to em-102

bed professional recipes and food images. However, our aim is to103

produce one model that performs well on both professional and104

amateur recipes. Therefore, our approach diverges from traditional105

transfer learning in the sense that we aim for improved learning of106

a target task (i.e. professional recipes) without compromising the107

performance on the base task (i.e. amateur recipes).108

In this project, two small professional recipe collections are used;109

the Jamie Oliver (JO) collection and the Allerhande (AH) collection.110

The AH collection has two interesting properties; the recipes are111

in Dutch instead of English and the recipes contain nutritional112

value information. First, we assess how well the model performs on113

these professional recipe collections compared to amateur recipes114

from the R1M collection. Next, transfer learning is applied by fine-115

tuning the pre-trained model on the professional recipe collections.116

Additionally, we want to assess the benefit of adding nutritional117

value information as a new feature in the model. Consequently, the118

research questions posed in this work are as follows:119

Question 1: Performance of pre-trained model Does the joint120

neural embedding model, pre-trained on amateur recipes121

and food images, perform equally well on professional and122

amateur recipe collections? To answer this question, the pre-123

trained model is tested on image-recipe and recipe-image124

retrieval tasks for the R1M, JO, and AH recipe collections125

separately. This is referred to as intra-collection retrieval, be-126

cause the query item and the retrieved items originate from127

the same collection. The pre-trained model is also tested on128

inter-collection retrieval. For example, recipes from the JO129

collection are retrieved for a query image from the R1M col-130

lection. Low performance on inter-collection retrieval would131

indicate a discrepancy between amateur and professional132

recipes.133

1

Question 2: Fine-tune pre-trained model on Jamie Oliver134

What is the best fine-tuning method to enhance the pre-135

trained model’s performance on the JO collection? To an-136

swer this question, multiple fine-tuning methods are applied.137

Learning is deemed to be correct when no over- or under-138

fitting is apparent. All methods that result in correct learning139

are further evaluated by testing the new models on the intra-140

and inter-collection retrieval tasks. The goal is to increase141

the model’s performance on the JO collection without com-142

promising the performance on the R1M collection.143

Question 3: Fine-tune pre-trained model on Allerhande144

What is the best method to enhance the pre-trained model’s145

performance on a recipe collection that is in Dutch instead of146

English (i.e. the AH collection)? Retrieval performance of the147

pre-trained model is used as a baseline. Multiple methods are148

applied to increase the model’s performance on the AH col-149

lection. The goal is to increase the model’s performance on150

the AH collection without compromising the performance151

on the R1M collection.152

Question 4: Nutritional value as a feature Does adding nu-153

tritional value as a feature increase the model’s performance154

on the AH collection? Retrieval performance of the best per-155

forming model of the previous section is used as a baseline.156

The AH collection contains nutritional value information for157

each recipe. A qualitative assessment is designed to investi-158

gate if nutritional value has any meaningful discriminative159

power. Next, nutritional value is incorporated as a feature160

in the model. Retrieval performance of the new model is161

compared to the baseline.162

Overview of thesis. The Related Work section reviews relevant163

academic work. Next, the joint neural embedding model and the164

text-representation models are described. The next section contains165

information on the content of the three recipe collections (R1M,166

JO, and AH). The fifth section ("Performance of the pre-trained167

model") describes the experiments that are used to test the model’s168

performance on all three recipe collections, and reviews the re-169

sults. The Experiments section first explains the experiments that170

are designed to increase the model’s performance on the JO and171

AH recipe collections through transfer learning. Additionally, the172

last experiment assesses the added value of utilizing nutritional173

value information as a feature. The next section ("Fine-Tuning ver-174

sus Training from scratch") discusses an additional observation175

concerning large performance differences between training from176

scratch and fine-tuning methods. Finally, the outcomes are summa-177

rized in the Conclusions section.178

2 RELATEDWORK2.1 Transfer Learning179

Transfer learning is a method where a trained model is used as a180

starting point for another model on a related task [15]. Transfer181

learning is a popular method in deep learning because using a182

pre-trained model saves time and computer resources [17].183

In research by [16], a pre-trained deep convolutional neural184

network (CNN) was fine-tuned on medical images to perform tasks185

such as classification, detection, and segmentation. A pre-trained186

CNN is trained, for example, on a large set of labeled natural images.187

They compared the fine-tuned CNN-model to a CNN-model that188

has been trained from scratch. The results showed that the fine-189

tuned model outperformed the model that has been trained from190

scratch. Importantly, they analyzed how the size of the training set191

influenced the performance of both models. A reduced training set192

size led to a larger decrease in performance for the model trained193

from scratch than for the fine-tuned model. This means that the194

fine-tuned CNN is more robust to training set size.195

The fine-tuning approach used by [16] is similar to our approach.196

In our approach, the joint neural embedding model by Salvador et197

al. [2] is fine-tuned on small professional recipe collections. Our198

approach diverges from the approach used by [16] because we aim199

to preserve the model’s performance on the base task (i.e. amateur200

recipes), while increasing the model’s performance on the target201

task (i.e. professional recipes). This project will contribute to the202

field of transfer learning by testing the feasibility of this approach203

within the joint neural embedding model.204

2.2 Nutritional value205

People increasingly take nutritional value into account when mak-206

ing food choices. Recent research has been focused on algorithmic207

nutritional estimation from text [10] or images [11]. Interestingly,208

research by [9] showed that simple models outperform human209

raters on nutritional estimation tasks. This means that nutritional210

value information can be obtained for all recipes, even for recipes211

that do not explicitly contain nutritional value information. Due212

to the increased interest in and easy access to nutritional value213

information, incorporating it into image-recipe embeddings is a214

meaningful contribution.215

3 THE JOINT NEURAL EMBEDDING MODELRecipes consist of three features; the ingredients, the cooking in-216

structions, and the food image. The representations of these fea-217

tures are discussed first. Next, the joint neural embedding model is218

described.219

3.1 Representation of ingredients220

Ingredient names are extracted from the ingredient-text. For exam-221

ple, "olive oil" is extracted from "2 tbsp of olive oil". Each ingredient222

is represented by a word2vec representation [3]. The skip-gram223

word2vec model represents each word as a vector. Two vectors are224

close in vector space when the corresponding words are placed225

in similar context. The word2vec model has been pre-trained on226

the cooking instructions of the R1M collection, and returns vectors227

with a dimensionality of 300. The pre-trained word2vec model has228

been made publicly available by Salvador et al. [2].229

3.2 Representation of cooking instructions230

Cooking instructions are represented through a two-stage LSTM231

method. A LSTM is a recurrent neural network that can learn232

long-term word dependencies. LSTM’s are suitable for language233

modeling because the probability of a word sequence can be mod-234

eled. In the first stage, a sequence-to-sequence LSTM model is235

applied to each single cooking instruction to obtain a so-called236

skip-instructions vector representation [5]. This first LSTM model237

2

is referred to as the skip-instructions model, and it has been trained238

on the R1M collection. The second stage of the two-stage LSTM239

method is integrated in the joint neural embedding model, and is240

discussed in section 3.4.241

The word2vec model and the skip-instructions model together242

are referred to as the text-representation models.243

3.3 Representation of images244

All food images are resized and center-cropped to 256 x 256 images.245

The images are represented by adopting the deep convolutional246

neural network (CNN) Resnet-50 [6]. The Resnet-50 model is in-247

tegrated in the joint neural embedding model, and is discussed in248

section 3.4.249

3.4 Joint neural embedding250

The joint neural embedding model is implemented in Torch7 [13].251

The model is visualized in Figure 1 (adopted from Salvador et al.252

[2]). It contains two encoders: one for ingredients, and one for cook-253

ing instructions. The ingredients-encoder combines the word2vec254

vectors of all ingredients, through the use of a bidirectional LSTM255

model. The instructions-encoder forms the second stage of the256

two-stage LSTM method as discussed in section 3.1. This second257

LSTM model represents all skip-instructions vectors of a recipe as258

one vector. The encoder outputs are concatenated to obtain the259

recipe representation. The recipe representation is embedded into260

the joint neural embedding space. As discussed before, the image261

representations are obtained through the Resnet-50 model. The262

Resnet-50 model is incorporated into the joint neural embedding263

model by removing the final softmax classification layer and pro-264

jecting the image representation into the embedding space through265

linear transformation. The joint neural embedding model is trained266

to learn transformations that minimize the distance in space be-267

tween a recipe and its corresponding image.268

4 THE RECIPE COLLECTIONSThree recipe collections are used in this project; R1M, JO, and AH.269

An overview of collection characteristics is depicted in Table 1.270

Complete example recipes from each recipe collection are added to271

the Appendix (see Figures 11, 12, and 13).272

Figure 2: Examples of food images from the Recipe1M col-lection

4.1 Recipe1M (R1M)273

The Recipe1M collection has been made publicly available by Sal-274

vador et al. [2]. This dataset was collected by scraping over two275

dozen cooking websites, extracting and cleaning relevant text from276

the raw HTML, and downloading associated images. The features277

that are stored for each recipe are: ID, title, instructions, ingredient278

names, partition (i.e. train, test, or validation), the URL, and the279

names of the images that the recipe is associated with. Examples of280

R1M food images are shown in Figure 2.281

4.2 Jamie Oliver (JO)282

The Jamie Oliver (JO) collection was scraped from the Internet283

website jamieoliver.com. Compared to the R1M collection, the JO284

collection is much smaller, and on average contains more ingre-285

dients and cooking instructions (see Table 1). Food images in the286

JO collection are of higher quality (with respect to composition,287

resolution, etc.) than the food images from the R1M collection (see288

Figures 3).289

Figure 3: Examples of food images from the Jamie Oliver col-lection

4.3 Allerhande (AH)290

The Allerhande (AH) collection was scraped from the Internet291

website allerhande.nl. The AH collection is much smaller than292

the R1M collection, but larger than the JO collection. Compared293

to the R1M collection, food image quality is high (with respect294

to composition, resolution, etc.) (see Figure 4). Additionally, food295

images in the AH collection are much wider and bigger than food296

images from the other two collections.297

Figure 4: Examples of food images from the Allerhande col-lection

3

Figure 1: Overview of joint neural embedding modelFigure adopted from Salvador et al.

Recipe1M (R1M) Jamie Oliver (JO) Allerhande (AH)

Website of originVarious well-known recipe

collections (e.g. food.com, kraftrecipes.com,

allrecipes.com, tastykitchen.com)

jamieoliver.com allerhande.nl

Language English English Dutch

Total number of recipes 1,029,720 1097 13179

Train | Test | Val n/a | 3480 | n/a 571 | 142 | 77 8645 | 2463 | 1233

Average number of ingredients 9.3 ± 4.3 11.7 ± 5.6 7.6 ± 2.2

Average number of instructions 10.4 ± 6.9 15.6 ± 8.2 11.8 ± 4.9

Average instruction length (in words) 60.2 ± 36.8 92.3 ± 66.6 51.8 ± 33.6

Average image size (height x width) 562 x 646 689 x 513 1600 x 550

Table 1: Overview of collection characteristics. Total number of recipes includes recipes that are removed by pre-processingprocesses.

5 PERFORMANCE OF THE PRE-TRAINEDMODEL

In this section, we assess the performance of the pre-trained model298

for all three recipe collections.299

5.1 Pre-processing JO and AH300

After scraping the JO and AH datasets from their corresponding301

websites of origin, relevant text is extracted from the raw HTML.302

The text is cleaned by removing excessive whitespace, HTML enti-303

ties, and non-ASCII characters (method adopted from Salvador et304

al. [2]). Next, all recipes are assigned a unique 10-digit hexadecimal305

ID. The recipes are segmented into training, test, and validation306

sets (ratio: 0.7, 0.2, 0.1, respectively).307

5.2 Preparing the test-sets308

Test-sets are prepared for the R1M, JO, and AH collections. Since the309

original R1M collection is very big, as subset of the R1M test-recipes310

suffices. The sizes of the test-sets are depicted in Table 1. To be311

able to apply the pre-trained model to the AH collection (originally312

in Dutch), the AH test-recipes are translated into English through313

4

im2recipe recipe2im

MedR R@1 R@5 R@10 MedR R@1 R@5 R@10

R1M 5.75 0.229 0.495 0.621 6.9 0.217 0.462 0.587JO 9.95 0.129 0.383 0.514 14.1 0.098 0.292 0.438

(Translated) AH 18.55 0.093 0.287 0.389 21.8 0.059 0.206 0.342

Table 2: Performance of pre-trained model on the R1M, JO, and AH collections

Google Translate. The JO test-set consists of all JO test-recipes.314

For each of the three test-sets, recipe representations are extracted315

using the text-representation models as discussed in section 3.316

The JO test-set is small because any recipes that contain more317

than 20 instructions or ingredients are excluded. Recipes that do318

not contain any known ingredients (i.e. ingredients that are in the319

vocabulary of the word2vec model) are excluded as well. From320

the JO collection, 307 recipes were excluded, often because of the321

number of instructions exceeding 20.322

Finally, the pre-trained joint neural embedding model is applied323

to all three test-sets. For each recipe, the model returns two vectors324

that represent the recipe and the corresponding image in embedding325

space. These vector representations are used in the subsequent326

retrieval experiments.327

5.3 Intra-collection retrieval328

The pre-trained model is tested, for each test-set, on two retrieval329

tasks; the im2recipe and the recipe2im task. In the im2recipe task,330

recipes are retrieved for a query food image. In the recipe2im task,331

food images are retrieved for a query recipe. The im2recipe task332

is performed by randomly selecting a subset of 100 test recipes333

and their corresponding images. Each recipe and food image is334

represented by a vector in the embedding space. The similarity of335

two vectors is determined by their cosine similarity according to336

the equation:337

cos(xxx ,yyy) = xxx ·yyy| |xxx | | · | |yyy | | (1)

For each image in the subset, all recipes are ranked on the basis of338

their cosine similarity to the image. The rank signifies the position339

of the ground truth recipe in the list of ranked recipes. When all340

images in the subset have been queried, the median rank (MedR)341

and the recall rates at top 1, 5, and 10 (R@1, R@5, and R@10) are342

calculated (adopted from Salvador et al. [2]). This experiment is343

repeated 10 times. Mean performances are reported. The recipe2im344

task is evaluated in the same manner.345

Mean performances are displayed in Table 2. As expected, perfor-346

mance on the R1M test-set is higher than on the JO and AH test-sets,347

for both retrieval tasks and all performance measures. Interestingly,348

performance on the JO test-set is much higher than for the (trans-349

lated) AH test-set. This signifies that the model’s performance is350

collection-specific. This collection-specificity most likely depends351

on how similar the specific collection is to the R1M collection with352

respect to image and recipe features and the co-occurrence of these353

features. The results imply that the R1M collection is more similar354

to the JO collection than to the AH collection. Another possibility355

is that the low performance on the AH test-set is due to translation356

Figure 5: Inter-collection (R1M & JO) ranking resultsThese plots show the sorted reported ranks (on the y-axis) for 10randomly chosen query items (on the x-axis). When no relevant

recipe was found, the rank was set to 16.

Figure 6: Inter-collection (R1M & AH) ranking resultsThese plots show the sorted reported ranks (on the y-axis) for 10randomly chosen query items (on the x-axis). When no relevant

recipe was found, the rank was set to 16.

5

Figure 7: Example of ranking and subsequent relevance judgmentIn this example, JO recipes are retrieved for a R1M image query (i.e., im2recipe). Only the first six recipes (excluding cooking instructions)

are shown due to limited space. The recipe that has been judged as "relevant" is encircled by the green box. In this case, rank = 1.

errors. Overall, the results suggest that the pre-trained joint neural357

embedding model does not perform equally well on professional358

and amateur recipes.359

5.4 Inter-collection retrieval360

Inter-collection retrieval refers to retrieving items from collection361

A for a query item from collection B. This experiment is designed362

to assess the model’s ability to directly match items from different363

recipe collections. The R1M collection will be matched with both364

the AH and JO collection. The experiment is performed in both365

directions (i.e. from R1M to JO/AH, and from JO/AH to R1M). The366

method will be explained by walking through an example where367

recipes from the JO collection are retrieved for food image queries368

from the R1M collection (i.e. im2recipe).369

The JO test-set does not contain the ground truth recipe that370

belongs to the R1M query image. Therefore, the relevance of the371

retrieved recipes has to be determined qualitatively. A recipe is372

deemed to be relevant to the query image if it describes a similar373

dish-type, with similar ingredients.374

First, one R1M query image is randomly selected from the R1M375

test-set. For any R1M query image, there is a possibility that the JO376

test-set by chance does not contain any relevant items. To diminish377

the probability of this happening, a recipe-subset of 130 (instead378

of 100) recipes is randomly selected from the JO test-set. The size379

of this subset is limited due to the size of the JO test-set. Similar to380

the intra-collection experiment, all JO test-recipes are ranked on381

the basis of their cosine similarity to the R1M query image.382

Finally, the first 15 retrieved recipes are manually inspected, and383

the rank of the first relevant recipe is reported. When no relevant384

recipe is found, the rank is set to 16. This experiment is repeated385

10 times. An example is shown in Figure 7.386

The reported ranks for the R1M and JO combination are sorted387

and shown in Figure 5. In the im2recipe task, six out of the ten388

queries resulted in a relevant recipe in the top-15. Performance is389

lower for the recipe2im task, which corresponds to the results of390

intra-collection retrieval (see Table 2). The reported ranks for the391

R1M and AH combination are shown in Figure 6. In the im2recipe392

task, only three out of the ten queries resulted in a relevant recipe393

in the top-15. Overall, these results suggest that the pre-trained394

model’s ability to directly match items from amateur and profes-395

sional collections is limited. This emphasizes the discrepancy be-396

tween amateur and professional recipes.397

6 EXPERIMENTS6.1 Fine-tune pre-trained model on Jamie398

Oliver399

This section describes the experiments that have been designed to400

answer the second research question: What is the best method to401

enhance the pre-trained model’s performance on the JO collection?402

6

Method Text representation models Fixed weights1 Trained on R1M No2 Trained on R1M Yes3 Trained on JO No4 Trained on JO Yes

Table 3: Fine-tuning methods for Jamie Oliver collection

Four different fine-tuning methods are proposed. The approaches403

differ in weight fixation and the specific text-representation models404

that are used. These variables are described below. An overview of405

all four approaches is shown in Table 3.406

Preparing the dataset. Tomaximize the number of training recipes,407

the JO recipe collection is re-segmented into a training and valida-408

tion set (ratio: 0.9, 0.1, respectively). The number of instructions409

and ingredients is limited to 20, to prevent recipes from being ex-410

cluded. The training-set contains 985 recipes, and the validation-set411

contains 110 recipes.412

Text-representation models. As discussed before, the word2vec413

and the skip-instructions model are together referred to as the text-414

representation models. These models are used to extract recipe415

representations when preparing the dataset for training and testing.416

There are two possibilities; either the pre-trained text-representation417

models are used (i.e. trained on the R1M collection), or the text-418

representation models are completely re-trained on the JO collec-419

tion.420

Weight-fixation. Weight fixation refers to freezingmodel-parameters421

during training. This can be used to restrict the learning to a spe-422

cific part of the model. The amount of weight fixation depends423

on which text-representation models are used. If the pre-trained424

text-representation models are used, either all model-parameters425

are fine-tuned (i.e. no weights are fixed) or only the parameters of426

the last two layers are fine-tuned. These are the layers that project427

the recipe and image representations onto the embedding space.428

If the text-representation models are trained on the JO collection,429

fine-tuning only the last two layers is insufficient because the ingre-430

dients and instructions encoders have to be adjusted to incorporate431

the new text-representation models. Therefore, the layers repre-432

senting the two encoders are fine-tuned in addition to the last two433

layers.434

The loss curves for each fine-tuning method are shown in Figure435

8. For all plots, the values of the hyper-parameters are fixed to436

allow for comparison. All plots show a decreasing training loss437

yet unchanging validation loss. This suggests that the model is not438

learning any new trends. Transferability of features is limited when439

the distance between the base task (i.e. R1M) and target task (i.e.440

JO) is large [18]. However, this is an unlikely explanation given the441

relatively small performance difference of the pre-trained model on442

the R1M and JO recipe collections (see Table 2). The unchanging443

validation loss could be an indication of over-fitting. This means444

that instead of learning to match JO recipes to JO food images, the445

model "memorizes" the correct recipe-image mappings from the JO446

training-set. This implies that the JO training set is too small and the447

Figure 8: Training (blue) and validation (red) loss curvesFor all plots, the hyper-parameters are fixed; batch size = 15,

learning rate = 0.00008, number of iterations = 15000, running time= 3 hours.

model too complex. The unchanging validation loss could also be448

an indication of an imbalance between the training and validation449

sets. In that case, the model correctly learns the underlying trends450

in the training set, but fails to perform well on the validation set due451

to the differences between the sets. Adjusting the hyper-parameters452

or the amount of weight fixation in any of the fine-tuning methods453

did not improve the results.454

Given that none of the four methods resulted in correct learning,455

no model evaluation is performed. These results suggest that it is456

difficult to increase the pre-trained model’s performance on the JO457

collection. This might be due to the small size of the JO collection458

and the high complexity of the joint neural embedding model. The459

model’s performance on the JO collection can possibly be increased460

by either increasing the size of the JO collection, or decreasing the461

complexity of the model.462

6.2 Fine-tune pre-trained model on Allerhande463

This section describes the experiments that have been designed464

to answer the third research question; What is the best method to465

7

Language Text representation models Fixed weights1 Dutch Trained on Dutch AH No2 Dutch Trained on Dutch AH Yes3 English Trained on R1M No4 English Trained on R1M Yes5 English Trained on English AH No6 English Trained on English AH Yes

Table 4: Fine-tuning methods for Allerhande collection

AH R1M

im2rec rec2im im2rec rec2im

Baseline 36.0 39.55 5.75 6.91 7.55 7.45 50.7 49.12 6.05 5.8 48.25 46.053 3.5 3.4 3.25 3.44 5.15 5.35 3.35 3.455 6.35 6.55 21.1 22.76 13.95 14.9 29.15 34.15

Table 5: Performance on retrieval tasks for each fine-tuningmethod, on both R1M and AH. The best method is depictedin bold.

enhance the pre-trained model’s performance on a recipe collection466

that is in Dutch instead of English (i.e. the AH collection)? Six differ-467

ent fine-tuning methods are proposed. A new variable is introduced468

in addition to weight fixation and the specific text-representation469

models; language. The joint neural embedding model is fine-tuned470

either on the original Dutch AH recipe collection, or on the AH471

recipe collection that has been translated to English. An overview472

of all six methods is depicted in Table 4.473

The baseline is the performance of the pre-trained model on474

the intra-collection retrieval experiment for both the R1M and475

Dutch AH collections (see Table 5). Only the median rank (MedR)476

measures are reported for clarity. As expected, baseline performance477

for the Dutch AH collection is low. This is due to the fact that478

the text-representation models have been trained on the English479

R1M dataset. The dictionary of the word2vec representation model480

therefore does not contain any Dutch words.481

The optimized training and validation loss curves and hyper-482

parameters are shown in the Appendix (Figure 14 and Table 7,483

respectively.) The evaluation results are depicted in Table 5. The484

third method results both in the highest performance (for AH and485

R1M, separately) and in the smallest performance difference (be-486

tween AH and R1M). In this method, the pre-trained model was487

fine-tuned on the English AH collection, using the pre-trained text-488

representation models.489

An interesting observation is that fine-tuning the pre-trained490

model on the English AH collection increases performance on the491

Figure 9: Inter-collection ranking results for method 3These plots show the sorted reported ranks (on the y-axis) for 10

randomly chosen query items (on the x-axis).

R1M collection as well. This indicates that the AH and R1M col-492

lections share a certain pattern that the pre-trained model did not493

sufficiently detect when training on the R1M collection. This corre-494

sponds to the assumption that, in transfer learning, the factors that495

explain the variations in one setting are needed to capture the vari-496

ations in the other setting [8]. In this case, factors that explain the497

variations in the AH collection are used to capture the variations498

in the R1M collection, and vice versa. These results suggest that499

the benefit of transfer learning is not restricted to one direction (i.e.500

from R1M to AH), but can manifest itself bi-directionally (i.e. from501

R1M to AH and vice versa).502

The third method has also been tested on inter-collection re-503

trieval. The reported ranks are shown in Figure 9. These ranks504

are generally lower than the reported ranks using the pre-trained505

model (see Figure 6). This means that fine-tuning the model on506

the English AH collection increased the model’s ability to directly507

match items from the R1M and AH collections. Overall, the best508

method to enhance the model’s performance on the AH collection509

is the third method, where the pre-trained model is fine-tuned on510

the translated AH dataset, using the pre-trained text-representation511

models.512

8

Figure 10: First ranking based on nutritional valueThe query recipe (on the left) and the first five retrieved recipes (on the right) on the basis of Euclidean distance between the normalized

nutritional value vectors.

im2recipe recipe2im

MedR R@1 R@5 R@10 MedR R@1 R@5 R@10

Excl. nutritional value 3.29 0.299 0.624 0.757 3.255 0.303 0.629 0.755Incl. nutritional value 3.05 0.312 0.648 0.773 3.13 0.309 0.646 0.773

t-value 2.560 -2.172 -3.448 -2.574 1.234 -0.977 -2.388 -2.882p-value 0.011* 0.031* 0.0006* 0.010* 0.218 0.329 0.017* 0.004*

Table 6: Effect of nutritional value feature on performance on (translated) AH collection

6.3 Nutritional value as a feature513

This section describes the experiments that have been designed514

to answer the fourth research question: Does adding nutritional515

value as a feature increase the model’s performance on the AH516

collection?517

Pre-processing nutritional information. The AH collection con-518

tains information on nutritional value for each recipe. There are519

six nutritional categories; fat, protein, fibers, energy, sodium, and520

carbohydrates. The values of each category are normalized across521

all recipes to bring all values into a range between 0 and 1 (i.e.522

unity-based normalization), following the equation:523

Xnorm =X − Xmin

Xmax − Xmin(2)

For each recipe, all six nutritional values are stored in one vector.524

Qualitative assessment of discriminative power. In this experi-525

ment, a query recipe is randomly selected. Next, all other recipes526

are ranked on the basis of the Euclidean distance between the vec-527

tors that represent nutritional value. The top-5 retrieved recipes are528

inspected manually. This is repeated three times. Nutritional value529

information is deemed to have meaningful discriminative power if530

the top-5 retrieved recipes are of a similar dish-type as the query531

recipe (e.g. all desserts).532

One of the rankings is shown in Figure 10. This figure shows533

that the top-5 retrieved recipes are of a similar dish-type as the534

query recipe; the retrieved recipes and the query recipe are all535

sweet desserts. These recipes all contains relatively large amounts536

of sugar (i.e. carbohydrates) and energy. The other two rankings537

reveal a similar pattern, and are shown in Figures 15 and 16 in the538

Appendix. These results show that nutritional value has meaningful539

discriminative power, in the sense that it can be used to distinguish540

between different types of dishes.541

Incorporation of nutritional feature into model. The 6-dimensional542

nutritional value vector is incorporated into the joint neural em-543

bedding model through a single linear layer with 6 nodes. This544

9

linear layers represents the nutritional-encoder within the joint545

neural embedding model. The encoder returns a 4-dimensional546

vector that is concatenated to the recipe representation. The new547

model is fine-tuned on the translated AH collection (including the548

nutritional value vector representations), using the pre-trained text-549

representation models. In the baseline model, the nutritional value550

feature is excluded ("Excl. nutritional value" in Table 6).551

The new model and the baseline model are evaluated using the552

intra-collection retrieval experiment. The im2recipe and recipe2im553

retrieval tasks are repeated 100 times to increase statistical power.554

Two-sided independent T-tests are performed to test the difference555

in performance for all performance measures (i.e. MedR, R@1, R@5,556

R@10). The results are shown in Table 6. All p-values below 0.05 are557

assumed to signify a significant difference, and are denoted by an558

asterisk. For the im2recipe task, all performance measures are sig-559

nificantly different from the baseline. For the recipe2im task, only560

R@5 and R@10 are significantly different. These results suggest561

that nutritional value contributes new information to the recipe rep-562

resentation, in addition to the ingredients and cooking instructions,563

and increases the model’s performance on the AH collection.564

7 FINE-TUNING VERSUS TRAINING FROMSCRATCH

In this project, transfer learning has been exploited by fine-tuning565

the pre-trained joint neural embeddingmodel on professional recipe566

collections. We also tried training the joint neural embedding model567

from scratch on the JO and AH collections. To increase the prob-568

ability of success, we experimented with the model’s complexity.569

Model complexity is related to the number of learn-able parameters570

in the model. Decreasing model complexity can be beneficial for571

training, especially when using a relatively small training set. The572

complexity of the joint neural embedding model has be decreased573

by, for example, decreasing the dimensionality of the embedding574

space. Irrespective of model complexity or hyper-parameter set-575

tings, training on the JO or AH collections did not result in any576

learning. This corresponds to the findings of [16], where fine-tuning577

outperformed training from scratch.578

The failure to train the model from scratch is most likely due to579

the small training set sizes of the JO and AH collections. Training580

deep neural networks requires a large amount of training data [17].581

Even though training from scratch did not work, fine-tuning the582

pre-trained model on the AH collection resulted in an increase of583

performance for both the R1M and AH collections. These results584

have two implications; 1) the model’s learning of the AH collection585

greatly benefited from transfer learning; and 2) learning the target586

task (i.e. AH) can even improve performance on the base task (i.e.587

R1M). This project demonstrates the large advantage of transfer588

learning via fine-tuning over training from scratch.589

8 CONCLUSIONSIn this paper we focused on the performance of the joint neural em-590

bedding model for amateur and professional recipes and the benefit591

of utilizing nutritional value information within this model. We592

showed that the pre-trained model does not perform equally well593

on amateur and professional recipes. As expected, performance is594

higher for amateur than professional recipes. Fine-tuning the model595

on the Jamie Oliver collection has not worked. This is probably due596

to the small size of the JO collection. This inference is supported by597

the fact that fine-tuning did work for the larger AH collection. The598

best method to enhance the pre-trained model’s performance on599

the AH collection is to fine-tune the pre-trained model on the trans-600

lated AH collection, using the pre-trained representation models.601

Surprisingly, this method resulted in an increase of performance602

for both the AH and the R1M collection. This suggests that the603

benefit of transfer learning is not restricted to the target task (i.e.604

professional recipes), but also serves the base task (i.e. amateur605

recipes). Finally, we found that nutritional value has meaningful606

discriminative power, in the sense that it can be used to distinguish607

between different types of dishes. We showed that adding nutri-608

tional value as a feature through a simple linear encoder increases609

the model’s performance on the AH collection.610

8.1 Acknowledgements611

I want to thank my two supervisors Thomas Mensink and Vladimir612

Nedović for their enthusiasm and good advice.613

REFERENCES[1] P. Domingos. A few useful things to know about machine learning. Communica-614

tions of the ACM, 55(10):78-87, 2012.615

[2] A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli, I. Weber, and A. Torralba. Learning616

cross-modal embeddings for cooking recipes and food images. Training, 720:619-617

508, 2017.618

[3] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word repre-619

sentations in vector space. arXiv preprint, 1301.3781, 2013.620

[4] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and621

S. Fidler. Skip-thought vectors. In Advances in neural information processing622

systems, 3294-3302, 2015623

[5] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural624

networks. In Advances in neural information processing systems, 3104-3112, 2014625

[6] K. He, X. Zhang, and S. Ren, and J. Sun. Deep residual learning for image recogni-626

tion. In Proceedings of the IEEE conference on computer vision and pattern recognition,627

770-778, 2016628

[7] H. Bal, D. Epema, C. de Laat, R. van Nieuwpoort, J. Romein, F. Seinstra, C. Snoek,629

and H. Wijshoff A Medium-Scale Distributed System for Computer Science630

Research: Infrastructure for the Long Term IEEE Computer, Vol. 49, No. 5, pp.631

54-63, May 2016.632

[8] I. Goodfellow, Y. Bengio, and A. Courville Deep Learning MIT Press, http:633

//www.deeplearningbook.org, 2016634

[9] M. Rokicki, C. Trattner, and E. Herder The Impact of Recipe Features, Social Cues635

and Demographics on Estimating the Healthiness of Online Recipes. 2018636

[10] T. Kusmierczyk, and K. Nørvåg Online food recipe title semantics: Combining637

nutrient facts and topics. In Proc. of CIKM, 2013–2016, 2016.638

[11] M. Chokr, and S. Elbassuoni Calories prediction from food images. In AAAI,639

4664–4669. 2017.640

[12] Z. Zheng, L. Zheng, M. Garrett, Y. Yang, and Y.D. Shen Dual-Path Convolutional641

Image-Text Embedding. arXiv preprint, 1711.05535, 2017.642

[13] R. Collobert, K. Kavukcuoglu, and C. Farabet Torch7: A matlab-like environment643

for machine learning. In BigLearn, NIPS workshop (No. EPFL-CONF-192376), 2011.644

[14] L. Torrey, and J. Shavlik Transfer learning. In Handbook of research on machine645

learning applications and trends: algorithms, methods, and techniques, pp. 242-264,646

2010.647

[15] S.J. Pan, and Q. Yang A survey on transfer learning. IEEE Transactions on648

knowledge and data engineering, 22(10), 1345-1359, 2010.649

[16] N. Tajbakhsh, J.Y. Shin, S.R. Gurudu, R.T. Hurst, C.B. Kendall, M.B. Gotway, and650

J. Liang Convolutional neural networks for medical image analysis: Full training651

or fine tuning? IEEE transactions on medical imaging, 35(5), 1299-1312, 2016.652

[17] D. Erhan, P.A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent The difficulty653

of training deep architectures and the effect of unsupervised pre-training. In654

Artificial Intelligence and Statistics, pp. 153-160, 2009.655

[18] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson How transferable are features in656

deep neural networks? In Advances in neural information processing systems, pp.657

3320-3328), 2014.658

9 APPENDIX

10

http://www.deeplearningbook.org



Figure 11: Example recipes of the Recipe1M collection

11

Figure 12: Example recipes of the Jamie Oliver collection

12

Figure 13: Example recipes of the Allerhande collection

Approach Learning rate Number of iterations Running time (hours) Snapshot at iteration1 0.00011 30.000 7 21.5002 0.0001 25.500 6 21.5003 0.000015 30.000 7 21.5004 0.00013 45.000 10 33.0005 0.00007 25.500 6 21.5006 0.00002 18.000 4 17.000

Table 7: Hyper-parameter values during training

13

Figure 14: Training (blue) and validation (red) loss curves; fine-tuning pre-trained model on AH collection

14

Figure 15: Second ranking based on nutritional valueThe query recipe (on the left) and the first five retrieved recipes (on the right) on the basis of Euclidean distance between the normalized


15

Figure 16: Third ranking based on nutritional valueThe query recipe (on the left) and the first five retrieved recipes (on the right) on the basis of Euclidean distance between the normalized


16

Documents

Transfer Learning in Joint Neural Embeddings for Cooking Recipes …flavourspace.com/wp-content/uploads/2018/06/Thesis_MarielvanSta… · 29 5.1 Pre-processing JO and AH 4 30 5.2