Deep Learning for Food Analysis

Presentacin de PowerPoint

Deep Learning for Food Analysis Petia Radeva

www.cvc.uab.es/~petia

Computer Vision at UB (CVUB), Universitat de Barcelona &

Medical Imaging Laboratory, Computer Vision Center

IndexMotivation

Learning and Deep learning

Deep learning for food analysis

Lifelogging209:31AMiTANS16, Albena, 26 of June, 2016

Metabolic diseases and health3

09:31AMiTANS16, Albena, 26 of June, 20164.2 million die of chronic diseases in Europe (diabetesor cancer) linked to lack of physical activities and unhealthy diet.Physical activitiescan increase lifespan by 1.5-3.7 years.Obesity is a chronic disease associated with huge economic, social and personal costs. Risk factors for cancers, cardiovascular and metabolic disorders and leading causes of premature mortality worldwide.

Health and medical careToday, 88% of U.S. healthcare dollars are spent on medical care access to physicians, hospitals, procedures, drugs, etc.

However, medical care only accounts for approximately 10% of a persons health.

Approximately half the decline in U.S. Deaths from coronary heart disease from 1980 through 2000 may be attributable to reductions in major risk factors (systolic blood pressure, smoking, physical inactivity).

4

09:31AMiTANS16, Albena, 26 of June, 2016

Health and medical careRecent data shows evidence of stagnation that may be explained by the increases in obesity and diabetes prevalence.Healthcare resources and dollars must now be dedicated to improving lifestyle and behavior. 5


Why food analysis?Today, measuring physical activities is not a problem.

But what about food and nutrition? Nutritional health apps are based on food diaries6


Two main questions?What we eat?

Automatic food recognition vs. Food diaries

And how we eat?

Automatic eating pattern extraction when, where, how, how long, with whom, in which context?

Lifelogging7


IndexMotivation




Why Learn?Machine learning consists of:Developing models, methods and algorithms to make computers learn i.e. take decision. Training from big amount of example data.Learning is used when:Humans are unable to explain their expertise (speech recognition)Human expertise does not exist (navigating on Mars),Solution changes in time (routing on a computer network)Solution needs to be adapted to particular cases (user biometrics)Data is cheap and abundant (data warehouses, data marts); knowledge is expensive and scarce. Example in retail: Customer transactions to consumer behavior: People who bought Da Vinci Code also bought The Five People You Meet in Heaven (www.amazon.com)Build a model that is a good and useful approximation to the data.


Growth of Machine LearningThis trend is accelerating due to:

Big data and data science today are a realityImproved data capture, networking, faster computersNew sensors / IO devices / Internet of ThingsSoftware too complex to write by handDemand for self-customization to userIt turns out to be difficult to extract knowledge from human expertsfailure of expert systems in the 1980s.Improved machine learning algorithms

AMiTANS16, Albena, 26 of June, 20161009:31

Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style Character recognition: Different handwriting styles.Speech recognition: Temporal dependency. Use of a dictionary or the syntax of the language. Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic for speechMedical diagnosis: From symptoms to illnessesWeb Advertizing: Predict if a user clicks on an ad on the Internet.

10


Deep leearning everywhere


Deep learning applications


Other methods also use unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by supervised back-propagation to classify labeled data. The deep model of Hinton et al. (2006) involves learning the distribution of a high level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.[8] Hinton reports that his models are effective feature extractors over high-dimensional, structured data.[9]

Natural Language Processing which is used heavily in language conversion in chat rooms or processing text from where human speeches.Optical Character Recognition which is scanning of images. It's gaining traction lately to read an image and extract text out of it and correlate to the objects found on imageSpeech Recognition applications like Siri or Cortana needs no introductionArtificial Intelligence induction to different robots for automating at least a minute level of tasks a human can do. We want them to be a little smarter. Drug discovery though medical imaging-based diagnosis using deep learning. It's kind of in early stages now. Check Butterfly Network for the work they are doing.CRM needs for companies are growing day by day. There are hundreds of thousands of companies around the globe from small to big companies who wants to know their potential customers. Deep Learning has provided some outstanding results. Check for companies like RelateIQ (product) who has seen astounding success of using Machine Learning in this area.

13

Formalization of learningConsider: training examples: D= {z1, z2, .., zn} with the zi being examples sampled from an unknown process P(Z); a model f and a loss functional L(f,Z) that returns a real-valued scalar.

Minimize the expected value of L(f,Z) under the unknown generating process P(Z).Supervised Learning: each example is an (input,target) pair: Z=(X,Y)

classification: Y is a finite integer (e.g. a symbol) corresponding to a class index, and we often take as loss function the negative conditional log-likelihood, with the interpretation that fi(X) estimates P(Y=i|X):L(f,(X,Y)) = -log fi(X), where fi(X)>=0, i fi(X) = 1.


Classification/RecognitionIs this an urban or rural area?

Input: xOutput: y {-1,+1}From: M. Pawan Kumar

Which city is this?Output: y {1,2,,C}Binary classificationMulti-class classification09:31AMiTANS16, Albena, 26 of June, 201615

Object Detection and segmentationWhere is the object in the image?Output: y {Pixels}

From: M. Pawan Kumar

What is the semantic class of each pixel?Output: y {1,2,,C}|Pixels|

carroadgrasstreesky09:31AMiTANS16, Albena, 26 of June, 201616

A Simplified View of the PipelineInputxFeatures(x)

Scoresf((x),y)Extract FeaturesComputeScores

maxy f((x),y)Predictiony(f)Learn fFrom: M. Pawan Kumar


Learning ObjectiveData distribution P(x,y)Predictionf* = argminf EP(x,y) Error(y(f),y)Ground TruthMeasure of prediction quality (error, loss)Distribution is unknownExpectation overdata distributionFrom: M. Pawan Kumar


Learning ObjectiveTraining data {(xi,yi), i = 1,2,,n}Predictionf* = argminf EP(x,y) Error(y(f),y)Ground TruthMeasure of prediction qualityExpectation overdata distributionFrom: M. Pawan Kumar


Learning ObjectiveTraining data {(xi,yi), i = 1,2,,n}Predictionf* = argminf i Error(yi(f),yi)Ground TruthMeasure of prediction qualityExpectation overempirical distributionFinite samplesFrom: M. Pawan Kumar


The problem of image classification

21From: Fei-Fei Li & Andrej Karpathy & Justin Johnson09:31AMiTANS16, Albena, 26 of June, 2016

Dual representation of images as points/vectors2209:32AMiTANS16, Albena, 26 of June, 2016

32x32x3 D vectorEach image of M rows by N columns by C channels ( 3 for color images) can be considered as a vector/point in RMxNxC and viceversa.

Linear Classier and key classification components09:3223Given two classes how to learn a hyperplane to separate them?

To find the hyperplane we need to specify :Score functionLoss functionOptimizationAMiTANS16, Albena, 26 of June, 2016

Interpreting a linear classifier

24From: Fei-Fei Li & Andrej Karpathy & Justin Johnson09:32AMiTANS16, Albena, 26 of June, 201632x32x3 D vector

General learning pipeline

25From: Fei-Fei Li & Andrej Karpathy & Justin Johnson09:32AMiTANS16, Albena, 26 of June, 2016Training consists of constructing the prediction model f according to a training set.

The problem of image classification


Parametric approach: linear classifier

27From: Fei-Fei Li & Andrej Karpathy & Justin Johnson09:32AMiTANS16, Albena, 26 of June, 2016Score function:

Loss function/optimization

28From: Fei-Fei Li & Andrej Karpathy & Justin Johnson09:32AMiTANS16, Albena, 26 of June, 2016The score function

Image classification


Loss function and optimisationQuestion: if you were to assign a single number to how unhappy you are with these scores, what would you do?09:3230

Question : Given the score and the loss function, how to find the parameters W?

AMiTANS16, Albena, 26 of June, 2016

Interpreting a linear classifier

31From: Fei-Fei Li & Andrej Karpathy & Justin Johnson09:32AMiTANS16, Albena, 26 of June, 201610x3072

Why is a CNN doing deep learning?


where fi=jwij * xj

w1nf1f2fmx1x2xnw11w12

Activation functions of NN


Setting the number of layers and their size

34Neurons arranged into fully-connected layers

Bigger = better (but might have to regularize more strongly).

How many parameters to learn?From: Fei-Fei Li & Andrej Karpathy & Justin Johnson09:32AMiTANS16, Albena, 26 of June, 2016

Why a CNN is neural network?


Architecture of neural networks

09:3236Modern CNNs: ~10 million neuronsHuman visual cortex: ~5 billion neurons


Activation functions of NN


Exponential linear units- ELU all benefits of relu, does not die, closer to zero meanoutputs, but computation requires exp()37

What is it a Convolutional Neural Network?

09:3238


Convolutional and Max-pooling layer

09:3339

Convolutional layerMax-pool layer

How does the CNN work?


Example architecture

09:3241The trick is to train the weights such that when the network sees a picture of a truck, the last layer will say truck.AMiTANS16, Albena, 26 of June, 2016

Training a CNN09:32AMiTANS16, Albena, 26 of June, 201642

The process of training a CNN consists of training all hyperparameters: convolutional matrices and weights of the fully connected layers.

Several millions pf parameters!!!

Learned convolutional filters

09:3243


Neural network training

44From: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Using the chain rule, optimize the parameters, W of the neural network by gradient descent and backpropagation.09:32AMiTANS16, Albena, 26 of June, 2016Optimization consists of training severalmillions of parameters!

Monitoring loss and accuracy

09:3245Looks linear? Learning rate too low!Decreases too slowly? Learning rate too high.Looks too noisy? Increases the batch size.

Big gap?

- you're overfitting, increase regularization!AMiTANS16, Albena, 26 of June, 2016

Transfer learning


Imagenet

09:3247


1001 benefits of CNNTransfer learning: Fine tunning for object recognitionReplace and retrain the classier on top of the ConvNetFine-tune the weights of the pre-trained network by continuing the backpropagationFeature extraction by CNNObject detectiionObject segmentation

Image similarity and matching by CNN

09:3248

Convolutional Neural Networks (4096 Features)AMiTANS16, Albena, 26 of June, 2016

ConvNets are everywhere















IndexMotivation




Automatic food analysis55Can we automatically recognize food?To detect every instance of a dish in all of its variants, shapes and positions and in a large number of images.

The main problems that arise are:Complexity and variability of the data.Huge amounts of data to analyse.


Automatic Food AnalysisFood detectionFood recognitionFood environment recognitionEating pattern extraction56


Food datasets

57Food256- 25.600 images (100 images/class)Classes: 256

Food101 101.000 images (1000 images/class)Classes: 101Food101+FoodCAT: 146.392 (101.000+45.392)Classes: 131EgocentricFood: 5038 imagesClasses: 909:32AMiTANS16, Albena, 26 of June, 2016

Food localization and recognition

58General scheme of our food localization and recognition proposal09:32AMiTANS16, Albena, 26 of June, 2016

Food localization

59

Examples of localization and recognition on UECFood256 (top) and EgocentricFood (bottom). Ground truth is shown in green and our method in blue.09:32AMiTANS16, Albena, 26 of June, 2016

Image InputFoodness MapExtractionFood Detection CNN

Food Recognition CNNFood TypeRecognitionAppleStrawberryFood recognitionResults: TOP-1 74.7%TOP-5 91.6%SoA (Bossard,2014): TOP-1 56,4%09:32AMiTANS16, Albena, 26 of June, 201660

Demo


Food environment classification62BakeryBanquet hallBarButcher shopCafeteraIce cream parlorKitchenKitchenetteMarketPantryPicnic AreaRestaurantRestaurant KitchenRestaurant PatioSupermarketCandy storeCoffee shopDinetteDining roomFood courtGalley

Classification results:0.92- Food-related vs. Non-food-related0.68 - 22 classes of Food-related categories 09:32AMiTANS16, Albena, 26 of June, 2016

IndexMotivation




Wearable cameras and the life-logging trend

64Shipments of wearable computing devices worldwide by category from 2013 to 2015 (in millions)


Life-logging dataWhat we have:

09:3265


Wealth of life-logging dataWe propose an energy-based approach for motion-based event segmentation of life-logging sequences of low temporal resolution- The segmentation is reached integrating different kind of image features and classifiers into a graph-cut framework to assure consistent sequence treatment.09:32AMiTANS16, Albena, 26 of June, 201666Complete dataset of a day captured with SenseCam (more than 4,100 imagesChoice of devise depends on: 1) where they are set: a hung up camera has the advantage that is considered more unobtrusive for the user, or 2) their temporal resolution: a camera with a low fps will capture less motion information, but we will need to process less data.We chose a SenseCam or Narrative - cameras hung on the neck or pinned on the dress that capture 2-4 fps.

100.000 images per month1 TB in 3 yearsOr the hell of life-logging data

Visual Life-logging data

Events to be extracted from life-logging images67

The camera captures up to 2000 images per day, around 100.000 images per month. Applying Computer Vision algorithms we are able to extract the diary of the person:

Activities he/she has doneInteractions he/she has participatedEvents he/she has taken partDuties he/she has performedEnvironments and places he/she visited, etc.09:32AMiTANS16, Albena, 26 of June, 2016

Towards healthy habitsTowards visualizing summarized lifestyle data to ease the management of the users healthy habits (sedentary lifestyles, nutritional activity, etc.).


ConclusionsHealthy habits one of the main health concern for people, society, and governments

Deep learning a technology that came to stay

A new technological trend with huge power

Specially useful for food recognition and analysis

Lifelogging a unexplored technology that hides big potential to help people monitor and describe their behaviour and thus improve their lifestyle.


Thank you for your attention!


70

Deep learning applications


Medical applications - there are tremendous advances in robotic surgery that relies on extremely sensitive tactile equipment. However, if a doctor can advise a robot to "move a fraction of a millimeter to the left of the clavicle" they could potentially gain more control by directing the robot via full understood voice control.Automotive - we are already seeing self driving cars; deep learning will possibly integrate into automated driving systems to detect and interpret sights and sounds that might be beyond the capacity of humans.Military - drones are particularly well suited to deep learning.Surveillance - here too drones will play a role, but the idea of computers that are able to sense and interpret with a human-like degree of accuracy will change the way in which surveillance is done.

71

Technology

Deep Learning for Food Analysis