1 ECE 517 Reinforcement Learning in Artificial Intelligence Lecture 21: Deep Machine Learning Dr. Itamar Arel College of Engineering Department of Electrical

11

ECE 517 Reinforcement Learning in ECE 517 Reinforcement Learning in Artificial IntelligenceArtificial Intelligence

Lecture 21: Deep Machine LearningLecture 21: Deep Machine Learning

Dr. Itamar ArelDr. Itamar Arel

College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science

The University of TennesseeThe University of TennesseeFall 2010Fall 2010

November 8, 2010November 8, 2010

ECE 517 - Reinforcement Learning in AI 22

RL and General AIRL and General AI

RL seems like a good AI frameworkRL seems like a good AI framework

Some pieces are missingSome pieces are missing Long/short term memory: what is the optimal value (or cost-Long/short term memory: what is the optimal value (or cost-

to-go) function to be used?to-go) function to be used? How do we treat multi-dimensional reward signals?How do we treat multi-dimensional reward signals? How do we deal with high-dimensional inputs (observations)?How do we deal with high-dimensional inputs (observations)? How to generalize to address an near-infinite state space?How to generalize to address an near-infinite state space? How long will it take to train such a system?How long will it take to train such a system?

If we want to use hardware – how do we go about doing it?If we want to use hardware – how do we go about doing it? Storage capacity – human brain ~10Storage capacity – human brain ~101414 synapses (i.e. synapses (i.e.

weights)weights) Processing power - ~10Processing power - ~101111 neurons neurons Communications – fully or partially connected architecturesCommunications – fully or partially connected architectures

ECE 517 - Reinforcement Learning in AI

Why Deep Learning?Why Deep Learning?

Mimicking the way the brain represents Mimicking the way the brain represents information is a key challengeinformation is a key challenge

Deals efficiently with high-dimensionalityDeals efficiently with high-dimensionality Handle multi-modal data fusionHandle multi-modal data fusion Capture temporal dependencies spanning large scalesCapture temporal dependencies spanning large scales Incremental knowledge acquisitionIncremental knowledge acquisition

The challenge with high-dimensionalityThe challenge with high-dimensionality Real-world problemReal-world problem Curse of dimensionalityCurse of dimensionality (Bellman) (Bellman) Spatial and temporal dependenciesSpatial and temporal dependencies How to represent key features??How to represent key features??

33


Main application: classificationMain application: classification

Hard (unsolved) problem due to …Hard (unsolved) problem due to … High-dimensionality dataHigh-dimensionality data Distortions (noise, rotation, displacement, perspective, Distortions (noise, rotation, displacement, perspective,

lighting conditions, etc.)lighting conditions, etc.) Partial observabilityPartial observability

Mainstream approach …Mainstream approach …

44

ROI detection Feature Extraction Classification

104 102106


The power of hierarchical representationThe power of hierarchical representation

Core idea: Core idea: partition high-dimensional data to small partition high-dimensional data to small patches, model them and discover dependencies between patches, model them and discover dependencies between themthem

Decomposes the problemDecomposes the problem

Suggests a trade offSuggests a trade off

more scope more scope less detail less detail

Key ideas:Key ideas: Basic cortical circuitBasic cortical circuit Massively parallel architectureMassively parallel architecture Discovers structure based onDiscovers structure based on

regularities in the observationsregularities in the observations Multi-modalMulti-modal Goal: Goal: situation/state inferencesituation/state inference

55


The power of hierarchical representation (con’t)The power of hierarchical representation (con’t)

HypothesisHypothesis: the brain represents information using a : the brain represents information using a hierarchical architecture that comprises basic cortical hierarchical architecture that comprises basic cortical circuitscircuits

Effective way of dealing with large-Effective way of dealing with large-scale POMDPsscale POMDPs

DL – state inferenceDL – state inference RL – for decision making underRL – for decision making under

uncertainty uncertainty

Suggest a Suggest a semi-supervisedsemi-supervised learning learningframeworkframework

UnsupervisedUnsupervised – learns structure of – learns structure ofnatural datanatural data

SupervisedSupervised – mapping states to – mapping states toclassesclasses

66


The Deep Learning TheoryThe Deep Learning Theory

Basic idea is to decompose the large image into smaller Basic idea is to decompose the large image into smaller images that can each be modeledimages that can each be modeled

The hierarchy is one of The hierarchy is one of abstractionabstraction Higher levels of the state represent more abstract notionsHigher levels of the state represent more abstract notions The higher the layer the more scope it encompasses and The higher the layer the more scope it encompasses and

less detail it offersless detail it offers Multi-scale spatial-temporal context representationMulti-scale spatial-temporal context representation Lower levels interpret or control limited domains of Lower levels interpret or control limited domains of

experience, or sensory systemsexperience, or sensory systems

Connections from the higher level states predispose Connections from the higher level states predispose some selected transitions in the lower-level state some selected transitions in the lower-level state machinesmachines


Inspiration: Role of Cerebral CortexInspiration: Role of Cerebral Cortex

The cerebral cortex (aka The cerebral cortex (aka neocortexneocortex), made up of four ), made up of four lobes, is involved in many complex cognitive functions lobes, is involved in many complex cognitive functions including: including: memorymemory, , attention, perceptual awareness, attention, perceptual awareness, "thinking", language and consciousness"thinking", language and consciousness

The cortex is the primary brain subsystem responsible The cortex is the primary brain subsystem responsible for learning …for learning …

Rich in neurons (>80% in human brain)Rich in neurons (>80% in human brain) It is the one embedding the hierarchicalIt is the one embedding the hierarchical

auto-associative memory architectureauto-associative memory architecture Receives sensory information from many Receives sensory information from many

different sensory organs e.g.: eyes, ears,different sensory organs e.g.: eyes, ears,etc. and processes the informationetc. and processes the information

Areas that receive that particularAreas that receive that particularinformation are called information are called sensory areassensory areas


Deep Machine Learning – general frameworkDeep Machine Learning – general framework

The lower layers predict short-term sequencesThe lower layers predict short-term sequences

As you go higher in the hierarchy – “As you go higher in the hierarchy – “less accuracy, broader less accuracy, broader perspectiveperspective””

Analogy to a general commanding an army, or poem being Analogy to a general commanding an army, or poem being recited recited

““Surprise” sequences should propagate up to the appropriate Surprise” sequences should propagate up to the appropriate layerlayer


DL for Invariant Pattern RecognitionDL for Invariant Pattern Recognition

Initial focus on the visual cortexInitial focus on the visual cortex Offers an invariant visual pattern recognition in the Offers an invariant visual pattern recognition in the

visual cortexvisual cortex Recognizing objects despite different Recognizing objects despite different scalingscaling, , rotationsrotations

and and translationstranslations is something humans perform without is something humans perform without conscious effortconscious effort

Lighting conditions, various noises (additive, Lighting conditions, various noises (additive, multiplicative)multiplicative)

Currently difficult for machines learning to achieveCurrently difficult for machines learning to achieveThe approach taken is that The approach taken is that geometric invariancegeometric invariance is is linked to linked to motionmotion

When we look at an object, the patterns on our retina When we look at an object, the patterns on our retina change a lot while the object (change a lot while the object (causecause) remains the same) remains the same

Thus, learning persistent patterns on the retina would Thus, learning persistent patterns on the retina would correspond to learning objects in the visual worldcorrespond to learning objects in the visual world

Associating patterns with their Associating patterns with their causescauses corresponds to corresponds to invariant pattern recognitioninvariant pattern recognition


DL for Invariant Pattern Recognition (cont’)DL for Invariant Pattern Recognition (cont’)

Each level in the system hierarchy has several Each level in the system hierarchy has several modules modules that model cortical regionsthat model cortical regions A module can have several children and one A module can have several children and one

parent, thus modules are arranged in a parent, thus modules are arranged in a tree tree structurestructure

The bottom most level is called The bottom most level is called level 1level 1 and the and the level number increases as you go up in the level number increases as you go up in the hierarchyhierarchy

Inputs go directly to the modules at level 1Inputs go directly to the modules at level 1 The level 1 modules have small The level 1 modules have small receptive fieldsreceptive fields

compared to the size of the total image, i.e., these compared to the size of the total image, i.e., these modules receive their inputs from a small patch of modules receive their inputs from a small patch of the visual fieldthe visual field

Several such level 1 modules tile the visual field, Several such level 1 modules tile the visual field, possibly with overlappossibly with overlap


General System ArchitectureGeneral System Architecture

Thus a level 2 module covers more of the visual field Thus a level 2 module covers more of the visual field compared to a level 1 module. However, a level 2 module compared to a level 1 module. However, a level 2 module gets it information only through a level 1 modulegets it information only through a level 1 module

This pattern is repeated in the hierarchyThis pattern is repeated in the hierarchy Receptive field sizes increase as one goes up the hierarchyReceptive field sizes increase as one goes up the hierarchy The module at the root of the tree covers the entire visual field, The module at the root of the tree covers the entire visual field,

by pooling inputs from its child modulesby pooling inputs from its child modules


Learning FrameworkLearning Framework

Let Xn(1) and Xn

(2) denote the sequence of inputs to modules 1 and 2 Learning occurs in three phases:

First, each module learns the most likely sequences of its inputs Second, each module passes an index of its most-likely observed

input sequence Third, each module learns the most frequent “coincidences” of

indices originating from the lower layer modules Next …


Contextual Embedding (if exists)Contextual Embedding (if exists)

Feedback loop from layer 2 back to layer 1 (its children)Feedback loop from layer 2 back to layer 1 (its children)This feedback provides contextual inference (from higher This feedback provides contextual inference (from higher layers)layers)

This stage is initiated once the level 2 module has formed its This stage is initiated once the level 2 module has formed its alphabet, alphabet, YYkk

Lower layer nodes eventually learn the CPD matrix Lower layer nodes eventually learn the CPD matrix PP((XX(1)(1)||YY))


Bayesian Network ObtainedBayesian Network Obtained

Bottom layer random variables correspond to quantizations Bottom layer random variables correspond to quantizations on input patternson input patternsThe r.v. at the intermediate layers represent object parts The r.v. at the intermediate layers represent object parts that move together persistentlythat move together persistentlyR.V. at the top layer correspond to R.V. at the top layer correspond to objectsobjects


Learning algorithm (cont.)Learning algorithm (cont.)

After the system has learned (seen many example) After the system has learned (seen many example) and obtained the CPD at each layer, we seekand obtained the CPD at each layer, we seek

wherewhere I I is the image observed. is the image observed.

A Bayesian Belief Propagation method is typically used A Bayesian Belief Propagation method is typically used to determine the above, based on hierarchy of beliefsto determine the above, based on hierarchy of beliefs

Drawbacks of current schemesDrawbacks of current schemes No “natural” spatiotemporal information representationNo “natural” spatiotemporal information representation Layer-by-layer training is neededLayer-by-layer training is needed Modality independent (most current schemes limited to Modality independent (most current schemes limited to

image data sets)image data sets)

IzPIzPz

|max|*


Alternative Explanations for Biological PhenomenaAlternative Explanations for Biological Phenomena

Physiological experiments found that neurons Physiological experiments found that neurons sometimes respond to illusory contours in a sometimes respond to illusory contours in a Kanizsa figureKanizsa figure

In other words, a neuron responds to a contour In other words, a neuron responds to a contour that does not exist in its receptive fieldthat does not exist in its receptive field

Possible interpretation: Possible interpretation: activity of a neuron represents activity of a neuron represents the probability that a contour should be presentthe probability that a contour should be present

Originates from its own state and the state information Originates from its own state and the state information of higher-level neuronsof higher-level neurons

DL based explanation for this phenomena DL based explanation for this phenomena Contrary to current hypothesis that assumeContrary to current hypothesis that assume

“signal subtraction” occurs for some reason “signal subtraction” occurs for some reason

Documents

1 ECE 517 Reinforcement Learning in Artificial Intelligence Lecture 21: Deep Machine Learning Dr. Itamar Arel College of Engineering Department of Electrical