Embodied Machines The Grounding (binding) Problem –Real cognizers form multiple associations between concepts Affordances - how is an object interacted

Embodied Machines• The Grounding (binding) Problem

– Real cognizers form multiple associations between concepts• Affordances - how is an object interacted with• Frames - Background structure against which concept is understood

-- sometimes highly complex (Educational system, family relationships)

• Emotions - witnessing event/seeing object conjures up emotional states

• Mental simulation - comprehending language may trigger imagistic modeling of event based on experience

Embodied Machines– Mouse

• Mammal, Small, furry, grey to brown, long whiskers, cats like to play with them and then eat them, they’re used in experiments, ladies stand on chairs when they’re around, they squeak, they’re prolific breeders, they’re sold live as snake food, they’re one kind of rodent, they look a lot like rats, they are sometimes pets, they like to run on a wheel…

– Play• The opposite of work, it’s fun, kids do it, scheduled in during

grade school, you play games, you play with words, …

Embodied Machines– Approaches to meaning construction

• NLP– Text/speech is considered comprehended when parsed

syntactically, and when word meanings have been assigned– Meaning is pre-determined by humans in some way

• Embodied approach– World has no structure until body begins to interact in it

» Need goals & sensorimotor system– Experience --> meaning– Words map onto meaning

Embodied Machines– Steel’s talking heads

• Simple robots – Auditory & visual systems– Motivating goal = language game

• Simple environment– 2 dimensional world containing objects

• Robots determine their own categories for objects• Robots determine their own labels for categories• Robots and environment are real physical entities

Embodied Machines– Cangelosi & Parisi

• Virtual agents, virtual world• A kind of embodied learning

– Agents have physical location, orientation, movement capabilities within their environment

– Agents consume mushrooms which affects their energy status– Agents (collectively) have a motivating task --> increase fitness

of species– They sense perceptual characteristics, not mushrooms --> they

learn which characteristics describe real vs. poisonous mushrooms

– Agents (collectively) learn to categorize and label mushrooms

Embodied Machines– CELL (Deb Roy)

• Cross channel Early Lexical Learning

• Models embodied language learning using input that approximates input to human infants

Instantiated in robot body with microphone/camera

• CELL learns to form word meaning correspondences from raw (unsegmented) audio and visual input

Embodied Machines– First Task

• Segmentation– Audio stream parsing into segments– Video stream parsing into objects– Segmentation process produces channel of ‘words’ and

channel of shapes

– Second Task• Build a lexicon by identifying frequently co-occurring pairs of

audio & visual segments

Embodied Machines

• Illustrative example (not from actual data)• Imagine an utterance:

“…don’t throw the ball at the cat…”

Uttered in a scene containing these identified objects(Noise present)

Embodied Machines

• Objects not necessarily identified in same order as named in utterance• Time delays between utterance and object recognition highly likely

…throw the ball at the cat

Embodied Machines

– Short term memory (STM) – look at a temporal window surrounding each word

– Aim is to go back or forward far enough in time to have the word and referent in same window


Short term memory

Embodied Machines

– Window marches through data stream collecting segmented objects and words for possible mapping


Short term memory

Embodied Machines


Short term memory

Embodied Machines


Short term memory

Embodied Machines

• Audio and visual segments that have a high degree of mutual information—are likely semantically linked and should be saved in long term memory (LTM)

Objects

Words … …

Ball 5

Cat 6

The 40 50

unique59 116

Unique occurrences

57

100

90,000

Embodied Machines• Mutual information

MI = P(a&b) co-occurrence (a&b)------------- ----------------------------------- P(a) P(b) occurrence (a) * occurrence (b)

P (‘the’ & )

= 40/(90,000 * 59)= 0.0000075

P (‘cat’ & )

= 40/(100 * 59)= 0.0067

Words like ‘the’ are promiscuous. They co-occur with so many categories, they lack predictive power.

Embodied Machines• Two implementations of CELL

– Robot– Learning from observing Infant/Caregiver interaction

Embodied Machines• Robot

– Input: spoken utterances and images of objects acquired from video camera mounted on robot

– Experimenter places objects in front of the robot and describes them

– Acquisition of lexicon • Robot gathers visual information about environment while listening

to speech (discovers high MI pairs)

– Speech generation• Search for objects in environment then describe

– Speech understanding (maps word to object)

Embodied Machines• Learning from infant-caregiver interaction

– Infants played with 7 classes of objects• Balls, shoes, keys, toy cars, trucks, dogs, horses

• Care-giver/infant interaction was natural

– CELL attempted to build up lexicon from observing these interactions

• Segmentation accuracy (segment boundaries correspond to word boundaries?)

• Word discovery (segments correspond to single word?)

• Semantic accuracy (if word segmented properly, is it properly mapped to an object?)

Embodied Machines• Segmentation accuracy – 28% (compared to 7% for acoustic only

model)• Word discovery – 72% of segmented items were single words

(compared to 31% for acoustic only model)• Semantic accuracy – 57% of hypothesized lexical candidates are

both valid words and were linked to semantically relevant visual categories

Documents

Embodied Machines The Grounding (binding) Problem –Real cognizers form multiple associations between concepts Affordances - how is an object interacted