thesis - MIT Media Labdkroy/papers/pdf/phd_thesis... · 2009. 8. 24. · /* Consider all pairs of...

Preview:

Citation preview

Utterance-Context Pair

semanticcategory N

semanticcategory 2

semanticcatregory 1

context

utterance

linguisticunit 1

linguisticunit 2

linguisticunit M

Sensors

utterance

context

linguistic unitprototype

semanticcategoryprototype

linguistic unitprototype

linguistic unitprototype

semanticcategoryprototype

semanticcategoryprototype

linguisticunit

prototypelinguistic

unitprototype

linguisticunit

prototype

semanticcategoryprototype

semanticcategoryprototype

semanticcategoryprototype

semanticcategoryprototype

semanticcategoryprototype

linguisticunit

prototype

linguisticunit

prototypelinguistic

unitprototype

semanticcategoryprototype

semanticcategoryprototype

semanticcategoryprototype

Lexicon

semanticcategory

linguisticunit

semanticcategory

linguisticunit

linguisticunit

prototypelinguistic

unitprototype

Linguistic Channels

Contextual Channels

Short Term Memory (STM)

Mid Term Memory (MTM)

Long Term Memory (LTM)

Linguistic-Semantic Events (LS-events)

Lexical Candidates

Lexical Items

Co-occurence filter

Recurrence filter

Mutual Information filter

time

Linguistic Events (L-events)Semantic Events (S-events)

Event Detection

Input Sensor Signals

Feature Extraction

input signalssensors

featureanalyzers

time

time

Ling

uist

icch

anne

lsCo

ntex

tual

chan

nels

linguisticevent

detectorlinguistic channels

contextual channelssemantic

eventdetector

S-events

L-events

time

time

The event is divided into an array alongchannel and time segment boundaries.

An L-event or S-event is composed ofmultiple channels

channel 1

segm

ent 1

segm

ent 2

segm

ent n

channel 3

channel 2

channel 1

channel 3

channel 2 EventSegmenter

An event divided along time segments and channels

Some potential subevents

semantic events (S-events)

linguistic events (L-events)time

co-occuring L-events and S-eventsare paired to form LS-events

short term memory (STM)contains recent LS-events

old LS-eventsforgotten

L-event L-event L-eventL-eventL-event

S-event S-event S-eventS-eventS-event

/* Consider all pairs of LS-events in short term memory */for each pair of LS-events in STM, LSi and LSj {

/* Compare each pair of L-subevents in LSi and LSj */ for each L-subevent in LSi, Li { for each L-subevent in LSj, Lj{ if dL(Li, Lj) < tL then set Lmatch = TRUE } }

/* Compare each pair of S-subevents in LSi and LSj */ for each S-subevent in LSi, Si { for each S-subevent in LSj, Sj{ if dS(Si, Sj) < tS then set Smatch = TRUE } }

/* check for matches of L-subevents and co-occuring S-subevents */ if Lmatch = TRUE and Smatch = TRUE then recurrent match found }}

Short Term Memory (STM)

Filled regionsindicate recurrentL-subeventsand S-subevents

LS-event LS-eventLS-event

Lexical Candidate

linguistic unit prototype ( )

semantic prototype ( )

Mid Term Memory (MTM)

linguistic feature space

L-radius ( )

L-prototype ( )

L-unit = { , }

S-prototype ( )

S-radius ( )

S-category = { , }

contextual feature space

medium

large

cl

o

ad

j

h

g

e

m

p

b

k

n

u

t

sr

z

yx

wv

if q

c

o

a

d

g

ep

b

k

n

u

y

wv

c

o

ad

g

e

p

b

k

n

u

y

wv

c

o

a

d

g

e

m

p

b

k

n

u

s

y

x

wvc

o

ad

g

e

m

p

b

k

n

u

s

y

wv

x

c

o

a

dj

h

g

e

m

p

b

k

n

u

ts

r

z

y

x

wv

l

s

r

li

j

h

f mt

q

zxl

ij

hf m

t

sr

qz

x

li

j

h

ft

rq

zilj

hf

t

r

qz

i

fq

small

I(S;L) = 0.013 bits

I(S;L) = 0.29 bits

I(S;L) = 0.0 bits

Mid Term Memory (MTM)

Lexical Item

MutualInformation Filter

Sensors

L-event

LTM

S-category ofrecognized L-unit

L-subeventmatchesL-unit in lexical item

Feature analysisL-event detection

Event segmentation

Sensors

Feature analysisS-event detection

Event segmentation

S-event

LTM

L-unitof recognizedS-categoryS-subevent matches

S-category in lexical item

Sensors

STM

Feature analysisEvent detection

Event segmentation

LS-prototypehypotheses

Lexicalsearch

matched hypothesesexplained away

Recurrencefilter

Mutualinformation

filterLTMMTM

Lexical item i

L-unitS-category

Lexical item j

L-units andS-categoriesoverlap

Matching lexical items are clustered toform a conglomerate lexical item

Lexical item i

L-unit S-category

S-prototype i matches S-category j

Lexical item j

Lexical item i Lexical item j

L-prototype j matches L-unit i

L-unit

S-category

Linguistic Units Semantic Categories

Environment

lexical item confidenceadjustment

Feedback

Actionselection

Goals

LTM

thresholdadjustment

MTMmutual

informationfilter

Objectdetection

Spokenutterancedetection

S-events: object view-sets

L-events: spoken utterances

Linguistic channel:phoneme probabilities

Contextual channels:object shape & color

L-subevents:speech segments

LS-events: {spoken utterance, object view-set}

S-subevents:shape / color view-sets

Lexical candidates: {spoken word prototype, color/shape prototype}

Lexical items:{spoken word model, color/shape category}

Objectshape

analysisPhonemeanalysis

Objectcolor

analysis

Microphone Camera

S-eventunpacking

L-eventunpacking

Co-occurencefilter

Recurrencefilter

Mutual informationfilter

Short TermMemory

Long TermMemory

Mid TermMemory

object maskmasked

color image

mask-edgespatial derivative

analysis

color image

foreground bitmap

connectedregions analysis

foregroundsegmentation

CCDcamera

Context Channel 1: Shape

Context Channel 2: Color

Original RGB Image Shape histogramObject maskrelative angle

norm

alize

d di

stan

ce

normalized green

norm

alize

d re

dColor Histogram

DOF 1: Base rotation

DOF 2:Base elevation

DOF 4:Neck

elevation

DOF 5: Object turntable rotation

DOF 3:Neck rotation

Color CCDCamera

RASTA-PLPspectralanalysis

timedelay

12 units

176 units

176 units

40 units

Linguistic channel:phoneme probabilitiesRecurrent Neural Network

aaaeahawayb

chd

dhdxehereyfg

hhihiyjhklmn

ngowoypqrs

shsilt

thuhuwvwyz

aaaeahawaybchddhdxehereyfghhihiyjhklmnngowoypqrsshsiltthuhuwvwyz

aaaeahawayb

chd

dhdxehereyfg

hhihiyjhklmn

ngowoypqrs

shsilt

thuhuwvwyz

aaaeahawaybchddhdxehereyfghhihiyjhklmnngowoypqrsshsiltthuhuwvwyz

state = 1; count_2 = 0; count_3 = 0; count_4 = 0UTTERANCE_START_DELAY = 50ms; UTTERANCE_END_DELAY = 300ms

for each RNN output vector, l(t) {

state 1: SILENCE if SIL != 1 { utteranceStartIndex = t state=2 } else { state = 1 }

state 2: POSSIBLE_START_OF_UTTERANCE count_2 = count_2 + 1 if SIL = 1 { count_2 = 0 state = 1 } else if {count_2 > UTTERANCE_START_DELAY) { state = 3 }

state 3: UTTERANCE if SIL { state = 4 } else { count_3 = count_3 + 1 state = 3 }

state 4: POSSIBLE_END_OF_UTTERANCE count_4 = count_4 + 1 if SIL != 1 { count_3 = count_3 + count_4 count_4 = 0 state = 3 } } else if count_4 > UTTERANCE_END_DELAY { utteranceEndIndex = t - count_4 - 1 ProcessUtterance(utteranceStartIndex, utteranceEndIndex) count_2 = 0 count_3 = 0 count_4 = 0 state = 1 } }}

utterancestart

utteranceend

null null

a

b

silence

Viterbialgorithm / b aa l /

aaaeahawaybchddhdxehereyfghhihiyjhklmnngowoypqrsshsiltthuhuwvwyz

Most likelyphoneme sequence

b aa l

Hidden MarkovModel

RNN outputphoneme probabilities

"yeah"

Mut

ual I

nfor

mat

ion

L-radius S-radius

"dog"

Mut

ual I

nfor

mat

ion

L-radius S-radius

0 5 10 15 20 25 30 35 40

Distance between view-sets

Hist

ogra

m b

in o

ccup

ancy

(nor

mal

ized)

0 5 10 15 20 25 30 35 40

Distance between view-sets

Hist

ogra

m b

in o

ccup

ancy

(nor

mal

ized)

0 5 10 15 20 25 30 35 40

Distance between view-sets

Hist

ogra

m b

in o

ccup

ancy

(nor

mal

ized)

0 5 10 15 20 25 30 35 40 450

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

5

10

15

20

25

30

35

40

CELL AcousticRecurrency

28%

7%

0

10

20

30

40

50

60

70

80

90

100

CELL AcousticRecurrency

72%

31%

0

10

20

30

40

50

60

70

80

CELL AcousticRecurrency

57%

13%

CELL

Spoken commands

Tasksemantics

User-dependentacoustic & semantic

model

Scene 1: User points to three colors in therainbow and names them (lexical acquisition)

Scene 2: User selects a part from the "Tree of Life" by pointing to the part

Scene 3: Part is colored by speech using oneof the three lexical items learned in Scene 1

Scene 4: User must select position for newbody part using gesture, confirm with speech

Scene 5: A successfully placed part Scene 6: After two more cycles of Scenes 2-5the mate is complete and Toco looks on in new-found love

Recommended