1
PURPOSE Recognition of Hands-Free Speech and Hand Pointing Action for Recognition of Hands-Free Speech and Hand Pointing Action for Conversational TV Conversational TV Yasuo Ariki, Tetsuya Takiguchi and Atsushi Sako Department of Computer and System Engineering, Kobe University Conversational TV to which we can e nquire the information about TV contents. HANDS-FREE SPEECH RECOGNITION Multi-modal interaction with hands -free speech recognition and hand pointing re cognition. Multimedia analysis to broadcasted contents. Context awareness to understand user intension. w hatisthe m eaning ofthe w ord B R O AD- BAND? W hatisthat? M ultim edia database M ultim edia database Internet Presentation Transform Integration A nalysis R etrieval R etrieval Presentation M odality recognition H and-freespeech Speaker direction H and pointing action Frontend processing M eta data extraction N ew s Dram a Soccer Baseball Back end processing show m e the goal show m e the latestevent w ho ishe? CONVERSATIONAL TV O bserved signal Beam form ing H ands-free speech input Speaker direction estim ation Userutterance section detection Acousticm odeladaptation Speech recognition Userrecognition W ho,H ow m any W atching history Personal profile, U sercontext Speaking style E nquire, C onversation, Monologue H ands-free U tterance section Speech/NoiseG M M Em otion recognition Satisfaction Video editing Preferablecontentretrieval Sum m ary and explanation C ontextaw areness Pronoun, abbreviation User analysis C ontentsanalysis, editing, retrieval Recognition ofintension Action recognition Finger m ouse R ecognition ofrequirement Recognition ofprofile R ecognition ofm ind Speech recognition CONTEXT AWARENESS Skin color region extraction and noise reduction from cam era im ages Tw o-dim ensional coordinatesestim ation ofa fingerpointand head on im ages Three-dim ensionalcoordinates estim ation ofa fingerpointand head in a realworld by cam era calibration Estim ation ofa pointed coordinate on thescreen HAND POINTING RECOGNITION DEMONSTRATION 1 DEMONSTRATION 2 DEMONSTRATION 3 Function Y ou can tellyourtelevision w hatyou w antto w atch. Exam ple M ethod Face extraction and recognition U serspeech recognition Estim ation ofuserrequirem enton w hatofw ho Presentation ofexplanation video Sam espeech such as“show m egrand slam ”, butdifferentvideos. M r.M atsuiis on. M r.Ichiro is on. G rand slam ofMr.M atsui G rand slam ofM r. Ichiro Show m e grand slam Function Y ou can w atch the sports in yourpreferable style. Exam ple M ethod V ideo generation by digitalcam era w ork Eventrecognition U ser speech recognition Estim ation ofw hattheuser w ants Presentation ofthe corresponding video The television can understand even by talking “show m e the previousgoal” Scene with events Scene with events Show m e individual playsm ore Show m e the previousgoal Function Y ou can ask yourtelevision w hatyou do notknow Exam ple W hatis it? Itindicatesthelastim portantword. 東東東東東東東東東東東東東東東 東東東東東 東東東東東東東東東東 東東東東東東東東東 東東東東東 東東東東東東東東東東東東東東東東 S peech recognition result M ethod A nnouncerspeech recognition Im portantw ord extraction (TF/ID F) U serspeech recognition Estim ation oftheim portantw ord Presentation ofexplanation video Scenesw ith the im portantw ords Scenesw ith the im portantw ords Thetelevision can understand even by talking “w hatisit?”

Recognition of Hands-Free Speech and Hand Pointing Action for Conversational TV

  • Upload
    ernst

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

Yasuo Ariki, Tetsuya Takiguchi and Atsushi Sako Department of Computer and System Engineering, Kobe University. PURPOSE. CONTEXT AWARENESS. Conversational TV to which we can enquire the information about TV contents. Multi-modal interaction with hands-free speech - PowerPoint PPT Presentation

Citation preview

Page 1: Recognition of Hands-Free Speech and Hand Pointing Action for Conversational TV

PURPOSEPURPOSE

Recognition of Hands-Free Speech and Hand Pointing Action for Conversational TVRecognition of Hands-Free Speech and Hand Pointing Action for Conversational TVYasuo Ariki, Tetsuya Takiguchi and Atsushi Sako

Department of Computer and System Engineering, Kobe University

Conversational TV to which we can enquire the information about TV contents.

HANDS-FREE SPEECH RECOGNITIONHANDS-FREE SPEECH RECOGNITION

Multi-modal interaction with hands-free speech recognition and hand pointing recognition.Multimedia analysis to broadcasted contents.Context awareness to understand user intension.

what is the meaning of the word BROAD-BAND?

What is that?

Multimedia databaseMultimedia database

Internet

Presentation

Transform

Integration

Analysis

Retrieval

Retrieval

Presentation

Modality recognitionHand-free speechSpeaker directionHand pointing action

Front end processing

Meta data extractionNewsDramaSoccerBaseball

Back end processing

show me the goal

show me the latest event

who is he?

CONVERSATIONAL TVCONVERSATIONAL TV

Observed signal

Beam forming

Hands-free s peech input

Speaker direction estimation

User utterance section detection

Acoustic model adaptationSpeech recognition

User recognitionWho, How many

Watching historyPersonal profile,User context Speaking style

Enquire,Conversation, Monologue

Hands-freeUtterance section

  Speech/Noise GMM

Emotion recognitionSatisfaction

Video editing

Preferable content retrievalSummary and explanation

Context awarenessPronoun, abbreviation

User analysis

Contents analysis, editing, retrieval

Recognition of intension

  Action recognitionFinger mouse

Recognition of requirementRecognition of profile

Recognition of mind

  Speech recognition

CONTEXT AWARENESSCONTEXT AWARENESS

Skin color region extraction and noise reduction from camera images

Two-dimensional coordinates estimation of a finger point and head on images

Three-dimensional coordinates estimation of a finger point and head in a real world by camera calibration

Estimation of a pointed coordinateon the screen

HAND POINTING RECOGNITIONHAND POINTING RECOGNITION

DEMONSTRATION 1DEMONSTRATION 1 DEMONSTRATION 2DEMONSTRATION 2 DEMONSTRATION 3DEMONSTRATION 3

Function You can tell your television what you want to watch.

Example

Method Face extraction and recognitionUser speech recognitionEstimation of user requirement on

what of whoPresentation of explanation video

Same speech such as “show me grand slam”,but different videos.

Mr. Matsui is on.

Mr. Ichiro is on.

Grand slam of Mr. Matsui

Grand slam of Mr. Ichiro

Show megrand slam

Function You can watch the sports in your preferable style.

Example

Method Video generation by digital camera workEvent recognitionUser speech recognitionEstimation of what the user wantsPresentation of the corresponding video

The television can understand even by talking “show me the previous goal”

Scene with eventsScene with events

Show me individual plays more

Show me the previous goal

Function You can ask your television what you do not know

Example

What is it?

It indicates the last important word.

東京地方裁判所はニッポン放送が新株予約券を発行するのは著しく不公正だと指摘して新株予約券の発行を禁止する決定を出しました。

Speech recognition result

Method Announcer speech recognitionImportant word extraction (TF/IDF)User speech recognitionEstimation of the important wordPresentation of explanation video

Scenes with the important wordsScenes with the important words

The television can understand evenby talking “what is it?”