Transcript
Page 1: Recognition of Hands-Free Speech and Hand Pointing Action for Conversational TV

PURPOSEPURPOSE

Recognition of Hands-Free Speech and Hand Pointing Action for Conversational TVRecognition of Hands-Free Speech and Hand Pointing Action for Conversational TVYasuo Ariki, Tetsuya Takiguchi and Atsushi Sako

Department of Computer and System Engineering, Kobe University

Conversational TV to which we can enquire the information about TV contents.

HANDS-FREE SPEECH RECOGNITIONHANDS-FREE SPEECH RECOGNITION

Multi-modal interaction with hands-free speech recognition and hand pointing recognition.Multimedia analysis to broadcasted contents.Context awareness to understand user intension.

what is the meaning of the word BROAD-BAND?

What is that?

Multimedia databaseMultimedia database

Internet

Presentation

Transform

Integration

Analysis

Retrieval

Retrieval

Presentation

Modality recognitionHand-free speechSpeaker directionHand pointing action

Front end processing

Meta data extractionNewsDramaSoccerBaseball

Back end processing

show me the goal

show me the latest event

who is he?

CONVERSATIONAL TVCONVERSATIONAL TV

Observed signal

Beam forming

Hands-free s peech input

Speaker direction estimation

User utterance section detection

Acoustic model adaptationSpeech recognition

User recognitionWho, How many

Watching historyPersonal profile,User context Speaking style

Enquire,Conversation, Monologue

Hands-freeUtterance section

  Speech/Noise GMM

Emotion recognitionSatisfaction

Video editing

Preferable content retrievalSummary and explanation

Context awarenessPronoun, abbreviation

User analysis

Contents analysis, editing, retrieval

Recognition of intension

  Action recognitionFinger mouse

Recognition of requirementRecognition of profile

Recognition of mind

  Speech recognition

CONTEXT AWARENESSCONTEXT AWARENESS

Skin color region extraction and noise reduction from camera images

Two-dimensional coordinates estimation of a finger point and head on images

Three-dimensional coordinates estimation of a finger point and head in a real world by camera calibration

Estimation of a pointed coordinateon the screen

HAND POINTING RECOGNITIONHAND POINTING RECOGNITION

DEMONSTRATION 1DEMONSTRATION 1 DEMONSTRATION 2DEMONSTRATION 2 DEMONSTRATION 3DEMONSTRATION 3

Function You can tell your television what you want to watch.

Example

Method Face extraction and recognitionUser speech recognitionEstimation of user requirement on

what of whoPresentation of explanation video

Same speech such as “show me grand slam”,but different videos.

Mr. Matsui is on.

Mr. Ichiro is on.

Grand slam of Mr. Matsui

Grand slam of Mr. Ichiro

Show megrand slam

Function You can watch the sports in your preferable style.

Example

Method Video generation by digital camera workEvent recognitionUser speech recognitionEstimation of what the user wantsPresentation of the corresponding video

The television can understand even by talking “show me the previous goal”

Scene with eventsScene with events

Show me individual plays more

Show me the previous goal

Function You can ask your television what you do not know

Example

What is it?

It indicates the last important word.

東京地方裁判所はニッポン放送が新株予約券を発行するのは著しく不公正だと指摘して新株予約券の発行を禁止する決定を出しました。

Speech recognition result

Method Announcer speech recognitionImportant word extraction (TF/IDF)User speech recognitionEstimation of the important wordPresentation of explanation video

Scenes with the important wordsScenes with the important words

The television can understand evenby talking “what is it?”

Recommended