A Study on the Video Scene Retrieving System

with a Speech Recognizer

2013. 5. 14

Yoshika OSAWA

Kohno Lab.

Outline1. Introduction

2. Aim of Study

3. Composition of Systemi. Voice Divide Section

ii. Speech Recognize Section

iii. Scene Retrieve Section

4. Evaluation Experiment

5. Conclusion

1. Introduction

• A variety of video data are being generated, stored, and accessed with advances in the Internet.

• To make search a video scene quickly from the data, an efficient technique is needed.

1. Introduction• Multimedia Annotations

oNagao(2001)

1. Introduction• A Subtitling System for Broadcast

Programs with a Speech Recognizer

oAndo et al.(2001)

1. Introduction• Extracting voices from the video.

• The advantage of voice :

Easy to Make texts.

Simple association.

Apply the speech recognition to the scene retrieving.

2. Aim of Study

5. Conclusion

2. Aim of Study

Implement a scene retrieving system, then verify the accuracy and

check the operations.

Make annotations with the speech recognition automatically.

2. Aim of Study

5. Conclusion

3. Composition of System

Select a Video

Speech Recognize Section

Input a Keyword

Scene Retrieve Section

Output the resultVoice Divide Section

i. Voice Divide Section• Focus on the Amplitude

oUse signals while exceeding the threshold value of the amplitude.

o Reject because it is not possible to recognize if it is too short.

oDerive threshold based on experiment.

axis threshold

Amplitude 10[%]

Time 1000[ms]

(1) Pre-Processing Unit• Digitization

o Sampling frequency: 16kHz

oQuantization bit : 16bit

• Noise Reductiono Additive: Subtract the difference between the silence

o Multiplicative: Subtract in the log axis

Microphone characteristics of SM57

(2) Feature Extraction Unit

Resonant frequency is effective as a feature value

• Resolution of human hearing

oHigher sensitivity in lower frequency

• Filter that matches the human hearing

Mel-frequency

• Inverse Fourier transform in the Mel-frequency axiso New axis: Cepstrum

o Separate the voice pitch and resonance frequency

• MFCC（Mel Frequency Cepstrum Coefficient)o Information of vowel

• ΔMFCCo Infromation of consonant

• Feature vectoro （Average power, MFCC, ΔMFCC）

(3) Identification Unit

From Bayes' theorem

(3) Identification UnitSpeech waveform : Observable

Character information: Unobservable directly

Estimate the character information from the waveform by using HMM (Hidden Markov Models)

Maximum likelihood calculation : Viterbi algorithmMachine learning : Baum-Welch algorithm

iii. Scene Retrieve Section• Matching keyword and text

1. Input a keyword

2. Matching the keyword by String searching

3. Extract scene that the keyword was spoken.

4. Output a thumbnail

2. Aim of Study

5. Conclusion

4. Evaluation Experiment1. Compare the result with the word I heard

2. Calculate the recognition rate

3. Evaluate it by each number of charactersSample data

Video NHK news

Time 3 minutes

Number 30 videos

Words 457 words

Engine Julius

Total average rate is 68%.

67%73%

46% 45%40%

Recognition Rate

1 2 3 4 5 6 words

4. Evaluation Experiment• Verify the correspondence between

keyword and the seek destination

o Select thumbnail and play from the scene

oCheck whether the keyword was spoken.

4. Evaluation Experiment• Recognition rate decrease when number

of characters increase.

• The retrieved scene is corresponding to the keyword.

• Recognition error in weak consonant part

oNeed improvement in Voice Devide Section

oMust also improve the recognition accuracy

2. Aim of Study

5. Conclusion

5. Conclusion• System for efficient watching video

oUse Speech Recognition

oMake Annotations automatically

• Future work

oAdopt the Zero-Crossing Number in Voice Devide Section

o Take in latest Speech Recognition technology.

o Incorporate Image Recognition.

Thank you for your attention!

A Study on the Video Scene Retrieving System

Technology

Scene Determination based on Video and Audio Features ABSTRACT

Unconstrained Scene Text and Video Text Recognition …mohit.jain/research_logs/...Unconstrained Scene Text and Video Text Recognition for Arabic Script Mohit Jain, Minesh Mathew and

Mise en-scene for Music Video

New Scene Classification of Images and Video via Semantic … · 2019. 9. 6. · Scene Classiﬁcation of Images and Video via Semantic Segmentation Heather Dunlop Digitalsmiths Corporation

Video Scene Categorization by 3D Hierarchical Histogram Matching · 2009-08-02 · Video Scene Categorization by 3D Hierarchical Histogram Matching Paritosh Gupta1, Sai Sankalp Arrabolu1,

Anomalous event detection from surveillance video · 2020. 5. 4. · Anomalies in Surveillance Video – Intelligent surveillance system • Video scene understanding, alarm abnormal

Scene-Adaptive Video Frame Interpolation via Meta …openaccess.thecvf.com/content_CVPR_2020/papers/Choi...Scene-Adaptive Video Frame Interpolation via Meta-Learning Myungsub Choi1

Scene Graphs for Interpretable Video Anomaly Classiﬁcation · Scene Graphs for Interpretable Video Anomaly Classiﬁcation Nicholas F. Y. Chen, Zhiyuan Du, Khin Hua Ng DSO National

PERANCANGAN VIDEO BEHIND THE SCENE REMBANG KITA … · Sifat dari Video Behind The Scene adalah nyata, fakta dan benar terjadi yang artinya suatu peristiwa yang di dapat dari sumber

Considerations for Video Scene Selection Margaret H. Pinson NTIA/ITS

Mise En Scene – Music Video

Mise en scene of the music video

Scene Management CSE 191A: Seminar on Video Game Programming Lecture 2: Scene Management UCSD, Spring, 2003 Instructor: Steve Rotenberg

General Dynamic Scene Reconstruction from Multiple View Video

Ep1039758 b1 Apparatus for creating shared video scene content

Path-Based Constraints for Accurate Scene Reconstruction from Aerial Video

Video Scene Parsing with Predictive Feature Learning ...openaccess.thecvf.com/content_ICCV_2017/supplemental/Jin_Video… · Video Scene Parsing with Predictive Feature Learning:

Digital Evidence Acquisition, Storage and Management Crime ... · Crime Scene Photo/Video Digital Evidence Acquisition, Storage and Management The ADAMS Crime Scene Photo/Video solution

Music video mise en-scene

DSTC7-AVSD: Scene-Aware Video Dialogue Systems with Dual ...rama-kanth.com/data/dstc7-avsd.pdf · DSTC7-AVSD: Scene-Aware Video-Dialogue Systems with Dual Attention R. Pasunuru &