59
Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery Chia-Hung Yeh Signal and Image Processing Institute Department of Electrical Engineering University of Southern California

Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Embed Size (px)

DESCRIPTION

Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery. Chia-Hung Yeh Signal and Image Processing Institute Department of Electrical Engineering University of Southern California. Vision. Parsing or Segmentation. Guidelines. Motivation Introduction - PowerPoint PPT Presentation

Citation preview

Page 1: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Content-Based Video Analysis based on Audiovisual Features for Knowledge

Discovery

Chia-Hung Yeh

Signal and Image Processing InstituteDepartment of Electrical Engineering

University of Southern California

Page 2: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Vision

Page 3: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Guidelines

Motivation Introduction Overview of visual and audio content Video abstraction Multimodal information concept Knowledge discovery via video mining Our previous work Conclusion and future work

Page 4: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Motivation

Amazing growth in the amount of digital video data in recent years.

Develop tools for classify, retrieve and abstract video content

Develop tools for summarization and abstraction Bridge a gap between low-level features and high-

level semantic content To let machine understand video is important and

challenging

Page 5: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Why, What and How

Why video content analysis?– Modern multimedia technologies have led to huge amount of

digital video collections. But, efficient access to video content is still in its infancy, because of its bulky data volume and unstructured data format.

What is video content analysis?– Video content analysis analyzes the video content and

attempts to automatically understand the embedded video semantics as humans do

How to do video content analysis?

Page 6: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Overview of Visual Content

Structured analysis– Extract hierarchical video structure

Grouping

Grouping

Event/Tempo

Scene

Shot

Frame

GAP

Semantic

Text Document

WordsWords

segmented into

SentencesSentences

Key sentencesKey sentences

grouped into

Page 7: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Overview of Audio Content

Continuous in the time domain, not like visual Multiple sound source exists in a sound track like

many objects in a single frame It is tough to separate audio content and give a

suitable description Framework in MPEG-7, silence, timbre, waveform,

spectal, harmonic and fundamental frequency Some special features for music and speech

Page 8: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Content-Based Video Indexing

Process of attaching content based labels to video shots

Essential for content-based classification and retrieval

Some required techniques– Shot detection

– Key frame selection

– Object segmentation and recognition

– Visual/audio feature extraction

– Speech recognition, video text, VOCR

Page 9: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Content-Based Video Classification

Segment & classify videos into meaning categories Classify videos based on predefined topic Multimodal concept

– Visual features

– Audio features

– Metadata features

Domain-specific knowledge

Page 10: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Query (Retrieval Methods)

Simple visual feature query Feature combination query Query by example (QBE)

– Retrieve video which is similar to example

Localized feature query– Example: retrieve video with a running car toward right

Object relationship query Concept query (query by keyword) Metadata

– Time, date and etc.

Page 11: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

The Ways to Browse a Video

Playback faster– Audio time scale modification – time saving factor 1.5 to 2.5– 15% - 20% time reduction by removing and shortening pauses

Storyboard– Composed of representative still frames (Keyframes)

Moving storyboard– Display keyframes while synchronized with the original audio track

Highlight– Pre-defined special event (example: sport and news)

Skimming– Extract short video clips to build a much shorter video

Page 12: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Timeline of Related Technique Development

Basictool

Low-levelfeatures

development

High-level semanticsconcepts

Digital image processingDigital signal processing Text recognition

Speech recognition

Audio processingImage retreival

Video retreival

Video browsing

Video abstraction

Video summarization

Video skimming

Page 13: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Image Retrieval and Video Browsing

Query by Image Content (QBIC), IBM, 1995– Complex multi-feature and multi-object queries

Video browsing– Quickly and efficiently Discover the information

– Browsing and searching are usually complement each other

– Visual content browsing us easier than audio content

– Achieved by static storyboard, dynamic video clips, fast forward

Representative work– Gary Marchionini, University of Maryland

– S.-F. Chang, Columbia University

Page 14: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Video Abstraction

Video summarization and video skimming– Belong to video abstraction and different from video browsing

– Automatically retrieve the most significant and most representative a collection of segments

Required techniques– Shot detection, scene generation

– Motion analysis

– Face recognition

– Audio segmentation

– Text detection

– Music detection

Page 15: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Video Abstraction

A video abstract– A sequence of still or moving images which preserve

essential original video content while it is much shorter than the original one

Applications– Automated authoring of web

content• Web news

• Web seminar

– Consumer domain applications• Analyzing, filtering, and browsing

Page 16: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Video Summarization (I)

A collection of salient frames that represent the underlying content

Most related work focus on the ways to extract still frame

Categorize into three classes– Frame-based

• Randomly or uniformly select

– Shot-based• Keyframe

– Feature-based• Motion, color and so on

Page 17: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Video Summarization (II)

Representative work– Y. Taniguchi, (1995)

• Frame-based scheme

• Simple but may not representative due to not uniform length of shots

– H.-J. Zhang, Microsoft Research China (1997)• Keyframe based on color histogram

– Gong and Liu, NEC Laboratories of American (2003)• SVD (Single Value Decomposition)

• Capture temporal and spatial characteristics

– Tseng, Lin and J. R. Smith, IBM T. J. Research Center (2002)• Video summarization scheme for pervasive mobile device

Page 18: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Video Skimming

A good skim is much like a movie trailer A synopsis of the entire video Representative work

– M. Smith and T. Kanade, Carnegie Mellon University (1995)• Audio and image characterization

– S. Pfeiffer, University of Mannheim (1996)• VAbstract system• Detection of special events such as dialogs, explosions and text

occurrences

– H. Sundaram and S.-F. Chang, Columbia University (2001)• A semantics skimming system• Visual complexity for human understanding• Film syntax

Page 19: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Video Skimming – Application

Video content transcoding– Content-based live sport video filtering

Page 20: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Video Shot Structure

Shot, a cinematic term, is the smallest addressable video unit (the building block). A shot contains a set of continuously recorded frames

Two types of video shots:– Camera break abrupt content change between neighboring frames. Usually

corresponds to an editing cut

– Gradual transition smooth content change over a set of consecutive frames. Usually caused by special effects

Shot detection is usually the first step towards video content analysis

Page 21: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Scene Characteristics

Scene is a semantic concept which refers to a relatively complete video paragraph with coherent semantic meaning It is subjectively defined

Shots within a movie scene have following 3 features– Visual similarity

• Since a scene could only be developed within certain spatial and temporal localities, the directors have to repeat some essential shots to convey parallelism and continuity of activities due to the sequential nature of film making

– Audio similarity• Similar background noises• Speeches from the same person have similar acoustic characteristics

– Time locality• Visually similar shots should also be temporally close to each other if they do

belong to the same scene

Page 22: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Basic Audio Features

Energy– Silence or pause detection

Zero crossing rate (ZCR)– The frequency of the audio signal amplitude passing through the zero value

in a given time Energy centroid

– Speech range: 100 Hz to 7k Hz

– Music range: 16 Hz to 16000 Hz Band periodicity

– Harmonic sounds

– Music: High frequency components are integer multiples of the lowest one

– Speech: Pitch MFCC - (Mel-Frequency Cepstral Coefficients)

– 13 linearly-spaced filters

Page 23: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Multimodal Information Concept

Who

When

How

Where

What

Relation

Relation

Video dataMultimodal

content segmetation

MultimodalityFusion/Integration

Semantic units

Page 24: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Multimodal Framework for Video Content Interpretation

Application on automatic TV Programs abstraction Allow user to request topic-level programs Integrate multiple modalities: visual, audio and text

information Multi-level concepts

– Low: low-level feature– Mid: object detection, event modeling– High: classification result of semantic content

Probabilistic model: using Bayesian network for classification (causal relationship, domain-knowledge)

Page 25: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Probabilistic Model – Data Fusion

Video dataVisual

information

Audioinformation

Metadatainformation

V_feature 1

V_feature 2

V_feature 3

A_feature 1

A_feature 2

A_feature 3

A_feature n

V_feature n

M_feature 1

M_feature 2

M_feature n

V_detector 1

V_detector 2

V_detector 3

A_detector 1

A_detector 2

A_detector 3

A_detector m

V_detector m

M_detector 1

M_detector 2

M_detector m

Fusion 1

Fusion 2

Fusion 3

Low-level Midlle-level HIgh-level

Semanticconcept 1

Semanticconcept 2

Semanticconcept 3

Constrained domain

Input data

Page 26: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

How to Work with the Framework

Preprocessing– Video segmentation (shot detection) and key frame selection– VOCR, speech recognition

Feature Extraction– Visual features based on key-frame

• Color, texture, shape, sketch, etc.

– Motion features• Camera operation: Panning, Tilting, Zooming, Tracking, Booming, Dollying• Motion trajectories (moving objects)• Object abstraction, recognition

– Audio features• average energy, bandwidth, pitch, mel-frequency cepstral coefficients, etc.

– Textual features (Transcript)• Knowledge tree, a lot of keyword categories: politics, entertainment, stock, art, war, etc.• Word spotting, vote histogram

Building and training the Bayesian network

Page 27: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Challenging Points

Preprocessing is significant in the framework.– Accuracy of key-frame selection

– Accuracy of speech recognition & VOCR

Good feature extraction is important for the performance of classification.

Modeling semantic video objects and events How to integrate multiple modalities still need to be

well considered

Page 28: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Knowledge Discovery via Video Mining

Objectives– Find the hidden links between isolated news, events, etc.– Find the general trend of an event development– Predict the possible future event – Discover abnormal events

Required Technologies– Domain-specific knowledge model– Mining association rules, sequential patterns and correlations – Effective and fast classification and clustering

Challenges– Model build-up in special knowledge domain– Integration of semantic mining and feature-based mining– Effective and scalable classification and clustering algorithms

Page 29: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Video Mining Issues

Frequent/Sequential Pattern Discovery – Fast and scalable algorithms for mining frequent, sequential and

structured patterns and for correlation analysis – Similarity of rule/event search/measurement

Efficient and fast classification and clustering algorithms– Constraint-based classification and clustering algorithms– Spatiotemporal data mining algorithms – Stream data mining (classification and clustering) algorithms

Surprise/outlier discovery and measurement– Detection of outliers based on similarity and trend analysis– Detection of outliers and surprised events based on stream data

mining algorithms Multidimensional data mining for trend prediction

Page 30: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Framework of Video Mining

Multimediadata

Knowledge

Miningengine

Specificdomain

Featuremining

Frequentmining

Sequentialmining

Exceptionmining

Movemining

Video content analysis

Page 31: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Our Previous Work

TV Commercial Detection– Visual/audio information processing

Cinema rules– Intensity mapping

Tempo analysis in digital video (Professional video)– Audio tempo– Motion tempo

Home video processing (Non-professional)– Quality enhancement (Bad shot detection)– Music and video matching

Page 32: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Commercial Detection

First step to do any TV program content management Monitor broadcast

– Government– Advertisement Company

Commercial features– Delimiting black frame (not available in some countries)

– High cut frequency and short shot interval (important feature)

– Still images

– Special editing styles and effects

– Text and logo

Page 33: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Commercial Detection

Visual information processing– Black frame detection

– Shot detection & its statistic analysis

– Still image detection

– Text-region detection

– Edge change rate detection Audio information processing

– Volume control

– Silence

Page 34: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Commercial Detection

Structure of TV program

Normalprogram

Normalprogram

NormalProgram withStation logo

Spot Spot

Black frame Structure of TV

program

Page 35: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Shot Detection & Its Statistic Analysis

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

100

200

300

400

500

600

700Shot boundary detection

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0

50

100

150Statistic analysis

mean variance

Commercial block

CommercialStart point

Page 36: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Still Image Detection

Still Image– Video Clip is composed of a sequence of image

– Find out a set of consecutive images that have little change over a period of time

Difficulty – Even though we feel that video clip is still, the difference

between two consecutive images is seldom zero

– It is tough to measure the moving part. (human eyes are sensitive to motion)

Main idea– Quantify motion in each image to detect still image

Page 37: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Still Image Detection

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Really still images

Error detection

Page 38: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Tempo Analysis and Cinema Rules

The visual story - seeing the structure of film, TV, and new media, Bruce Block– Relationship between story structure and visual structure

• Their intensity maps are correlated

– Principle of contrast and affinity• The greater the contrast in a visual component, the more the visual intensity

or dynamic increases

Time

Story intensity

Exposition

Conflict

Climax

Resolution

Page 39: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Cinema Rules

Every feature film has a well designed story structure, which contains the beginning (exposition), the middle (conflict), and the end (resolution)

R

0 1201101020 ... …

EX

CO

CX

Time length of the story in minutes

Story Intensity

0

100

R

0 1201101020 ... …

EX

CO

CX

Time length of the story in minutes

Story Intensity

0

100

EX: exposition gives the facts needed to begin the storyCO: conflict contains rising actions or conflictCX: climaxR: resolution end the story

Page 40: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Cinema Rules

Scene:– A simple theme in a scene– Each scene is composed of setup part, progressing part, and

resolution part– Final film is just a way to present this theme

• Dialog• Close-up view

A story unit– A example of scene

• Main actors drove the main actress from train station back to home

– A simple action• Met at train station ->On the road->Another main actor joined them ->

Arrive home

Page 41: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Audio Tempo

Music tempo

Definition in music– Note

– Meter: A longer period contains many beats. For example, we can count as ONE-two-three, ONE-two-three

– Tempo (pace/beat period)• It is often indicated in the beginning. For example, the rate should be

100 quarter notes per minute (100 times we clap per minute)

Page 42: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Audio Tempo

Speech tempo– Emotion detection

– Segmental durations• Syllable or phoneme

Audio tempo– Short time pace

• Short-term memory

– The number of sound events per unit of time• The more events, the faster it seems to go

– Onset• A new note or a new syllable

Page 43: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Audio Tempo

Diagram of audio tempo analysis

FrequencyFilerbank

EnvelopeExtractor

EnvelopeExtractor

Input Audio

Tempo

Shotboundary

Downsampling

Downsampling

DifferentiatorDifferentiator

L

H

2

L

H

Page 44: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Audio Tempo

Frequency filterbank– Perceptual frequency

– Critical bands• Wavelet-packet

• Multirate system

Envelope extractor– Rectify

– Filtering: 50 ms half-Hamming window

Differentiator– First-order difference

– Half-wave rectified

Input signal and detected onsets

Page 45: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Audio Tempo

Boundary of story units– Local minima of audio tempo

Post signal processing– Help to get local minima

– Three steps• Lowpass filtering

• Morphological operation

– Minmax

– Close operation

• Detect local minima

– Detected valleysPost processing for audio tempo analysis

Page 46: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Motion Analysis

The variance of motion vector

– Where is a window, is the average length of motion vectors for each shot, and is shot index

)(nW )(nMn

2

1

2 )]()()[()( nnMinWiN

nM

N

n

nMinWN

n1

)(*)(1

)(

Page 47: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Motion Analysis

Boundary of story units– Transition Edges

Post processing– Morphological operation

• Median

• Maxmin

• Minmax

– Gradient

– Detect edgesPost processing for visual tempo

Page 48: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Skimming Video

Test data– Legends of The Fall

• Beginning 26 minutes

• MPEG format

– 352*240 pixels

– 44.1 KHz

Page 49: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Home Video Processing

Home video characteristics– Fragmental

– Sound may not be very important

– Bad shots• Stabilization

• Focus

• Lighting

Shooting tips

1 Shoot lots of short scenes (5 ~ 10 seconds)

2 Use zoom in/out to take exposition shots or emphasize something

3 Zoom or pan slowly

4 Get a lot of face shots

5 Keep a steady hand

6 Make sure your subject is well lit

Page 50: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Bad Shots

Shaky– Drive– Walk

Vibration of the camera motions of successive frames

Page 51: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Bad Shots

Ill-light– Too dark/bright– Variance too much

• Diaphragm

Lighting Problem– Average of luminance

• Highest 1/3 pixels and lowest 1/3 pixels

• Negative feedback

Page 52: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Bad Shots

Blur– Motion blur

– Out-of-focus blur

– Foggy blur

Page 53: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Music and Video Matching

Shot detection Remove bad shots Match music tempo

– Shot length– Motion activity Shot Detection

Remove badshots

Choose shotsaccording to music

tempo

Page 54: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Authoring Scheme

Match music tempo– High tempo

• Small segment length

– Transition time

• High motion activity

Clip 1 Clip 7

Clip 6Clip 5Clip 4

Clip 3

Clip 2

Visual tempo

Music tempo

Time

Time

Input Music

Selected video clips

Page 55: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Experimental Results

Test data– Input music: 5.5-

minutes music, Canon– Input video clips:

• Activities of babies of 0 ~ 3 years old

• Man-made bad shots

• Average clip length is about 20 seconds

• Total length is 50 minutes

Page 56: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Well-Known Research in Video Content Analysis Field

Well-known university– Digital Video Multimedia laboratory (DVMM), Columbia

University– MIT Media laboratory– Information Digital Video Understanding, Carnegie Mellon

University– Department of Electrical and Computer Engineering,

University of Illinois of Urbana-Champaign– Signal and Image Processing Institute, University of Southern

California– Department of Electrical Engineering, Princeton University– Language and media processing laboratory, University of

Maryland

Page 57: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Well-Known Research in Video Content Analysis Field

Well-known R&D laboratory– IBM T. J. Watson research center

– IBM Almaden research center

– Intel corporation

– Sharp Laboratory of America (SLA)

– Microsoft research laboratory

– Microsoft research China

– Hawlett-Packard research laboratory

– AT&T Bell laboratory

– InterVideo

– Pinnacle

Page 58: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Conclusion

Introduction of several basic concepts Basic processing and low-level feature extraction Semantic video modeling and indexing Multimodal framework for topic classification of

Video Knowledge discovery via video mining Our research results Discussion of Challenging problems

Page 59: Content-Based Video Analysis based on Audiovisual Features for Knowledge Discovery

Questions

Thank You