VIDEO ANNOTATION - MICC · 2010. 5. 10. · Web Media Search. Color • Robust to background, Independent of size, orientation ... Bridging the Semantic Gap. Syntactic analysis Derivation

VIDEO ANNOTATION

Market Trends

• Broadband doubling over next 3-5 years• Video enabled devices are emerging rapidly• Emergence of mass internet audience• Mainstream media moving to the Web

What do we search in video

- a movie scene where Titanic hits the iceberg- the CNN business news report from 1999- everything related to Claudia Schiffer- a “western” movie- all home videos taken in Paris- all scores of a soccer game-movies that I like-…..

Video search requires efficient annotation of video contentTo some extent this can be done automatically

Video conceptual hierarchy

Video

Episode Episode Episode

Video Shot Video Shot Video Shot Video Shot

Frame Frame Frame

Type, Cast, Producer…

Global Content, Intent…

Camera motion, angle

Objects and their spatial relation

– Frames– Shots– Shot boundaries– Scenes– Audio

Low Level

Visual Features

• Color• Texture• Shape• Motion• Shot Boundaries

Mid LevelSemantic Content

• People/Objects• Location• Actions• Time

High LevelSemantic Content

• Concept• Event• Episode/Story

Video objects

Video retrieval and annotation

Retrieval Annotation Syntactical level

Text BasedVisual and audio content-based shot automatic segmentation(limited success) from shot sequences: edit effects

from shot key-frames: color, texture, shape, face..

from shot sequences: motion info

Semantical level

Text Based text tags from manual annotation (subjectivity of human perception)

from video metadata: author, date,… from shot, scene, sequence tags

text tags from automatic annotation

(limited success) from visual and audio

information (scene, event segmentation)

from textual labels in consecut. frames

from audio /speech translation

Low-level analysis

t

News report

Shot 1 Shot 2 Shot 3 Shot 4 Shot 5 Shot 6 Shot 7 . . .

Mid/High-level analysis

Movie episodeSport-event

highlightTopic segment

of a documentary . . .

• Automatic annotation requires automatic partitioning of video into syntactic segments (low-level analysis) and detection of entities and their spatio-temporal relationships (mid/high-level analysis)

Automatic annotation

Video processing chain

Feature Extraction

Spatial Segmentation

Object Tracking

Event Modeling

Temporal Segmentation

Motion Detection

Semantic level

Syntactic level

Video Sequence

Color & TextureAnalysis

Spatio-temporalAnalysis

MotionAnalysis

Shot boundary detection

Classification

Motion verification

Shot summaries

Event Detection

Events

Low Level: extract color, texture, motion features, detect shot boundaries: domain and event independent

Mid Level: determine the object class (e.g. sky, grass, tree, rock, animal)Generate shot descriptors: domain dependent

High Level: use shot descriptors and domain specific inference process to detect events and then episodes: event specific

Video processing framework

Syntactic features

1970 200019901980 2010

Texture: Autocorrelation Wavelet transforms Gabor Filters

Shape: Edge Detectors Moment invariantsFinite Element Methods Shape from Motion

Color: Color Moments Color Histograms Color Autocorrelograms

Segmentation: Scene Segmentation Shot detection

OCR: Modeling Successful OCR deployments

Face: Face Detection algorithms Neural Networks EigenFaces

ASR: Acoustic analysis HMMS N-grams CSR LVCSR

NIST Video TREC StartsMedia IR systems

Web Media Search

Color

• Robust to background, Independent of size, orientation

– Color Histogram [Swain & Ballard]

– Color Moments– Color Sets: Map RGB

Color space to Hue Saturation Value, & quantize [Smith, Chang]

– Color layout- local color features by dividing image into regions

– Color Autocorrelograms

Texture• One of the earliest Image

features [Haralick et al 70s]

– Co-occurrence matrix– Orientation and distance on

gray-scale pixels– Contrast, inverse deference

moment, and entropy [Gotlieb & Kreyszig]

– Human visual texture properties: coarseness, contrast, directionality, likeliness, regularity and roughness [Tamura et al]

– Wavelet Transforms [90s]– [Smith & Chang] extracted

mean and variance from wavelet subbands

– Gabor Filters– …..

Region Segmentation• Partition image into regions

– Strong Segmentation: Object segmentation is difficult.

– Weak segmentation: Region segmentation based on some homegenity criteria

Scene Segmentation• Shot detection, scene detection

– Look for changes in color, texture, brightness

– Context based scene segmentation applied to certain categories such as broadcast news

Face

• Face detection is highly reliable, Face recognition for video is still a challenging problem.

- Viola and Jones face detection- Neural Networks [Rwoley]- Wavelet based histograms of facial features- EigenFaces: Extract eigenvectors and use as feature space

OCR• OCR is fairly successful technology.

Accurate, especially with good matching vocabularies.

ASR• Automatic speech recognition fairly accurate

for medium to large vocabulary broadcast type data, Large number of available speech vendors. Still open for free conversational speech in noisy conditions.

Shape• Outer Boundary based vs. region

based

– Fourier descriptors– Moment invariants– Finite Element Method (Stiffness

matrix- how each point is connected to others; Eigen vectors of matrix)

– Wavelet transforms leverages multiresolution [Chuang & Kao]

– Chamfer matching for comparing 2 shapes (linear dimension rather than area)

– Well-known edge detection algorithms.

• Context is useful to recognize a scene• Scenes are composed by objects and events

• Use Proto-concepts to describe Context• Use Machine Learning to link Context to concepts that define objects and

events

Water

SkySky

Water

Semantic features

Bag ofRegions

Compute Similarity. . .

Image

Sky Grass Road

. . .

. . . =

Proto-concept Similarity Distribution: . . . =

• Semantic segment– concatenation of consecutive video shots that are related to each other regarding

their semantic content– parts of a video where content coherence (i.e. continuity of the semantic content

from one shot to another) remains high– detection of Semantic segment boundaries can be made by measuring the

coherence of the semantic content along neighboring video shots: segment boundaries are places of low content coherence

• Semantic segments are suitable for video genres characterized by clear sequential content structure:

– broadcast news– movies (episode-based scheme)– documentaries

Semantic segments

• Four ways of performing high-level analysis– Video categorization into genres– Video partitioning into semantic segments– Semantic segment extraction– Video summarization/abstraction

• Feature-based algorithmic solutions: different features and algorithms for different analysis processes

• Some model assumptions are required.

High level video analysis is essential to perform semantic automatic annotation of video

Bridging the Semantic Gap

Syntactic analysis

Derivation of style attributes

Genre mapping

style attributes

Genre recognition

Basic statistics

Raw video

STEP 1

STEP 2

STEP 3

• Syntactic analysis is based on simple statistics for the sequence of RGB frames

– Color Statistics• basis for cut detection • color histogram• standard deviation

– Audio statistics• record basic audio frequency and amplitude statistics

– Motion Detection• motion energy - total amount of motion in a scene (block-wise difference color

histogram)• motion vector - distinguish camera motion from object motion

– Object Segment• moving object

– same speed, same direction– use the motion vector field– subtraction the camera motion → pure object motion

Step 1: Syntactic analysisStep 1: Syntactic analysis

• Assign semantics to the scene• Scene length and transition

– important style attribute– scene separator - only hard cuts ( >95% )– other transition - artistic style element

• Camera motion– panning, tilting, zooming– identify the motion vector direction with the highest frequency – normal distribution

• classification panning - error rate < 10%• Object recognition

– simple objects or pattern in well-defined environment– face recognition– TV channel

• predefine pattern → stored in a database• logo recognition algorithm - every 4th frame

Step 2: Derivation of style attributesStep 2: Derivation of style attributes

“KBS” logo in bottom right corner

• Semantics of Audio– Amplitude

• speech - characteristic frequency spectrum• music - beat• noise > threshold• silence < threshold

– Frequency• distinguish between speaker and noise > 95%

– human speech• limited frequency spectrum

• Newscast– characteristic pattern of speaker

and non-speaker– Logo

• Car race– scene length - much shorter than

tennis– camera motion, object motion– noise at high amplitude– easy to distinguish from tennis and

soccer• Tennis

– very good example of the discriminating power of audio

– bouncing of the ball - singular peak– speaker phase

Step 3:Step 3: Genre mappingGenre mapping

VideoClassifier

News

Action movie

Tennis match

Drama

. . .

VideoFeatureExtractor

Genre

Pre-specified genres

. . .

Video-StyleAttributes

Video-StyleAttributes

Matching

• Commercials– separated by up to 5 monochrome

frames– fade in scene transition

• Cartoon– scene length > other genre– less camera motion– zero amplitude - no background

noise• Result

– no single attribute is sufficient to uniquely identify genre

– style profile

Video Search: Video TREC

– 1st TREC (2001)• Mean Average Precision @100 items (MAP) 0.033 (in general category)• Transcript only was better than transcript+image aspects+ASR• Also there were different types of test: known items versus general. Known items had at least

one result.– 2nd TREC (2002)

• Interactive runs to rephrase queries• Again text based on only ASR was the best performing system• Mean Average precision of 0.137• Leading system employed multiple systems: TF-IDF variants (Mean Average Precision MAP

0.093). • Other was boolean with query expansion (0.101 MAP)• OCR was not particularly applicable; Phonetic ASR did not help.

– 3rd TREC (2003)• Data changes radically (Broadcast news added – CNN, ABC, CSPAN)• Baseline ASR + CC MAP 0.155• ASR + CC + VOCR + Image Similarity + Person X retrieval MAP 0.218 * “Successful Approaches in the TREC Video Retrieval Evaluations”, Alexander Hauptmann,

Michael Christel, ACM Multimedia 2004.

Aircraft Animal Boat Building Bus Car Chart Corp. leader Court

Crowd Desert Entertainment Explosion Face Map Meeting

Military Mountain Natural disaster

Office Outdoor

Flag USA Gov. leader

People People marching

Police / security

Walking Water body Weather news

Sky Truck Urban Vegetation Vehicle ViolenceSports Studio

Prisoner

Screen

39 concepts

Digitization and compression: Hardware and software necessary to convert the video information into digital compressed format

Cataloguing: process of extractingmeaningful story units from the raw video data and building the corresponding indexes

Digital video archive: repository of digitized, compressed video data

Visual summaries: representation of video contents in a concise, typically hierarchical, way

Indexes: pointers to video segments or story units

User interface: friendly, visually rich interface that allows the user to interactively query the database, browse the results, and view the selected video clips

Query / search engine: responsible for searching the database according to the parameters provided by the user

Indexes

Documents

VIDEO ANNOTATION - MICC · 2010. 5. 10. · Web Media Search. Color • Robust to background, Independent of size, orientation ... Bridging the Semantic Gap. Syntactic analysis Derivation