Download pdf - Guo-Jun Qi Beckman Institute University of Illinois at …ece417/LectureNotes/ECE417-LSCOM.pdfText script associated with news videos Can help information extraction for visual concepts?

Bridge Semantic Gap: A Large Scale

Concept Ontology for Multimedia

(LSCOM)

Guo-Jun Qi

Beckman Institute

University of Illinois at Urbana-Champaign

LSCOM (Large Scale Concept

Ontology for Multimedia)

A broadcast news video dataset

200+ news videos/ 170 hours

61,901 shots

Language

◦ English/Arabic/Chinese

Why broadcast News ontology?

Critical mass of users, content providers,

applications

Good content availability (TRECVID LDC

FBIS)

Share Large set of core concepts with

other domains

LSCOM Provides

Richly annotated video content for accomplishing required access and analysis functions over massive amount of video content

Large scale useful well-defined semantic lexicon

◦ More than 3000 concepts

◦ 374 annotated concepts

◦ Bridging semantic gap from low-level features to high-level concepts

A LSCOM concept

000 - Parade

Concept ID: 000

Name: Parade

Definition: Multiple units of marchers,

devices, bands, banners or Music.

Labeled: Yes

LSCOM Hierarchy

http://www.lscom.org/ontology/index.html

Thing

.Individual

..Dangerous_Thing

...Dangerous_Situation

....Emergency_Incident

.....Disaster_Event

......Natural_Disaster

....Natural_Hazard

.....Avalance

.....Earthquake

.....Mudslide

.....Natural_Disaster

.....Tornado

...Dangerous_Tangible_Thing

....Cutting_Device

http://www.lscom.org/ontology/index.html

Definition: What’s the ontology?

(Wikipedia) An ontology is a formal representation

of the knowledge by a set of concepts

within a domain and the relationships

between those concepts. It is used to

reason about the properties of that

domain, and may be used to describe the

domain.

Ontology

Represents the visual knowledge base in a

structure way

◦ Graph structure

◦ Tree (hierarchy) structure

Images/videos can be effectively learned

and retrieved by the coherence between

concepts

◦ Logical coherence

◦ Statistical coherence

An Ontology Hierarchy: Military

Vehicle

An example from Wikipedia

Ontology Tree for LSCOM

A Light Scale Concept Ontology for

Multimedia Understanding

(LSCOM-Lite) The aim is to break the semantic space using

a few concepts (39 concepts).

Selection Criteria

◦ Semantic Coverage

As many as semantic concepts in News videos could be covered by the light concept set.

◦ Compactness

These concept should not semantically overlap.

◦ Modelability These concepts could be modeled with a smaller

semantic gap.

Selected concept dimensions

Divide the semantic space into a multimedia-dimensional space, where each dimension is nearly orthogonal

◦ Program Category

◦ Setting/Scene/Site

◦ People

◦ Objects

◦ Activities

◦ Events

◦ Graphics

Histogram of LSCOM-Lite

Concepts

Some example keyframes

Applications

Application I: Conceptual Fusion (most

basic – early fusion)

Application II: Cross-Category

Classification (inter-class relation)

Application III: Event Dynamic in Concept

Space

Application I: Conceptual Fusion

Video

Concept 1

Concept 2

Concept 3

Concept n

Visual

Features

Classifier

…

LSCOM 374 Models

374 LIBSVM models

◦ http://www.ee.columbia.edu/ln/dvmm/columbi

a374/

◦ Feature used (MPEG-7 descriptors)

Color Moments

Edge Histogram

Wavelet Texture

◦ LIBSVM – a library for support vector

machine at

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

http://www.ee.columbia.edu/ln/dvmm/columbia374/

http://www.ee.columbia.edu/ln/dvmm/columbia374/

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Application II: cross-category

classification with concept transfer

G.-J. Qi et al. Towards Cross-Category

Knowledge Propagation for Learning

Visual Concepts, in CVPR 2011

Instance-Level Concept Correlation

+1

-1

+1

-1

Mountain Castle

Mountain and castle

Transfer Function

Mountain, Castle

Mountain

Castle

None of them

Model Concept Relations

Automatically construct ontology in

a data-driven manner

An application III – Event Dynamics

in Concept Space

Event Detection with Concept

Dynamics

W. Jiang et al, Semantic event detection

based on visual concept prediction, ICME,

Germany, 2008.

Open Problems

Cross-Dataset Gap

◦ Generalize LSCOM dataset to other dataset (e.g., non-

news video dataset)

Cross-Domain Gap

◦ Text script associated with news videos

Can help information extraction for visual concepts?

Automatic ontology construction

◦ Task dependent v.s. task independent

◦ Data driven v.s. preliminary knowledge (e.g., WordNet)

◦ Incorporate prior human knowledge (logic relation

etc.)

TRECVID Competition

Task 1: High-Level Feature Extraction

◦ Input: subshot

◦ Output: detection results for 39 LSCOM-Lite

concepts in the subshot

High-Level Feature Extraction

Each concept assumed to be binary

(absent or present) in each subshot

Submission: Find subshots that contain a

certain concept, rank them by the

detection confidence score, and submit

the top 2000.

Evaluations: NIST evaluated 20 medium

frequent concepts from 39 concepts using a

50% random samples of all the submission pools

20 Evaluated Concepts

Evaluation Metric: Average Precision

Relevant subshots should be ranked

higher than the irrelevant ones.

R is the number of relevant images in total,

Rj is the number of relevant images in top

j images, Ij indicates if the jth image is

irrelevant or not.

1

1Average Precision

Nj

j

j

RI

R j

Results

TRECVID Competition

Task II: Video Search

◦ Input: text-based 24 topics

◦ Output: relevant subshots in the database

Topics to search

Topics to search (cont’d)

Topics to search

Three Types of Search Systems

Results: Automatic Runs

Results: Manual Runs

Results: Interactive Runs

Machine Problem 7: Shot Boundary

Detection in Videos

Goals

Detect the abrupt content changes

between consecutive frames.

◦ Scene changes

◦ Scene cuts

Steps

Step 1: Measuring the change of content

between video frames

◦ Visual/Acoustic measurements

Step 2: Compare the content distance

between successive frames. If the

distance is larger than a certain threshold,

then a shot boundary may exist.

Measuring Content based on Visual

Information

256 dimensional Color Histogram

◦ In RGB space, normalize the r, g, b in [0,1]

◦ Color space

nr

ng

8X8 histogram

Color Histograms Divide each image into four parts, each

part has a 8X8 histogram, and 256 dim

features in total.

Acoustic Features

12 cepstral coefficients

Energy (sum of square of raw signals)

Zero crossing rates (ZCR)

ZCR = sum(|sign(S(2:N))-sign(S(1:N-1))|)

Hints: normalize energy to avoid it over-

dominating when computing distances

between successive frames

Datasets

Two videos of little over one minute

Manually label the shot boundary

What to submit

Source code

Report

◦ compare shot boundary detection results

returned by your algorithm with the manually

labeled boundaries

◦ Compare

◦ Explain your choice of threshold

◦ Explain the differences between the acoustic-

based and visual-based detection results

Where and when to submit

Email to [email protected]

Due: May 2nd

mailto:[email protected]

Thanks! Q&A