Bridge Semantic Gap: A Large Scale
Concept Ontology for Multimedia
(LSCOM)
Guo-Jun Qi
Beckman Institute
University of Illinois at Urbana-Champaign
LSCOM (Large Scale Concept
Ontology for Multimedia)
A broadcast news video dataset
200+ news videos/ 170 hours
61,901 shots
Language
◦ English/Arabic/Chinese
Why broadcast News ontology?
Critical mass of users, content providers,
applications
Good content availability (TRECVID LDC
FBIS)
Share Large set of core concepts with
other domains
LSCOM Provides
Richly annotated video content for accomplishing required access and analysis functions over massive amount of video content
Large scale useful well-defined semantic lexicon
◦ More than 3000 concepts
◦ 374 annotated concepts
◦ Bridging semantic gap from low-level features to high-level concepts
A LSCOM concept
000 - Parade
Concept ID: 000
Name: Parade
Definition: Multiple units of marchers,
devices, bands, banners or Music.
Labeled: Yes
LSCOM Hierarchy
http://www.lscom.org/ontology/index.html
Thing
.Individual
..Dangerous_Thing
...Dangerous_Situation
....Emergency_Incident
.....Disaster_Event
......Natural_Disaster
....Natural_Hazard
.....Avalance
.....Earthquake
.....Mudslide
.....Natural_Disaster
.....Tornado
...Dangerous_Tangible_Thing
....Cutting_Device
Definition: What’s the ontology?
(Wikipedia) An ontology is a formal representation
of the knowledge by a set of concepts
within a domain and the relationships
between those concepts. It is used to
reason about the properties of that
domain, and may be used to describe the
domain.
Ontology
Represents the visual knowledge base in a
structure way
◦ Graph structure
◦ Tree (hierarchy) structure
Images/videos can be effectively learned
and retrieved by the coherence between
concepts
◦ Logical coherence
◦ Statistical coherence
An Ontology Hierarchy: Military
Vehicle
An example from Wikipedia
Ontology Tree for LSCOM
A Light Scale Concept Ontology for
Multimedia Understanding
(LSCOM-Lite) The aim is to break the semantic space using
a few concepts (39 concepts).
Selection Criteria
◦ Semantic Coverage
As many as semantic concepts in News videos could be covered by the light concept set.
◦ Compactness
These concept should not semantically overlap.
◦ Modelability These concepts could be modeled with a smaller
semantic gap.
Selected concept dimensions
Divide the semantic space into a multimedia-dimensional space, where each dimension is nearly orthogonal
◦ Program Category
◦ Setting/Scene/Site
◦ People
◦ Objects
◦ Activities
◦ Events
◦ Graphics
Histogram of LSCOM-Lite
Concepts
Some example keyframes
Applications
Application I: Conceptual Fusion (most
basic – early fusion)
Application II: Cross-Category
Classification (inter-class relation)
Application III: Event Dynamic in Concept
Space
Application I: Conceptual Fusion
Video
Concept 1
Concept 2
Concept 3
Concept n
Visual
Features
Classifier
…
LSCOM 374 Models
374 LIBSVM models
◦ http://www.ee.columbia.edu/ln/dvmm/columbi
a374/
◦ Feature used (MPEG-7 descriptors)
Color Moments
Edge Histogram
Wavelet Texture
◦ LIBSVM – a library for support vector
machine at
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Application II: cross-category
classification with concept transfer
G.-J. Qi et al. Towards Cross-Category
Knowledge Propagation for Learning
Visual Concepts, in CVPR 2011
Instance-Level Concept Correlation
+1
-1
+1
-1
Mountain Castle
Mountain and castle
Transfer Function
Mountain, Castle
Mountain
Castle
None of them
Model Concept Relations
Automatically construct ontology in
a data-driven manner
An application III – Event Dynamics
in Concept Space
Event Detection with Concept
Dynamics
W. Jiang et al, Semantic event detection
based on visual concept prediction, ICME,
Germany, 2008.
Open Problems
Cross-Dataset Gap
◦ Generalize LSCOM dataset to other dataset (e.g., non-
news video dataset)
Cross-Domain Gap
◦ Text script associated with news videos
Can help information extraction for visual concepts?
Automatic ontology construction
◦ Task dependent v.s. task independent
◦ Data driven v.s. preliminary knowledge (e.g., WordNet)
◦ Incorporate prior human knowledge (logic relation
etc.)
TRECVID Competition
Task 1: High-Level Feature Extraction
◦ Input: subshot
◦ Output: detection results for 39 LSCOM-Lite
concepts in the subshot
High-Level Feature Extraction
Each concept assumed to be binary
(absent or present) in each subshot
Submission: Find subshots that contain a
certain concept, rank them by the
detection confidence score, and submit
the top 2000.
Evaluations: NIST evaluated 20 medium
frequent concepts from 39 concepts using a
50% random samples of all the submission pools
20 Evaluated Concepts
Evaluation Metric: Average Precision
Relevant subshots should be ranked
higher than the irrelevant ones.
R is the number of relevant images in total,
Rj is the number of relevant images in top
j images, Ij indicates if the jth image is
irrelevant or not.
1
1Average Precision
Nj
j
j
RI
R j
Results
TRECVID Competition
Task II: Video Search
◦ Input: text-based 24 topics
◦ Output: relevant subshots in the database
Topics to search
Topics to search (cont’d)
Topics to search
Three Types of Search Systems
Results: Automatic Runs
Results: Manual Runs
Results: Interactive Runs
Machine Problem 7: Shot Boundary
Detection in Videos
Goals
Detect the abrupt content changes
between consecutive frames.
◦ Scene changes
◦ Scene cuts
Steps
Step 1: Measuring the change of content
between video frames
◦ Visual/Acoustic measurements
Step 2: Compare the content distance
between successive frames. If the
distance is larger than a certain threshold,
then a shot boundary may exist.
Measuring Content based on Visual
Information
256 dimensional Color Histogram
◦ In RGB space, normalize the r, g, b in [0,1]
◦ Color space
nr
ng
8X8 histogram
Color Histograms Divide each image into four parts, each
part has a 8X8 histogram, and 256 dim
features in total.
Acoustic Features
12 cepstral coefficients
Energy (sum of square of raw signals)
Zero crossing rates (ZCR)
ZCR = sum(|sign(S(2:N))-sign(S(1:N-1))|)
Hints: normalize energy to avoid it over-
dominating when computing distances
between successive frames
Datasets
Two videos of little over one minute
Manually label the shot boundary
What to submit
Source code
Report
◦ compare shot boundary detection results
returned by your algorithm with the manually
labeled boundaries
◦ Compare
◦ Explain your choice of threshold
◦ Explain the differences between the acoustic-
based and visual-based detection results
Thanks! Q&A