Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Content-based Image and Video Retrieval
Vorlesung, SS 2010
Content-based Image and Video Analysis
-High-Level Feature Detection-
Hazım Kemal Ekenel, [email protected]
Rainer Stiefelhagen, [email protected]
CV-HCI Research Group: http://cvhci.ira.uka.de/
31.05.2010
Last week
Introduction to digital video processing
Shot boundary detection
A Shot is a sequence of frames captured by a
single camera in a single continuous action.
A case study: TV Genre Classification
Estimating the type of the TV program using visual
+ aural + cognitive + structural cues
Content-based image and video retrieval 2
This week
Concept Detection/High-level feature
detection
Data collection & annotation
Concept lexicon
Evaluations
Sample systems –IBM, MediaMill, Columbia Uni.
Visual features
Learning approaches
Fusion techniques
Issues & challenges
Content-based image and video retrieval 3
Video Retrieval
4
Search using content
Example I:
IBM IMARS Multimedia Analysis and Retrieval System:
http://mp7.watson.ibm.com/?ibm-download-now=View+demo
Video Retrieval
Example II:
University of Amsterdam MediaMill
http://www.science.uva.nl/research/mediamill/demo/index.php
5
Video Retrieval
Example III:
Uni-Karlsruhe, CVCHI research group
6
Levels of image/video retrieval
Level 1: Based on color, texture, shape features
Images are compared based on low-level features,
no semantics involved
A lot of research done, is a feasible task
Level 2: Bring semantic meanings into the search
E. g. identifying human beings, horses, trees, beaches
Requires retrieval techniques of level 1
Level 3: Retrieval with abstract and subjective attributes
Find pictures of a particular birthday celebration
Find a picture of a happy beautiful woman
Requires retrieval techniques of level 2 and very complex logic
What is a concept/high-level feature?
Content-based image and video retrieval 8
Challenges
Content-based image and video retrieval 9
Changes in
• View angle,
• Scale,
• Color,
• Shape …
Large-Scale Concept Ontology
for Multimedia (LSCOM)
Content-based image and video retrieval 10
• Collaborative activity of three critical communities to create
a user-driven concept ontology for analysis of video
Users (Analysts,
Broadcasters)
Ontology Experts
Technical Researchers, Algorithm
Designers & System Developers
Sample Concepts
000 – Parade
Definition: Multiple units of marchers, devices, bands,
banners or Music.
001 – Exiting_Car
Definition: A car exiting from somewhere, such as a
highway, building, or parking lot.
002 – Handshaking
Definition: Two people shaking hands. Does not include
hugging or holding hands.
003 – Running
Definition: One or more people running.
004 – Airplane_Crash
Definition: Airplane crash site.
Content-based image and video retrieval 11
Sample Concepts
006 – Demonstration_Or_Protest
Definition: One or more people protesting. May or may not
have banners or signs.
007 – People_Crying
Definition: One or more people with visible tears.
008 – Airplane_Takeoff
Definition: Airplane heading down the runway for take off
(may have already left runway and be ascending).
009 – Airplane_Landing
Definition: Airplane descending or decelerating after making
contact with runway.
010 – Helicopter_Hovering
Definition: Helicopter in the air. May be moving or staying in
place.
Content-based image and video retrieval 12
Evaluation: TRECVID
High-level Feature Detection Task
Promote progress in content-based analysis,
detection, retrieval in large amount of digital
video
Combine multiple errorful sources of evidence
Achieve greater effectiveness, speed, and
usability
Confront systems with unfiltered data and
realistic tasks
Measure systems against human abilities
Content-based image and video retrieval 13
Evolution of TRECVID
Content-based image and video retrieval 14
TV2007 vs. TV2008 vs TV2009 datasets
Content-based image and video retrieval 15
TV 2009 10 new features selection
Participants suggested features that include:
Parts of natural scenes.
Child.
Sports.
Non-speech audio component.
People and objects in action.
Frequency in consumer video.
NIST basic selection criteria:
Features has to be moderately frequent
Has clear definition
Be of use in searching
No overlap with previously used topics/features
Content-based image and video retrieval 16
20 features evaluated
The 10 marked with “*” are a subset of those tested in 2008
Content-based image and video retrieval 17
Frequency of hits varies by
feature (TRECVID 2009)
Content-based image and video retrieval 18
High-level feature detection task (1)
Goal: Build benchmark collection for visual
concepts detection methods
Secondary goals:
encourage generic (scalable) methods for detector
development
semantic annotation is important for
search/browsing
Video data collection:
News magazine, science news, news reports,
documentaries, educational programming and
archival video
Content-based image and video retrieval 19
High-level feature detection task (2)
NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP)
Four training types were allowed A : Systems trained on only common TRECVID
development collection data
OR
(formerly B) Systems trained on only common development collection data but not on (just) common annotation of it.
C : System is not of type A
a,c : same as A and C but no training data specific to any sound and vision data has been used
Content-based image and video retrieval 20
Evaluation
Each feature assumed to be binary: absent or present
for each master reference shot
Task: Find shots that contain a certain feature, rank
them according to confidence measure, submit the top
2000
NIST pooled and judged top results from all
submissions
Evaluated performance effectiveness by calculating
the inferred average precision of each feature result
Compared runs in terms of mean inferred average
precision across the feature results.
Content-based image and video retrieval 21
Problem definition
Given an n-dimensional feature vector xi, part of a
shot i,
The aim is to obtain a measure, which indicates
whether semantic concept wj is present in shot i
Various visual feature extraction methods can be
used to obtain xi
Several supervised machine learning approaches
can be used to learn the relation between wj and xi
p(wj | xi)
Content-based image and video retrieval 22
Concept Detection Process
Content-based image and video retrieval 23
Precision / Recall
Recall: Percentage of all relevant documents that are
retrieved
Precision: Percentage of retrieved documents that
are relevant
F/F1 measure:
Content-based image and video retrieval 24
Precision vs. Recall
Precision-recall curves
Recall = 1 / 5
Precision = 1 / 1
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
1
Ranked Matches
Query
4
7
1
5
2
8
6
3
9
Precision vs. Recall
Precision-recall curves
Recall = 2 / 5
Precision = 2 / 3
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
1
Ranked Matches
Query
4
7
1
5
2
8
6
3
9
Precision vs. Recall
Precision-recall curves
Recall = 3 / 5
Precision = 3 / 5
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
1
Ranked Matches
Query
4
7
1
5
2
8
6
3
9
Precision vs. Recall
Precision-recall curves
Recall = 4 / 5
Precision = 4 / 7
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
1
Ranked Matches
Query
4
7
1
5
2
8
6
3
9
Precision vs. Recall
Precision-recall curves
Recall = 5 / 5
Precision = 5 / 9
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
1
Ranked Matches
Query
4
7
1
5
2
8
6
3
9
Inferred average precision (infAP)
Developed by Emine Yilmaz and Javed A. Aslam at
Northeastern University*
Estimates average precision well using a small
sample of judgments from the usual submission
pools
Experiments on TRECVID 2005 & 2006 & 2007
feature submissions confirmed quality of the estimate
in terms of actual scores and system ranking
Content-based image and video retrieval 30
* J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments
Proceedings of the 29th ACM SIGIR Conference, Seattle, 2006.
Top 10 runs in HLF (TRECVID 2009)
31
Trends in HLF task -2008
HLF task is accepted as an important building
block for search
More interest in category B and C
submissions (using web data, e.g. Flickr
image, youtube)
Hardly any feature specific approaches
Large variety in classifier architectures and
choices of feature representations
Using salient/SIFT points as feature becomes
more and more popular!
Content-based image and video retrieval 32
TRECVID 2009 Observations
Focus on robustness, merging many different
representations
Comparing fusion strategies
Efficiency improvements (e.g. GPU implementations)
Analysis of more than one keyframe per shot
Audio analysis
Using temporal context information
Analyzing motion information
Automatic extraction of Flickr training data
Content-based image and video retrieval 33
State of the art
Content-based image and video retrieval 34
Labeled examples
Low-level
Feature
Extraction
Supervised
Learner
Feature
MeasurementClassification
Training
Testing
Outdoor
probability 0.95
Airplane
probability 0.7
Points to Consider
High-level feature learning consists in training a
system from sets of positive and negative examples.
The system’s performance depends a lot on the
implementation choices and details. It also strongly
depends on the SIZE and QUALITY of the TRAINING
EXAMPLES.
While it is quite easy and cheap to get large amounts
of raw data, it is usually very costly to have them
annotated.
Content-based image and video retrieval 35
Annotations
Collaborative Annotation Effort
36
Sequential
annotation interface
Parallel annotation interface
Keyframes
A set of frames that best represent the visual
content of the scene
Many techniques available for keyframe
extraction:
Take the first, middle or last frames
Temporal change
Clustering
Content-based image and video retrieval 37
Active Learning
Use an existing system and heuristics for
selecting the samples to annotate → need of
a classification score.
Annotate first or only the samples that are
expected to be the most informative for
system training → various strategies.
Get same performance with less annotations
and/or get better performance with the same
annotation count
Content-based image and video retrieval 38
Active Learning Strategies
Random sampling
Uncertainty sampling: Choose the most uncertain
samples,
Samples with the probability which is the closest 0.5 are selected
Relevance sampling: choose the most probable positive
samples,
Samples with the probability which is the closest 1.0 are selected
Choose the farthest samples from already evaluated
ones,
Combinations of these, e.g. choose the samples
amongst the most probable ones and amongst the
farthest from the already evaluated ones.
Content-based image and video retrieval 39
Evaluated Strategies
Random sampling: baseline
Relevance sampling is the best one when a small fraction
(less than 15%) of the dataset is annotated.
Uncertainty sampling is the best one when a medium to
large fraction (15% or more) of the dataset is annotated.
40Amount of used data
MA
P
Active Learning Conclusions
The maximum performance is reached when 12 to 15% of
the whole dataset is annotated (for 36K samples).
The optimal fraction to annotate depends upon the size of
the training set: it roughly varies with the square root of
the training set size (25 to 30% for 9K samples).
Random sampling is better than linear scan.
Simulated active learning can improve system
performance even on fully annotated training sets.
Uncertainty sampling is more “precision oriented”.
Relevance sampling is more “recall oriented”.
Content-based image and video retrieval 41