Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
VIDEO ANNOTATION
Market Trends
• Broadband doubling over next 3-5 years• Video enabled devices are emerging rapidly• Emergence of mass internet audience• Mainstream media moving to the Web
What do we search in video
- a movie scene where Titanic hits the iceberg- the CNN business news report from 1999- everything related to Claudia Schiffer- a “western” movie- all home videos taken in Paris- all scores of a soccer game-movies that I like-…..
Video search requires efficient annotation of video contentTo some extent this can be done automatically
Video conceptual hierarchy
Video
Episode Episode Episode
Video Shot Video Shot Video Shot Video Shot
Frame Frame Frame
Type, Cast, Producer…
Global Content, Intent…
Camera motion, angle
Objects and their spatial relation
– Frames– Shots– Shot boundaries– Scenes– Audio
Low Level
Visual Features
• Color• Texture• Shape• Motion• Shot Boundaries
Mid LevelSemantic Content
• People/Objects• Location• Actions• Time
High LevelSemantic Content
• Concept• Event• Episode/Story
Video objects
Video retrieval and annotation
Retrieval Annotation Syntactical level
Text BasedVisual and audio content-based shot automatic segmentation(limited success) from shot sequences: edit effects
from shot key-frames: color, texture, shape, face..
from shot sequences: motion info
Semantical level
Text Based text tags from manual annotation (subjectivity of human perception)
from video metadata: author, date,… from shot, scene, sequence tags
text tags from automatic annotation
(limited success) from visual and audio
information (scene, event segmentation)
from textual labels in consecut. frames
from audio /speech translation
Low-level analysis
t
News report
Shot 1 Shot 2 Shot 3 Shot 4 Shot 5 Shot 6 Shot 7 . . .
Mid/High-level analysis
Movie episodeSport-event
highlightTopic segment
of a documentary . . .
• Automatic annotation requires automatic partitioning of video into syntactic segments (low-level analysis) and detection of entities and their spatio-temporal relationships (mid/high-level analysis)
Automatic annotation
Video processing chain
Feature Extraction
Spatial Segmentation
Object Tracking
Event Modeling
Temporal Segmentation
Motion Detection
Semantic level
Syntactic level
Video Sequence
Color & TextureAnalysis
Spatio-temporalAnalysis
MotionAnalysis
Shot boundary detection
Classification
Motion verification
Shot summaries
Event Detection
Events
Low Level: extract color, texture, motion features, detect shot boundaries: domain and event independent
Mid Level: determine the object class (e.g. sky, grass, tree, rock, animal)Generate shot descriptors: domain dependent
High Level: use shot descriptors and domain specific inference process to detect events and then episodes: event specific
Video processing framework
Syntactic features
1970 200019901980 2010
Texture: Autocorrelation Wavelet transforms Gabor Filters
Shape: Edge Detectors Moment invariantsFinite Element Methods Shape from Motion
Color: Color Moments Color Histograms Color Autocorrelograms
Segmentation: Scene Segmentation Shot detection
OCR: Modeling Successful OCR deployments
Face: Face Detection algorithms Neural Networks EigenFaces
ASR: Acoustic analysis HMMS N-grams CSR LVCSR
NIST Video TREC StartsMedia IR systems
Web Media Search
Color
• Robust to background, Independent of size, orientation
– Color Histogram [Swain & Ballard]
– Color Moments– Color Sets: Map RGB
Color space to Hue Saturation Value, & quantize [Smith, Chang]
– Color layout- local color features by dividing image into regions
– Color Autocorrelograms
Texture• One of the earliest Image
features [Haralick et al 70s]
– Co-occurrence matrix– Orientation and distance on
gray-scale pixels– Contrast, inverse deference
moment, and entropy [Gotlieb & Kreyszig]
– Human visual texture properties: coarseness, contrast, directionality, likeliness, regularity and roughness [Tamura et al]
– Wavelet Transforms [90s]– [Smith & Chang] extracted
mean and variance from wavelet subbands
– Gabor Filters– …..
Region Segmentation• Partition image into regions
– Strong Segmentation: Object segmentation is difficult.
– Weak segmentation: Region segmentation based on some homegenity criteria
Scene Segmentation• Shot detection, scene detection
– Look for changes in color, texture, brightness
– Context based scene segmentation applied to certain categories such as broadcast news
Face
• Face detection is highly reliable, Face recognition for video is still a challenging problem.
- Viola and Jones face detection- Neural Networks [Rwoley]- Wavelet based histograms of facial features- EigenFaces: Extract eigenvectors and use as feature space
OCR• OCR is fairly successful technology.
Accurate, especially with good matching vocabularies.
ASR• Automatic speech recognition fairly accurate
for medium to large vocabulary broadcast type data, Large number of available speech vendors. Still open for free conversational speech in noisy conditions.
Shape• Outer Boundary based vs. region
based
– Fourier descriptors– Moment invariants– Finite Element Method (Stiffness
matrix- how each point is connected to others; Eigen vectors of matrix)
– Wavelet transforms leverages multiresolution [Chuang & Kao]
– Chamfer matching for comparing 2 shapes (linear dimension rather than area)
– Well-known edge detection algorithms.
• Context is useful to recognize a scene• Scenes are composed by objects and events
• Use Proto-concepts to describe Context• Use Machine Learning to link Context to concepts that define objects and
events
Water
SkySky
Water
Semantic features
Bag ofRegions
Compute Similarity. . .
Image
Sky Grass Road
. . .
. . . =
Proto-concept Similarity Distribution: . . . =
• Semantic segment– concatenation of consecutive video shots that are related to each other regarding
their semantic content– parts of a video where content coherence (i.e. continuity of the semantic content
from one shot to another) remains high– detection of Semantic segment boundaries can be made by measuring the
coherence of the semantic content along neighboring video shots: segment boundaries are places of low content coherence
• Semantic segments are suitable for video genres characterized by clear sequential content structure:
– broadcast news– movies (episode-based scheme)– documentaries
Semantic segments
• Four ways of performing high-level analysis– Video categorization into genres– Video partitioning into semantic segments– Semantic segment extraction– Video summarization/abstraction
• Feature-based algorithmic solutions: different features and algorithms for different analysis processes
• Some model assumptions are required.
High level video analysis is essential to perform semantic automatic annotation of video
Bridging the Semantic Gap
Syntactic analysis
Derivation of style attributes
Genre mapping
style attributes
Genre recognition
Basic statistics
Raw video
STEP 1
STEP 2
STEP 3
• Syntactic analysis is based on simple statistics for the sequence of RGB frames
– Color Statistics• basis for cut detection • color histogram• standard deviation
– Audio statistics• record basic audio frequency and amplitude statistics
– Motion Detection• motion energy - total amount of motion in a scene (block-wise difference color
histogram)• motion vector - distinguish camera motion from object motion
– Object Segment• moving object
– same speed, same direction– use the motion vector field– subtraction the camera motion → pure object motion
Step 1: Syntactic analysisStep 1: Syntactic analysis
• Assign semantics to the scene• Scene length and transition
– important style attribute– scene separator - only hard cuts ( >95% )– other transition - artistic style element
• Camera motion– panning, tilting, zooming– identify the motion vector direction with the highest frequency – normal distribution
• classification panning - error rate < 10%• Object recognition
– simple objects or pattern in well-defined environment– face recognition– TV channel
• predefine pattern → stored in a database• logo recognition algorithm - every 4th frame
Step 2: Derivation of style attributesStep 2: Derivation of style attributes
“KBS” logo in bottom right corner
• Semantics of Audio– Amplitude
• speech - characteristic frequency spectrum• music - beat• noise > threshold• silence < threshold
– Frequency• distinguish between speaker and noise > 95%
– human speech• limited frequency spectrum
• Newscast– characteristic pattern of speaker
and non-speaker– Logo
• Car race– scene length - much shorter than
tennis– camera motion, object motion– noise at high amplitude– easy to distinguish from tennis and
soccer• Tennis
– very good example of the discriminating power of audio
– bouncing of the ball - singular peak– speaker phase
Step 3:Step 3: Genre mappingGenre mapping
VideoClassifier
News
Action movie
Tennis match
Drama
. . .
VideoFeatureExtractor
Genre
Pre-specified genres
. . .
Video-StyleAttributes
Video-StyleAttributes
Matching
• Commercials– separated by up to 5 monochrome
frames– fade in scene transition
• Cartoon– scene length > other genre– less camera motion– zero amplitude - no background
noise• Result
– no single attribute is sufficient to uniquely identify genre
– style profile
Video Search: Video TREC
– 1st TREC (2001)• Mean Average Precision @100 items (MAP) 0.033 (in general category)• Transcript only was better than transcript+image aspects+ASR• Also there were different types of test: known items versus general. Known items had at least
one result.– 2nd TREC (2002)
• Interactive runs to rephrase queries• Again text based on only ASR was the best performing system• Mean Average precision of 0.137• Leading system employed multiple systems: TF-IDF variants (Mean Average Precision MAP
0.093). • Other was boolean with query expansion (0.101 MAP)• OCR was not particularly applicable; Phonetic ASR did not help.
– 3rd TREC (2003)• Data changes radically (Broadcast news added – CNN, ABC, CSPAN)• Baseline ASR + CC MAP 0.155• ASR + CC + VOCR + Image Similarity + Person X retrieval MAP 0.218 * “Successful Approaches in the TREC Video Retrieval Evaluations”, Alexander Hauptmann,
Michael Christel, ACM Multimedia 2004.
Aircraft Animal Boat Building Bus Car Chart Corp. leader Court
Crowd Desert Entertainment Explosion Face Map Meeting
Military Mountain Natural disaster
Office Outdoor
Flag USA Gov. leader
People People marching
Police / security
Walking Water body Weather news
Sky Truck Urban Vegetation Vehicle ViolenceSports Studio
Prisoner
Screen
39 concepts
Digitization and compression: Hardware and software necessary to convert the video information into digital compressed format
Cataloguing: process of extractingmeaningful story units from the raw video data and building the corresponding indexes
Digital video archive: repository of digitized, compressed video data
Visual summaries: representation of video contents in a concise, typically hierarchical, way
Indexes: pointers to video segments or story units
User interface: friendly, visually rich interface that allows the user to interactively query the database, browse the results, and view the selected video clips
Query / search engine: responsible for searching the database according to the parameters provided by the user
Indexes