Upload
gwenda-walters
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
DMASM 2011 1
Paul Over
TRECVID Project Leader Information Access Division
National Institute of Standards and Technology Gaithersburg, MD, USA
TRECVID: Promoting Research Via Community Technology Evaluations
http://trecvid.nist.gov
2
What is TRECVID?Workshop series (2001 – present) http://trecvid.nist.gov
• to promote research/progress in content-based video analysis/exploitation
Foundation for large-scale laboratory testing
Forum for the • exchange of research ideas • discussion of research methodology – what works, what doesn’t , and why.
Focus: content-based approaches to
• retrieval/detection/summarization/segmentation/…
Aims for realistic system tasks and test collections• unfiltered data• focus on relatively high-level functionality (e.g. interactive search)• measurement against human abilities
Provides data, tasks, and uniform, appropriate scoring proceduresDMASM 2011
3
TRECVID PhilosophyTRECVID is a modern example of the Cranfield tradition
• Laboratory system evaluation based on test collections
Emphasis on advancing the state of the art from evaluation results• TRECVID’s primary aim is not competitive product benchmarking • experimental workshop: sometimes experiments fail!
Laboratory experiments (vs. e.g., observational studies)• sacrifice operational realism and broad scope of conclusions• for control and information about causality – what works and why• results tend to be narrow, at best indicative, not final• evidence grows as approaches prove themselves repeatedly, as part of
various systems, against various test data, over years
DMASM 2011
4
TRECVID Yearly Cycle
Post-workshop experiments, final papers
Results Evaluation
TRECVID Workshop
Results analysis and workshop paper/presentation preparation ~400 authors /year
System building & experimentation; Community contributions (shots, training data, ASR, MT, etc.)
Search topic, ground truth development
Task definitions complete
Call for Participation
Data Procurement
DMASM 2011
English TV News
0
500
1000
1500
2000
TV newsBBC rushes
Sound & vision
Airport Sur-veillance
Internet ArchiveCreative Commons
HAVIC
HAVIC
IACC
TRECVID’s Evolution
2003 2004 2005 2006 2007 2008 2009 2010 2011
0
40
80
120Applied Finished
?
Shot boundaries ■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■
Ad hoc search ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Features/semantic indexing ■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■
Stories ■■■■■■■■■■■■■ Camera motion ■■■■■■■
BBC rushes - - - - - - ■■■■■■■■■■■■■■■■■■■■■ Summaries ■■■■■■■■■■■■
Copy detection - - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■
Surveillance events - - - - - - - - - - - - - - - - - - -■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Known-item search - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■
Instance search pilot - - - - - - - - - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■ ■■■■■■■■■■
Multimedia event detection (MED) pilot - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■
Tasks:
Data:(hours)
Participantingteams:
… 2003 2004 2005 2006 2007 2008 2009 2010 2011
5DMASM 2011
BBC rushes
New development or test dataas added
S&V
6
TRECVID 2010 Tasks and Data
DMASM 2011
Internet Archive – Creative Commons (IACC)[ video, title, keywords, description]
Sound and Vision[video]
Airport surveillance [video]
HAVIC - Internet multimedia [video]
Known-item search from text-only query
Instance search from multiple frames with bounding boxes
Surveillance event detection
Multimedia event detection
Semantic indexing (automatic assignment of ~150 tags)
Content-based copy detection
TRECVID @ NIST 7
TV2010 FinishersGroups Finished
Task code Task name
22 CCD Copy detection
11 SED Surveillance event detection
39 SIN Semantic indexing
15 KIS Known-item search
5 MED Multimedia event detection pilot
15 INS Instance search pilot
32
27
12
1 1
Unique finishing teams
Asia EuropeNorth America South AmericaAfrica
TRECVID @ NIST 8
Support
Brewster Kahle (Internet Archive's founder) and R. Manmatha (U. Mass, Amherst) suggested in December of 2008 that TRECVID take another look at the resources of the Archive.
Cara Binder and Raj Kumar @ archive.org helped explain how to query and download automatically from the Internet Archive.
Georges Quénot with Franck Thollard, Andy Tseng, Bahjat Safadi from LIG and Stéphane Ayache from LIF shared coordination of the semantic indexing task and organized additional judging with support from the Quaero program
Georges Quénot and Stéphane Ayache again organized a collaborative annotation of 130 features.
Shin'ichi Satoh at NII along with Alan Smeaton and Brian Boyle at DCU arranged for the mirroring of the video data
Colum Foley and Kevin McGuinness (DCU) helped segment the instance search topic examples and set up the oracle at DCU for interactive systems in the known-item search task.
The LIMSI Spoken Language Processing Group and VexSys Research provided ASR for the IACC.1 videos.
Laurent Joyeux (INRIA-Roquencourt) updated the copy detection query generation code.
Matthijs Douze from INRIA-LEAR volunteered a camcorder simulator to automate the camcording transformation for the copy detection task.
Emine Yilmaz (Microsoft Research) and Evangelos Kanoulas (U. Sheffield) updated their xinfAP code (sample_eval.pl) to estimate additional values and made it available.
: National Institute of Standards and Technology (NIST) Intelligence Advanced Research Projects Activity (IARPA) Department of Homeland Security (DHS)Contributors:
TRECVID @ NIST 9
Some impacts … Continuing improvement in feature detection (automatic tagging)in the University of Amsterdam’s MediaMill system
Performance on 36 features doubled: 2006 –> 2009 Within domain (train and test) MAP 0.22 -> 0.41 Cross domains MAP 0.13 -> 0.27
Bibliometric study of TRECVID’s scholarly impact: 2003 - 2009(Dublin City University & University College, Dublin)
2073 peer-reviewed journal/conference papers
2010 RTI International economic impact study of TREC/TRECVID “…for every $1 that NIST and its partners invested in TREC[/TRECVID],
at least $3.35 to $5.07 in benefits accrued to IR [Information Retrieval] researchers”
10
TRECVID search types so far TRECVID search has modeled a user looking for video shots for reuse
• of people, objects, locations, events• not just information (e.g., video of X, not video of someone talking about X)• independent of original intent, saliency, etc.
in video of various sorts (without metadata other than file names):• multilingual broadcast news (Arabic, Chinese, English)• Dutch “edutainment”, cultural, news magazine, historical shows
using queries containing: • text only • text + image/video examples• image/video examples only
in two modes:• fully automatic• human-in-the-loop search
DMASM 2011
11
Specific(Iconographic)
Generic(Pre-iconographic)
Abstract(Iconological)
Who Individually named person, group, thing
Kind of person, thing Mythical, fictitious being
What Individually named event, action
Kind of event, action, condition
Emotion, abstraction
Where Individually named geographical location
Kind of place, geographical, architectural
Place symbolized
When Linear time: date or period
Cyclical time: season, time of day
Emotion, abstraction symbolized by time
Panofsky/Shatford mode/facet matrix
** From Enser, Peter G. B. and Sandom, Chriss J. Retrieval of Archival Moving Imagery –
CBIR Outside the Frame. CIVR2002. LNCS 2383 pp. 206-214.
**
DMASM 2011
DMASM 2011 12
24 Topics from TRECVID 2009 Find shots of a road taken from a moving vehicle
through the front window. Find shots of a crowd of people, outdoors, filling
more than half of the frame area. Find shots with a view of one or more tall
buildings (more than 4 stories) and the top story visible.
Find shots of a person talking on a telephone. Find shots of a close-up of a hand, writing,
drawing, coloring, or painting. Find shots of exactly two people sitting at a table. Find shots of one or more people, each walking up
one or more steps. Find shots of one or more dogs, walking, running,
or jumping. Find shots of a person talking behind a
microphone. Find shots of a building entrance. Find shots of people shaking hands. Find shots of a microscope.
Find shots of two more people, each singing and/or playing a musical instrument.
Find shots of a person pointing. Find shots of a person playing a piano. Find shots of a street scene at night. Find shots of printed, typed, or handwritten text,
filling more than half of the frame area. Find shots of something burning with flames
visible. Find shots of one or more people, each at a table
or desk with a computer visible. Find shots of an airplane or helicopter on the
ground, seen from outside. Find shots of one or more people, each sitting in a
chair, talking. Find shots of one or more ships or boats, in the
water. Find shots of a train in motion, seen from outside. Find shots with the camera zooming in on a
person's face.
13DMASM 2011
Data typesSearcher abilities, needs,
preferences, history
Documentary producer searches TV archive for reusable shots ofBerlin in 1920’s
Student searches Web for new music video
Your mother searches home videos for shots of daughter playing with family pet.
Voter looks for video of candidate X at recent town hall meeting
Drilling down in the search landscape
Intelligence analyst searches multilingual open source video for background info on location X
Security personnelsearches surveillancevideo archive for suspiciousbehavior
Fan searches for favoriteTV show episode
10-yr old looks forvideo of tigers for school report
Doctor searches echocardiogramvideos for instances like example
Human visual capabilities, expert vs novice, text/image/concept querying, visualization, …
Indexing, query typing, concept selection, weighting, ranking, pos/neg relevance feedback, metadata, …
Segmentation, keypoints, SIFT, classifier fusion, face recognition, …
SVM, GMM, graphical models, boosting, …
Metrics, data, task definition, ground truth, significance, …
Human-computer interaction
Information retrieval
Machine vision
Machine learning
Metrology …
TRECVID
You want something to make you laugh
14
Finding meaning in text (words) versus images (pixels)
Hurricane Andrew which hit the Florida coast south of Miami in late August 1992 was at the time the most expensive disaster in US history. Andrew's damage in Florida cost the insurance industry about $8 billion. There were fifteen deaths, severe property damage, 1.2 million homes were left without electricity, and in Dade county alone 250,000 were left homeless.
DMASM 2011
Hurricane Andrew which hit the Florida coast south of Miami in late August 1992 was at the time the most expensive disaster in US history. Andrew's damage in Florida cost the
insurance industry about $8 billion. There were fifteen deaths, severe property damage, 1.2 million homes were left without electricity, and in Dade county alone 250,000 were left homeless.
15
One image/video – many different (changing) views of content
Creator’s keywords: “stupid sister”
DMASM 2011
www.archive.org/details/StupidSister
womenpigeonsplazabuildingsoutdoorsdaytimerunning fallingclapping….
Possible content keywords, tags:
16
One person/thing/location – many different (changing) appearances
DMASM 2011
17
Can multimedia features serve as “words”?
Low-level– Color – Texture– Shape
High-level– 449 annotated
LSCOM features– 39 LSCOM-Lite– TRECVID 2009
ClassroomChairInfantTraffic intersectionDoorwayAirplane-flyingPerson-playing-a-musical-
instrumentBusPerson-playing-soccerCityscapePerson-riding-a-bicycleTelephone
Person-eatingDemonstration-Or-ProtestHandPeople-dancingNighttimeBoat-ShipFemale-human-face-
closeupSinging
Text from– speech– video OCR
DMASM 2011
18
LSCOM feature sample
DMASM 2011
000 – Parade001 - Exiting_Car002 – Handshaking003 – Running004 - Airplane_Crash005 – Earthquake006 - Demonstration_Or_Protest007 - People_Crying008 - Airplane_Takeoff009 - Airplane_Landing010 - Helicopter_Hovering011 – Golf012 – Walking013 – Singing014 – Baseball015 – Basketball016 – Football017 – Soccer018 – Tennis019 - Speaking_To_Camera020 – Riot021 - Natural_Disasters022 – Tornado023 - Ice_Skating024 – Snow025 - Flood026 – Skiing027 – Talking028 – Dancing
029 - Car_Crash030 – Funeral031 – Gymnastics032 - Rocket_Launching033 – Cheering034 – Greeting035 – Throwing036 – Shooting037 - Address_Or_Speech038 - Bomber_Bombing039 - Celebration_Or_Party040 – Airport041 – Barn042 – Castle043 – College044 – Courthouse045 - Fire_Station046 - Gas_Station047 – Grain_Elevator048 – Greenhouse049 – Hangar050 – Hospital051 – Hotel052 - House_Of_Worship053 - Police_Station054 - Power_Plant055 - Processing_Plant056 – School057 - Shopping_Mall
058 – Stadium059 – Supermarket060 - Airport_Or_Airfield061 – Aqueduct062 – Avalanche063 - River_Bank064 - Aircraft_Cabin
. . .810 - Still_Image_Composition_May_Include_Text811 - Stock_Exchange812 – Stockyard813 - Storage_Tanks814 - Store_Outside815 - Street_Signs816 - Street_Vendor817 - Students_Schoolkids818 – Suitcases819 – Surgeons820 – Sword821 – Synagogue822 – Tailor823 – Tanneries824 - Taxi_Driver825 – Teacher826 - Team_Organized_Group827 – Technicians828 – Teenagers829 – Temples
830 – Terrorist831 - Text_Only_Artificial_Bkgd832 - Thatched_Roof_Buildings833 – Theater834 – Toddlers835 - Town_Halls836 - Town_Squares837 – Townhouse838 – Tractor839 - Traffic_Cop840 - Train_Station841 - Tribal_Chief842 – Twilight843 – Uav844 - Vacationer_Tourist845 – Vandal846 – Veterinarian847 – Viaducts848 – Vineyards849 – Voter850 - Waiter_Waitress851 - Water_Mains852 – Windmill853 - Wooden_Buildings854 - Worker_Laborer
http://www.lscom.org
19
Simulation study suggests ….
“… ‘concept-based’ video retrieval with fewer than 5000 concepts, detected with minimal accuracy of 10% mean average precision is likely to provide high accuracy results, comparable to text retrieval on the web, in a typical broadcast news collection.” *
? * Alexander Hauptmann, Rong Yan, Wei-Hao Lin, Michael Christel, and Howard
Wactlar. Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News. IEEE Transactions in Multimedia. Vol. 9, No. 5. August 2007 pp.958-966.
DMASM 2011
20
A generic TRECVID search system (based on Snoek and Worring 2008 ** )
** Cees G. M. Snoek and Marcel Worring. Concept-Based Video Retrieval. in Foundations and Trends in Information Retrieval Vol. 2, No. 4 (2008) 215-322
Basic Concept
DetectionFeature Fusion Classifier
FusionModeling Relations
Best of SelectionShot-segmented
video
Database
SEARCHER
Query results
combination
Query Prediction
Learning from the searcher
Visualization
Query Methods
Information need
Query requests
21
Innovative search interfaces …
DMASM 2011http://www-nlpir.nist.gov/projects/tvpubs/tv9.slides/mediamill1.slides.pdf
U. Amsterdam MediaMill
22
Some results Keyframes from top 20 clips returned by a system
to query for “shots of person seated at computer “
DMASM 2011
DMASM 2011 23
Variation in Average Precision by topic
Dogs walking …
Printer, typed… text …
Closeup of hand writing …
Crowds of people (270), Building entrance (278), People at desk with computer (287) each had automatic max better then interactive max
24
Observations, questions … One solution will not fit all. Investigations/discussion of video search
must be related to the searcher‘s specific needs/capabilities/history and to the kinds data being searched.
The enormous and growing amounts of video require extremely large-scale approaches to video exploitation. Much of it has little or no metadata describing the content in any detail.
TREVCID participants have explored some automatic approaches to tagging and use of those tags in automatic and interactive search systems on a couple sorts of video. Much has been learned, some results may already be useful, but most of the territory is still unexplored.
DMASM 2011
25
Observations, questions …Within the focus of TRECVID experiments …
Multiple information sources (text, audio, video), each errorful, can yield better results when combined than used alone…
A human in the loop in search still makes an enormous difference.
Text from speech via automatic speech recognition (ASR) is a powerful source of information but:
• Its usefulness varies by video genre• Not everything/one in a video is talked about, “in the news"• Audible mentions are often offset in time from visibility • Not all languages have good ASR
Machine learning approaches to tagging• yield seemingly useful results against large amounts of data when training data
is sufficient and similar to the test data • but will they work well enough to be useful on highly heterogeneous video?
DMASM 2011
26
Within the focus of TRECVID experiments …
A hierarchy of automatically derived features can help bridge the gap between pixels and meaning and can assist search - but problems abound:
What is the right set of features for a given application?
Given a query, how do you automatically decide which specific features to use?
Creating quality training data, even with active learning, is very expensive
Searchers (experts and non-experts) will use more than text queries if available: concepts, visual similarity, temporal browsing, positive and negative relevance feedback,… http://www.videolympics.org
Processing video using a sample of more than one frame per shot, yields better results but quickly pushes common hardware configurations to their limits
DMASM 2011
Observations, questions …
27
Within the focus of TRECVID experiments …
TRECVID has only just started looking at combining automatically derived and manual-provided evidence in search
Systems have been using externally annotated video (e.g. Flickr) but results are not conclusive
Internet Archive video will provide titles, keywords, descriptions
Where in the Panofsky hierarchy are the donors’ descriptions? If very personal, does that mean less useful for other people?
Need observational studies of real searching of various sorts using current functionality and identifying unmet needs
Need more access for researchers to much more multimedia data of varying kinds, mixtures, with and without human annotation
DMASM 2011
Observations, questions …
28
Time to take some of the ideas developed in the laboratory out for small scale testing with real users with real needs and real video collections ?
DMASM 2011
Observations, questions …