Paul Over TRECVID Project Leader Information Access Division National Institute of Standards and Technology Gaithersburg, MD, USA 1DMASM 2011

DMASM 2011 1

Paul Over

TRECVID Project Leader Information Access Division

National Institute of Standards and Technology Gaithersburg, MD, USA

TRECVID: Promoting Research Via Community Technology Evaluations

http://trecvid.nist.gov

2

What is TRECVID?Workshop series (2001 – present) http://trecvid.nist.gov

• to promote research/progress in content-based video analysis/exploitation

Foundation for large-scale laboratory testing

Forum for the • exchange of research ideas • discussion of research methodology – what works, what doesn’t , and why.

Focus: content-based approaches to

• retrieval/detection/summarization/segmentation/…

Aims for realistic system tasks and test collections• unfiltered data• focus on relatively high-level functionality (e.g. interactive search)• measurement against human abilities

Provides data, tasks, and uniform, appropriate scoring proceduresDMASM 2011

http://trecvid.nist.gov/



3

TRECVID PhilosophyTRECVID is a modern example of the Cranfield tradition

• Laboratory system evaluation based on test collections

Emphasis on advancing the state of the art from evaluation results• TRECVID’s primary aim is not competitive product benchmarking • experimental workshop: sometimes experiments fail!

Laboratory experiments (vs. e.g., observational studies)• sacrifice operational realism and broad scope of conclusions• for control and information about causality – what works and why• results tend to be narrow, at best indicative, not final• evidence grows as approaches prove themselves repeatedly, as part of

various systems, against various test data, over years

DMASM 2011

4

TRECVID Yearly Cycle

Post-workshop experiments, final papers

Results Evaluation

TRECVID Workshop

Results analysis and workshop paper/presentation preparation ~400 authors /year

System building & experimentation; Community contributions (shots, training data, ASR, MT, etc.)

Search topic, ground truth development

Task definitions complete

Call for Participation

Data Procurement

DMASM 2011

English TV News

0

500

1000

1500

2000

TV newsBBC rushes

Sound & vision

Airport Sur-veillance

Internet ArchiveCreative Commons

HAVIC

HAVIC

IACC

TRECVID’s Evolution

2003 2004 2005 2006 2007 2008 2009 2010 2011

0

40

80

120Applied Finished

?

Shot boundaries ■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■

Ad hoc search ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

Features/semantic indexing ■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■

Stories ■■■■■■■■■■■■■ Camera motion ■■■■■■■

BBC rushes - - - - - - ■■■■■■■■■■■■■■■■■■■■■ Summaries ■■■■■■■■■■■■

Copy detection - - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■

Surveillance events - - - - - - - - - - - - - - - - - - -■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

Known-item search - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■

Instance search pilot - - - - - - - - - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■ ■■■■■■■■■■

Multimedia event detection (MED) pilot - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■

Tasks:

Data:(hours)

Participantingteams:

… 2003 2004 2005 2006 2007 2008 2009 2010 2011

5DMASM 2011

BBC rushes

New development or test dataas added

S&V

6

TRECVID 2010 Tasks and Data

DMASM 2011

Internet Archive – Creative Commons (IACC)[ video, title, keywords, description]

Sound and Vision[video]

Airport surveillance [video]

HAVIC - Internet multimedia [video]

Known-item search from text-only query

Instance search from multiple frames with bounding boxes

Surveillance event detection

Multimedia event detection

Semantic indexing (automatic assignment of ~150 tags)

Content-based copy detection

TRECVID @ NIST 7

TV2010 FinishersGroups Finished

Task code Task name

22 CCD Copy detection

11 SED Surveillance event detection

39 SIN Semantic indexing

15 KIS Known-item search

5 MED Multimedia event detection pilot

15 INS Instance search pilot

32

27

12

1 1

Unique finishing teams

Asia EuropeNorth America South AmericaAfrica

TRECVID @ NIST 8

Support

Brewster Kahle (Internet Archive's founder) and R. Manmatha (U. Mass, Amherst) suggested in December of 2008 that TRECVID take another look at the resources of the Archive.

Cara Binder and Raj Kumar @ archive.org helped explain how to query and download automatically from the Internet Archive.

Georges Quénot with Franck Thollard, Andy Tseng, Bahjat Safadi from LIG and Stéphane Ayache from LIF shared coordination of the semantic indexing task and organized additional judging with support from the Quaero program

Georges Quénot and Stéphane Ayache again organized a collaborative annotation of 130 features.

Shin'ichi Satoh at NII along with Alan Smeaton and Brian Boyle at DCU arranged for the mirroring of the video data

Colum Foley and Kevin McGuinness (DCU) helped segment the instance search topic examples and set up the oracle at DCU for interactive systems in the known-item search task.

The LIMSI Spoken Language Processing Group and VexSys Research provided ASR for the IACC.1 videos.

Laurent Joyeux (INRIA-Roquencourt) updated the copy detection query generation code.

Matthijs Douze from INRIA-LEAR volunteered a camcorder simulator to automate the camcording transformation for the copy detection task.

Emine Yilmaz (Microsoft Research) and Evangelos Kanoulas (U. Sheffield) updated their xinfAP code (sample_eval.pl) to estimate additional values and made it available.

: National Institute of Standards and Technology (NIST) Intelligence Advanced Research Projects Activity (IARPA) Department of Homeland Security (DHS)Contributors:

TRECVID @ NIST 9

Some impacts … Continuing improvement in feature detection (automatic tagging)in the University of Amsterdam’s MediaMill system

Performance on 36 features doubled: 2006 –> 2009 Within domain (train and test) MAP 0.22 -> 0.41 Cross domains MAP 0.13 -> 0.27

Bibliometric study of TRECVID’s scholarly impact: 2003 - 2009(Dublin City University & University College, Dublin)

2073 peer-reviewed journal/conference papers

2010 RTI International economic impact study of TREC/TRECVID “…for every $1 that NIST and its partners invested in TREC[/TRECVID],

at least $3.35 to $5.07 in benefits accrued to IR [Information Retrieval] researchers”

10

TRECVID search types so far TRECVID search has modeled a user looking for video shots for reuse

• of people, objects, locations, events• not just information (e.g., video of X, not video of someone talking about X)• independent of original intent, saliency, etc.

in video of various sorts (without metadata other than file names):• multilingual broadcast news (Arabic, Chinese, English)• Dutch “edutainment”, cultural, news magazine, historical shows

using queries containing: • text only • text + image/video examples• image/video examples only

in two modes:• fully automatic• human-in-the-loop search

DMASM 2011

11

Specific(Iconographic)

Generic(Pre-iconographic)

Abstract(Iconological)

Who Individually named person, group, thing

Kind of person, thing Mythical, fictitious being

What Individually named event, action

Kind of event, action, condition

Emotion, abstraction

Where Individually named geographical location

Kind of place, geographical, architectural

Place symbolized

When Linear time: date or period

Cyclical time: season, time of day

Emotion, abstraction symbolized by time

Panofsky/Shatford mode/facet matrix

** From Enser, Peter G. B. and Sandom, Chriss J. Retrieval of Archival Moving Imagery –

CBIR Outside the Frame. CIVR2002. LNCS 2383 pp. 206-214.

**

DMASM 2011

DMASM 2011 12

24 Topics from TRECVID 2009 Find shots of a road taken from a moving vehicle

through the front window. Find shots of a crowd of people, outdoors, filling

more than half of the frame area. Find shots with a view of one or more tall

buildings (more than 4 stories) and the top story visible.

Find shots of a person talking on a telephone. Find shots of a close-up of a hand, writing,

drawing, coloring, or painting. Find shots of exactly two people sitting at a table. Find shots of one or more people, each walking up

one or more steps. Find shots of one or more dogs, walking, running,

or jumping. Find shots of a person talking behind a

microphone. Find shots of a building entrance. Find shots of people shaking hands. Find shots of a microscope.

Find shots of two more people, each singing and/or playing a musical instrument.

Find shots of a person pointing. Find shots of a person playing a piano. Find shots of a street scene at night. Find shots of printed, typed, or handwritten text,

filling more than half of the frame area. Find shots of something burning with flames

visible. Find shots of one or more people, each at a table

or desk with a computer visible. Find shots of an airplane or helicopter on the

ground, seen from outside. Find shots of one or more people, each sitting in a

chair, talking. Find shots of one or more ships or boats, in the

water. Find shots of a train in motion, seen from outside. Find shots with the camera zooming in on a

person's face.

13DMASM 2011

Data typesSearcher abilities, needs,

preferences, history

Documentary producer searches TV archive for reusable shots ofBerlin in 1920’s

Student searches Web for new music video

Your mother searches home videos for shots of daughter playing with family pet.

Voter looks for video of candidate X at recent town hall meeting

Drilling down in the search landscape

Intelligence analyst searches multilingual open source video for background info on location X

Security personnelsearches surveillancevideo archive for suspiciousbehavior

Fan searches for favoriteTV show episode

10-yr old looks forvideo of tigers for school report

Doctor searches echocardiogramvideos for instances like example

Human visual capabilities, expert vs novice, text/image/concept querying, visualization, …

Indexing, query typing, concept selection, weighting, ranking, pos/neg relevance feedback, metadata, …

Segmentation, keypoints, SIFT, classifier fusion, face recognition, …

SVM, GMM, graphical models, boosting, …

Metrics, data, task definition, ground truth, significance, …

Human-computer interaction

Information retrieval

Machine vision

Machine learning

Metrology …

TRECVID

You want something to make you laugh

14

Finding meaning in text (words) versus images (pixels)

Hurricane Andrew which hit the Florida coast south of Miami in late August 1992 was at the time the most expensive disaster in US history. Andrew's damage in Florida cost the insurance industry about $8 billion. There were fifteen deaths, severe property damage, 1.2 million homes were left without electricity, and in Dade county alone 250,000 were left homeless.

DMASM 2011

Hurricane Andrew which hit the Florida coast south of Miami in late August 1992 was at the time the most expensive disaster in US history. Andrew's damage in Florida cost the

insurance industry about $8 billion. There were fifteen deaths, severe property damage, 1.2 million homes were left without electricity, and in Dade county alone 250,000 were left homeless.

15

One image/video – many different (changing) views of content

Creator’s keywords: “stupid sister”

DMASM 2011

www.archive.org/details/StupidSister

womenpigeonsplazabuildingsoutdoorsdaytimerunning fallingclapping….

Possible content keywords, tags:

http://www.archive.org/details/StupidSister

16

One person/thing/location – many different (changing) appearances

DMASM 2011

17

Can multimedia features serve as “words”?

Low-level– Color – Texture– Shape

High-level– 449 annotated

LSCOM features– 39 LSCOM-Lite– TRECVID 2009

ClassroomChairInfantTraffic intersectionDoorwayAirplane-flyingPerson-playing-a-musical-

instrumentBusPerson-playing-soccerCityscapePerson-riding-a-bicycleTelephone

Person-eatingDemonstration-Or-ProtestHandPeople-dancingNighttimeBoat-ShipFemale-human-face-

closeupSinging

Text from– speech– video OCR

DMASM 2011

18

LSCOM feature sample

DMASM 2011

000 – Parade001 - Exiting_Car002 – Handshaking003 – Running004 - Airplane_Crash005 – Earthquake006 - Demonstration_Or_Protest007 - People_Crying008 - Airplane_Takeoff009 - Airplane_Landing010 - Helicopter_Hovering011 – Golf012 – Walking013 – Singing014 – Baseball015 – Basketball016 – Football017 – Soccer018 – Tennis019 - Speaking_To_Camera020 – Riot021 - Natural_Disasters022 – Tornado023 - Ice_Skating024 – Snow025 - Flood026 – Skiing027 – Talking028 – Dancing

029 - Car_Crash030 – Funeral031 – Gymnastics032 - Rocket_Launching033 – Cheering034 – Greeting035 – Throwing036 – Shooting037 - Address_Or_Speech038 - Bomber_Bombing039 - Celebration_Or_Party040 – Airport041 – Barn042 – Castle043 – College044 – Courthouse045 - Fire_Station046 - Gas_Station047 – Grain_Elevator048 – Greenhouse049 – Hangar050 – Hospital051 – Hotel052 - House_Of_Worship053 - Police_Station054 - Power_Plant055 - Processing_Plant056 – School057 - Shopping_Mall

058 – Stadium059 – Supermarket060 - Airport_Or_Airfield061 – Aqueduct062 – Avalanche063 - River_Bank064 - Aircraft_Cabin

. . .810 - Still_Image_Composition_May_Include_Text811 - Stock_Exchange812 – Stockyard813 - Storage_Tanks814 - Store_Outside815 - Street_Signs816 - Street_Vendor817 - Students_Schoolkids818 – Suitcases819 – Surgeons820 – Sword821 – Synagogue822 – Tailor823 – Tanneries824 - Taxi_Driver825 – Teacher826 - Team_Organized_Group827 – Technicians828 – Teenagers829 – Temples

830 – Terrorist831 - Text_Only_Artificial_Bkgd832 - Thatched_Roof_Buildings833 – Theater834 – Toddlers835 - Town_Halls836 - Town_Squares837 – Townhouse838 – Tractor839 - Traffic_Cop840 - Train_Station841 - Tribal_Chief842 – Twilight843 – Uav844 - Vacationer_Tourist845 – Vandal846 – Veterinarian847 – Viaducts848 – Vineyards849 – Voter850 - Waiter_Waitress851 - Water_Mains852 – Windmill853 - Wooden_Buildings854 - Worker_Laborer

http://www.lscom.org

19

Simulation study suggests ….

“… ‘concept-based’ video retrieval with fewer than 5000 concepts, detected with minimal accuracy of 10% mean average precision is likely to provide high accuracy results, comparable to text retrieval on the web, in a typical broadcast news collection.” *

? * Alexander Hauptmann, Rong Yan, Wei-Hao Lin, Michael Christel, and Howard

Wactlar. Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News. IEEE Transactions in Multimedia. Vol. 9, No. 5. August 2007 pp.958-966.

DMASM 2011

20

A generic TRECVID search system (based on Snoek and Worring 2008 ** )

** Cees G. M. Snoek and Marcel Worring. Concept-Based Video Retrieval. in Foundations and Trends in Information Retrieval Vol. 2, No. 4 (2008) 215-322

Basic Concept

DetectionFeature Fusion Classifier

FusionModeling Relations

Best of SelectionShot-segmented

video

Database

SEARCHER

Query results

combination

Query Prediction

Learning from the searcher

Visualization

Query Methods

Information need

Query requests

21

Innovative search interfaces …

DMASM 2011http://www-nlpir.nist.gov/projects/tvpubs/tv9.slides/mediamill1.slides.pdf

U. Amsterdam MediaMill

22

Some results Keyframes from top 20 clips returned by a system

to query for “shots of person seated at computer “

DMASM 2011

DMASM 2011 23

Variation in Average Precision by topic

Dogs walking …

Printer, typed… text …

Closeup of hand writing …

Crowds of people (270), Building entrance (278), People at desk with computer (287) each had automatic max better then interactive max

24

Observations, questions … One solution will not fit all. Investigations/discussion of video search

must be related to the searcher‘s specific needs/capabilities/history and to the kinds data being searched.

The enormous and growing amounts of video require extremely large-scale approaches to video exploitation. Much of it has little or no metadata describing the content in any detail.

TREVCID participants have explored some automatic approaches to tagging and use of those tags in automatic and interactive search systems on a couple sorts of video. Much has been learned, some results may already be useful, but most of the territory is still unexplored.

DMASM 2011

25

Observations, questions …Within the focus of TRECVID experiments …

Multiple information sources (text, audio, video), each errorful, can yield better results when combined than used alone…

A human in the loop in search still makes an enormous difference.

Text from speech via automatic speech recognition (ASR) is a powerful source of information but:

• Its usefulness varies by video genre• Not everything/one in a video is talked about, “in the news"• Audible mentions are often offset in time from visibility • Not all languages have good ASR

Machine learning approaches to tagging• yield seemingly useful results against large amounts of data when training data

is sufficient and similar to the test data • but will they work well enough to be useful on highly heterogeneous video?

DMASM 2011

26

Within the focus of TRECVID experiments …

A hierarchy of automatically derived features can help bridge the gap between pixels and meaning and can assist search - but problems abound:

What is the right set of features for a given application?

Given a query, how do you automatically decide which specific features to use?

Creating quality training data, even with active learning, is very expensive

Searchers (experts and non-experts) will use more than text queries if available: concepts, visual similarity, temporal browsing, positive and negative relevance feedback,… http://www.videolympics.org

Processing video using a sample of more than one frame per shot, yields better results but quickly pushes common hardware configurations to their limits

DMASM 2011

Observations, questions …

http://www.videolympics.org/

27

Within the focus of TRECVID experiments …

TRECVID has only just started looking at combining automatically derived and manual-provided evidence in search

Systems have been using externally annotated video (e.g. Flickr) but results are not conclusive

Internet Archive video will provide titles, keywords, descriptions

Where in the Panofsky hierarchy are the donors’ descriptions? If very personal, does that mean less useful for other people?

Need observational studies of real searching of various sorts using current functionality and identifying unmet needs

Need more access for researchers to much more multimedia data of varying kinds, mixtures, with and without human annotation

DMASM 2011


28

Time to take some of the ideas developed in the laboratory out for small scale testing with real users with real needs and real video collections ?

DMASM 2011


Documents

Paul Over TRECVID Project Leader Information Access Division National Institute of Standards and Technology Gaithersburg, MD, USA 1DMASM 2011