Text REtrieval Conference (TREC)
Overview of TREC 2006
Ellen Voorhees
Sponsored by:NIST, DTO
Text REtrieval Conference (TREC)
Text REtrieval Conference (TREC)
TREC 2006 Program Committee
Ellen Voorhees, chair John Prager
James Allan Steve Robertson
Chris Buckley Mark Sanderson
Gord Cormack Ian Soboroff
Sue Dumais Karen Sparck Jones
Donna Harman Richard Tong
Bill Hersh Ross Wilkinson
David Lewis
Text REtrieval Conference (TREC)
Blog: Ounis, de Rijke, Macdonald, Mishne, Soboroff
Enterprise: Nick Craswell, Arjen de Vries, Ian Soboroff
Genomics: William Hersh
Legal: Jason Baron, David Lewis, Doug Oard
Question Answering: Hoa Dang, Jimmy Lin, Diane Kelly
Spam: Gordon Cormack
Terabyte: Stefan Büttcher,Charles Clarke,Ian Soboroff
TREC 2006 Track Coordinators
Text REtrieval Conference (TREC)
Arizona State U. Inst. for Infocomm Research Queensland U. of Technology U. & Hospitals of Geneva
Australian Nat. U. & CSIRO ITC-irst Ricoh Software Research Ctr U. of Illinois Chicago (2)
Beijing U. Posts & Telecom Jozef Stefan Institute RMIT U. U. Illinois Urbana Champaign
Carnegie Mellon U. Kyoto U. Robert Gordon U. U. of Iowa
Case Western Reserve U. Language Computer Corp. (2) Saarland U. U. Karlsruhe & CMU
Chinese Acad. Sciences (2) LexiClone Sabir Research, Inc. U. of Limerick
The Chinese U. of Hong Kong LowLands Team Shanghai Jiao Tong U. U. MD Baltimore Cnty & APL
City U. of London Macquarie U. Stan Tomlinson U. of Maryland
CL Research Massachusetts Inst. Tech. SUNY Buffalo U. of Massachusetts
Concordia U. (2) Massey U. Technion-Israel Inst. Tech. The U. of Melbourne
Coveo Solutions Inc. Max-Planck Inst. Comp. Sci. Tokyo Inst. of Technology U. degli Studi di Milano
CRM114 The MITRE Corp. TrulyIntelligent Technologies U. of Missouri Kansas City
Dalhousie U. National Inst of Informatics Tsinghua U. U. de Neuchatel
DaLian U. of Technology National Lib. of Medicine Tufts U. U. of Pisa
Dublin City U. National Security Agency UCHSC at Fitzsimons U. of Pittsburgh
Ecole Mines de Saint-Etienne National Taiwan U. U. of Alaska Fairbanks U. Rome “La Sapienza”
Erasmus, TNO, & U. Twente National U. of Singapore U. of Albany U. of Sheffield
Fidelis Assis NEC Labs America, Inc. U. of Amsterdam (2) U. of Strathclyde
Fudan U. (2) Northeastern U. U. of Arkansas Little Rock U. of Tokyo
Harbin Inst. of Technology The Open U. U. of Calif., Berkeley U. Ulster & St. Petersburg
Humboldt U. & Strato AG Oregon Health & Sci. U. U. of Calif., Santa Cruz U. of Washington
Hummingbird Peking U. U. Edinburgh U. of Waterloo
IBM Research Haifa Polytechnic U. U. of Glasgow U. Wisconsin
IBM Watson Research Purdue U. & CMU U. of Guelph Weill Med. College, Cornell
Illinois Inst. of Technology Queen Mary U. of London U. of Hannover Berlin York U.
Indiana U.
TREC 2006 Participants
Text REtrieval Conference (TREC)
TREC Goals• To increase research in information retrieval based on
large-scale collections
• To provide an open forum for exchange of research ideas to increase communication among academia, industry, and government
• To facilitate technology transfer between research labs and commercial products
• To improve evaluation methodologies and measures for information retrieval
• To create a series of test collections covering different aspects of information retrieval
Text REtrieval Conference (TREC)19
92
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
The TREC Tracks
Retrieval in a domain
Ad Hoc, Robust
Interactive, HARD
X→{X,Y,Z} ChineseSpanish
VideoSpeechOCR
EnterpriseTerabyteWebVLC
NoveltyQ&A
FilteringRouting
LegalGenome
Static text
Streamed text
Human-in-the-loop
Beyond just English
Beyond text
Web searching, size
Answers, not docs
BlogSpam
Personal documents
Text REtrieval Conference (TREC)
Common Terminology• “Document” broadly interpreted,
– for example• email message in enterprise, spam tracks
• blog posting plus comments in blog track
• Classical IR tasks• ad hoc search: collection known; new queries
• routing/filtering: standing queries; streaming document set
• known-item search: find partially remembered specific document
• question answering
Text REtrieval Conference (TREC)
TREC 2006 Themes
• Explore broader information contexts• different document genres
– newswire (QA); web (terabyte)
– blogs; email (enterprise, spam); corporate repositories (legal, enterprise); scientific reports (genomics, legal)
• different tasks– ad hoc (terabyte, enterprise discussion, legal, genomics); known-item (terabyte); classification (spam)
– specific responses (QA, genomics, enterprise expert); opinion finding (blog)
Text REtrieval Conference (TREC)
TREC 2006 Themes
• Construct evaluation methodologies• fair comparisons on massive data sets (terabyte, legal)
• quality of a specific response (QA, genomics)
• evaluation strategies balancing realism and privacy protection (spam, enterprise)
• distributed efficiency benchmarking (terabyte)
Text REtrieval Conference (TREC)
Terabyte Track
• Motivations• investigate evaluation methodology for collections substantially larger than existing TREC collections
• provide test collection for exploring system issues related to size
• Tasks• traditional ad hoc retrieval task
• named page finding task
• efficiency task– systems required to report various timing and resource statistics
Text REtrieval Conference (TREC)
Terabyte Collection• Documents
• ~ 25,000,000 web documents (426 gb)
• spidered in early 2004 from .gov domain
• includes text from pdf, word, etc. files
• Topics• 50 new information-seeking topics for ad hoc
• 181 named page queries contributed by participants
• 100,000 queries from web logs for efficiency
• Judgments• new pooling methodology in ad hoc to address bias
• duplicate detection using DECO
Text REtrieval Conference (TREC)
Ad Hoc Task• Emphasis on exploring pooling strategies
– task guidelines strongly encouraged manual runs to help pools• [unspecified] prize offered for most unique rels
– judgment budget allocated among 3 pools• traditional pools to depth 50; used for scoring runs by trec_eval
• traditional pools starting at rank 400; not used for scoring
• sample of documents per topic such that number judged topic-dependent; used for scoring runs by inferred AP [Yilmaz & Aslam]
Text REtrieval Conference (TREC)
Ad Hoc Task Results
Automatic, title-only runs; 50 new topics
0
0.1
0.2
0.3
0.4
0.5
uwm
tFad
TP
FB
indr
i06A
lceB
TW
TB
06A
D01
hum
T06
xle
Juru
TW
E
uogT
B06
QE
T2
AM
RIM
tp20
006
Cov
eoR
un1
mg4
jAut
oV
zeta
dir
Sco
re
bpref
MAP
infAP
Text R
Etrieval C
onference (TR
EC
)
Nam
ed Page F
inding R
esults
0
0.1
0.2
0.3
0.4
0.5
0.6
indri06Nsdp
uogTB06MP
CoveoNPRun2
THUNPNOSTOP
humTN06dp1
MU06TBn5
zetnpft
uwmtFnpsRR1
UAmsT06n3SUM
ictb0603
TWTB06NP02
MRR
0 5 10 15 20
25
30
35
40
45
50
% topics with no page found
Text REtrieval Conference (TREC)
Efficiency Task
• Goal: compare efficiency/scalability despite different hardware
• 100,000 queries vetted for hits in corpus and seeded with queries from other tasks
• queries divided into 4 streams; within-stream queries required to be processed serially; between streams could use arbitrary interleaving
• participants also report efficiency measures for running open source retrieval system with known characteristics for normalization
• P(20) or MRR measured for seeded queries
Text REtrieval Conference (TREC)
55
13
2202
29
13
109
1680
534
229
79
32
167
4630
808
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MU CWI RMIT MPI UWMT P6 HUM THU
Pre
c(20
)
1
10
100
1000
10000
Late
ncy
(ms)
Fastest Best Fastest Best
Efficiency Task Results
Precision Latency
Text REtrieval Conference (TREC)
Question Answering Track
• Goal: return answers, not document lists
• Tasks:• define a target by answering a series of factoid and list questions about that target, plus returning other info not covered by previous questions
• complex interactive question answering (ciQA)
• AQUAINT document collection source of answers for all tasks
• 3 GB text; approx. 1,033,000 newswire articles
Text REtrieval Conference (TREC)
Question Series
75 series in test set with 6-9 questions per series
19 People 403 total factoid questions
19 Organizations 89 total list questions
18 Things 75 total “other” questions
19 Events
Other185.9
Which companies have sponsored the Iditarod?LIST185.8
What is the record time in which the Iditarod was won?FACT185.7
How many miles long is the Iditarod?FACT185.6
Name people who have run the IditarodLIST185.5
Who is the founder of the Iditarod?FACT184.4
In what month is it held?FACT185.3
In what city does the Iditarod end?FACT185.2
In what city does the Iditarod start?FACT185.1
Iditarod race185
Text REtrieval Conference (TREC)
Main task
• Similar to previous 2 years, but:– questions required to be answered within particular time frame• present tense à latest in corpus
• past tense à implicit default reference to timeframe of series
– factoid questions relatively less important in final score
– conscious attempt to make questions somewhat more difficult• temporal processing
Text REtrieval Conference (TREC)
Series Score• Score a series using weighted average of componentsScore = 1/3FactoidScore+1/3ListScore+1/3OtherScore
• Component score is mean of scores for questions of that type in given series
• FactoidScore: average accuracy. Individual question has score of 1 or 0
• ListScore: average F measure score. Recall & precision of response based on entire set of known answers.
• OtherScore: F(β=3) score for that series’ Other question, calculated using “nugget” evaluation
Text R
Etrieval C
onference (TR
EC
)
Series T
ask Results
0
0.1
0.2
0.3
0.4
0.5lccPA06
LCCFerret
cuhkqaepisto
ed06qar1
FDUQAT15A
NUSCHUAQA3
QACTIS06A
ILQUA1
QASCU1
Roma2006run3
InsunQA06
Average Series Score
factoidlist
other
Text REtrieval Conference (TREC)
Pyramid Nugget Scoring
• Variant of nugget scoring for ‘Others’• suggested by Lin and Demner-Fushman [2006]
• vital/okay distinction by single assessor is major contributor to instability of the evaluation; thus have multiple judges “vote” on nuggets, and use percentage voting for a nugget as nugget’s weight
• each QA assessor given list of nuggets created by primary assessor for series and asked simply to say vital/okay/NaN
• pyramid of votes does not represent any single user, but does increase stability and average series scores highly correlated
Text REtrieval Conference (TREC)
Single vs. Pyramid Nugget Scores
0
0.1
0.2
0.3
0 0.1 0.2 0.3
Primary Assessor Average F(3) Score
Pyra
mid A
vera
ge F(3
) Sco
re
Text REtrieval Conference (TREC)
Complex Interactive QA
• Goals:• investigate richer user contexts within QA
• have (limited) actual interaction with user
• Task inspired by TREC 2006 relationship QA task and HARD track
• “essay” questions
• interaction forms allowed participants to solicit information from assessor (surrogate user)
Text REtrieval Conference (TREC)
Complex Questions• Questions taken from relationship type identified in AQUAINT pilot
• question formed from a relationship template
• also included narrative giving more details
What evidence is there for transport of [goods] from [entity] to [entity]?
What [financial relationship] exists between [entity] and [entity]?
What [organizational ties] exist between [entity] and [entity]?
What [familial ties] exist between [entity] and [entity]?
What [common interests] exist between [entity] and [entity]?
What influence/effect does [entity] have on/in [entity]?
What is the position of [entity] with respect to [issue]?
Is there evidence to support the involvement of [entity] in [event/entity]?
Text REtrieval Conference (TREC)
ciQA Protocol
• Perform baseline runs
• Receive interaction form responses• interaction forms web forms that ask assessor for more information
• content is participant's choice, but assessor spends at most 3 minutes/topic/form
• Perform additional (non-baseline) runs exploiting additional info
Text REtrieval Conference (TREC)
Relationship Task ResultsPyramid F(3) Scores
0
0.1
0.2
0.3
0.4
0 0.1 0.2 0.3 0.4
Initial Score
Final S
core
Automatic
Manual
Text REtrieval Conference (TREC)
Genomics Track
• Track motivation: explore information use within a specific domain
• focus on person experienced in the domain
• 2006 task• single task with elements of both IR and IE
• instance-finding, QA task where unit of retrieval is a (sub) paragraph
Text REtrieval Conference (TREC)
Genomics Track Task• Documents
• full-text journal articles provided through Highwire Press
• associated metadata (eg MEDLINE record) available
• 162,259 articles from 49 journals; about 12.3GB HTML
• Topics• 28 well-formed questions derived from topics used in 2005 (two questions discarded as had no relevant passages)
• topics based on 4 generic topic type templates and instantiated from real user requests – e.g., What is the role of DRD4 in alcoholism?
How do HMG and HMGB1 interact in hepatitis?
• System response• ranked list of up to 1000 passages (pieces of paragraphs)
• each passage a contribution to the answer
Text REtrieval Conference (TREC)
Task Evaluation
• Relevance judging• pools built from standardized paragraphs by mapping retrieved passage to its unique standard
• judged by domain experts using 3-way judgments: not/possibly/definitely relevant
• assessor marked contiguous span in paragraph as answer
• answers assigned a set of MeSH terms
Text REtrieval Conference (TREC)
Task Evaluation• Scoring
– document:• standard ad hoc retrieval task (MAP)
• doc is relevant iff it contains a relevant passage
• collapse system ranking so doc appears just once
– aspect:• use MeSH term sets to define aspects
• retrieved passage that overlaps with marked answer assigned aspect(s) of that answer
– passage• calculate fraction of answers (in characters) that are contained in retrieved passages
Text R
Etrieval C
onference (TR
EC
)
Genom
ics Track R
esults
0
0.1
0.2
0.3
0.4
0.5
0.6
UICGenRun1
UICGenRun3
UICGenRun2
NLMInter
THU1
THU3
THU2
iitx1
UIUCinter2
PCPsgRescore
PCPClean
PCPsgAspect
UIUCinter
Score
Docum
entAspect
Passage
Text REtrieval Conference (TREC)
Enterprise Track
• Goal: investigate enterprise search, searching the data of an organization to complete some task
• find-an-expert task
• ad hoc search for email messages that discuss reasons pro/con for a particular past decision
• Document set• crawl of W3C public web as of June 2004
• ~300,000 documents divided into five major areas: discussion lists (email), code, web, people, other
Text REtrieval Conference (TREC)
Search-for-Experts Task
• Motivation• exploit corporate data to determine who are experts on a given topic
• Task• given a topic area, return a ranked-list of people; also return supporting documents
• people are represented as an id from a canonical list of people in the corpus
• evaluate as standard ad hoc retrieval task where a person is relevant if he/she is indeed an expert [and expertise is documented]
• topics and judgments constructed by participants
Text REtrieval Conference (TREC)
Expert Search Topic
<title>W3C translation policy
<description>I want to find people in charge or knowledgeable about W3C translation policies and procedures
<narrative>I am interested in translating some W3C document into my native language. I want to find experts on W3C translation policies and procedures. Experts are those persons at W3C who are in charge of the translation issues or know about the procedures and/or the legal issues. I do not consider other translators experts.
Text REtrieval Conference (TREC)
Search-for-Experts
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1km
iZhu
1
SJT
U04
SR
CB
EX
5
PR
ISE
XB
IBM
06M
A
UM
aTD
Fb
TH
UP
DD
SN
EM
S
ICT
CS
XR
UN
01
FD
US
O
UvA
prof
iling
R-Precision
No Support Support Required
Text REtrieval Conference (TREC)
Discussion Search
• Retrieve (email) documents containing arguments pro/con about a decision
• discussion list portion of collection only
• topics created at NIST based on categories produced from last year’s track
• messages judged not relevant, on-topic, containing argument– assessor marked whether argument was pro, con, both but this was not part of system’s task
Text REtrieval Conference (TREC)
Discussion Email Search
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
THUDSTHDPFSM
srcbds5
DUTDS3
UAmsPOSBase
york06ed03
UMaTDMixThr
IIISRUN
IBM06JAQ
Text REtrieval Conference (TREC)
Blog Track• New track in 2006
• explore information access in the blogosphere
• One shared task• find opinions about a given target
• traditional ad hoc retrieval task evaluation, similar to discussion search in enterprise track
• Document set• set of blogs collected in Dec 2005-February 2006 & distributed by University of Glasgow
• document is a permalink: blog post plus all its comments
• ~3.2 million permalinks, some other info collected
Text REtrieval Conference (TREC)
Topics & Judgments
• 50 topics• created at NIST by backfitting queries from blog search engine logs
• title was submitted query; NIST assessor created rest of topic to fit
• Judgments• judgments distinguished posts containing opinions from simply on-topic posts; also marked polarity
• systems not required to give polarity, but did have to distinguish opinions from on-topic
• assessors could skip document in pool if post likely offensive, but never exercised that option
Text REtrieval Conference (TREC)
Sample Topics<title> ``March of the Penguins’’
<description>Provide opinion of the film documentary ``March of the Penguins’’.
<narrative>Relevant documents should include opinions concerni ng the film documentary ‘’March of the Penguins’’. Articles or comments about penguins outside the context of this film documenta ry are not relevant.
<title> larry summers
<description>Find opinions on Harvard President Larry Summers’ c omments on gender differences in aptitude for mathematics and science.
<narrative>Statements of opinion on Summers’ comments are rele vant. Quotations of Summers without comment or references to Summers’ statements without discussion of their content are not relevant.Opinions on innate gender differences without refer ence to Summers’ statements are not relevant.
Text REtrieval Conference (TREC)
Opinion Results
Automatic, Title-Only Runs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
woqln2
ParTiDesDmt2
uicst
UAmsB06All
uscsauto
pisaB1Des
UABas11
UALR06a260r2
Text REtrieval Conference (TREC)
Legal Track
• Goal: contribute to the understanding of ways in which automated methods can be used in the legal discovery of electronic records
• Timely track in that changes to the US Federal Rules of Civil Procedure regarding electronic records take effect Dec. 1
Text REtrieval Conference (TREC)
Legal Track Collection
• Document set • almost 7 million documents made public through the tobacco Master Settlement Agreement
• snapshot of the University of California Library’s Legacy Tobacco Document Library produced by IIT CDIP project [IIT CDIP Test Collection, version 1.0]
• documents are XML records including metadata and a text field consisting of the output of OCR
• wide variety of document types including scientific reports, memos, email, budgets…
Text REtrieval Conference (TREC)
Legal Track Collection
• Topic set• modeled after actual legal discovery practice and created by lawyers
• five (hypothetical) complaints with several requests to produce (topics) per complaint
• also included a negotiated Boolean query statement that participants could use as they saw fit
• 46 topics distributed; 39 used in scoring
Text REtrieval Conference (TREC)
Legal Track Collection
• Relevance judgments• pools built from participant’s submitted runs (λ=100 for one run per participant & λ=10 for remaining runs) …
• plus set of documents retrieved by a manual run created by a document set expert who targeted unique relevant docs (approx. 100 docs/topic) …
• plus a stratified sample of a baseline Boolean run (up to 200 documents/topic)
• judgments made by law professionals
Text REtrieval Conference (TREC)
Ranked Retrieval Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Prec
ision
humL06t
SabLeg06aa1
york06la01
UMKCB2
UmdComb
NUSCHUA2
Text REtrieval Conference (TREC)
R-Precision Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
125
165
192
130 5
162 36 4 80 505 35 291 69 481 9 19 354
188 46 17 97 320 64 37 245 34 13 78 137 18 1 1
162 28 158 50 6 62 33
R-p
rec
baseline sableg06ao1 HumL06t
Text REtrieval Conference (TREC)
Spam Track• Motivation:
• assess quality of an email spam filter’s actual usage
• lay groundwork for other tasks with sensitive data
• How to get appropriate corpus?• true mail streams have privacy issues
• simulated/cleansed mail streams introduce artifacts that affect filter performance
• track solution: create software jig that applies given filter to given message stream and evaluates performance based on judgments
• have participants send filters to data
Text REtrieval Conference (TREC)
Spam Tasks
• 4 email streams [ham; spam; total]• public English [12,910; 24,912; 37,822]
• public Chinese [21,766; 42,854; 64,620]
• MrX2 (private) [ 9039; 40,135; 49,174]
• SB2 (private) [ 9274; 2695; 11,969]
• 3 tasks• immediate feedback filtering
• delayed feedback filtering
• active learning
Text REtrieval Conference (TREC)
Evaluation
• Ham misclassification rate (hm%)
• Spam misclassification rate (sm%)
• ROC curve• assumes filter computes a “spamminess” score
• use score to compute sm% as function of hm%
• area under ROC curve is measure of filter effectiveness
• use 1-area expressed as a % to reflect filter ineffectiveness (1-ROCA)%
Text REtrieval Conference (TREC)
Immediate vs. Delayed Feedback
SB2 Corpus
50.00
10.00
1.00
0.10
0.01
50.0010.001.000.100.01
% S
pam
Mis
clas
sific
atio
n (lo
git s
cale
)
% Ham Misclassification (logit scale)
oflS4b2dijsS1b2d
CRMS3b2dhubS2b2d
tufS4b2dtamS4b2d
hitS1b2dbptS2b2ddalS1b2d
50.00
10.00
1.00
0.10
0.01
50.0010.001.000.100.01
% S
pam
Mis
clas
sific
atio
n (lo
git s
cale
)
% Ham Misclassification (logit scale)
oflS3b2ijsS1b2
CRMS3b2tufS4b2
hubS2b2tamS4b2
hitS1b2bptS2b2dalS1b2
Immediate Feedback Delayed Feedback
Text REtrieval Conference (TREC)
Active Learning
0.01
0.1
1
10
100
100 1000 10000
1-R
OC
A (
%)
Training Messages
oflA1b2hubA1b2
ijsA1b2bptA2b2hitA1b2
SB2 Corpus
Text REtrieval Conference (TREC)
Future• TREC expected to continue into 2007
• TREC 2007 tracks:• all but terabyte continuing
• adding “million query” track– goal is to test hypothesis that a test collection built from very many very incompletely judged topics is a better tool than a traditional TREC collection
• start planning for an eventual Desktop search track – goal is to test efficacy of search algorithms for the desktop
– privacy/realism considerations severe barriers to traditional methodology
Text REtrieval Conference (TREC)
Track Planning Workshops
• Thursday, 4:00-5:30pmBlog, LR B Desktop, LR A
Genomics, Green Legal, LR E
Terabyte, LR D
• Friday, 10:50am-12:20pmEnterprise, LR B Million Query, LR A
QA, Green Spam, LR D