Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Of Search and Semantics Patrick Pantel
NSF Symposium on Semantic Knowledge Discovery, Organization and Use
November 15, 2008
- 2 -
VannaverVannaver Bush proposes to Bush proposes to build a body of knowledge build a body of knowledge for all mankind: for all mankind: MemexMemex
Bush: "technical difficulties Bush: "technical difficulties of all sorts have been of all sorts have been
ignored."ignored."
- 3 -
Memex, in the form of a desk, would instantly bring files and material on any subject to the operator’s fingertips.
Semantics captured by an Semantics captured by an associative trailassociative trail, , personal personal commentscomments and and side trailsside trails
Gerald SaltonGerald Salton, father of modern , father of modern search: introduces concepts such as search: introduces concepts such as
vector space modelvector space model, Inverse , Inverse Document FrequencyDocument Frequency (IDF), (IDF), Term Term q yq yFrequency Frequency (TF), and (TF), and relevancy relevancy
feedbackfeedback
- 4 -
1990: the first search 1990: the first search engine engine ArchieArchie, from McGill, , from McGill, matches keyword queries matches keyword queries
against a database of Web against a database of Web filenamesfilenames
Yahoo!Yahoo! is founded around a is founded around a taxonomical organization of taxonomical organization of
the Web… manuallythe Web… manually
- 5 -
Key innovatorKey innovator::Advanced search features + Advanced search features +
first SE to allow natural first SE to allow natural language querieslanguage queries
Key innovatorKey innovator::Semantic models for ranking Semantic models for ranking
paid search resultspaid search results
- 6 -
Semantics and Query LogsSemantics and Query Logs::Technologies developed/applied during the Technologies developed/applied during the DARPA Tipster program and TREC make it DARPA Tipster program and TREC make it
commercial: spelling correction, query commercial: spelling correction, query reformulations, also try, …reformulations, also try, …
Future of search lies in a Future of search lies in a deep deep understanding and understanding and matching of information matching of information gg
request request behind user queriesbehind user queries
Natural language questions Natural language questions answered by editorsanswered by editors
- 7 -
Semantic repositories Semantic repositories and userand user--annotated annotated content grow rapidlycontent grow rapidly
Semantic Semantic search engines search engines
- 8 -
search engines search engines emergeemerge
- 9 -
- 10 -
- 11 -
Search Assist Search Assist TechnologyTechnology
- 12 -
- 13 -
Smart SnippetsTapas Kanungo et al.
- 14 -
Aggregate star Aggregate star ratingrating
Example review as Example review as summarysummary
Current prices at Current prices at
- 15 -
Current prices at Current prices at merchant sitesmerchant sites
Images, maps,Images, maps,specs, …specs, …
O i iOpportunities:
• Marriage of information extraction, content analysis, and query intent modeling
• User experience design
• Key Technologies• IE: Entity detection and
salience; attribute detection
- 16 -
salience; attribute detection
• CA: Text classification, aboutness, information fusion
• QIM: Entity detection, intent/task understanding
Task ModelingAna-Maria Popescu
- 17 -
Task Modeling
• What is the user trying to do?– Find product reviews of the Nikon p
D300?– Buy a Nikon D300?– Find support for her camera?
• UED: enriching the search experience– How can we make use of this
knowledge?
K T h l i
- 18 -
• Key Technologies– Entity detection– Document classification– Intent modeling / detection
Is Is battery life battery life a synonym of a synonym of image qualityimage quality??
Research ProblemResearch Problem::Intent SynonymyIntent Synonymy
How can we How can we automatically discover automatically discover
Review Intent
- 19 -
automatically discover automatically discover intent synonyms?intent synonyms?
Transactional Intent
Intent Synonymy
Key Assumption: one intent per session A searcher’s intent remains the same within a search session
Intent Synonymsh d l l d f d b ll l d
Review Price Supportbest thin where can I get cheap common problemsrate black Friday sale fault codescompare kodak vs. canon christmas sales installing newsmall portable cheap easy use
Methodology very similar to constructing dictionary of distributionally similar words
Intent Discovery and Chaining via Clustering
- 20 -
small portable cheap easy usehigh zoom buy now pay later operating instructionsrate best dell discount won’t heatbattery life great deals best schematicsuser comments overstock.com keeps shutting
Road Ahead
Key Applications Enabling Technologies
Entity DetectionEntity Detection
AttributeAttribute MiningMining
Intent SynonymyIntent Synonymy
Distributional SimilarityDistributional Similarity
EntailmentEntailment
- 21 -
Anaphora ResolutionAnaphora Resolution
Content AnalysisContent Analysis
TextText ClassificationClassification
- 22 -
Dynamic Similarity ModelingEric Crestan and Vishnu Vyas
- 23 -
SeeLEx: Seed List Expansion
• SeeLEx, developed at Yahoo!, is a weakly supervised system for mining entity lists based on distributional similarity
Jaguar
Honda
Peugeot
- 24 -
Mercedes Ford
SeeLEx: Seed List Expansion
NissanOpel cheetahlion
Jaguar
Peugeot
HondaNissan
VolkswagenPorsche
MazdaFiat
Hyundai
LexusToyota
Clinton
- 25 -
Mercedes Ford
Renault
Hyundai
Saab
Suburu
puma
caracal
Carter Eisenhower
SeeLEx: Seed List Expansion
NissanOpel cheetahlion
Jaguar
Peugeot
HondaNissan
VolkswagenPorsche
MazdaFiat
Hyundai
LexusToyota
What is a caracal?endangered
carnivorefast
- 26 -
Mercedes Ford
Renault
Hyundai
Saab
Suburu
puma
fasthunted
…caracal
Questions Asked and Conclusions
What is the effect of corpus size on expansion accuracy? Significant boost in
performance
What is the effect of corpus quality on expansion accuracy?
Does seed quality matter?Significant boost in performance
Great variance in performance depending on set seed composition
- 27 -
How many seeds are on average optimal for expansion accuracy and are more seeds better than less? Somewhat surprising: 5-20
seeds in general is sufficient; more seeds gains little but don’t hurt
Archbishops of CanterburyAstronomersAustralian AirlinesAustralian A-League
Cognitive ScientistsComposersCountriesElectronic Companies
Male Tennis PlayersMaryland CountiesNew Zealand SongwritersNHL Hockey Teams
Gold Standard Sets
Football TeamsAustralian CitiesAustralian Prime MinistersBest Actress Academy Award WinnersBiology DisciplinesBottled Water BrandsBoxing Weight ClassesCalifornia CountiesCanadian Stadiums
ElementsEnglish CitiesEnglish PoetsEnglish Premier Football ClubsFirst LadiesFormula One DriversFrench ArtistsGreek GodsGreek Islands
North American Mountain RangesPresidents of ArgentinaRivers in EnglandRoman EmperorsRussian AuthorsSpanish ProvincesStarsSuperheroesTexas Counties
- 28 -
Canadian StadiumsCanadian UniversitiesCharitable FoundationsClassical PianistsCocktails
Greek IslandsIrish TheatresItalian RegionsJapanese Martial ArtsJapanese Prefectures
Texas CountiesU.S. Army GeneralsU.S. Federal DepartmentsU.S. Internet Companies
Average of 208 instancesMinimum of 11Maximum of 1116Total of 10,377 instances
50 sets extracted from Wikipedia (2007/12)
Corpora
Table: Corpora used to build our expansion models.
CORPORA UNIQUE
SEN TEN CES
(MILLIONS)
TOKEN S
(MILLIONS )
UN IQUE
WORDS (MILLION S)
k100 5,201 217,940 542
k020† 1040 43,588 108
- 29 -
k004† 208 8,717 22
Wikipedia 30 721 34 †Estimate d from k100 st atistics.
The Effect of:Corpus Size and Corpus Quality
- 30 -
0.7
0.8
0.9
1System and Corpora Analysis
(Precision vs. Recall)
full.k100
f ll k020
Takeaway: Corpus Size Matters• k100 yields 13% higher R-precision
than 1/5th its size
0.3
0.4
0.5
0.6
Recall
full.k020
full.k004
full.wikipedia
Opportunity: Model a more naturalSeeLEx usage
• k100 yields 53% higher R-precision than 1/25th its size
Takeaway: Corpus Quality Matters• Wikipedia yields similar performance
to a web corpus 60 times its size
- 31 -
0
0.1
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
SeeLEx usage• What is the performance of the system
on typical sets searched by editors?
0.7
0.8
0.9
1
k100: List EffectPrecision vs. Recall
Opportunity: Bucket sets to find predictable behaviorsO ld t d d t i ifi tl i th i
0.3
0.4
0.5
0.6
Recall
• Our gold standard sets vary significantly in their expansion performance
• Look into predictability of open vs. closed class sets, large vs. small class sets, and types of sets such as locations, people, …
- 32 -
0
0.1
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
The Effect of:Seed Selection
- 33 -
0.7
0.8
0.9
1k100: Seed Selection Effect
Precision vs. Recall full.k100 s010
s010.t01 s010.t02
s010.t03 s010.t04
s010.t05 s010.t06
s010.t07 s010.t08Takeaway: Seed set composition greatly affects performance
0.3
0.4
0.5
0.6
Recall
s010.t09 s010.t10
s010.t11 s010.t12
s010.t13 s010.t14
s010.t15 s010.t16
s010.t17 s010.t18
s010.t19 s010.t20
s010.t21 s010.t22
affects performance• Best performing seed set had 42% higher
precision and 39% higher recall than the worst performing seed set
Opportunity: Reject seeds and/or ask for more• We are investigating ways to automatically detect
hi h d l t b tt th th i
- 34 -
0
0.1
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
s010.t23 s010.t24
s010.t25 s010.t26
s010.t27 s010.t28
s010.t29 s010.t30
which seed elements are better than others in order to reduce the impact of seed selection effect
The Effect of:Seed Size
- 35 -
2
2.5
3
s
Rate of New Correct Expansionsvs. Seed Size
Takeaway: Small numbers of seeds are sufficient to fully saturate the distributional similarity model
1
1.5
2
Rate of N
ew Correct Expan
sion to fully saturate the distributional similarity model
• After 10-20 seeds, more seeds yield few newcorrect instances in the expansion
- 36 -
0
0.5
0 20 40 60 80 100 120 140 160 180 200
Seed Size
0.7
0.8
0.9
1
Seed Size vs. % of Errors
0.3
0.4
0.5
0.6
% o
f Err
ors
Takeaway: Although bad seeds may be less desirable in a seed set than others, adding them does not seem to degrade performance• Percentage of errors does not increase with
- 37 -
0
0.1
0.2
0 20 40 60 80 100 120 140 160 180 200
Seed Size
Percentage of errors does not increase with increased seed set size
Questions Asked and Conclusions
What is the effect of corpus size on expansion accuracy? Significant boost in
performance
What is the effect of corpus quality on expansion accuracy?
Does seed quality matter?Significant boost in performance
Great variance in performance depending on set seed composition
- 38 -
How many seeds are on average optimal for expansion accuracy and are more seeds better than less? Somewhat surprising: 5-20
seeds in general is sufficient; more seeds gains little but don’t hurt
- 39 -