Upload
kenneth-parks
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Translating Dialects in Search:
Mapping between Specialized Languages of Discourse
and Documentary Languages
Vivien Petras
UC Berkeley School of Information
Overcoming the Language Problem in Search
How can someone searching for violins be made aware that there are also fiddles (and vice versa)?
• The Language Problem in Information Retrieval
• Dialects & Contexts
• The Search Term Recommender
• 4 Research Questions
• Exploratory Web Interface
Outline
“how to obtain the right information for the right user
at the right time” (Chu, 2003)
Decision Process under Uncertainty
Information Retrieval
• Searching the Needle in the Haystack
• Which Needle in which Haystack
• How to express the Needle and the Haystack
Language Problem in Information Retrieval
Decision Process under Uncertainty
SearcherAuthor
Concept Space
Concept Space
QuestionText
Search Statement
Match!
• Mapping between searcher and IR system
• Mapping between author and IR system
• Mapping between search statement and document
Document
Language Mapping
IR = Language Mapping Exercise
Searcher
Concept Space
Question
Search Statement
Document
Match!
Information Retrieval
A search statement needs to describe the:• searcher’s question (information need) • documents that are relevant to a searcher’s question
In Linguistics:
unlimited semiosis
In Information Science:
Inter-indexer inconsistency (20-60%)
The Language Problem
How to alleviate language ambiguity?
Ludwig Wittgenstein:• Language games• Language regions
Language is disambiguated within contexts and specialized dialects.
Dialects and Contexts
How to alleviate language ambiguity for search term selection?
Support search term selection:• Within the dialect of a specialized community• In context• Using the language of documents (for term matching)
Dialects and Contexts
Search Term Recommender
Search Statement
SpecialtySpecialty
Specialty
Specialty
Specialty
SpecialtySpecialty
Did you mean…
Specialty Term
Specialty Term
Specialty Term
Specialty TermInformation Collection
Search Term Recommender
• Divide information collection by specialty
• Association between – specialty terms– documentary terms (subject metadata)
• Recommend highly associated terms
The Search Term Recommender Methodology
• Term selection support (query expansion & reformulation)
• Automatic classification
• Terminology mapping
The Search Term Recommender: Applications
1. How can specialties & specialty dialects be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
4. How specific should specialties be?
Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)
The Search Term Recommender - Questions
• Physics, Electrical and Electronic Engineering, Computers and Control
• Document: author, title, source, publication year, abstract, Inspec thesaurus descriptors, Inspec classification codes
• Test collection:
Inspec
Number of documents 427,340
Descriptors / Document 6.99
• Biomedicine and Health
• Document: author, title, source, publication year, publication type, abstract, Mesh Headings
• Test collection:
Medline Ohsumed Collection
Number of documents 168,463
Mesh Headings / Document 3.11
1. How can specialties be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
4. How specific should specialties be?
Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)
The Search Term Recommender System - Questions
• Domain terminology
• Publication source
• Bibliometric analysis
• Social network analysis
• Subject-specific classification
Determine specialty documents in the collection:
Inspec test collection• by top-level categories in the Inspec classification• 3 specialties: Physics, Electrical & Electronic
Engineering, Computers & Control
Ohsumed test collection• by journals grouped by subject• 33 specialties
Identification of Specialties in an Information Collection
1. How can specialties be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
4. How specific should specialties be?
Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)
The Search Term Recommender System - Questions
Differences in specialty dialects (specialty term overlap)
Differences in documentary languages (subject metadata term overlap)
Differences in search term recommender suggestions (term suggestion overlap)
Differences in Language
Inspec Dialects (specialty term overlap)
20%
7%
13%
13%
4%
33%
13%
Physics
ElectricalEngineering
Computers
terms analyzed: 60,601
Subject metadata term overlap: 87%Suggested term overlap: 30%
Ohsumed Dialects (Specialty term overlap)
terms analyzed: 11,663
Subject metadata term overlap: 32%Suggested term overlap: 30%
13%
29%
8%
19%
2%
21%
7%
CommunicableDiseases
GynecologyOrthopedics
1. How can specialties be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
4. How specific should specialties be?
Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)
The Search Term Recommender System - Questions
Comparison: specialty vs. general term suggestions
Automatic classification
Title: “A search for clusters of protostars in Orion cloud cores”
Automatic Classification
Originally assigned terms
Specialty Search Term Recommender
General Search Term Recommender
1. Infrared sources (astronomical)
2. Interstellar molecular clouds
3. Pre-main-sequence stars
4. Star associations
1. Clouds
2. Clusters of galaxies
3. Interstellar molecular clouds
4. Star clusters
5. Pre-main-sequence stars
1. Search problems
2. Clouds
3. Atomic clusters
4. Clusters of galaxies
5. Interstellar molecular clouds
Recall: Hit rate 2/4 = 0.5 1/4 = 0.25
Precision: Accuracy 2/5 = 0.4 1/5 = 0.2
Evaluation
Performance of the STR: Inspec
Inspec specialties and general STRs
0.0
0.1
0.2
0.3
0.4
0.5
0.0 0.1 0.2 0.3 0.4 0.5Recall
Pre
cisi
on
Individual Specialty STRs
General STR
Test Documents: 42,735
Specialties: 3
First 3 suggested:
Recall: 13.6%
Precision: 11.2%
Performance of the STR: Ohsumed
Ohsumed specialties and general STR
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Recall
Prec
isio
n
Individual Specialty STRs
General STR
First 3 suggested:
Recall: 26%
Precision: 25.6%
Test Documents: 18,733
Specialties: 33
1. How can specialties be identified in an information collection?
2. Do specialty dialects really differ?
3. Is performance improved when focusing on specialty dialects?
4. How specific should specialties be?
Tested on 2 bibliographic collections:• Inspec• Medline (Ohsumed collection)
The Search Term Recommender System - Questions
• Language differences
• Collection sizes for training
Specificity of Specialties
Identifying subspecialties by classification hierarchy– e.g. Computers & Control -- Computer Hardware -- Circuits &
Devices
Specificity of Specialties - Inspec
Four levels of specificity in the Inspec collection
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.0 0.1 0.2 0.3 0.4 0.5 0.6Recall
Pre
cisi
on
Sub-sub specialty STR
Sub-specialty STR
Specialty STR
General STR
Test documents: 2425 Specialties: 3
Identifying subspecialties by journal within subject– e.g. Orthopedics -- Clinical Orthopaedics & Related Research
journal
Specificity of Specialties - Ohsumed
Three levels of specificity in the Ohsumed Collection
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Recall
Pre
cisi
on
Journal STR
Specialty STR
General STR
Test documents: 745 Specialties: 3
Inspec
http://metadata.sims.berkeley.edu/str/inspec/inspec.html
Ohsumed
http://metadata.sims.berkeley.edu/str/ohsumed/ohsumed.html
Exploratory Web Interfaces
1. How can specialties be identified in an information collection?– Inspec: subject-specific classification– Ohsumed: journal specialty area
2. Do specialty dialects really differ?– Inspec specialties: term overlap 50%, suggestions overlap 30%– Ohsumed specialties: term overlap 30%, suggestions overlap 30%
3. Is performance improved when focusing on specialty dialects?– Inspec specialties: 10% improvement over general STR– Ohsumed specialties: 25% improvement over general STR
4. How specific should specialties be?– Depends: on language differences & collection size
Summary
Overcoming the Language Problem in Search
Search Term Recommender:
See also:
FIDDLES
50% Discount!
Thank you!