Upload
trenton-mabray
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
1
Information RetrievalInformation Retrieval
2
What is IR?What is IR?
IR is concerned with the representation, storage, organization, and accessing of information items. [Salton]
Information include Text , Audio, image, ….
For simplicity , we consider texts.
Text information retrieval.
Information SpaceInformation Space
UserUser
User
Request DocumentsLanguage
3
The role of databasesThe role of databases
Databases hold specific data items Organization is explicit
Keys relate items to each other
Queries are constrained, but effective in retrieving the data that is there
Databases generally respond to specific queries with specific results
Searching for items not anticipated by the designers can be difficult
4
Information vs. Information sourcesInformation vs. Information sources
User needs information Distinguish data, information, knowledge
Information sources Very well organized, indexed, controlled
Totally unorganized, uncharacterized, uncontrolled
Something in between
Connect the two in a way that matches information needs to information available.
5
The WebThe Web
Extreme opposite of a database
No organization, no overall structure, no index or key to the content
Searching and browsing are supported, but generally are not complete. (You will not know if you got every good response to your request. You may be able to tell that you got the response that meets your need, but may not know if you got the best response available.)
Each HTML page is considered as a document
6
Information Retrieval vs. Information Retrieval vs. Data RetrievalData Retrieval
Data retrieval consists of determining which documents of the collection contain the keywords in the user query.
Information retrieval should “interpret” the contents of the documents in the collection and retrieve all the documents that are relevant to the user query while retrieving a few non relevant documents as possible.
7
A General text-information retrieval modelA General text-information retrieval model
An information retrieval model is a quadruple<D,Q,F,R(qi, dj)> where
D is a set composed of logical views (or representations) for the documents in the collection
Q is a set composed of logical views (or representations) for the user information needs called “queries”
F is a framework for modeling document representations, queries and their relationships
R(qi, dj) is a ranking function which associates a real number with a query qi in Q and a document representation dj in D.(A similarity measure which perform a mapping from query to documents that are more similar to our particular query )
8
Retrieval modelsRetrieval models
Probabilistic IR : (Baysian, Naïve Bayes), Compute the probability of relevance of a
document to given query.
Statistical IR (Vector Space, Concept space).
Machine Learning based techniques. (Extracting knowledge or identifying patterns) Symbolic learning (ID3)
Neural Networks (Any where that is required)
Evolution based Algorithms( For adapting of F as matching function )
The effectiveness of an IR system depends on the ability of the document representation to capture the “meaning” of the documents with respect to the users’ needs
9
Text retrieval Overall ArchitectureText retrieval Overall Architecture
UsersUsers
Queries (Q)
RelevanceFeedback
RelevanceFeedback
MatchingAlgorithm (R)
DocumentRepresentation (D)
Documents
User Side
User Side
Information Space
Information Space
Retrieved Documents
F
10
Preparing queries and documentsPreparing queries and documents
Convert file format.
Text segmentation.
Term extraction
Stemming , eliminating stop words
Term weighting
Phrase construction
Storing indexed documents
Similar stages for preparing queries, but instead of storage it passes to matching algorithm.
11
The Retrieval ProcessThe Retrieval Process
UserInterface
Text Operations
QueryOperations
Searching
Indexing DB ManagerModule
Index
TextDatabase
Ranking
User’s need
Ranked Docs
Retrieved Docs
Query
User’s feedback
Text
Text
Logical view
Inverted file
12
Vector space Vector space
1960s introduction of vector space model (Salton, cornell Univ. Smart system )
Dj=(Wj1,Wj2,…,Wjt) if kth term doesn’t exist then Wjk=0
Qj=(wj1 ,wj2 ,…,wjt)
NtNNN
t
t
t
WWWD
WWWD
WWWD
TTT
...
...
...
...
21
222212
112111
21
Sparse Term-Doc matrix
Doc
Term
Query
NtNNN
t
t
t
WWWQ
WWWQ
WWWQ
TTT
...
...
...
...
21
222212
112111
21
13
Term WeightingTerm Weighting
jiijij NGLW
Global•None 1
•IDF
•Entropy
•IDFB
•IDFP
N
j
j
ij
i
ij
N
F
f
F
f
1 log
log
1
i
i
n
F
)log(in
N
)log(i
i
n
nN
Normalization•None 1
•Cosine
•PUQN
t
i iji LG0
2)(
1
jlslopPivotslop )1(
1
Local
•Binary 1
•TF
•Log
•LOGN
•ATF1
)1log( ijf
ijf
j
ij
a
f
log1
log1
)(5.05.0j
ij
x
f
If fij==0 then 0else
14
Vector space graphical Vector space graphical representationrepresentation
Vector space graphical Vector space graphical representationrepresentation
Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
32
5
• Is D1 or D2 more similar to Q?• How to measure the degree of
similarity? Distance? Angle? Projection?
15
Similarity Measure (Matching Function)Similarity Measure (Matching Function)FF
Similarity between documents Di and query Q can be computed as the inner vector product:
where dik is the weight of Term k in document i and qk is the weight of Term k in the query
1
sim ( Di , Q ) = (dik.qk)t
k
Binary: D = 1, 1, 1, 0, 1, 1, 0
Q = 1, 0 , 1, 0, 0, 1, 1
sim(D, Q) = 3
retri
eval
database
archite
cture
computer
textmanagem
ent
informatio
n
16
Cosine Similarity measureCosine Similarity measure
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
t
i
t
i
t
i
ww
ww
qd
qd
iqij
iqij
j
j
1 1
22
1
)(
CosSim(dj, q) =
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
32
5
17
t
k
t
kkik
t
k
t
kkik
qdqd
qd
kik1 11
22
1
)(
)(
Inner Product:
Cosine:
Jaccard :qkdiqkdi
qkdi
t
k
t
k
t
kkik
qd
qd
kik1 1
22
1
)(
qkdi
qkdi
t
kkik qd
1
)( qkdi
di and qk here are sets of keywordsdi and qk here are vectors
Similarity MeasuresSimilarity Measures
18
Comments on Vector Space Comments on Vector Space ModelsModels
Simple, mathematically based approach.
Considers both local (tf) and global (idf) word occurrence frequencies.
Provides partial matching and ranked results.
Tends to work quite well in practice despite obvious weaknesses.
Allows efficient implementation for large document collections.
19
Problems with Vector SpaceProblems with Vector Space
There is no real theoretical basis for the assumption of a term space it is more for visualization that having any real basis
most similarity measures work about the same regardless of model
Terms are not really orthogonal dimensions Terms are not independent of all other terms
20
Semantic IRSemantic IR
Different voc. For users and authors (or indexers)
Polysemy problem (words having multiple meaning)
Synonymy problem (multiple words having the same meaning)
Using a dictionary of Synonymy and Polysemy inside IR system.
Latent Semantic Indexing (LSI). Using Singular Value decomposition
Identifying the correlation between terms by means of singular values (e.g. car and auto inside gasoline,… in different docs)
SVD provides a solution to this, and in doing so, It captures all the info in the original array, without loss.
Reduces the size of the matrix to operate on. (Deals with non- sparse parts)
Places similar elements closer to each other.
Allows the reconstruction of the original matrix, with some loss of precision.
21
Information Retrieval SystemsInformation Retrieval Systems
Information retrieval (IR) systems use a simpler data model than database systems Information organized as a collection of documents
Documents are unstructured, no schema
Information retrieval locates relevant documents, on the basis of user input such as keywords or example documents e.g., find documents containing the words “database systems”
Can be used even on textual descriptions provided with non-textual data such as images
IR on Web documents has become extremely important E.g. google, altavista, …
22
Information Retrieval Systems (Cont.)Information Retrieval Systems (Cont.)
Differences from database systems IR systems don’t deal with transactional updates (including
concurrency control and recovery)
Database systems deal with structured data, with schemas that define the data organization
IR systems deal with some querying issues not generally addressed by database systems
Approximate searching by keywords
Ranking of retrieved answers by estimated degree of relevance
23
Query Modification ProcessQuery Modification Process
F: accepts relevance judgement from the user and produces as output sets of relevant and nonrelevant documents
G: implements the feedback formula (for rewriting the original query)
RetrievalProcess F G
Originalquery Q
Ranked output
Rel. & nonrel.documents
Reformulatedquery Q’
Relevancy judgement
24
The Effect of Relevance FeedbackThe Effect of Relevance Feedback
xx
x
Relevant documents
Nonrelevant documents
Original queryOriginal query retrieved five documents
xx
Reformulated query
25
The Basic Idea of Query ModificationThe Basic Idea of Query Modification
Terms that occur in relevant documents are added to the original query vectors, or the weight of such terms is increased by an appropriate factor in constructing the new query statements
Terms occurring in nonrelevant documents are deleted from the original query statements, or the weight of such terms is appropriately reduced
26
Keyword SearchKeyword Search
In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document
Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not Ands are implicit, even if not explicitly specified
Ranking of documents on the basis of estimated relevance to a query is critical Relevance ranking is based on factors such as
Term frequency
– Frequency of occurrence of query keyword in document Inverse document frequency
– How many documents the query keyword occurs in
» Fewer give more importance to keyword Hyperlinks to documents
– More links to a document document is more important
27
Relevance Ranking Using TermsRelevance Ranking Using Terms
TF-IDF (Term frequency/Inverse Document frequency) ranking: Let n(d) = number of terms in the document d
n(d, t) = number of occurrences of term t in the document d.
Then relevance of a document d to a term t
The log factor is to avoid excessive weightage to frequent terms
And relevance of document to query Q
nn((dd))
nn((dd, , tt))1 +1 +rr((dd, , tt) = ) = loglog
rr((dd, , QQ) =) = rr((dd, , tt))nn((tt))ttQQ
28
Relevance Ranking Using Terms (Cont.)Relevance Ranking Using Terms (Cont.)
Most systems add to the above model Words that occur in title, author list, section headings, etc. are given
greater importance
Words whose first occurrence is late in the document are given lower importance
Very common words such as “a”, “an”, “the”, “it” etc are eliminated
Called stop words
Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart
Documents are returned in decreasing order of relevance score Usually only top few documents are returned, not all
29
Relevance Using HyperlinksRelevance Using Hyperlinks
When using keyword queries on the Web, the number of documents is enormous (many billions) Number of documents relevant to a query can be enormous if only
term frequencies are taken into account
Most of the time people are looking for pages from popular sites
Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords
Problem: hard to find actual popularity of site Solution: next slide
30
Relevance Using Hyperlinks (Cont.)Relevance Using Hyperlinks (Cont.) Solution: use number of hyperlinks to a site as a measure of the
popularity or prestige of the site Count only one hyperlink from each site (why?)
Popularity measure is for site, not for individual page
Most hyperlinks are to root of site
Site-popularity computation is cheaper than page popularity computation
Refinements When computing prestige based on links to a site, give more weightage to
links from sites that themselves have higher prestige
Definition is circular
Set up and solve system of simultaneous linear equations
Above idea is basis of the Google PageRank ranking mechanism
31
Relevance Using Hyperlinks (Cont.)Relevance Using Hyperlinks (Cont.)
Connections to social networking theories that ranked prestige of people E.g. the president of the US has a high prestige since many people
know him
Someone known by multiple prestigious people has high prestige
Hub and authority based ranking A hub is a page that stores links to many pages (on a topic)
An authority is a page that contains actual information on a topic
Each page gets a hub prestige based on prestige of authorities that it points to
Each page gets an authority prestige based on prestige of hubs that point to it
Again, prestige definitions are cyclic, and can be got by solving linear equations
Use authority prestige when ranking answers to a query
32
Similarity Based RetrievalSimilarity Based Retrieval
Similarity based retrieval - retrieve documents similar to a given document Similarity may be defined on the basis of common words
E.g. find k terms in A with highest r(d, t) and use these terms to find relevance of other documents; each of the terms carries a weight of r (d,t)
Similarity can be used to refine answer set to keyword query User selects a few relevant documents from those retrieved by
keyword query, and system finds other documents similar to these
33
Synonyms and HomonymsSynonyms and Homonyms
Synonyms E.g. document: “motorcycle repair”, query: “motorcycle maintenance”
need to realize that “maintenance” and “repair” are synonyms
System can extend query as “motorcycle and (repair or maintenance)”
Homonyms E.g. “object” has different meanings as noun/verb
Can disambiguate meanings (to some extent) from the context
Extending queries automatically using synonyms can be problematic Need to understand intended meaning in order to infer synonyms
Or verify synonyms with user
Synonyms may have other meanings as well
34
Indexing of DocumentsIndexing of Documents
An inverted index maps each keyword Ki to a set of documents Si
that contain the keyword Documents identified by identifiers
Inverted index may record Keyword locations within document to allow proximity based ranking
Counts of number of occurrences of keyword to compute TF
and operation: Finds documents that contain all of K1, K2, ..., Kn.
Intersection S1 S2 ..... Sn
or operation: documents that contain at least one of K1, K2, …, Kn
union, S1 U S2 ..... USn
Each Si is kept sorted to allow efficient intersection/union by merging
“not” can also be efficiently implemented by merging of sorted lists
35
Measuring Retrieval EffectivenessMeasuring Retrieval Effectiveness IR systems save space by using index structures that support
only approximate retrieval. May result in: false negative (false drop) - some relevant documents may not
be retrieved.
false positive - some irrelevant documents may be retrieved.
For many applications a good index should not permit any false drops, but may permit a few false positives.
Relevant performance metrics: Precision - what percentage of the retrieved documents are
relevant to the query.
Recall - what percentage of the documents relevant to the query were retrieved.
37
Why is System Evaluation Needed?Why is System Evaluation Needed?
There are many retrieval systems on the market, which one is the best?
When the system is in operation, is the performance satisfactory? Does it deviate from the expectation?
To fine tune a query to obtain the best result (for a particular set of documents and application)
To determine the effects of changes made to an existing system (system A versus system B)
Efficiency: speed
Effectiveness: how good the result is?
38
Difficulties in Evaluating IR SystemDifficulties in Evaluating IR System
Effectiveness is related to relevancy of items retrieved
Relevancy is not a binary evaluation but a continuous function
Relevancy, from a human judgement standpoint, is subjective - depends upon a specific user’s judgement
situational - relates to user’s requirement
cognitive - depends on human perception and behavior
temporal - changes over time
39
documents relevant of number Total
retrieved documents relevant of Number recall
retrieved documents of Number totalretrieved documents relevant of Number
precision
Retrieval Effectiveness - Precision and RecallRetrieval Effectiveness - Precision and Recall
Relevant documents
Retrieved documents
Entire document collection
retrieved & relevant
not retrieved but relevant
retrieved & irrelevant
Not retrieved & irrelevant
retrieved not retrieved
rele
vant
irre
leva
nt
40
Precision and RecallPrecision and Recall
Precision evaluates the correlation of the query to the database
an indirect measure of the completeness of indexing algorithm
Recall the ability of the search to find all of the relevant items in the database
Among three numbers, only two are always available
total number of items retrieved
number of relevant items retrieved
total number of relevant items is usually not available
Unfortunately, precision and recall affect each other in the opposite direction! Given a system:
Broadening a query will increase recall but lower precision
Increasing the number of documents returned has the same effect
41
Total Number of Relevant ItemsTotal Number of Relevant Items
Problem: which documents are actually relevant, and which are not Usual solution: human judges
Create a corpus of documents and queries, with humans deciding which documents are relevant to which queries
In an uncontrolled environment (e.g., the web), it is unknown.
Two possible approaches to get estimates
Sampling across the database and performing relevance judgment on the returned items
Apply different retrieval algorithms to the same database for the same query. The aggregate of relevant items is taken as the total number of relevant documents in the collection
42
Relationship between Recall and Relationship between Recall and PrecisionPrecision
10
1
precision
reca
ll
Return mostly relevantdocuments but missmany relevant ones
The idealReturn most of the relevantdocuments but include many junks
43
Fallout RateFallout Rate
Problems with precision and recall: A query on “Hong Kong” will return most relevant documents but it
doesn’t tell you how good or how bad the system is! (What is the chance that a randomly picked document is relevant to the query?)
number of irrelevant documents in the collection is not taken into account
recall is undefined when there is no relevant document in the collection
precision is undefined when no document is retrieved
collection the in items tnonrelevan of no. totalretrieved items tnonrelevan of no.
Fallout
A good system should have high recall and low fallout
44
Fallout (cont)Fallout (cont)
Fallout can be viewed as the inverse of recall It is very unlikely to have situation as 0/0
the number of non-relevant items in a collection can be safely be assumed to be non-zero.
It is the probability that a retrieved item is nonrelevant. (Recall: the probability that a retrieved item is relevant)
Among three measures, precision, recall and fallout, fallout is least sensitive to the accuracy of the search process
A good system should have high recall and low fallout
45
R=2/5=0.4; p=2/3=0.67
Computation of Recall and PrecisionComputation of Recall and Precisionn doc # relevantRecallPrecision1 588 x 0.2 1.002 589 x 0.4 1.003 576 0.4 0.674 590 x 0.6 0.765 986 0.6 0.606 592 x 0.8 0.677 984 0.8 0.578 988 0.8 0.509 578 0.8 0.4410 985 0.8 0.4011 103 0.8 0.3612 591 0.8 0.3313 772 x 1.0 0.3814 990 1.0 0.36
Suppose:total no. of relevant docs = 5
R=1/5=0.2; p=1/1=1
R=2/5=0.4; p=2/2=1
R=5/5=1; p=5/13=0.38
46
Computation of Recall and PrecisionComputation of Recall and Precisionn RecallPrecision1 0.2 1.002 0.4 1.003 0.4 0.674 0.6 0.765 0.6 0.606 0.8 0.677 0.8 0.578 0.8 0.509 0.8 0.4410 0.8 0.4011 0.8 0.3612 0.8 0.3313 1.0 0.3814 1.0 0.36 0.4 0.8
1.0
0.8
0.6
0.4
0.2
0.2 1.00.6
1 2
3
4
5
6
7
12
13
200
recall
prec
isio
n
47
Compare Two or More SystemsCompare Two or More Systems
Computing recall and precision values for two or more systems
Superimposing the results in the same graph
The curve closest to the upper right-hand corner of the graph indicates the best performance
TREC (Text REtrieval Conference) Benchmark
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
Stem Theraurus
48
Web CrawlingWeb Crawling
Web crawlers are programs that locate and gather information on the Web Recursively follow hyperlinks present in known documents, to find
other documents
Starting from a seed set of documents
Fetched documents
Handed over to an indexing system
Can be discarded after indexing, or store as a cached copy
Crawling the entire Web would take a very large amount of time Search engines typically cover only a part of the Web, not all of it
Take months to perform a single crawl
49
Web Crawling (Cont.)Web Crawling (Cont.)
Crawling is done by multiple processes on multiple machines, running in parallel Set of links to be crawled stored in a database
New links found in crawled pages added to this set, to be crawled later
Indexing process also runs on multiple machines Creates a new copy of index instead of modifying old index
Old index is used to answer queries
After a crawl is “completed” new index becomes “old” index
Multiple machines used to answer queries Indices may be kept in memory
Queries may be routed to different machines for load balancing
50
BrowsingBrowsing
Storing related documents together in a library facilitates browsing users can see not only requested document but also related ones.
Browsing is facilitated by classification system that organizes logically related documents together.
Organization is hierarchical: classification hierarchy
51
A Classification Hierarchy For A Library A Classification Hierarchy For A Library SystemSystem
52
Classification DAGClassification DAG
Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important.
Classification hierarchy is thus Directed Acyclic Graph (DAG)
53
A Classification DAG For A Library A Classification DAG For A Library Information Retrieval SystemInformation Retrieval System
54
Web DirectoriesWeb Directories
A Web directory is just a classification directory on Web pages E.g. Yahoo! Directory, Open Directory project
Issues:
What should the directory hierarchy be?
Given a document, which nodes of the directory are categories relevant to the document
Often done manually
Classification of documents into a hierarchy may be done based on term similarity
55
Computational Creativity is a small sub-field of artificial intelligence
Its focus is the study and support, through computational methods, of behaviour which, in humans, would be deemed “creative”
Ranging from intelligent digital libraries to systems which create music, art, scientific theories, etc.
What isWhat isComputational Creativity?Computational Creativity?
56
Work in CCC falls into various categories: literary forensics
(Dr Peter Smith/Dr Gea De Jong)computer-based musicology
(Dr Geraint Wiggins/Tim Crawford/David Lewis/Michael Gale) intelligent digital signal & score processing
(Dr Michael Casey/Dr Geraint Wiggins/Dr Darrell Conklin/Dave Meredith/Miguel Ferrand)
computational music cognition(Dr Geraint Wiggins/Dr Andrés Melo/Dave Meredith/Marcus Pearce/Miguel Ferrand)
intelligent composition and performance systems(Dr Geraint Wiggins/Dr Darrell Conklin/Dr John Drever/Marcus Pearce/Tak-Shing Chan/ Prof Simon Emmerson/Prof Denis Smalley)
formal models of creative systems(Dr Geraint Wiggins)
Work in CCCWork in CCC
57
Content-Based Information Content-Based Information RetrievalRetrieval
Driven by large volumes of multimedia.
Search terra-bytes of sound and images by similarity.
International Standardisation makes it work globally. (Like the WWW).
58
MPEG-7 MPEG-7 International StandardInternational Standard
ISO/IEC/JTC-1/SC29/WG11 [MPEG]
ISO-15938 2001 Part 4 (Audio) [MPEG-7]
Multimedia Content Description Interface
59
Audio Information RetrievalAudio Information Retrieval
MPEG-7Database
A pre-indexed Collection of Sounds
60
Audio Query Extract
MPEG-7Database
Segment Match
Result ListA Sound or Scene orList of Sounds
Audio Information RetrievalAudio Information Retrieval
61
Audio Query Extract
MPEG-7Database
Segment Match
Result ListFeature extractionfrom audio.
Audio Information RetrievalAudio Information Retrieval
62
Audio Query Extract
MPEG-7Database
Segment Match
Result ListPartitioningof audio intochunks.
Audio Information RetrievalAudio Information Retrieval
63
Audio Query Extract
MPEG-7Database
Segment Match
Result List
Find similar chunksof Audio
Audio Information RetrievalAudio Information Retrieval
64
Audio Query Extract
MPEG-7Database
Segment Match
Result List
Use Results for Creativity Support
Creativity SupportApplication
User
Collect
Relate
Audio Information RetrievalAudio Information Retrieval
65
Arabic IRArabic IR
66
Creating the DatabaseCreating the Database
Method 1: Without Morphology
Index the text based on the form of the word
Method 2: With Morphology
Index the text based on the stem of the word
67
Retrieval SystemsRetrieval Systems
Monolingual retrieval system Arabic query
Returns Arabic Documents
Cross lingual retrieval system (Arabic Translingual System) English query Translated Using Online Dictionary with human selection of terms Returns Arabic Documents and translations
68
Monolingual retrieval systemMonolingual retrieval system
Arabic query
Retrieve text
Display
Morphology
69
Monolingual retrieval systemMonolingual retrieval systemEnter Query, Select Data Source, Search
70
List of Documents and Top Document Returned
Monolingual retrieval systemMonolingual retrieval system
71
Arabic Translingual SystemArabic Translingual System
Arabic query
Retrieve text
Display/Translate
Morphology
English query
Translate
72
Arabic Translingual SystemArabic Translingual System
Type query in English, Select Translate option
73
Double click on any word to see dictionary entry
Arabic Translingual SystemArabic Translingual System
74
Click on translation button for gisting translation
Arabic Translingual SystemArabic Translingual System
75
Translingual SystemTranslingual System
Include syntax in translation system
Expand the bi-directional dictionaries
Improve Onomasticon
Perform automatic disambiguation of translated queries in cross-language system (necessary for TREC-9) Using ontology?
Participate in TREC
76
Distributed IRDistributed IR
Engine 1 Engine 2 Engine 3 Engine 4 Engine n. . . .
. . . . . .?
InformationNeed
Common scenarios:• Multiple partitions, single service• Independent engines, single organization• Independent engines, affiliated organizations• Independent engines, unaffiliated organizations
Defining dimensions:• Cooperative vs. uncooperative engines• Centralized vs. decentralized solutions