Upload
philip-washington
View
216
Download
0
Embed Size (px)
DESCRIPTION
Topic Design information retrieval system to compare performance such as Vector modeling, boolean, and natural-query. Design information retrieval system to compare performance such as Vector modeling, boolean, and natural-query.
Citation preview
The Development of a The Development of a search engine & Comparison search engine & Comparison
according to algorithmsaccording to algorithms
20032017 20032017 Sung-soo KimSung-soo Kim
The final report
ContentsContents TopicTopic Development environmentDevelopment environment ProcedureProcedure Retrieval system designRetrieval system design Comparing performance Comparing performance ConclusionConclusion Future workFuture work ReferenceReference
TopicTopic Design information retrieval Design information retrieval
system to compare performance system to compare performance such as Vector modeling, boolean, such as Vector modeling, boolean, and natural-query.and natural-query.
Development environmentDevelopment environment OS: OS: Red hat – linux Red hat – linux System: System: Pentium 2.4G, XP windowPentium 2.4G, XP window Language:Language: C and gcc compilerC and gcc compiler Interface: Interface: Execute on console line Execute on console line
ProcedureProcedureI.I. Extracting the text-information’s Extracting the text-information’s
position from raw files.position from raw files.II.II. Extracting the keyword or index from Extracting the keyword or index from
the text.the text.III.III. Making the index file.Making the index file.IV.IV. Gathering and sorting those index fileGathering and sorting those index fileV.V. Getting information of index.Getting information of index.VI.VI. Boolean retrieval Boolean retrieval VII.VII. Natural language retrieval using VectorNatural language retrieval using Vector
Retrieval system design Retrieval system design (1)(1)
Informationretrievalsystem
Index Retrieval User Interface Assessperformance
Extractingindex
Storedata
Booleanretrieval
Naturallanguageretrieval
Query Displaythe results Assess
Retrieval system design Retrieval system design (2)(2)
Comparing performance Comparing performance (1)(1)
SIM(Di,Dj)=SIM(Di,Dj)=
Where the weights Wik are simple frequency countsWhere the weights Wik are simple frequency counts The problem with this simple measure is that it is not The problem with this simple measure is that it is not
normalized to account for variances in the length of normalized to account for variances in the length of documentsdocuments– This might be corrected by dividing each frequency count This might be corrected by dividing each frequency count
by the length of the documentby the length of the document– It may be also be corrected by dividing each frequency It may be also be corrected by dividing each frequency
count by the maximum frequency count for the document count by the maximum frequency count for the document Additional normalization is often performed to force all Additional normalization is often performed to force all
similarity values to the range between 0 and 1similarity values to the range between 0 and 1
jkik WWn
k
1
Comparing performance Comparing performance (2)(2)
Comparing performance Comparing performance (3)(3)
Comparing performance Comparing performance (4)(4)
But, we used different equation followingBut, we used different equation following- Similarity: SIM(Di,Dj)=Similarity: SIM(Di,Dj)=
- Weighted value for index in document:Weighted value for index in document:
- Weighted value for query:Weighted value for query:
jkik WWn
k
1
)0.1
log(*3.0 n
N
avgdldl
tf
tfWdi
qtfWdi
Executes system Executes system (indexing)(indexing)
Executes system Executes system (boolean)(boolean)
Executes system Executes system (natural_query)(natural_query)
ConclusionConclusion Boolean: Boolean: -Easy for user to composite and, for computer to -Easy for user to composite and, for computer to
transact.transact.-Cannot sort the document as similarity for ranking-Cannot sort the document as similarity for ranking-Only find the document that is exactly equal to -Only find the document that is exactly equal to
user’s query. user’s query. Vector:Vector:--Calculate similarity (query and document’s index).Calculate similarity (query and document’s index).-Can retrieval some document satisfied similarity -Can retrieval some document satisfied similarity
defined by user.defined by user.
Future workFuture work Both boolean and natural_query have Both boolean and natural_query have
relevant limitsrelevant limits Because they are based on Structural Because they are based on Structural
concepts (streaming match)concepts (streaming match) Recently new concepts are Recently new concepts are
accomplished not structural but accomplished not structural but semantic. semantic.
So called semantic web So called semantic web
ReferenceReference Lee, J.H(1995), Combining Multiple Evidence from Lee, J.H(1995), Combining Multiple Evidence from
different Properties of Weighting Schemes, ACM different Properties of Weighting Schemes, ACM SIGIR Conference on Research and Development in SIGIR Conference on Research and Development in Information Retrieval.Information Retrieval.
Harman,D.(1993), Overview of the 1Harman,D.(1993), Overview of the 1stst text retrieval text retrieval conference, Proceeding of the 16conference, Proceeding of the 16thth Annual Annual International ACM SIGIR Conference on Research International ACM SIGIR Conference on Research and development in Information Retrieval.and development in Information Retrieval.
http://blue.skhu.ac.kr/~mckim/Lecture/IR/Note/http://blue.skhu.ac.kr/~mckim/Lecture/IR/Note/hwork.htmlhwork.html