17
The Development of a The Development of a search engine & search engine & Comparison according to Comparison according to algorithms algorithms 20032017 20032017 Sung-soo Kim Sung-soo Kim The final report

The Development of a search engine Comparison according to algorithms 20032017 Sung-soo Kim The final report

Embed Size (px)

DESCRIPTION

Topic Design information retrieval system to compare performance such as Vector modeling, boolean, and natural-query. Design information retrieval system to compare performance such as Vector modeling, boolean, and natural-query.

Citation preview

Page 1: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

The Development of a The Development of a search engine & Comparison search engine & Comparison

according to algorithmsaccording to algorithms

20032017 20032017 Sung-soo KimSung-soo Kim

The final report

Page 2: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

ContentsContents TopicTopic Development environmentDevelopment environment ProcedureProcedure Retrieval system designRetrieval system design Comparing performance Comparing performance ConclusionConclusion Future workFuture work ReferenceReference

Page 3: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

TopicTopic Design information retrieval Design information retrieval

system to compare performance system to compare performance such as Vector modeling, boolean, such as Vector modeling, boolean, and natural-query.and natural-query.

Page 4: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

Development environmentDevelopment environment OS: OS: Red hat – linux Red hat – linux System: System: Pentium 2.4G, XP windowPentium 2.4G, XP window Language:Language: C and gcc compilerC and gcc compiler Interface: Interface: Execute on console line Execute on console line

Page 5: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

ProcedureProcedureI.I. Extracting the text-information’s Extracting the text-information’s

position from raw files.position from raw files.II.II. Extracting the keyword or index from Extracting the keyword or index from

the text.the text.III.III. Making the index file.Making the index file.IV.IV. Gathering and sorting those index fileGathering and sorting those index fileV.V. Getting information of index.Getting information of index.VI.VI. Boolean retrieval Boolean retrieval VII.VII. Natural language retrieval using VectorNatural language retrieval using Vector

Page 6: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

Retrieval system design Retrieval system design (1)(1)

Informationretrievalsystem

Index Retrieval User Interface Assessperformance

Extractingindex

Storedata

Booleanretrieval

Naturallanguageretrieval

Query Displaythe results Assess

Page 7: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

Retrieval system design Retrieval system design (2)(2)

Page 8: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

Comparing performance Comparing performance (1)(1)

SIM(Di,Dj)=SIM(Di,Dj)=

Where the weights Wik are simple frequency countsWhere the weights Wik are simple frequency counts The problem with this simple measure is that it is not The problem with this simple measure is that it is not

normalized to account for variances in the length of normalized to account for variances in the length of documentsdocuments– This might be corrected by dividing each frequency count This might be corrected by dividing each frequency count

by the length of the documentby the length of the document– It may be also be corrected by dividing each frequency It may be also be corrected by dividing each frequency

count by the maximum frequency count for the document count by the maximum frequency count for the document Additional normalization is often performed to force all Additional normalization is often performed to force all

similarity values to the range between 0 and 1similarity values to the range between 0 and 1

jkik WWn

k

1

Page 9: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

Comparing performance Comparing performance (2)(2)

Page 10: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

Comparing performance Comparing performance (3)(3)

Page 11: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

Comparing performance Comparing performance (4)(4)

But, we used different equation followingBut, we used different equation following- Similarity: SIM(Di,Dj)=Similarity: SIM(Di,Dj)=

- Weighted value for index in document:Weighted value for index in document:

- Weighted value for query:Weighted value for query:

jkik WWn

k

1

)0.1

log(*3.0 n

N

avgdldl

tf

tfWdi

qtfWdi

Page 12: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

Executes system Executes system (indexing)(indexing)

Page 13: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

Executes system Executes system (boolean)(boolean)

Page 14: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

Executes system Executes system (natural_query)(natural_query)

Page 15: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

ConclusionConclusion Boolean: Boolean: -Easy for user to composite and, for computer to -Easy for user to composite and, for computer to

transact.transact.-Cannot sort the document as similarity for ranking-Cannot sort the document as similarity for ranking-Only find the document that is exactly equal to -Only find the document that is exactly equal to

user’s query. user’s query. Vector:Vector:--Calculate similarity (query and document’s index).Calculate similarity (query and document’s index).-Can retrieval some document satisfied similarity -Can retrieval some document satisfied similarity

defined by user.defined by user.

Page 16: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

Future workFuture work Both boolean and natural_query have Both boolean and natural_query have

relevant limitsrelevant limits Because they are based on Structural Because they are based on Structural

concepts (streaming match)concepts (streaming match) Recently new concepts are Recently new concepts are

accomplished not structural but accomplished not structural but semantic. semantic.

So called semantic web So called semantic web

Page 17: The Development of a search engine  Comparison according to algorithms 20032017 Sung-soo Kim The final report

ReferenceReference Lee, J.H(1995), Combining Multiple Evidence from Lee, J.H(1995), Combining Multiple Evidence from

different Properties of Weighting Schemes, ACM different Properties of Weighting Schemes, ACM SIGIR Conference on Research and Development in SIGIR Conference on Research and Development in Information Retrieval.Information Retrieval.

Harman,D.(1993), Overview of the 1Harman,D.(1993), Overview of the 1stst text retrieval text retrieval conference, Proceeding of the 16conference, Proceeding of the 16thth Annual Annual International ACM SIGIR Conference on Research International ACM SIGIR Conference on Research and development in Information Retrieval.and development in Information Retrieval.

http://blue.skhu.ac.kr/~mckim/Lecture/IR/Note/http://blue.skhu.ac.kr/~mckim/Lecture/IR/Note/hwork.htmlhwork.html