31
A Full-Text Search Algorithm for Long Queries Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems

A Full-Text Search Algorithm for Long Queries

  • Upload
    patch

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Laboratory of Information Systems. Tula State University Faculty of Cybernetics. A Full-Text Search Algorithm for Long Queries. Alexey Kolosoff, Michael Bogatyrev. Table of Contents. Problem statement Suggested algorithm Queries processing Documents ranking Experimental results. - PowerPoint PPT Presentation

Citation preview

Page 1: A Full-Text Search Algorithm for Long Queries

1

A Full-Text Search Algorithmfor Long Queries

Alexey Kolosoff, Michael Bogatyrev

Tula State UniversityFaculty of CyberneticsLaboratory of

Information Systems

Page 2: A Full-Text Search Algorithm for Long Queries

2

Table of ContentsProblem statementSuggested algorithmQueries processingDocuments rankingExperimental results

Page 3: A Full-Text Search Algorithm for Long Queries

3

Problem statementSuggested algorithmQueries processingDocuments rankingExperimental results

Page 4: A Full-Text Search Algorithm for Long Queries

4

EnvironmentA question-answer portal is considered.

Answers are produced by technical support persons.

Several existing databases may contain the needed answer for a question.

The task is to decrease the workload of the support team.

Page 5: A Full-Text Search Algorithm for Long Queries

5

Workflow

Customer

Web form Support

Customer

Web form Search Support

Search doesn’t help

Before:

After:

Search helps

Page 6: A Full-Text Search Algorithm for Long Queries

6

Data Processing

Q&A database

Forums

E-mail

Web form

Documents database

(help, FAQ, etc.)

Support team

Search system

Customers’ questions(natural language text)

Links to documents

Input Output

Page 7: A Full-Text Search Algorithm for Long Queries

7

Page 8: A Full-Text Search Algorithm for Long Queries

8

Using Message Subject for Search - Results

Page 9: A Full-Text Search Algorithm for Long Queries

9

Why it’s not a Typical Web Search

Queries consist of multiple sentences, instead of several keywords

The number of documents is not very big (tens of thousands)

Indexed documents consider a single subject or several related subjects

Page 10: A Full-Text Search Algorithm for Long Queries

10

Problem statementSuggested algorithmQueries processingDocuments rankingExperimental results

Page 11: A Full-Text Search Algorithm for Long Queries

11

The Suggested Algorithm

Build CG for input text, filter out unrelated

words

Get concepts mentioned in

the text (context matrix)

Get documents with the same concepts (filter out irrelevant documents)

Rank documents

Page 12: A Full-Text Search Algorithm for Long Queries

12

Advantages of the AlgorithmWords and sentences filtering allows excluding

words and phrases which possibly do not affect the meaning of the text. The task of text search decreases to phrases search.

Using concepts for articles filtering decreases the impact of polysemy on search results.

Getting articles with specific concepts is expected to be faster than searching for articles with specific keywords in the entire corpus.

Page 13: A Full-Text Search Algorithm for Long Queries

13

Problem statementSuggested algorithmQueries processingDocuments rankingExperimental results

Page 14: A Full-Text Search Algorithm for Long Queries

14

Queries Processing

Noise words filtering

Phrases detection

Word forms expansion (with lesser weight)

Synonyms expansion (with lesser weight)

Page 15: A Full-Text Search Algorithm for Long Queries

15

Phrases Detection – Punctuation Marks (During Indexing)

Examples:

issue-tracking tools => [N, N + 0.25]...the issue, but{stop word} tracking

changes... => [N, N + 3]Object.Method() =>

Object[N], Method[N + 0.25]…some object. Method A shows… =>

object[N], Method[N + 15]

Page 16: A Full-Text Search Algorithm for Long Queries

16

Phrases Detection - Semantics

Despite possible errors, users tend to use correct word combinations for technical details description.

A conceptual graph build from a question’s text allows filtering out unrelated words and word combinations which are not grammatically correct.

Page 17: A Full-Text Search Algorithm for Long Queries

Begin

Input text

Recognize language

Parsing sentences

For all sentences

Parsing sent. elements

For all sent.

elements

Normal word forms and

morphologyLanguage?

Concepts & relations in

Russian

Concepts & relations in

English

Output CG

End

EnglishRussian

1

2

3

4

5

6

7

89

9 10

11

12

Morphological analysis

• word formation paradigms from Russian & English languages• using dictionaries Semantic analysis

• semantic role labelling• using templates

How Conceptual Graphs are Built

Page 18: A Full-Text Search Algorithm for Long Queries

18

Sample Query“Hi there!

i have a script test with a bunch of checkpoints, but when it hits a checkpoint cannot be verified, the execution of the script stops and any tests after the failed checkpoint do not get executed. Thank's in advance.Randy”

Page 19: A Full-Text Search Algorithm for Long Queries

19

A Conceptual Graph Fragment

Page 20: A Full-Text Search Algorithm for Long Queries

20

Filtered out Sentences

Page 21: A Full-Text Search Algorithm for Long Queries

21

Parsing results

Phrases used as an input for full-text search.

Each phrase has its own weight.

As a result, the task of searching for a given text can be reduced to searching for a number of phrases. This task can be solved via the suggested algorithm.

Page 22: A Full-Text Search Algorithm for Long Queries

22

Problem statementSuggested algorithmQueries processingDocuments rankingExperimental results

Page 23: A Full-Text Search Algorithm for Long Queries

23

Document ModelVector model is used to represent an indexed document or a query:

The native methods of the control =>[the (0.333), native(0.166), methods(0.166), of (0.166), control(0.166)]

[0, 5] [1] [2] [3] [4]

Page 24: A Full-Text Search Algorithm for Long Queries

24

The Sought-for Phrase is Present in an Indexed Document if…

...(distance between each words pair) < M, where M is the artificial word position increment value for sentence breaks.

Sample query: AJAX applications testing

AJAX web applications are, indeed, difficult for testing. => total words distance: 7

No AJAX applications. Testing desktop applications is another task. => total words distance: 16

Page 25: A Full-Text Search Algorithm for Long Queries

25

Phrases Relevance

N

wdpRdqR i jipjphrase

pi

),(),(

Arithmetical mean for each phrase (pi) detected in a query (q). Where wpi – the weight of the phrase in the query,

Rp – document’s relevance for pi calculated via the following formula:

kkp

n

jipi

it

dpR,

22

),(

where – the number of words in pi,

– total words distance in a document (dj), calculated for each occurrence of pi.

itn

kpi ,

Page 26: A Full-Text Search Algorithm for Long Queries

26

Resulting Relevance

)),(),( ( jphrasefieldj dqRWdqR k

where Rphrase – phrases relevance,

Wfield – indexed field’s weight

Page 27: A Full-Text Search Algorithm for Long Queries

27

Problem statementSuggested algorithmQueries processingDocuments rankingExperimental results

Page 28: A Full-Text Search Algorithm for Long Queries

28

Performance Measurement Formula

p

i

ip i

relrelDCG

2 21 log

where reli – assessor-defined relevance [0..2],i – the result’s order number,p = 10 – the number of considered results

Page 29: A Full-Text Search Algorithm for Long Queries

29

Experimental ResultsThe average quality of «top 10» search results(discounted cumulative gain), max=10,51

Number of queries

5 10 15 20 25 300

1

2

3

4

5

6

7

8

New AlgorithmSQL Server iFTSGoogle

Page 30: A Full-Text Search Algorithm for Long Queries

30

Conclusions

It is necessary to perform phrase search when finding an answer in an automated way.

Conceptual graphs allow detecting phrases in natural language queries.

Storing conceptual graphs instead of document vectors as indexes and using the graphs directly for relevance calculation can be an interesting approach, which will be examined in a future work.

Page 31: A Full-Text Search Algorithm for Long Queries

31

Thank you! Any questions?