57
Indexing the Albanian Language by Andri Xhitoni may 2015

Andri Xhitoni - Indexing Albanian Language

Embed Size (px)

Citation preview

Indexing the Albanian Languageby Andri Xhitoni

may 2015

Indexing the Albanian Language

The problem: Have search functionality work on a website that's in Albanian.

Indexing the Albanian Language

The problem: Have search functionality work on a website that's in Albanian.

Indexing the Albanian Language

The problem: Have search functionality work on a website that's in Albanian.

Intricacies of search

Intricacies of searchMany think of search as

a straight-forward process

Intricacies of searchMany think of search as

a straight-forward process

“in go search terms, out come results”

it’s not that simple...

“in go search terms, out come results”

it’s not that simple...

it’s not that simple...Words take on many forms.

it’s not that simple...Words take on many forms.

Words may have different meanings based on context

it’s not that simple...Words take on many forms.

Words may have different meanings based on context

Some words have no real semantic valueand must be ignored (stop words)

How do the big guys do it?

How do the big guys do it?

No searching through raw content

How do the big guys do it?

No searching through raw content

Search through optimized versionsof the raw content (indexing)

Basic indexing process

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'

Basic indexing processNormalize the characters (transliteration)

and remove punctuation

alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought alice `without pictures or conversation?'

Basic indexing processRemove stop words

alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought alice `without pictures or conversation?'

Basic indexing processTransform each remaining word to its "basic version"

(stemming)

alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'

Basic indexing processStore the indexed content alongside the original

alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'

Performing the search

Performing the search

the book Alice’s sister was reading

Performing the search

the book alice’s sister was reading

Perform the same indexing on the search terms

Performing the searchSearch for the indexed search terms

in the indexed content

alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'

the book alice’s sister was reading

Performing the searchRank results according to number of occurrences,

closeness of terms, position in the indexed text

alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'

the book alice’s sister was reading2 21 1

Add the Albanian languageon top of the problem

Add the Albanian languageon top of the problem

No known "stop words" list

Add the Albanian languageon top of the problem

No known "stop words" list

Non-trivial stemming process

Add the Albanian languageon top of the problem

No known "stop words" list

Non-trivial stemming process

High irregularity in word formation

Add the Albanian languageon top of the problem

No known "stop words" list

Non-trivial stemming process

High irregularity in word formation

Vast number of forms for each single word

Just a taste of the complexityNouns 6 cases

x 2 numbers (singular, plural)x 2 definitenes (definite, indefinite)

~24 word forms

Verbs 3 unique word-forming modes (of 6) x 4 unique word-forming tenses (of 8)x 2 voices (active, passive)x 6 conjugative forms

~70 word forms

Looking for solutions

Looking for solutions

Ideally:

Looking for solutions

Ideally:A list of stop words

Looking for solutions

Ideally:A list of stop words

A (huge) list of all possible word formsfor all words in Albanian, linked to their stem form.

Looking for solutions

Sources:

Looking for solutions

Sources:The Dictionary

highly comprehensive

only base word forms

Looking for solutions

Sources:The Dictionary

highly comprehensive

only base word forms

The Internet

not too comprehensive

many word forms

potential errors

Looking for solutions

Sources:The Dictionary

highly comprehensive

only base word forms

The Internet

not too comprehensive

many word forms

potential errors

Hybrid source

a probability-based modelpicking (hopefully) the best

from both sources

Data mining: Stop words

Data mining: Stop wordsGet as many texts in Albanian as possible(the more diverse, the better)

Data mining: Stop wordsGet as many texts in Albanian as possible(the more diverse, the better)

Transliterate the texts

Data mining: Stop wordsGet as many texts in Albanian as possible(the more diverse, the better)

Transliterate the texts

Keep a running count of the occurrence for each word

Data mining: Stop wordsGet as many texts in Albanian as possible(the more diverse, the better)

Transliterate the texts

Keep a running count of the occurrence for each word

Sort the list by occurrence count (highest first).

Data mining: Stop wordsGet as many texts in Albanian as possible(the more diverse, the better)

Transliterate the texts

Keep a running count of the occurrence for each word

Sort the list by occurrence count (highest first).

Stop words will float to the top.

Data mining: Stop wordsGet as many texts in Albanian as possible(the more diverse, the better)

Transliterate the texts

Keep a running count of the occurrence for each word

Sort the list by occurrence count (highest first).

Stop words will float to the top.

Manually white-list obvious false positives

Data mining: Stemming

Data mining: StemmingInvert each word from the collected list

Data mining: StemmingInvert each word from the collected list

Sort the list alphabetically (effectively sorting by suffixes)

Data mining: StemmingInvert each word from the collected list

Sort the list alphabetically (effectively sorting by suffixes)

Find highest occurring suffixes of 2, 3 and 4 letters

Data mining: StemmingInvert each word from the collected list

Sort the list alphabetically (effectively sorting by suffixes)

Find highest occurring suffixes of 2, 3 and 4 letters

Manually look for false positives and put them in a white list

The (basic) indexing algorithm

The (basic) indexing algorithm

Transliterate the input text

The (basic) indexing algorithm

Transliterate the input textFind and remove all stop words

The (basic) indexing algorithm

Transliterate the input textFind and remove all stop wordsGo through each word and remove the found suffixes (largest to smallest)

The (basic) indexing algorithm

https://github.com/andrixh/index-albanian

Transliterate the input textFind and remove all stop wordsGo through each word and remove the found suffixes (largest to smallest)

Indexing the Albanian Languageby Andri Xhitoni

Thank you!

https://github.com/andrixh/index-albanian