Upload
open-labs-albania
View
38
Download
2
Tags:
Embed Size (px)
Citation preview
Indexing the Albanian Language
The problem: Have search functionality work on a website that's in Albanian.
Indexing the Albanian Language
The problem: Have search functionality work on a website that's in Albanian.
Indexing the Albanian Language
The problem: Have search functionality work on a website that's in Albanian.
Intricacies of searchMany think of search as
a straight-forward process
“in go search terms, out come results”
it’s not that simple...Words take on many forms.
Words may have different meanings based on context
Some words have no real semantic valueand must be ignored (stop words)
How do the big guys do it?
No searching through raw content
Search through optimized versionsof the raw content (indexing)
Basic indexing process
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'
Basic indexing processNormalize the characters (transliteration)
and remove punctuation
alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought alice `without pictures or conversation?'
Basic indexing processRemove stop words
alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought alice `without pictures or conversation?'
Basic indexing processTransform each remaining word to its "basic version"
(stemming)
alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'
Basic indexing processStore the indexed content alongside the original
alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'
Performing the search
the book alice’s sister was reading
Perform the same indexing on the search terms
Performing the searchSearch for the indexed search terms
in the indexed content
alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'
the book alice’s sister was reading
Performing the searchRank results according to number of occurrences,
closeness of terms, position in the indexed text
alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'
the book alice’s sister was reading2 21 1
Add the Albanian languageon top of the problem
No known "stop words" list
Non-trivial stemming process
Add the Albanian languageon top of the problem
No known "stop words" list
Non-trivial stemming process
High irregularity in word formation
Add the Albanian languageon top of the problem
No known "stop words" list
Non-trivial stemming process
High irregularity in word formation
Vast number of forms for each single word
Just a taste of the complexityNouns 6 cases
x 2 numbers (singular, plural)x 2 definitenes (definite, indefinite)
~24 word forms
Verbs 3 unique word-forming modes (of 6) x 4 unique word-forming tenses (of 8)x 2 voices (active, passive)x 6 conjugative forms
~70 word forms
Looking for solutions
Ideally:A list of stop words
A (huge) list of all possible word formsfor all words in Albanian, linked to their stem form.
Looking for solutions
Sources:The Dictionary
highly comprehensive
only base word forms
The Internet
not too comprehensive
many word forms
potential errors
Looking for solutions
Sources:The Dictionary
highly comprehensive
only base word forms
The Internet
not too comprehensive
many word forms
potential errors
Hybrid source
a probability-based modelpicking (hopefully) the best
from both sources
Data mining: Stop wordsGet as many texts in Albanian as possible(the more diverse, the better)
Transliterate the texts
Data mining: Stop wordsGet as many texts in Albanian as possible(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Data mining: Stop wordsGet as many texts in Albanian as possible(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
Data mining: Stop wordsGet as many texts in Albanian as possible(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
Stop words will float to the top.
Data mining: Stop wordsGet as many texts in Albanian as possible(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
Stop words will float to the top.
Manually white-list obvious false positives
Data mining: StemmingInvert each word from the collected list
Sort the list alphabetically (effectively sorting by suffixes)
Data mining: StemmingInvert each word from the collected list
Sort the list alphabetically (effectively sorting by suffixes)
Find highest occurring suffixes of 2, 3 and 4 letters
Data mining: StemmingInvert each word from the collected list
Sort the list alphabetically (effectively sorting by suffixes)
Find highest occurring suffixes of 2, 3 and 4 letters
Manually look for false positives and put them in a white list
The (basic) indexing algorithm
Transliterate the input textFind and remove all stop wordsGo through each word and remove the found suffixes (largest to smallest)
The (basic) indexing algorithm
https://github.com/andrixh/index-albanian
Transliterate the input textFind and remove all stop wordsGo through each word and remove the found suffixes (largest to smallest)