University of TehranUniversity of TehranUniversity of TehranUniversity of Tehran
FuFaIR: a Fuzzy Farsi Information FuFaIR: a Fuzzy Farsi Information Retrieval SystemRetrieval System
Amir NayyeriAmir NayyeriSchool of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
University of TehranUniversity of Tehran
Farhad Oroumchian Farhad Oroumchian University of Wollongong in DubaiUniversity of Wollongong in Dubai
AICCSA06 AICCSA06 22
OverviewOverview Persian Language Persian Language Related WorkRelated Work
Fuzzy IRFuzzy IR Farsi IRFarsi IR
FuFaIR ExplanationFuFaIR Explanation Experimental ResultsExperimental Results Conclusion and Future WorkConclusion and Future Work
AICCSA06 AICCSA06 33
Persian LanguagePersian Language Spoken in several countries (Iran, Afghanistan, Tajikistan …)Spoken in several countries (Iran, Afghanistan, Tajikistan …) This language has evolved over the years been influenced by This language has evolved over the years been influenced by
many languages many languages Contains foreign words from many languages such as Arabic, Contains foreign words from many languages such as Arabic,
Turkish, French, English, … Turkish, French, English, … In some cases these words still follow the grammatical rules In some cases these words still follow the grammatical rules
of their original languages for example:of their original languages for example: ““Maktab” Maktab” مكتبمكتب (singular) (singular) “MAKATEB” “MAKATEB” مكاتبمكاتب (plural)(plural)
In some cases these words could use grammatical rules of In some cases these words could use grammatical rules of both languages i.e.both languages i.e.
““Khabar” Khabar” خبرخبر (singular) (singular) “ “AKHBAR” AKHBAR” اخباراخبار (Arabic)(Arabic) ““KHABAR-HA” KHABAR-HA” خبرهاخبرها (Persian)(Persian)
Morphological analyzers for this language need to deal Morphological analyzers for this language need to deal with many forms of wordswith many forms of words
AICCSA06 AICCSA06 44
Information Retrieval and Information Retrieval and Natural Language Processing Natural Language Processing
for Persian (Farsi)for Persian (Farsi) Faculty of Engineering of University of Faculty of Engineering of University of
Tehran started working on processing of Tehran started working on processing of Persian about 7 years ago.Persian about 7 years ago.
From 3 years ago, it has been a joint co-From 3 years ago, it has been a joint co-operation between UT and UOWD.operation between UT and UOWD.
Since then several thousand Since then several thousand experiments on processing and retrieval experiments on processing and retrieval of Persian text have been performed.of Persian text have been performed.
AICCSA06 AICCSA06 55
Test CollectionsTest Collections1.1. Qvanin CollectionQvanin Collection
Documents: Iranian Law Collection Documents: Iranian Law Collection 177089 passages 177089 passages 41 queries and Relevance Judgments41 queries and Relevance Judgments
2.2. Hamshari CollectionHamshari Collection Documents: 300 MB News from Documents: 300 MB News from
Hamshari NewspaperHamshari Newspaper
3.3. Part of Speech Tagging CollectionPart of Speech Tagging Collection A tag set of 40 tagsA tag set of 40 tags 2590000+ tagged words2590000+ tagged words
AICCSA06 AICCSA06 66
Natural Language Natural Language ProcessingProcessing
Investigating Automatic Part of Investigating Automatic Part of Speech Tagging based on machine Speech Tagging based on machine learning approaches:learning approaches: Probabilistic (Hidden Markov Model)Probabilistic (Hidden Markov Model) Rule basedRule based Entropy basedEntropy based Neural NetworksNeural Networks The best so far has reached a 96% The best so far has reached a 96%
accuracy.accuracy.
AICCSA06 AICCSA06 77
Information Retrieval Information Retrieval ExperimentsExperiments
All Major Retrieval Models of English text retrieval have been tested and their combinations (i.e.) Fuzzy Logic
MMM, Paice, Vector Space Probabilistic
BM25 N-Grams
N=2, N=3, N=4 Combinational With many different term weighting
schemes.
AICCSA06 AICCSA06 88
NameName WeightingWeighting
tf.idftf.idf tf*log(N/n) / (tf*log(N/n) / ((tf(tf22) * ) * (qtf(qtf22))))
lnc.ltclnc.ltc (1+log(tf))*(1+log(qtf))*log((1+N)/n) (1+log(tf))*(1+log(qtf))*log((1+N)/n) / (/ ((tf(tf22) * ) * (qtf(qtf22))))
nxx.bpxnxx.bpx (0.5+0.5*tf/max tf)+log((N-n)/n)(0.5+0.5*tf/max tf)+log((N-n)/n)
tfc.nfctfc.nfc tf*log(N/n)*(0.5+0.5*qtf/max tf*log(N/n)*(0.5+0.5*qtf/max qtf)*log(N/n) / (qtf)*log(N/n) / ((tf(tf22) * ) * (qtf(qtf22))))
tfc.nfx1tfc.nfx1 tf* log(N/n)*(0.5+0.5*qtf/max qtf) tf* log(N/n)*(0.5+0.5*qtf/max qtf) *log(N/n) / (*log(N/n) / ((tf(tf * log(N/n))* log(N/n))22))
tfc.nfx2tfc.nfx2 tf*log(N/n)*(0.5+0.5*qtf/max tf*log(N/n)*(0.5+0.5*qtf/max qtf)*log(N/n) / (qtf)*log(N/n) / ((tf(tf22))))
Lnu.ltuLnu.ltu ((1+log(tf))*(1+log(qtf))*log((1+N)/((1+log(tf))*(1+log(qtf))*log((1+N)/n))/n))/
((1+log(average tf)) * ((1-s) + s ((1+log(average tf)) * ((1-s) + s * *
N.U.W/ average N.U.W)N.U.W/ average N.U.W)2)2)
List of Weights that produced the best results
Best
AICCSA06 AICCSA06 99
NNoo
SystemSystem NoNo SystemSystem NoNo SystemSystem
Fuzzy LogicFuzzy Logic Fuzzy LogicFuzzy Logic Vector Vector SpaceSpace
11 paice-tf.idfpaice-tf.idf 1111 mmm-tf.idfmmm-tf.idf 2020 2gram-2gram-Lnu.ltuLnu.ltu
22 paice-lnc.ltcpaice-lnc.ltc 1212 mmm-lnc.ltcmmm-lnc.ltc 2121 2gram-2gram-tfc.nfxtfc.nfx
33 paice-Lnu.ltupaice-Lnu.ltu 1313 mmm-mmm-Lnu.ltuLnu.ltu
2222 2gram-2gram-lnc.ltclnc.ltc
44 paice-paice-nxx.bpxnxx.bpx
1414 mmm-mmm-nxx.bpxnxx.bpx
2323 3gram-3gram-Lnu.ltuLnu.ltu
55 paice-tfc.nfx1paice-tfc.nfx1 1515 mmm-mmm-tfc.nfx1tfc.nfx1
2424 3gram-3gram-tfc.nfxtfc.nfx
66 paice-tfc.nfcpaice-tfc.nfc 1616 mmm-tfc.nfcmmm-tfc.nfc
ProbabilisticProbabilistic 2525 3gram-3gram-lnc.ltclnc.ltc
77 BM25BM25 Vector Vector SpaceSpace
2626 4gram-4gram-Lnu.ltuLnu.ltu
88 2gram-BM252gram-BM25 1717 vector-vector-Lnu.ltuLnu.ltu
2727 4gram-4gram-tfc.nfxtfc.nfx
99 3gram-BM253gram-BM25 1818 vector-vector-tfc.nfx2tfc.nfx2
2828 4gram-4gram-lnc.ltclnc.ltc
1010 4gram-BM254gram-BM25 1919 vector-vector-lnc.ltclnc.ltc
Best
AICCSA06 AICCSA06 1010
The context of the current The context of the current workwork
Improving the quality of Persian Improving the quality of Persian retrieval retrieval
Improving IR systems that used Improving IR systems that used Fuzzy Logic as their retrieval modelFuzzy Logic as their retrieval model
AICCSA06 AICCSA06 1111
Related Work – Fuzzy IRRelated Work – Fuzzy IR Fuzzy logic has been used in IR from early days.Fuzzy logic has been used in IR from early days. But only a few of them could show superiority in But only a few of them could show superiority in
comparison with Classical approaches like vector space.comparison with Classical approaches like vector space. This has been confirmed for Persian language also.This has been confirmed for Persian language also.
The current work has been mostly inspired by one of The current work has been mostly inspired by one of
them:them:
D.E. Losada, F.D. Hermida, A. Bugarin, S. Barro. D.E. Losada, F.D. Hermida, A. Bugarin, S. Barro.
Experiments on using fuzzy quantified sentences in Experiments on using fuzzy quantified sentences in
adhoc retrieval. ACM Symposium on Applied Aomputin, adhoc retrieval. ACM Symposium on Applied Aomputin,
2004.2004.
AICCSA06 AICCSA06 1212
Mixed Min & Max – MMM Calculates the degree of membership of a document to the fuzzy set of the terms in the query as belowOR Query:
) حضانت ) يا ))Guardian OR GOD Parentقيموميت
Q or = (A1OR A2 OR A3 OR …) SIM(Qor, D) = C or1 * max(dA1, dA2, …) +C or2 * min(dA1, dA2, …)
AND Queryثبت( ) و Registration AND Properties ( (امالك
Q and = (A1 AND A2 AND A3 AND …) SIM(Qand, D) = C and1 * min(dA1, dA2, …) +
C and2 * max(dA1, dA2, …) Cand , Cor softness coefficient
Cand1 = [0.5,0.8] Cand2 = 1 – Cand1Cor1 > 0.2 Cor2 = 1- Cor1
AICCSA06 AICCSA06 1313
Paice ModelCalculates the degree of membership of a document to the fuzzy set of terms in the query as below:
AND Query) ثبت و )Registration AND Properties( ) امالك
Q and = (A1 and A2 and A3 and …)
OR Query: ) حضانت يا ) Guardian OR GOD Parent( )قيموميت
Q or = (A1or A2 or A3 or …)
SIM(Q, D) = ri-1 tdi / ri-1
r = 1.0 for and queries (tdi ascending order)
r = 0.7 for or queries (tdi descending order)
AICCSA06 AICCSA06 1414
Comparison of Fuzzy Comparison of Fuzzy SystemsSystems
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
P@5 P@10 P@15 P@20Document Cut off
Prec
isio
n
paice-tf.idf paice-lnc.ltc paice-Lnu.ltu paice-nxx.bpx
paice-tfc.nfx1 paice-tfc.nfc mmm-tf.idf mmm-lnc.ltc
mmm-Lnu.ltu mmm-nxx.bpx mmm-tfc.nfx1 mmm-tfc.nfc
Experiments on Qavanin Collection
AICCSA06 AICCSA06 1515
Probabilistic Systems Probabilistic Systems (BM25)(BM25)
BM25 Final Results
00.10.20.30.40.50.60.70.80.9
1
BM25 BM252-gram
BM253-gram
BM254-gram
Pre
cis
ion at 5
at 10
at 15
at 20
Experiments on Qavanin Collection
AICCSA06 AICCSA06 1616
Comparison of Vector Space Comparison of Vector Space Systems With BM25Systems With BM25
00.10.20.30.40.50.60.70.80.9
1
Document Cut off
Pre
cisi
on
vector-Lnu.ltu 0.91 0.83 0.76 0.74
vector-tfc.nfx2 0.66 0.62 0.54 0.59
vector-lnc.ltc 0.63 0.60 0.58 0.55
BM25 0.77 0.71 0.68 0.66
P@5 P@10 P@15 P@20
Experiments on Qavanin Collection
AICCSA06 AICCSA06 1717
Comparison of Best Vector Comparison of Best Vector Space With Best N-gramsSpace With Best N-grams
00.10.20.30.40.50.60.70.80.9
1
P@5 P@10 P@15 P@20
Prec
ision
vector-Lnu.ltu
3gram-Lnu.ltu
4gram-Lnu.ltu
paice-nxx.bpx
4gram-BM25
Experiments on Qavanin Collection
AICCSA06 AICCSA06 1818
FuFaIRFuFaIR
The query is considered as a fuzzy set The query is considered as a fuzzy set of relevant documents in the database of relevant documents in the database
The documents will be sent to the client The documents will be sent to the client sorted based on their degree of sorted based on their degree of membership to the query's fuzzy set membership to the query's fuzzy set
The larger the value of µThe larger the value of µii the more the more relevant is the document to the query relevant is the document to the query
ii
AICCSA06 AICCSA06 1919
FuFaIR FuFaIR (Cont.)(Cont.)
each term is assigned a membership degree to a document each term is assigned a membership degree to a document based on the importance of that term for representing the based on the importance of that term for representing the document’s content. document’s content.
Membership degree can be computed with classical IR Membership degree can be computed with classical IR parameters such as tf/idf parameters such as tf/idf
The input query is considered as an algebraic sentence The input query is considered as an algebraic sentence whose elements are:whose elements are: TermsTerms Fuzzy operators such as AND, OR, and NOTFuzzy operators such as AND, OR, and NOT
Applying the operators on terms the final Fuzzy Set resultsApplying the operators on terms the final Fuzzy Set results
ii
AICCSA06 AICCSA06 2020
FuFaIR FuFaIR (Cont.)(Cont.)
The membership degree of a The membership degree of a document to an individual term is document to an individual term is defined as follows in our method:defined as follows in our method:ii
))((max
)(
)(max)(
,
,
ktdtt
dtt tidf
tidf
f
fd
kkk
ft,d = Frequency of term t in document d
idf (t) = Inverse document frequency of term t
AICCSA06 AICCSA06 2121
OverviewOverview
Persian Language Persian Language Related WorkRelated Work
Fuzzy IRFuzzy IR Farsi IRFarsi IR
Fuzzy Logic OverviewFuzzy Logic Overview FuFaIR ExplanationFuFaIR Explanation Experimental ResultsExperimental Results Conclusion and Future WorkConclusion and Future Work
AICCSA06 AICCSA06 2222
Experimental ResultsExperimental Results
Parameters:Parameters: Hamshahri Corpora has been usedHamshahri Corpora has been used Total size of the collection:Total size of the collection: 300+MB300+MB
Indexing has been performed after stop word Indexing has been performed after stop word eliminationelimination
No stemming has been applied No stemming has been applied 30 queries have been used for these experiments 30 queries have been used for these experiments Precision has been computed for top 20 retrieved Precision has been computed for top 20 retrieved
documents.documents.
AICCSA06 AICCSA06 2323
Experimental Results Experimental Results (Cont.)(Cont.)Some Sample Some Sample
Queries:Queries:
The Bidel music group concert
بيدل گروه موسيقي کنسرت
Iran AND USA relations امريکا و ايران روابط
Economic benefit of Iran’s agriculture
کشاورزي اقتصادي سودايران
The punishment of doping in swimming
شنا در دوپينگ مجازات
Cancer treatment methods
سرطان درمان روشهاي
Classic music in Iran ايران در کالسيک موسيقي
AICCSA06 AICCSA06 2424
Experimental Results Experimental Results (Cont.)(Cont.) As a bench mark the best Persian retrieval As a bench mark the best Persian retrieval
model so far has been selected. That is the model so far has been selected. That is the Vector Space model with Lnu-ltu Vector Space model with Lnu-ltu weighting scheme.weighting scheme.
Pivot and the slope parameters have been Pivot and the slope parameters have been set to 13.36, and 0.75, respectivelyset to 13.36, and 0.75, respectively The effectiveness of these values had been The effectiveness of these values had been
shown by previous works (See Paper). shown by previous works (See Paper). To calculate the performance of each run, To calculate the performance of each run,
the precision at 5, 10, 15 and 20 the precision at 5, 10, 15 and 20 document cut-offs have been calculated document cut-offs have been calculated and averaged over all 30 queries.and averaged over all 30 queries.
AICCSA06 AICCSA06 2525
Experimental Results Experimental Results (Cont.)(Cont.)
Comparison Results:Comparison Results:
AICCSA06 AICCSA06 2626
Conclusion & Future WorkConclusion & Future WorkConclusionConclusion Main contribution of this paper:Main contribution of this paper:
Design, implementation and testing of FuFaIR a Fuzzy Design, implementation and testing of FuFaIR a Fuzzy retrieval system for Persian language.retrieval system for Persian language.
fuzzy quantifiers are also added to the original model to fuzzy quantifiers are also added to the original model to provide more flexibility provide more flexibility
In comparison with Vector Space, FuFaIR significantly better In comparison with Vector Space, FuFaIR significantly better performanceperformance
Future Works:Future Works: Testing different interpretation of the Fuzzy operators on Testing different interpretation of the Fuzzy operators on
the Persian corpora the Persian corpora Examining the true value and contribution of a Persian Examining the true value and contribution of a Persian
stemmer in retrieval. stemmer in retrieval.
AICCSA06 AICCSA06 2727
Questions ?Questions ?
AICCSA06 AICCSA06 2828
Conception of Fuzzy LogicConception of Fuzzy Logic
Many decision-making and problem-Many decision-making and problem-solving tasks are too complex to be solving tasks are too complex to be defined preciselydefined precisely
however, people succeed by using however, people succeed by using
imprecise knowledgeimprecise knowledge Fuzzy logic resembles human reasoning in Fuzzy logic resembles human reasoning in
its use of approximate information and its use of approximate information and uncertainty to generate decisions. uncertainty to generate decisions.
AICCSA06 AICCSA06 2929
Natural LanguageNatural Language
Consider:Consider: Joe is tall -- what is tall?Joe is tall -- what is tall? Joe is very tall -- what does this differ from Joe is very tall -- what does this differ from
tall?tall?
Natural language (like most other Natural language (like most other activities in life and indeed the universe) is activities in life and indeed the universe) is not easily translated into the absolute not easily translated into the absolute terms of 0 and 1.terms of 0 and 1. “false” “true”
AICCSA06 AICCSA06 3030
Fuzzy LogicFuzzy Logic
An approach to uncertainty that An approach to uncertainty that combines real values [0…1] and logic combines real values [0…1] and logic operationsoperations
Fuzzy logic is based on the ideas of Fuzzy logic is based on the ideas of fuzzy set theory and fuzzy set fuzzy set theory and fuzzy set membership often found in natural membership often found in natural (e.g., spoken) language.(e.g., spoken) language.
AICCSA06 AICCSA06 3131
Example: “Young”Example: “Young”
Example:Example: Ann is 28, 0.8 in set “Young”Ann is 28, 0.8 in set “Young” Bob is 35, 0.1 in set “Young”Bob is 35, 0.1 in set “Young” Charlie is 23, 1.0 in set “Young”Charlie is 23, 1.0 in set “Young”
Unlike statistics and probabilities, the Unlike statistics and probabilities, the degreedegree is not describing is not describing probabilitiesprobabilities that that the item is in the set, but instead describes the item is in the set, but instead describes to what extentto what extent the item is the set. the item is the set.
AICCSA06 AICCSA06 3232
Membership function of fuzzy Membership function of fuzzy logiclogic
Age25 40 55
Young Old1
Middle
0.5
DOM
Degree of Membership
Fuzzy values
Fuzzy values have associated degrees of membership in the set.
0
AICCSA06 AICCSA06 3333
Benefits of fuzzy logicBenefits of fuzzy logic
You want the value to switch You want the value to switch gradually as gradually as YoungYoung becomes becomes MiddleMiddle and and MiddleMiddle becomes becomes OldOld. This is the . This is the idea of fuzzy logic.idea of fuzzy logic.
AICCSA06 AICCSA06 3434
Fuzzy Set OperationsFuzzy Set Operations
Fuzzy OR (Fuzzy OR (): the union of two fuzzy ): the union of two fuzzy sets is the maximum (MAX) of each sets is the maximum (MAX) of each element from two sets.element from two sets.
E.g.E.g. A = {1.0, 0.20, 0.75}A = {1.0, 0.20, 0.75} B = {0.2, 0.45, 0.50}B = {0.2, 0.45, 0.50} A A B = {MAX(1.0, 0.2), MAX(0.20, 0.45), B = {MAX(1.0, 0.2), MAX(0.20, 0.45),
MAX(0.75, 0.50)}MAX(0.75, 0.50)}= {1.0, 0.45, 0.75}= {1.0, 0.45, 0.75}
AICCSA06 AICCSA06 3535
Fuzzy Set OperationsFuzzy Set Operations
Fuzzy AND (Fuzzy AND (): the intersection of ): the intersection of two fuzzy sets is just the MIN of each two fuzzy sets is just the MIN of each element from the two sets.element from the two sets.
E.g.E.g. A A B = {MIN(1.0, 0.2), MIN(0.20, 0.45), B = {MIN(1.0, 0.2), MIN(0.20, 0.45),
MIN(0.75, 0.50)} = {0.2, 0.20, 0.50}MIN(0.75, 0.50)} = {0.2, 0.20, 0.50}
AICCSA06 AICCSA06 3636
Fuzzy Set OperationsFuzzy Set Operations
The The complement complement of a fuzzy variable of a fuzzy variable with DOM with DOM x x is (1-x).is (1-x).
Complement: The Complement: The complement complement of a of a fuzzy set is composed of all fuzzy set is composed of all elements’ elements’ complement.complement.
Example.Example. AAcc = {1 – 1.0, 1 – 0.2, 1 – 0.75} = {0.0, = {1 – 1.0, 1 – 0.2, 1 – 0.75} = {0.0,
0.8, 0.25}0.8, 0.25}