Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Dragan Ivanovic
3rd International KEYSTONE Conference
Multi-Lingual LSA with Serbian and Croatian:An Investigative Case Study
Colin LayfieldJoel Azzopardi
Layfield/Draganovic/Azzopardi IKC 2017
Want to create search functionality for dissertations in various Balkan region languages
• Serbian• Croatian• Montenegrin• Bosnian
ProblemThe Problem
The ApproachThe ResultsConclusions
2
Layfield/Draganovic/Azzopardi IKC 2017
Multi-Lingual Search Problem:Special Case:• Existence of these four “Political”
languages
• In practice, very similar
Multi-Lingual SearchThe Problem
The ApproachThe ResultsConclusions
3
Layfield/Draganovic/Azzopardi IKC 2017
Other approaches Roughly 2 categories• Query and document translation§ Parallel corpora§ Dictionary§ Machine Translation
Multi-Lingual SearchThe Problem
The ApproachThe ResultsConclusions
4
(e.g. Dhavachelvan, Pothula (2011), Dwivedi, Chandra (2016), Sharma, Morwal (2015))
Layfield/Draganovic/Azzopardi IKC 2017
Requires a parallel corpora
Multi-Lingual LSAThe Problem
The ApproachThe ResultsConclusions
5
LanguageA
LanguageB
A
B
AB
ParallelDocuments
A + B
Layfield/Draganovic/Azzopardi IKC 2017
LSA – Transforms term by doc matrix using Singular Value Decomposition
LSAThe Problem
The ApproachThe ResultsConclusions
6
𝐴 = 𝑇𝑆𝐷&
ParallelDocuments
A + BSemantic
Space
Layfield/Draganovic/Azzopardi IKC 2017
(Dumais et al 1997) – Experimented with ML-LSA• Parallel Corpora (Canadian Parliamentary Q&A)• 2,482 documents – both French & English• 1,500 randomly selected for training• Remaining 982 used as test set for “mate retrieval”
Previous WorkThe Problem
The ApproachThe ResultsConclusions
7
SemanticSpace
1,500 Docs982
Docs
E
F
E
FCompare
Layfield/Draganovic/Azzopardi IKC 2017
Repeat the previous experimentWith and without parallel semantic spaceHypothesis:
Due to the similarity of Croatian and Serbian are we able to use a semantic space that is comprised of
documents from each language rather than concatenated documents from a parallel set of
documents?
Our ApproachThe Problem
The ApproachThe ResultsConclusions
8
Layfield/Draganovic/Azzopardi IKC 2017
We used the SETIMES dataset (Tyers & Alperen 2010)• Collection of parallel news articles in the Balkan
languages taken from http://www.setimes.com• 9 Languages overall• Data consisted of:§ 170,466 aligned sentences § 29,391 news articles
The DataThe Problem
The ApproachThe ResultsConclusions
9
Layfield/Draganovic/Azzopardi IKC 2017
Two sets of experiments run with dataset1. Perform mate-retrieval test using classic ML-LSA 2. Perform mate-retrieval test where each document
in the semantic space consists of only a single language
MethodThe Problem
The ApproachThe ResultsConclusions
10
Layfield/Draganovic/Azzopardi IKC 2017
Additional features:• Each test run 10 times• Document number varied for semantic space§ 1,000, 2,500 and 5,000
• 5,000 documents chosen from those outside of semantic space for testing
• For each run documents randomly chosen to create semantic space
• Tried with TF-IDF and Log-Entropy weighting
MethodThe Problem
The ApproachThe ResultsConclusions
11
Layfield/Draganovic/Azzopardi IKC 2017
ResultsThe Problem
The ApproachThe ResultsConclusions
12
Layfield/Draganovic/Azzopardi IKC 2017
ResultsThe Problem
The ApproachThe ResultsConclusions
13
99.1 98.898.3
90.2
86.7
83.7
98.1 97.7 97.5
82.3
79
75.3
70
75
80
85
90
95
100
1000 2500 5000
ML-LSA Results Cro->Srb
L-E Comb L-E TF-IDF Comb TF-IDF
Layfield/Draganovic/Azzopardi IKC 2017
ResultsThe Problem
The ApproachThe ResultsConclusions
14
99.6 99.3 99
94.2
91.2
89.6
99 98.8 98.5
87.2
80.1
76.3
70
75
80
85
90
95
100
1000 2500 5000
ML-LSA Results Cro->Srb (Folding in with straight TF)
L-E Comb L-E TF-IDF Comb TF-IDF
Layfield/Draganovic/Azzopardi IKC 2017
ResultsThe Problem
The ApproachThe ResultsConclusions
15
99.699.1 98.7
94.5
91.4
88.6
98.998.2 97.9
89.2
86.3
83.2
70
75
80
85
90
95
100
1000 2500 5000
ML-LSA Results Top 5 Cro
L-E Comb L-E TF-IDF Comb TF-IDF
Layfield/Draganovic/Azzopardi IKC 2017
ResultsThe Problem
The ApproachThe ResultsConclusions
16
Layfield/Draganovic/Azzopardi IKC 2017
ResultsThe Problem
The ApproachThe ResultsConclusions
17
0.9470.954 0.957
0.8
0.813 0.809
0.9510.956 0.959
0.777
0.797 0.799
0.7
0.75
0.8
0.85
0.9
0.95
1
1000 2500 5000
Average Same Similarity Values
L-E Comb L-E TF-IDF Comb TF-IDF
Layfield/Draganovic/Azzopardi IKC 2017
ResultsThe Problem
The ApproachThe ResultsConclusions
18
0.561
0.635
0.662
0.595
0.662
0.686
0.613
0.686
0.72
0.64
0.705
0.737
0.5
0.55
0.6
0.65
0.7
0.75
0.8
1000 2500 5000
Average Best Croatian Similarity
L-E Comb L-E TF-IDF Comb TF-IDF
Layfield/Draganovic/Azzopardi IKC 2017
ResultsThe Problem
The ApproachThe ResultsConclusions
19
0.561
0.635
0.662
0.595
0.662
0.686
0.613
0.686
0.72
0.64
0.705
0.737
0.947 0.954 0.957
0.80.813 0.809
0.951 0.956 0.959
0.7770.797 0.799
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1000 2500 5000
Best Croatian and Same Similarity Scores
BCS-L-E Comb BCS-L-E BCS-TF-IDF Comb BCS-TF-IDF SS-L-E Comb SS-L-E SS-TF-IDF Comb SS-TF-IDF
Layfield/Draganovic/Azzopardi IKC 2017
Promising results but…• Not as good as expected• Some surprising features§ Increase in similarity values with increase in
semantic space size§ Decrease in mate matching results due to above§ Behavior of folding in documents using straight
TF
ConclusionsThe Problem
The ApproachThe ResultsConclusions
20
Layfield/Draganovic/Azzopardi IKC 2017
1. Experiment using sentence level semantic space2. Use mixed language semantic space of non
parallel documents3. Detect most important terms from query and use
those4. Investigate folding into weighted semantic space
using TF (and others?)
Future WorkThe Problem
The ApproachThe ResultsConclusions
21
Questions?Comments?
Suggestions?
Layfield/Draganovic/Azzopardi IKC 2017
Dhavachelvan, P., & Pothula, S. (2011). A Review on the Cross and Multilingual Information Retrieval. International Journal of Web & Semantic Technology, 2(4), 115–124.
Dumais, S. T., Letsche, T. a, Littman, M. L., & Landauer, T. K. (1997). Automatic Cross-Language Retrieval Using Latent Semantic Indexing. AAAI Technical Report SS-97-05., 18–24.
Dwivedi, S., & Chandra, G. (2016). A Survey on Cross Language Information Retrieval. International Journal on Cybernetics & Informatics, 5(1), 127–142. http://doi.org/10.17148/IJARCCE.2015.4287
Sharma, M., & Morwal, S. (2015). A Survey on Cross Language Information Retrieval. International Journal of Advanced Research in Computer Andf, 4(2), 384–387. http://doi.org/10.17148/IJARCCE.2015.4287
Tyers, F. M., & Alperen, M. S. (2010). South-East European Times : A parallel corpus of Balkan languages. In Proceedings of the Workshop on Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages, LREC 2010 (pp. 49–53). Retrieved from http://xixona.dlsi.ua.es/~fran/publications/lrec2010.pdf
Young, P. G. (1994). Cross-Language Information Retrieval Using Latent Semantic Indexing. University of Knoxville, Tennessee.
References
23