3rd International KEYSTONE · Layfield/Draganovic/Azzopardi IKC 2017 Additional features: •Each test run 10 times •Document number varied for semantic space §1,000, 2,500 and

Dragan Ivanovic

3rd International KEYSTONE Conference

Multi-Lingual LSA with Serbian and Croatian:An Investigative Case Study

Colin LayfieldJoel Azzopardi

Layfield/Draganovic/Azzopardi IKC 2017

Want to create search functionality for dissertations in various Balkan region languages

• Serbian• Croatian• Montenegrin• Bosnian

ProblemThe Problem

The ApproachThe ResultsConclusions

2


Multi-Lingual Search Problem:Special Case:• Existence of these four “Political”

languages

• In practice, very similar

Multi-Lingual SearchThe Problem


3


Other approaches Roughly 2 categories• Query and document translation§ Parallel corpora§ Dictionary§ Machine Translation

Multi-Lingual SearchThe Problem


4

(e.g. Dhavachelvan, Pothula (2011), Dwivedi, Chandra (2016), Sharma, Morwal (2015))


Requires a parallel corpora

Multi-Lingual LSAThe Problem


5

LanguageA

LanguageB

A

B

AB

ParallelDocuments

A + B


LSA – Transforms term by doc matrix using Singular Value Decomposition

LSAThe Problem


6

𝐴 = 𝑇𝑆𝐷&

ParallelDocuments

A + BSemantic

Space


(Dumais et al 1997) – Experimented with ML-LSA• Parallel Corpora (Canadian Parliamentary Q&A)• 2,482 documents – both French & English• 1,500 randomly selected for training• Remaining 982 used as test set for “mate retrieval”

Previous WorkThe Problem


7

SemanticSpace

1,500 Docs982

Docs

E

F

E

FCompare


Repeat the previous experimentWith and without parallel semantic spaceHypothesis:

Due to the similarity of Croatian and Serbian are we able to use a semantic space that is comprised of

documents from each language rather than concatenated documents from a parallel set of

documents?

Our ApproachThe Problem


8


We used the SETIMES dataset (Tyers & Alperen 2010)• Collection of parallel news articles in the Balkan

languages taken from http://www.setimes.com• 9 Languages overall• Data consisted of:§ 170,466 aligned sentences § 29,391 news articles

The DataThe Problem


9


Two sets of experiments run with dataset1. Perform mate-retrieval test using classic ML-LSA 2. Perform mate-retrieval test where each document

in the semantic space consists of only a single language

MethodThe Problem


10


Additional features:• Each test run 10 times• Document number varied for semantic space§ 1,000, 2,500 and 5,000

• 5,000 documents chosen from those outside of semantic space for testing

• For each run documents randomly chosen to create semantic space

• Tried with TF-IDF and Log-Entropy weighting

MethodThe Problem


11


ResultsThe Problem


12


ResultsThe Problem


13

99.1 98.898.3

90.2

86.7

83.7

98.1 97.7 97.5

82.3

79

75.3

70

75

80

85

90

95

100

1000 2500 5000

ML-LSA Results Cro->Srb

L-E Comb L-E TF-IDF Comb TF-IDF


ResultsThe Problem


14

99.6 99.3 99

94.2

91.2

89.6

99 98.8 98.5

87.2

80.1

76.3

70

75

80

85

90

95

100

1000 2500 5000

ML-LSA Results Cro->Srb (Folding in with straight TF)



ResultsThe Problem


15

99.699.1 98.7

94.5

91.4

88.6

98.998.2 97.9

89.2

86.3

83.2

70

75

80

85

90

95

100

1000 2500 5000

ML-LSA Results Top 5 Cro



ResultsThe Problem


16


ResultsThe Problem


17

0.9470.954 0.957

0.8

0.813 0.809

0.9510.956 0.959

0.777

0.797 0.799

0.7

0.75

0.8

0.85

0.9

0.95

1

1000 2500 5000

Average Same Similarity Values



ResultsThe Problem


18

0.561

0.635

0.662

0.595

0.662

0.686

0.613

0.686

0.72

0.64

0.705

0.737

0.5

0.55

0.6

0.65

0.7

0.75

0.8

1000 2500 5000

Average Best Croatian Similarity



ResultsThe Problem


19

0.561

0.635

0.662

0.595

0.662

0.686

0.613

0.686

0.72

0.64

0.705

0.737

0.947 0.954 0.957

0.80.813 0.809

0.951 0.956 0.959

0.7770.797 0.799

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1000 2500 5000

Best Croatian and Same Similarity Scores

BCS-L-E Comb BCS-L-E BCS-TF-IDF Comb BCS-TF-IDF SS-L-E Comb SS-L-E SS-TF-IDF Comb SS-TF-IDF


Promising results but…• Not as good as expected• Some surprising features§ Increase in similarity values with increase in

semantic space size§ Decrease in mate matching results due to above§ Behavior of folding in documents using straight

TF

ConclusionsThe Problem


20


1. Experiment using sentence level semantic space2. Use mixed language semantic space of non

parallel documents3. Detect most important terms from query and use

those4. Investigate folding into weighted semantic space

using TF (and others?)

Future WorkThe Problem


21

Questions?Comments?

Suggestions?


Dhavachelvan, P., & Pothula, S. (2011). A Review on the Cross and Multilingual Information Retrieval. International Journal of Web & Semantic Technology, 2(4), 115–124.

Dumais, S. T., Letsche, T. a, Littman, M. L., & Landauer, T. K. (1997). Automatic Cross-Language Retrieval Using Latent Semantic Indexing. AAAI Technical Report SS-97-05., 18–24.

Dwivedi, S., & Chandra, G. (2016). A Survey on Cross Language Information Retrieval. International Journal on Cybernetics & Informatics, 5(1), 127–142. http://doi.org/10.17148/IJARCCE.2015.4287

Sharma, M., & Morwal, S. (2015). A Survey on Cross Language Information Retrieval. International Journal of Advanced Research in Computer Andf, 4(2), 384–387. http://doi.org/10.17148/IJARCCE.2015.4287

Tyers, F. M., & Alperen, M. S. (2010). South-East European Times : A parallel corpus of Balkan languages. In Proceedings of the Workshop on Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages, LREC 2010 (pp. 49–53). Retrieved from http://xixona.dlsi.ua.es/~fran/publications/lrec2010.pdf

Young, P. G. (1994). Cross-Language Information Retrieval Using Latent Semantic Indexing. University of Knoxville, Tennessee.

References

23