1
Cross Language Concept Mining { Motaz.Saad and David.Langlois and Kamel.Smaili }@loria.fr 1. OVERVIEW Journalist Review System JRS Objective: Build a Journalist Review System (JRS) that enables me- dia trackers (journalists) to collect multilingual comparable articles con- cerning a given topic, and perform the following: Explore & review opinions. Automatically detect the split of public opinions (e.g.: with vs against an issue or person ...). Identify & review more detailed opinions (joy, sad, anger, ...). Requirements: Comparable corpora for training/testing. Comparability Measure (CM): to compare multilingual articles Sentiment Based Comparability Measure (SCM): to compare opin- ions of comparable articles. 2. C OMPARABLE CORPORA Sources: Wikipedia encyclopedia and Euronews website. Aligning Wikipedia articles Use interlanguage links [[ar:]] [[de:Regen]] [[es:Lluvia]] [[fr:Pluie]] [[en:Rain]] Aligning Euronews articles parsing html links of each English article and fetching corresponding Arabic/French articles. Corpora Information: publicly available at AFEWC eNews English French Arabic English French Arabic Articles 40290 40290 40290 34442 34442 34442 Sentences 4.8M 2.7M 1.2M 744K 746K 622K Avg #sentences/article 119 69 30 21 21 17 Avg #words/article 2266 1435 548 198 200 161 Words 91.3M 57.8M 22M 6.8M 6.9M 5.5M Vocabulary 2.8M 1.9M 1.5M 232K 256K 373K 3. C OMPARABILITY M EASURE (CM) CM is based on cosine similarity between comparable articles. Word’s weight are represented as binary and frequency of words. Cosine similarity is better for CM R1 R5 R10 0.4 0.6 0.8 1 0.36 0.81 1 0.49 0.86 1 Recall binCM cosineCM 4. S ENTIMENT B ASED C OMPARABILITY M EASURE (SCM) scm(c)= C (S x )=c P (S x |c) N x - C (S y )=c P (S y |c) N y 5. SCM RESULTS Corpora scmo) scm(o) scmp) scm(p) parallel-p2 AFP 0.02 0.02 0.1 0.12 ANN 0.05 0.06 0.1 0.1 ASB 0.07 0.1 0.12 0.14 TED 0.06 0.06 0.08 0.07 UN 0.05 0.02 0.07 0.08 Comparable ENews 0.07 0.15 0.11 0.15 AFEWC 0.11 0.19 0.11 0.16 ¯ o = subjective, o = objective, ¯ p = negative, (p) = positive AFP: Associated France Press, ANN, Annahar newspaper, ASB: Assabah newspaper, TED: talks from ted.com, UN: United nations resolutions. - Comparing CM results for parallel/comparable corpora CM can capture comparability - Comparable articles do not have the same opinions they variate in their objectivity and positivity 6. M ORPHOLOGICAL A NALYSIS كتبkatab to write écrire طيرtair to fly voler maktab مكتبoffice bureau kitab كتابbook livre maktaba مكتبةlibrary bibliothèque ta-iar طيارpilot pilote matar مطارairport aéroport ta-ira طائرةairplane avion ta-ir طائرbird oiseau Stemming and lemmatization for English and French Rooting and light stemming for Arabic Light stemming removes suffixes and prefixes Rooting removes suffixes and prefixes and reduce to the root 7. C OVERAGE RATE OF THE BILINGUAL DICTIONARY 57% morphAr-lemma 50% morphAr-stemEn 40% root-lemma 39% root-stemEn 41% lightStem-lemma 41% LightStem-stemEn 0% 20% 40% 60% 80% 100% 8. F UTURE WORK Elaborate a multilingual document representation model based on Latent Semantic Indexing to enhance CM. Elaborate SCM by enhancing sentiment detecting and by reviewing more detailed sentiments, i.e emotion in words (joy, anger, pleasure, ...). This will be done by exploiting annotated lexicons and semantic network. Develop an interface for journalists to review comparable articles. 9. R EFERENCES Saad, M.; Langlois, D. & Smaili, K. (2013), Comparing Multilingual Comparable Articles Based On Opinions, in ’Proceedings of the Sixth Workshop on Building and Using Comparable Corpora’ , Association for Computational Linguistics, Sofia, Bulgaria , pp. 105-111. Saad, M.; Langlois, D. & Smaili, K. (2013), Extracting Comparable Articles from Wikipedia and Measuring Their Comparabilities, in ’5th International Conference on Corpus Linguistics’ , University of Alicante, Spain.

Cross Language Concept Mining

  • Upload
    -

  • View
    116

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cross Language Concept Mining

Cross Language Concept Mining{ Motaz.Saad and David.Langlois and Kamel.Smaili }@loria.fr

1. OVERVIEW

Journalist Review System JRSObjective: Build a Journalist Review System (JRS) that enables me-

dia trackers (journalists) to collect multilingual comparable articles con-cerning a given topic, and perform the following:• Explore & review opinions.• Automatically detect the split of public opinions (e.g.: with vs

against an issue or person ...).• Identify & review more detailed opinions (joy, sad, anger, ...).Requirements:• Comparable corpora for training/testing.• Comparability Measure (CM): to compare multilingual articles• Sentiment Based Comparability Measure (SCM): to compare opin-

ions of comparable articles.

2. COMPARABLE CORPORA

• Sources: Wikipedia encyclopedia and Euronews website.• Aligning Wikipedia articles⇒ Use interlanguage links⇒ [[ar:rW�]] [[de:Regen]] [[es:Lluvia]] [[fr:Pluie]] [[en:Rain]]

• Aligning Euronews articles⇒ parsing html links of each Englisharticle and fetching corresponding Arabic/French articles.

• Corpora Information: publicly available athttp://sf.net/projects/crlcl/

AFEWC eNewsEnglish French Arabic English French Arabic

Articles 40290 40290 40290 34442 34442 34442Sentences 4.8M 2.7M 1.2M 744K 746K 622KAvg #sentences/article 119 69 30 21 21 17Avg #words/article 2266 1435 548 198 200 161Words 91.3M 57.8M 22M 6.8M 6.9M 5.5MVocabulary 2.8M 1.9M 1.5M 232K 256K 373K

3. COMPARABILITY MEASURE (CM)• CM is based on cosine similarity between comparable articles.• Word’s weight are represented as binary and frequency of words.• Cosine similarity is better for CM

R1 R5 R10

0.4

0.6

0.8

1

0.36

0.81

1

0.49

0.86

1

Rec

all

binCM cosineCM

4. SENTIMENT BASED COMPARABILITY MEASURE (SCM)

scm(c) =

∣∣∣∣∣∣∣∑

C(Sx)=c

P (Sx|c)

Nx−

∑C(Sy)=c

P (Sy|c)

Ny

∣∣∣∣∣∣∣

5. SCM RESULTS

Corpora scm(o) scm(o) scm(p) scm(p)

parallel-p2

AFP 0.02 0.02 0.1 0.12ANN 0.05 0.06 0.1 0.1ASB 0.07 0.1 0.12 0.14TED 0.06 0.06 0.08 0.07UN 0.05 0.02 0.07 0.08

ComparableENews 0.07 0.15 0.11 0.15AFEWC 0.11 0.19 0.11 0.16

o = subjective, o = objective, p = negative, (p) = positive

AFP: Associated France Press, ANN, Annahar newspaper, ASB: Assabah newspaper, TED: talks fromted.com, UN: United nations resolutions.

- Comparing CM results for parallel/comparable corpora⇒ CM can capture comparability- Comparable articles do not have the same opinions⇒ they variate in their objectivityand positivity

6. MORPHOLOGICAL ANALYSIS

katabكتب to writeécrire

tairطير to flyvoler

maktab مكتبoffice

bureau

kitab كتابbooklivre

maktaba مكتبةlibrary

bibliothèque

ta-iar طيارpilotpilote

matar مطارairport

aéroport

ta-ira طائرةairplane

avion

ta-ir طائرbird

oiseau

• Stemming and lemmatization for English and French• Rooting and light stemming for Arabic⇒ Light stemming removes suffixes and prefixes⇒ Rooting removes suffixes and prefixes and reduce to the root

7. COVERAGE RATE OF THE BILINGUAL DICTIONARY

57%morphAr-lemma50%morphAr-stemEn

40%root-lemma39%root-stemEn41%lightStem-lemma41%LightStem-stemEn

0% 20% 40% 60% 80% 100%

8. FUTURE WORK

• Elaborate a multilingual document representation model based on Latent SemanticIndexing to enhance CM.

• Elaborate SCM by enhancing sentiment detecting and by reviewing more detailedsentiments, i.e emotion in words (joy, anger, pleasure, ...). This will be done byexploiting annotated lexicons and semantic network.

• Develop an interface for journalists to review comparable articles.

9. REFERENCES• Saad, M.; Langlois, D. & Smaili, K. (2013), Comparing Multilingual Comparable Articles Based On Opinions, in ’Proceedings of

the Sixth Workshop on Building and Using Comparable Corpora’ , Association for Computational Linguistics, Sofia, Bulgaria , pp.105-111.

• Saad, M.; Langlois, D. & Smaili, K. (2013), Extracting Comparable Articles from Wikipedia and Measuring Their Comparabilities, in’5th International Conference on Corpus Linguistics’ , University of Alicante, Spain.