15
Multilingual Synchronization Eun-kyung Kim 2011-02-10

Multilingual Synchronization

  • Upload
    halle

  • View
    59

  • Download
    0

Embed Size (px)

DESCRIPTION

Multilingual Synchronization. Eun-kyung Kim 2011-02-10. Introduction. Wikipedia Supports over 270 languages Allows cross-lingual navigation with inter-language link Different quantity of data Goal - PowerPoint PPT Presentation

Citation preview

Page 1: Multilingual Synchronization

Multilingual Synchronization

Eun-kyung Kim2011-02-10

Page 2: Multilingual Synchronization

2

Introduction• Wikipedia

– Supports over 270 languages– Allows cross-lingual navigation with inter-language link – Different quantity of data

• Goal– Synchronizing multilingual Wikipedia data

to fill the gap between different languages & to acquire the integrated knowledge

Page 3: Multilingual Synchronization

3

Methodology (base)

• Hypothesis– X is a key fact in L1 X’ should be a key fact in L2

• where X’ is a corresponding term to X in different language– Assumption

» Inter-language links are accurate links to connect two pages about the same entity or concept in different languages

• Key facts come from the structured data such as:– Infobox– Category– Hyperlink text (than normal text)

Page 4: Multilingual Synchronization

4

Methodology• Basic methodology

– Infobox synchronization between English & Korean• Duplicate resolving & conflict resolving

• Comments from Committee (Ph.D proposal)– Improve the multilingualism– Do not ignore multiple viewpoints – Hard to evaluate

• Extended methodology– Sync target selection in 5 languages– Key facts synchronization including not only Infobox but also LinkText– Filling missing information according to each background knowledge and

characteristics • focusing on how to add new information from other language resources

Page 5: Multilingual Synchronization

5

Methodology• Basic methodology

– Infobox synchronization between English & Korean• Duplicate resolving & conflict resolving

• Comments from Committee (Ph.D proposal)– Improve the multilingualism– Do not ignore multiple viewpoints – Hard to evaluate

• Extended methodology– Sync target selection in 5 languages– Key facts synchronization including not only Infobox but also LinkText– Filling missing information according to each background knowledge and

characteristics • focusing on how to add new information from other language resources

Page 6: Multilingual Synchronization

6

Example of Infobox Synchronization

Page 7: Multilingual Synchronization

7

Example of Infobox Synchronization

• Drawback of Infobox– Sometimes meaningless

for synchronization

• Solution– Adding links information

to synchronize

Infobox from Arthritis

Page 8: Multilingual Synchronization

8

Links on the Web• Links

– navigate to a web page with more detailed information

– point to previously published web pages with similar or related content

• Understanding of the influence of each link can substantially benefit many applications – e.g., multilingual sync

메뚜기메뚜기목귀뚜라미 베짱이

방아깨비

풀무치

농업 예멘사우디아라비아

해충여치

벼메뚜기

Page 9: Multilingual Synchronization

9

Multilingual Synchronization Process

Preprocessing(Target Page Selection)Wikipedia

Data LNWikipedia

Data L2WikipediaData L1

Extracting Links

Modeling on influence links

L1 L2 LN…

Finding missing linksaccording to the model

Translating links into target languages to sync

Computing similarity between existing and new

Unifying synchronized data

Page 10: Multilingual Synchronization

10

Multilingual Synchronization Process

Preprocessing(Target Page Selection)Wikipedia

Data LNWikipedia

Data L2WikipediaData L1

Extracting Links

Modeling on influence links

L1 L2 LN…

Finding missing linksaccording to the model

Translating links into target languages to sync

Computing similarity between existing and new

Unifying synchronized data

Page 11: Multilingual Synchronization

11

Preprocessing: Selecting Target• Source languages(5)

– English, Spanish, French, Chinese, Korean

• Extracting target pages with a complete graph(clique) by inter-language links– Assumption:

• Pages founded in all 5 languages are key pages and the target to sync

• Enforcing consistency of a link path– If a path from X(L1) to X’(L2) founded once,

its inverse path (X’, X) is automatically added to the outputen:Badminton

es:Bádmintonfr:Badminton

zh: 羽毛球 ko: 배드민턴

A subset of UN official languages

Page 12: Multilingual Synchronization

12

Preprocessing: Selected Pages

• Total 42,077 pages– Example) page-length comparison• Badminton ( 배드민턴 )

– en(52,098) > fr(26,508) > ko(22,960) > zh(19,050) > es(17,594)• Suncheon,_Jeollanam-do ( 순천시 _( 전라남도 ))

– ko(20,816) > en(8,910) > zh(1,688) > es(1,600) > fr(1,503)

Eng-lish

62%Spanish11%

French19%

Chinese5%

Korean3%

Page Length Winning Cnt

Page 13: Multilingual Synchronization

13

Modeling on influence links

• Example of links in multiple language Wikipedia– Different Wikipedia has different viewpoints and different

concerns (fig)– Some links are newly added, some others are deleted by

user in a temporal manner

• We need to know the permutation distance of links on each language Wikipedia(ongoing)

Page 14: Multilingual Synchronization

14

Evaluation Plan

• Compare how much useful information to fill from other language resources– Links of the Featured article in L1 vs.

Unified links from M-Sync in L2 , …, LN

• Compare how much relevant information to fill from other language resources• NGD(normalized Google distance)

Page 15: Multilingual Synchronization

15

Task & Schedule

• Target Conference: – Web Intelligence 2011 (3/4, 3/11)

• Task– Modeling on influence link to synchronize

• Link category analysis on each Lang– Using Wikipedia links information – Using Wikipedia template, category(CAT2ISA)

• Link evolution analysis on each Lang– Using Wikipedia edit history

• Making evaluation dataset