Multilingual Synchronization

Preview:

DESCRIPTION

Multilingual Synchronization. Eun-kyung Kim 2011-02-10. Introduction. Wikipedia Supports over 270 languages Allows cross-lingual navigation with inter-language link Different quantity of data Goal - PowerPoint PPT Presentation

Citation preview

Multilingual Synchronization

Eun-kyung Kim2011-02-10

2

Introduction• Wikipedia

– Supports over 270 languages– Allows cross-lingual navigation with inter-language link – Different quantity of data

• Goal– Synchronizing multilingual Wikipedia data

to fill the gap between different languages & to acquire the integrated knowledge

3

Methodology (base)

• Hypothesis– X is a key fact in L1 X’ should be a key fact in L2

• where X’ is a corresponding term to X in different language– Assumption

» Inter-language links are accurate links to connect two pages about the same entity or concept in different languages

• Key facts come from the structured data such as:– Infobox– Category– Hyperlink text (than normal text)

4

Methodology• Basic methodology

– Infobox synchronization between English & Korean• Duplicate resolving & conflict resolving

• Comments from Committee (Ph.D proposal)– Improve the multilingualism– Do not ignore multiple viewpoints – Hard to evaluate

• Extended methodology– Sync target selection in 5 languages– Key facts synchronization including not only Infobox but also LinkText– Filling missing information according to each background knowledge and

characteristics • focusing on how to add new information from other language resources

5

Methodology• Basic methodology

– Infobox synchronization between English & Korean• Duplicate resolving & conflict resolving

• Comments from Committee (Ph.D proposal)– Improve the multilingualism– Do not ignore multiple viewpoints – Hard to evaluate

• Extended methodology– Sync target selection in 5 languages– Key facts synchronization including not only Infobox but also LinkText– Filling missing information according to each background knowledge and

characteristics • focusing on how to add new information from other language resources

6

Example of Infobox Synchronization

7

Example of Infobox Synchronization

• Drawback of Infobox– Sometimes meaningless

for synchronization

• Solution– Adding links information

to synchronize

Infobox from Arthritis

8

Links on the Web• Links

– navigate to a web page with more detailed information

– point to previously published web pages with similar or related content

• Understanding of the influence of each link can substantially benefit many applications – e.g., multilingual sync

메뚜기메뚜기목귀뚜라미 베짱이

방아깨비

풀무치

농업 예멘사우디아라비아

해충여치

벼메뚜기

9

Multilingual Synchronization Process

Preprocessing(Target Page Selection)Wikipedia

Data LNWikipedia

Data L2WikipediaData L1

Extracting Links

Modeling on influence links

L1 L2 LN…

Finding missing linksaccording to the model

Translating links into target languages to sync

Computing similarity between existing and new

Unifying synchronized data

10

Multilingual Synchronization Process

Preprocessing(Target Page Selection)Wikipedia

Data LNWikipedia

Data L2WikipediaData L1

Extracting Links

Modeling on influence links

L1 L2 LN…

Finding missing linksaccording to the model

Translating links into target languages to sync

Computing similarity between existing and new

Unifying synchronized data

11

Preprocessing: Selecting Target• Source languages(5)

– English, Spanish, French, Chinese, Korean

• Extracting target pages with a complete graph(clique) by inter-language links– Assumption:

• Pages founded in all 5 languages are key pages and the target to sync

• Enforcing consistency of a link path– If a path from X(L1) to X’(L2) founded once,

its inverse path (X’, X) is automatically added to the outputen:Badminton

es:Bádmintonfr:Badminton

zh: 羽毛球 ko: 배드민턴

A subset of UN official languages

12

Preprocessing: Selected Pages

• Total 42,077 pages– Example) page-length comparison• Badminton ( 배드민턴 )

– en(52,098) > fr(26,508) > ko(22,960) > zh(19,050) > es(17,594)• Suncheon,_Jeollanam-do ( 순천시 _( 전라남도 ))

– ko(20,816) > en(8,910) > zh(1,688) > es(1,600) > fr(1,503)

Eng-lish

62%Spanish11%

French19%

Chinese5%

Korean3%

Page Length Winning Cnt

13

Modeling on influence links

• Example of links in multiple language Wikipedia– Different Wikipedia has different viewpoints and different

concerns (fig)– Some links are newly added, some others are deleted by

user in a temporal manner

• We need to know the permutation distance of links on each language Wikipedia(ongoing)

14

Evaluation Plan

• Compare how much useful information to fill from other language resources– Links of the Featured article in L1 vs.

Unified links from M-Sync in L2 , …, LN

• Compare how much relevant information to fill from other language resources• NGD(normalized Google distance)

15

Task & Schedule

• Target Conference: – Web Intelligence 2011 (3/4, 3/11)

• Task– Modeling on influence link to synchronize

• Link category analysis on each Lang– Using Wikipedia links information – Using Wikipedia template, category(CAT2ISA)

• Link evolution analysis on each Lang– Using Wikipedia edit history

• Making evaluation dataset

Recommended