32
Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Embed Size (px)

Citation preview

Page 1: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Multilingual Synchronization focusing on Wikipedia

Eun-kyung Kim2011-02-25

Page 2: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Introduction

• Wikipedia: Multilingual encyclopedia– Supports over 270 languages

• English, German, Spanish, French, Chinese, Arabic, …• Allows cross-lingual navigation with inter-language link

– Inter-language links: hyperlinks from any page in one Wikipedia language edition to one or more nearly equivalent or exactly equivalent pages in another Wikipedia language editions

– Different quantity of data on each languages• Wikipedia other language editions often suffer from lack of

information compared to the English version– Multilingual stat on Feb. 2011

» English: 3.5 million articles (Most dominant)» French: 1 million articles (3rd)» Korean: 156,290 articles (22nd)

Page 3: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Goal of M-Sync

• Multilingual Synchronization– Synchronizing contents of Wikipedia from multiple

different languages• Linking among multiple language contents• Combining them to synthesis

– The various Wikipedia editions from different languages • can offer more precise and detailed information based on

different intentions/backgrounds/cultures• can fill the gap between different languages and to acquire

the integrated knowledge

Page 4: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Two types of M-Sync

• Factual synchronization– Filling missing

information

• Cultural synchronization– Improving unknown

information

Page 5: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Factual Synchronization

Page 6: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Approach: Factual Synchronization

• Hypothesis– X is a key fact in L1 X’ should be a key fact in L2

• where X’ is a corresponding term to X in different language– Assumption

» Inter-language links are accurate links to connect two pages about the same entity or concept in different languages

• Key facts of Wikipedia come from the structured data such as:– Infobox

Page 7: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Infobox

• An infobox– a fixed-format table – present a summary of some unifying aspect that

the articles share and to improve navigation to other interrelated articles

– contain facts and statistics

Page 8: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Example of Infobox Asymmetry(Absence)

Page 9: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Example of Infobox Asymmetry(Absence)

Page 10: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Factual Synchronization

• Infobox Synchronization– Similar template structures• Template alignment • Contents translation

– Wikipedia dictionary-based– Google Translation API-based

• Application– Seed-data for article generation

Page 11: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Limitations of Infobox Synchronization

• Missing key properties– E.g.) symptoms of Disease

• Infobox Conflicts– How to select the target

• Infobox A in L1 vs. Infobox B in L2 A, B, C ?

• Users’ dis-satisfaction– Wikipedia is not a parallel corpus– Cultural differences must not be ignored

• The most of articles in different languages are independently created by different users and independently maintained by different communities

Page 12: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Cultural Synchronization

Page 13: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Distributions on Multilingual Overlaps

EN + FR + ES + RU + ZH + KO + AR0

100000

200000

300000

400000

500000

600000

700000

800000

Page 14: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Distributions on Multilingual Overlaps

EN + FR + ES + RU + ZH + KO + AR0

100000

200000

300000

400000

500000

600000

700000

800000

Culturally Unique Contents

Page 15: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Cultural Synchronization

• Synthesizing missing information according to each background knowledge and characteristics – focusing on how to add new information from

other language resources

Page 16: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

16

Approach: Cultural Synchronization

• Hypothesis– X has a topic model M1 in L1 X’ may have a topic model

M2 in L2

• where X’ is a corresponding term to X in different language– Assumption

» Inter-language links are accurate links to connect two pages about the same entity or concept in different languages

• where M1 and M2 have different topic distributions according to their topical intentions

• Topic Model– A type of statistical model for discovering the abstract "topics" that occur

in a collection of documents

Page 17: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Topic Model on Links

• Latent Dirichlet allocation (LDA)– The most common topic model currently in use

• Each document may be viewed as a mixture of various topics– If observations are words collected into documents– It posits that each document is a mixture of a small number of

topics and that each word's creation is attributable to one of the document's topics» The topic distribution is assumed to have a Dirichlet prior

– Specifically, links from a word w to a document d depend directly on how frequent the topic of w is in d

Page 18: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

18

Links on the Web

• Links – navigate to a web page with

more detailed information– point to previously published

web pages with similar or related content

• Understanding of the influence of each link can substantially benefit many applications – e.g., multilingual sync

메뚜기메뚜기목

귀뚜라미 베짱이

방아깨비

풀무치

농업예멘

사우디아라비아

해충

여치벼메뚜기

Page 19: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Link types of Wikipedia

• internal links to other pages in the wiki– Syntax usage: [[Main Page]]

• external links to other websites• interwiki links to other websites registered to the

wiki in advance– Unlike internal links, interwiki links do not use page

existence detection– Syntax usage: [[wikipedia:Sunflower]]

• Interlanguage links to other websites registered as other language versions of the wiki

Page 20: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Link types of Wikipedia

• internal links to other pages in the wiki– Syntax usage: [[Main Page]]

• external links to other websites• interwiki links to other websites registered to the

wiki in advance– Unlike internal links, interwiki links do not use page

existence detection– Syntax usage: [[wikipedia:Sunflower]]

• Interlanguage links to other websites registered as other language versions of the wiki

Page 21: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

21

Multilingual Synchronization Process

Preprocessing(Target Page Selection)Wikipedia

Data LNWikipedia

Data L2WikipediaData L1

Extracting Links

Modeling on influence links

L1 L2 LN…

Finding missing linksaccording to the model

Translating links into target languages to sync

Computing similarity between existing and new

Unifying synchronized data

Page 22: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

22

Multilingual Synchronization Process

Preprocessing(Target Page Selection)Wikipedia

Data LNWikipedia

Data L2WikipediaData L1

Extracting Links

Modeling on influence links

L1 L2 LN…

Finding missing linksaccording to the model

Translating links into target languages to sync

Computing similarity between existing and new

Unifying synchronized data

Page 23: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

25

Modeling on links

• Example of links in multiple language Wikipedia– Different Wikipedia has different viewpoints and different

concerns (fig)– Some links are newly added, some others are deleted by

user in a temporal manner

Page 24: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

How many topics on a document?Example: Section Headings of “Autism”

역사증상사회적 성장의사 소통바깥 고리

CharacteristicsSocial developmentCommunicationRepetitive behaviorOther symptomsClassificationCausesMechanismPathophysiologyNeuropsychologyScreeningDiagnosisManagementPrognosisEpidemiologyHistoryReferencesExternal links

Présentation généraleNotion de spectre autistiqueCatégorisation des troubles liés à l'autismeL'autisme infantileLe syndrome de RettLe syndrome d'AspergerÉpidémiologiePar paysEn FranceAu MarocDépistage et diagnosticTraitementPathologies associéesHistoire de la notionThéorisation de l'autismeL'approche psychanalytiqueThéorie de l'espritOrigine, test de ''Sally et Anne''Remise en cause et évolution du conceptDésordre du traitement temporo-spatial des informations sensoriellesRecherche sur les causes (étiologie)La théorie de l'origine vaccinaleLa théorie de l'intoxication aux métaux lourdsAnomalies cérébrales et défauts du placentaCauses génétiquesAire de perception de la voixVoir aussiArticles connexesBibliographieGénéralisteTémoignages, biographieLittératureVidéo et cinémaLiens externesRéférences

定義特徵社交發展感官系統溝通的困難病因自閉症與超常智商的聯繫世界自閉症日治疗社会关注相关作品电影參見參考資料外部連結

Korean English French Chinese

Page 25: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

역사증상사회적 성장의사 소통바깥 고리

특성사회 개발통신반복적인 행동기타 증상분류원인기구PathophysiologyNeuropsychology전형진단관리예지역학역사참고 문헌외부 링크

개요자폐증 스펙트럼의 의미장애의 분류가 자폐증과 관련유아의 자폐증Rett 증후군아스퍼거 증후군역학문학비디오 및 필름국가별프랑스에서모로코에서심사 및 진단치료관련 질병개념의 역사자폐증의 Theorizationpsychoanalytic 접근마음의 이론원래는 앤과 test''Sally''and도전과 개념을 변화Temporomandibular 장애 치료 공간 감각 정보연구 원인 ( 병인 )에예방 접종 뒤에 이론중금속 중독의 이론뇌 이상과 태반의 결함유전적인 원인음성 인식 분야참고관련 기사서지일반증거 전기문학비디오 및 필름외부 링크참고 문헌

정의특징사회 개발관능 시스템의사 소통의 어려움원인특별한 지능 지수와 자폐증 링크세계 자폐증의 날치료사회 관심사관련 작품영화보기외부 링크

Korean English French Chinese

How many topics on a document?Example: Translated-Section Headings of “Autism”

Page 26: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Technical Process Review

Page 27: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

LDA-based Topic Model

• LDA– Document Topics Words(links)

• Links: semantic key terms of document• No word boundary detection required

– Extract all links from target pages in 5 languages• Link extractor in python: http://swrc.kaist.ac.kr/msync/

• Link information can be extracted from Wikipedia dump database

– How many Topics are selected• According to sections

– Section: A page can and should be divided into sections, using the section heading syntax

• Section heading extraction in shell: http://swrc.kaist.ac.kr/msync/

Page 28: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Advanced LDA-based Topic Model

• We generate a topic modelwith out-going hypertext of doc.

• We generated a topic modelwith in-coming hypertext of doc.

AAA

BBB

CCC DDD

EEE FFF ZZZ

Document

Page 29: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Advanced LDA-based Topic Model

• We generate a topic modelwith out-going hypertext of doc.

• We generated a topic modelwith in-coming hypertext of doc.

AAA

BBB

CCC DDD

EEE FFF ZZZ

Document

111

222

333

A specific model method is required! (Novelty)

Page 30: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Example: Out-going Hypertext

메뚜기메뚜기목

귀뚜라미

방아깨비

풀무치 농업

예멘 사우디아라비아

여치

벼메뚜기

곰팡이 아프리카

중동 천적

거미 사마귀

때까치

개구리

구약성서 야훼 출애굽기

베짱이 해충

Page 31: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Example: In-coming Hypertext

메뚜기메뚜기과

여치

타임 보칸

진드기 초원

최진실 코뿔새과

신사임당

백악기

땅돼지

메뚜기아목메뚜기목

가면라이더 _OOO

타임 _크라이시스 _시리즈의 _등장인물

민족 무용 딱다기

호랑이

탄문

콩고 _ 민주 _공화국 _ 요리 애벌레프레리도그

땅늑대 벼

아스테카문명

유재석무한도전

Page 32: Multilingual Synchronization focusing on Wikipedia Eun-kyung Kim 2011-02-25

Contributions

• To show the diverse topic distributions of related entities in several language Wikipedias depending on different topical intentions

• To make Wikipedia pages more shareable to the multilingual users depending on their culturally biased interests’ weight

• To support the seed data (seed keywords) to complete articles in a multilingual manner, or to guide users in generating new articles in Wikipedia