25
Extending Wordnet Bahasa with External Resources Lim Lian Tze 1 and Tang Enya Kong 2 1 KDU College Penang, Malaysia ([email protected]) 2 Linton University College, Malaysia ([email protected]) WordNet Bahasa Hackathon/Workshop Lim and Tang | WordNet Bahasa Hackathon/Workshop 1 / 25

Extending Wordnet Bahasa with External Resourcescompling.hss.ntu.edu.sg/events/2014-ws-wn-bahasa/pdf/LLT... · Extending Wordnet Bahasa with External Resources ... Kamus Inggeris-Melayu

  • Upload
    vannhi

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Extending Wordnet Bahasa with External Resources

Lim Lian Tze1 and Tang Enya Kong2

1KDU College Penang Malaysia (liantzegmailcom)2Linton University College Malaysia (enyagkong1gmailcom)

WordNet Bahasa HackathonWorkshop

Lim and Tang | WordNet Bahasa HackathonWorkshop 1 25

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 2 25

| How it all started

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 3 25

| How it all started

Aligning Bilingual Dictionary to Princeton WordNetUTMKUSM linguists Lim and Hussein (2006)

Kamus Inggeris-Melayu Dewan (KIMD)dot n small round spot titik (appearing in large numbers on dress leaf etc)bintik

KIMD senses (manually) aligned to WordNet 16 senseskimd (dot n 1 [small round spot (appearing in large numbers on dressleaf etc)] ⟨titik bintik⟩)wordnet (110025218 lsquodotrsquo n 1 [a very small circular shape] )

Malay WordNet synset(titik bintik [a very small circular shape] )

Lim and Tang | WordNet Bahasa HackathonWorkshop 4 25

| How it all started

Malay WordNet Prototype

Nouns12429 synsets

hypernymyhyponymyholonymymeronymy

part-ofmember-ofsubstance-of

Verbs5805 synsets

hypernymytroponymy

cause

entailment

Lim and Tang | WordNet Bahasa HackathonWorkshop 5 25

| How it all started

Screenshots

Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25

| Utilising Interlingual Links in Wikipedia Articles

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25

| Utilising Interlingual Links in Wikipedia Articles

Wikipedia Article Dumps

ltpagegtlttitlegtMarikhlttitlegtlttextgt

Infobox Planet[[enMars]][[esMarte (planeta)]]

lttextgtltpagegtltpagegt

lttitlegtLaut Kaspialttitlegtlttextgt

[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]

lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25

| Utilising Interlingual Links in Wikipedia Articles

Categories and Multilingual Translations

[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)

[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)

Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)

Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa

1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet

2 If only one synset is found map the IndonesianMalaysian title to it

3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to

Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa (contrsquod)

8480 new mappings

3725 new synsets

732 new Malay entries (ie used in both Malaysian and Indonesian)

2109 new Malaysian entries

5473 new Indonesian entries

Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 2 25

| How it all started

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 3 25

| How it all started

Aligning Bilingual Dictionary to Princeton WordNetUTMKUSM linguists Lim and Hussein (2006)

Kamus Inggeris-Melayu Dewan (KIMD)dot n small round spot titik (appearing in large numbers on dress leaf etc)bintik

KIMD senses (manually) aligned to WordNet 16 senseskimd (dot n 1 [small round spot (appearing in large numbers on dressleaf etc)] ⟨titik bintik⟩)wordnet (110025218 lsquodotrsquo n 1 [a very small circular shape] )

Malay WordNet synset(titik bintik [a very small circular shape] )

Lim and Tang | WordNet Bahasa HackathonWorkshop 4 25

| How it all started

Malay WordNet Prototype

Nouns12429 synsets

hypernymyhyponymyholonymymeronymy

part-ofmember-ofsubstance-of

Verbs5805 synsets

hypernymytroponymy

cause

entailment

Lim and Tang | WordNet Bahasa HackathonWorkshop 5 25

| How it all started

Screenshots

Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25

| Utilising Interlingual Links in Wikipedia Articles

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25

| Utilising Interlingual Links in Wikipedia Articles

Wikipedia Article Dumps

ltpagegtlttitlegtMarikhlttitlegtlttextgt

Infobox Planet[[enMars]][[esMarte (planeta)]]

lttextgtltpagegtltpagegt

lttitlegtLaut Kaspialttitlegtlttextgt

[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]

lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25

| Utilising Interlingual Links in Wikipedia Articles

Categories and Multilingual Translations

[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)

[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)

Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)

Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa

1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet

2 If only one synset is found map the IndonesianMalaysian title to it

3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to

Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa (contrsquod)

8480 new mappings

3725 new synsets

732 new Malay entries (ie used in both Malaysian and Indonesian)

2109 new Malaysian entries

5473 new Indonesian entries

Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| How it all started

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 3 25

| How it all started

Aligning Bilingual Dictionary to Princeton WordNetUTMKUSM linguists Lim and Hussein (2006)

Kamus Inggeris-Melayu Dewan (KIMD)dot n small round spot titik (appearing in large numbers on dress leaf etc)bintik

KIMD senses (manually) aligned to WordNet 16 senseskimd (dot n 1 [small round spot (appearing in large numbers on dressleaf etc)] ⟨titik bintik⟩)wordnet (110025218 lsquodotrsquo n 1 [a very small circular shape] )

Malay WordNet synset(titik bintik [a very small circular shape] )

Lim and Tang | WordNet Bahasa HackathonWorkshop 4 25

| How it all started

Malay WordNet Prototype

Nouns12429 synsets

hypernymyhyponymyholonymymeronymy

part-ofmember-ofsubstance-of

Verbs5805 synsets

hypernymytroponymy

cause

entailment

Lim and Tang | WordNet Bahasa HackathonWorkshop 5 25

| How it all started

Screenshots

Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25

| Utilising Interlingual Links in Wikipedia Articles

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25

| Utilising Interlingual Links in Wikipedia Articles

Wikipedia Article Dumps

ltpagegtlttitlegtMarikhlttitlegtlttextgt

Infobox Planet[[enMars]][[esMarte (planeta)]]

lttextgtltpagegtltpagegt

lttitlegtLaut Kaspialttitlegtlttextgt

[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]

lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25

| Utilising Interlingual Links in Wikipedia Articles

Categories and Multilingual Translations

[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)

[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)

Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)

Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa

1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet

2 If only one synset is found map the IndonesianMalaysian title to it

3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to

Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa (contrsquod)

8480 new mappings

3725 new synsets

732 new Malay entries (ie used in both Malaysian and Indonesian)

2109 new Malaysian entries

5473 new Indonesian entries

Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| How it all started

Aligning Bilingual Dictionary to Princeton WordNetUTMKUSM linguists Lim and Hussein (2006)

Kamus Inggeris-Melayu Dewan (KIMD)dot n small round spot titik (appearing in large numbers on dress leaf etc)bintik

KIMD senses (manually) aligned to WordNet 16 senseskimd (dot n 1 [small round spot (appearing in large numbers on dressleaf etc)] ⟨titik bintik⟩)wordnet (110025218 lsquodotrsquo n 1 [a very small circular shape] )

Malay WordNet synset(titik bintik [a very small circular shape] )

Lim and Tang | WordNet Bahasa HackathonWorkshop 4 25

| How it all started

Malay WordNet Prototype

Nouns12429 synsets

hypernymyhyponymyholonymymeronymy

part-ofmember-ofsubstance-of

Verbs5805 synsets

hypernymytroponymy

cause

entailment

Lim and Tang | WordNet Bahasa HackathonWorkshop 5 25

| How it all started

Screenshots

Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25

| Utilising Interlingual Links in Wikipedia Articles

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25

| Utilising Interlingual Links in Wikipedia Articles

Wikipedia Article Dumps

ltpagegtlttitlegtMarikhlttitlegtlttextgt

Infobox Planet[[enMars]][[esMarte (planeta)]]

lttextgtltpagegtltpagegt

lttitlegtLaut Kaspialttitlegtlttextgt

[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]

lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25

| Utilising Interlingual Links in Wikipedia Articles

Categories and Multilingual Translations

[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)

[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)

Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)

Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa

1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet

2 If only one synset is found map the IndonesianMalaysian title to it

3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to

Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa (contrsquod)

8480 new mappings

3725 new synsets

732 new Malay entries (ie used in both Malaysian and Indonesian)

2109 new Malaysian entries

5473 new Indonesian entries

Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| How it all started

Malay WordNet Prototype

Nouns12429 synsets

hypernymyhyponymyholonymymeronymy

part-ofmember-ofsubstance-of

Verbs5805 synsets

hypernymytroponymy

cause

entailment

Lim and Tang | WordNet Bahasa HackathonWorkshop 5 25

| How it all started

Screenshots

Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25

| Utilising Interlingual Links in Wikipedia Articles

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25

| Utilising Interlingual Links in Wikipedia Articles

Wikipedia Article Dumps

ltpagegtlttitlegtMarikhlttitlegtlttextgt

Infobox Planet[[enMars]][[esMarte (planeta)]]

lttextgtltpagegtltpagegt

lttitlegtLaut Kaspialttitlegtlttextgt

[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]

lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25

| Utilising Interlingual Links in Wikipedia Articles

Categories and Multilingual Translations

[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)

[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)

Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)

Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa

1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet

2 If only one synset is found map the IndonesianMalaysian title to it

3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to

Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa (contrsquod)

8480 new mappings

3725 new synsets

732 new Malay entries (ie used in both Malaysian and Indonesian)

2109 new Malaysian entries

5473 new Indonesian entries

Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| How it all started

Screenshots

Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25

| Utilising Interlingual Links in Wikipedia Articles

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25

| Utilising Interlingual Links in Wikipedia Articles

Wikipedia Article Dumps

ltpagegtlttitlegtMarikhlttitlegtlttextgt

Infobox Planet[[enMars]][[esMarte (planeta)]]

lttextgtltpagegtltpagegt

lttitlegtLaut Kaspialttitlegtlttextgt

[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]

lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25

| Utilising Interlingual Links in Wikipedia Articles

Categories and Multilingual Translations

[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)

[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)

Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)

Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa

1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet

2 If only one synset is found map the IndonesianMalaysian title to it

3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to

Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa (contrsquod)

8480 new mappings

3725 new synsets

732 new Malay entries (ie used in both Malaysian and Indonesian)

2109 new Malaysian entries

5473 new Indonesian entries

Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Utilising Interlingual Links in Wikipedia Articles

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25

| Utilising Interlingual Links in Wikipedia Articles

Wikipedia Article Dumps

ltpagegtlttitlegtMarikhlttitlegtlttextgt

Infobox Planet[[enMars]][[esMarte (planeta)]]

lttextgtltpagegtltpagegt

lttitlegtLaut Kaspialttitlegtlttextgt

[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]

lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25

| Utilising Interlingual Links in Wikipedia Articles

Categories and Multilingual Translations

[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)

[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)

Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)

Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa

1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet

2 If only one synset is found map the IndonesianMalaysian title to it

3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to

Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa (contrsquod)

8480 new mappings

3725 new synsets

732 new Malay entries (ie used in both Malaysian and Indonesian)

2109 new Malaysian entries

5473 new Indonesian entries

Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Utilising Interlingual Links in Wikipedia Articles

Wikipedia Article Dumps

ltpagegtlttitlegtMarikhlttitlegtlttextgt

Infobox Planet[[enMars]][[esMarte (planeta)]]

lttextgtltpagegtltpagegt

lttitlegtLaut Kaspialttitlegtlttextgt

[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]

lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25

| Utilising Interlingual Links in Wikipedia Articles

Categories and Multilingual Translations

[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)

[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)

Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)

Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa

1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet

2 If only one synset is found map the IndonesianMalaysian title to it

3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to

Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa (contrsquod)

8480 new mappings

3725 new synsets

732 new Malay entries (ie used in both Malaysian and Indonesian)

2109 new Malaysian entries

5473 new Indonesian entries

Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Utilising Interlingual Links in Wikipedia Articles

Categories and Multilingual Translations

[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)

[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)

Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)

Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa

1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet

2 If only one synset is found map the IndonesianMalaysian title to it

3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to

Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa (contrsquod)

8480 new mappings

3725 new synsets

732 new Malay entries (ie used in both Malaysian and Indonesian)

2109 new Malaysian entries

5473 new Indonesian entries

Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa

1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet

2 If only one synset is found map the IndonesianMalaysian title to it

3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to

Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa (contrsquod)

8480 new mappings

3725 new synsets

732 new Malay entries (ie used in both Malaysian and Indonesian)

2109 new Malaysian entries

5473 new Indonesian entries

Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Utilising Interlingual Links in Wikipedia Articles

Adding Entries to WordNet Bahasa (contrsquod)

8480 new mappings

3725 new synsets

732 new Malay entries (ie used in both Malaysian and Indonesian)

2109 new Malaysian entries

5473 new Indonesian entries

Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Utilising Wikidata API

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Utilising Wikidata API

The Wikidata Project

Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others

All interlingual links will be moved to Wikidata eventually

Hence some articles Wikipedia dumps are lsquomissingrsquo these links

Wikidata HTTP API httpwwwwikidataorgwapiphp

Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Utilising Wikidata API

Example Retrieving Multilingual Translations

httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=

serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|

descriptions|aliasesamplanguages=ms|id|en

XML response (Other formats are possible)

ltxml version=10gtltapi success=1gt

ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt

ltnormalizedgtltentitiesgt

ltentity id=Q12152 type=itemgtltlabelsgt

ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt

ltlabelsgtltdescriptionsgt

Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Utilising Wikidata API

Example Retrieving Multilingual Translations (contrsquod)

ltdescription language=en value=interruption of bloodsupply to a part of the heart gt

ltdescriptionsgtltaliasesgt

ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt

ltaliasesgtltentitygt

ltentitiesgtltapigt

Mappings to WordNet synsets done but not yet checked andoroicially added

Cursory glance lots of identical lexicalisation to English

Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Possible Next Steps

Topics

1 How it all started

2 Utilising Interlingual Links in Wikipedia Articles

3 Utilising Wikidata API

4 Possible Next Steps

Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Possible Next Steps

Extending Specific Hierarchies

Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)

ayam golek buah keras acar timun ikan bakar mi rebus apam

Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia

How can we tell if the title of an Wikipedia article is a foreign languageword

Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Possible Next Steps

Kamus Dewan

The main Malay monolingual dictionary in Malaysia

Published by Dewan Bahasa dan Pustaka

Currently annotating contents with TEI to give structure (as part ofanother project)

Allows for easier more targeted searches

(Still in progress ndash expected completion Nov 2014)

Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa

Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Possible Next Steps

Lexical Items

Derived words as subentries of root word ndash very rich

Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS

Do something based on definition text

Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Possible Next Steps

Extending Specific HierarchiesSearch by definitions (Start with the simple ones)

Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments

(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)

(Mine from Wikipedia CC-BY-SAGFPL)

lsquoFlatrsquo hierarchy for starters

Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Possible Next Steps

Names of Flora amp Fauna

KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna

Match up Malay names with English translations via Latin names

Some might not yet be in Princeton WordNet

Many may not even have English equivalent names

Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Possible Next Steps

Penjodoh Bilangan (Classifiers)

KD indicates classifiers

batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)

But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions

biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)

Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Possible Next Steps

Named Entities

Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles

Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Possible Next Steps

Further Processing

More advanced processing (including discovering relations) may(should) be possible in future

Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus

Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps

| Possible Next Steps

Thank You

qatlhorsquo

Danke谢谢

Grazie

Спасибо

ขอบคณ

9 E4E5IacuteRQordm

Merccedili

Gracias

ntilde

Obrigado

Ευχαριστώ

감사합니다DyvAd

Terima kasih

Thank you

ありがとう

Tapadh leibhiumlgsAumlee

Go raibh maith agaibh

Xin cảm ơn

Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25

  • How it all startedhellip
  • Utilising Interlingual Links in Wikipedia Articles
  • Utilising Wikidata API
  • Possible Next Steps