66
Taranis Software 1 Some thoughts on how to address commercially unprofitable languages and language pairs Vincent BERMENT (*) Saturday August 23 rd 2014 (*) Lecturer at INaLCO, Paris (Lao Section) Associated Researcher at GETALP, Grenoble Director of Taranis, Paris

Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

1

Some thoughts on how to address commercially unprofitable languages and language pairs

Vincent BERMENT (*)Saturday August 23rd 2014

(*) Lecturer at INaLCO, Paris (Lao Section) – Associated Researcher at GETALP, Grenoble – Director of Taranis, Paris

Page 2: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

2

Master plan

My motivations: to contribute to the reduction of the digital divide

1. Some experiments highlighting the gains achieved with appropriate methods

• Integrate with generic environments

• Reuse software code whenever possible

• Recycle existing dictionaries

2. Looking for an efficient Machine Translation solution

• Looking for a framework able to address any language pair

• Ariane-G5 and Heloise

3. Current activities around Heloise

• Reducing the development costs

• Cooperative development projects with skills transfer

Perspectives

Page 3: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

3

My motivation: contribute to reducing the digital divide

• Since 1996: Work on Lao NLP, starting was at a time it was difficult even to type text

• Microsoft Word add-in: LaoWord (intuitive input, “legacy font” and Unicode interworking, sorting, smart syllabic selections, dictionaries, phonetic transcriptions, miscellaneous functions)

• Unicode Virtual Keyboard: LaoUniKey (intuitive input; system hook)

• Online Lao-French and French-Lao dictionary: LaoDict and word for word translator: LaoTrans

• French-Lao Reinhorn-Berment dictionary: 1729 pages, ~40,000 entries (~60,000 word senses), ~15,000 examples, http://www.you-feng.com/dictionnaire_fran_lao_reinhorn.php

• Lao Software Web site: http://www.laosoftware.com

• Since 2001: Work on NLP for under-resourced languages, especially in the GMS area

• PhD thesis “Methods to computerize under-resourced languages and groups of languages” (2004)

• Extension of LaoWord to other languages & writing systems (Khmer, Tham, Bengali): GMSWord

• Word segmenter: Motor (Burmese, Khmer, Lao, Thai, Tibetan)

• GMS Software Web site: http://www.gmsware.org

� LaoWord, LaoUniKey and GMSWord are Free and Open Source Software (FOSS)

Page 4: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

4

My motivation: participate to reducing the digital divide

• Since 2004: Work on developing open source Machine Translation systems & teams

• Integration and strengthening of a “stream” launched by Bernard Vauquois: Work on under-resourced language pairs, often linguistically distant, with the aim to create autonomous qualified teams (which was done by him in Malaysia and Thailand, at least)

• Different from other efforts such as the work done by Carnegie Mellon University, or those funded by DARPA (USA) or ODA (Japan)

• Similar to some projects such as PAN-Asia funded by IDRC and driven by Sarmad Hussein (Pakistan)

• Development of a MT framework able to address under-resourced languages: Heloise

• Based on tools and methods that proved their ability to address pairs of languages with big differences: GÉTA’s Ariane-G5 and methodology

• Open source reusable lingware tools and components already exists: English, French, Russian (prototypes), Malay, Thai, Arabic, Portuguese… Recent development of an advanced large coverage morphological analyzer for German (JP. Guilbaud)

• Lingwarium Web site: http://www.lingwarium.org

• Taranis Web site (under construction): http://www.taranis-software.com

Page 5: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

5

1.1 – Integrate with generic environments

1 – Some methods to reduce costs

Page 6: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

6

Integrate with generic environments: a Word add-in for Lao

Page 7: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

7

Integrate with generic environments: time saved for LaoWord

• LaoWord: WLL (Word add-in) embedded in the MS Word environment

� Lao functions: intuitive input, “legacy font” text import, sorting, smart syllabic selections, dictionaries, phonetic transcriptions, miscellaneous functions

• ~2500 working hours “only”, against ~25.000 (?) if developing a complete but simple word processor � GAIN ~90 % (and more with maintenance!)

GENERIC PART OF THE SOFTWARE

SPECIFIC PART OF THE SOFTWARE

IGS/SS

e.g. MS Word

e.g. LaoWord

Page 8: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

8

1.2 – Reuse software code whenever possible

1 – Some methods to reduce costs

Page 9: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

9

Many specific writing systems remain under-computerized

• Yuon

• Tham Lao

Page 10: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

10

• Tai Lü and khün (Tham Lanna)

• Example of “ book” (Bailan)

Many specific writing systems remain under-computerized

Page 11: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

11

The Nidana Sutta written in Tham Lao script

Page 12: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

12

Gatha (magical formula) written in Tham Lao script

Page 13: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

13

We defined a Tham Lao keyboard (QWERTY mapping)

The Tham Lao writing system is used in Laos for the Buddhist texts

No Unicode encoding, no font, no existing keyboard layout standard…

� Define an interim solution so the monks can type in Tham Lao

Page 14: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

14

We defined a Tham Lao input sequence

• Tham script is used in Laos for writing Buddhist texts (the word “Tham” comes from “Dhamma”), either in Lao or in Pali languages

• Characters can be written in line or with modifications (subscript…)

• In the word “” (indri, from indriya: sense, faculty), the (r) is transformed

into and the (d) is transformed into by prefixing the character by “-” (so the sequence is: <-rn-di)

• � Specific software deriving from LaoWord = GMSWord (MS Word add-in for the languages of the Greater Mekong Region: Khmer, Lao, Tham…)

Page 15: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

15

Reuse software code: adapting LaoWord to the Tham script

Page 16: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

16

The LaoWord text input

LAO FRANCETABLE

FONT LAO 2TABLE

Abstract character set for Lao

FONT LAO 3TABLE

FONT LAO 1TABLE

LaoWord text input

DLL + HOOK (input) DLL + HOOK (output)

150 hours(generic)

100 hours (1 font)+ ≈≈≈≈ 3 hours per font

DUANG JANTABLE

Page 17: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

17

Adaptation of the LaoWord text input to the Tham script

THAM SCRIPTTABLE

THAM SCRIPTTABLE

Abstract character set for Tham

DLL + HOOK (input) DLL + HOOK (output)

“GMSWord” text input for Tham

(GMSWord also contains Khmer and Lao)

8 hours(reuse)

10 hours (1 font)

• This excludes the time for defining the input sequence, the keyboards and developing a font for Tham, as well as the time to do GMSWord (listbox…).

Page 18: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

18

Time saving

COST AT THE FIRST TIME

(Lao)

SUBSEQUENT COST

(Tham)GAIN

250 h 18 h 93 %

NOTA:

• Another adaptation of LaoWord was done in 2002 for Bengali (directly in LaoWord ���� BanglaWord). It took about the same time.

• Commercial interest for a very low marginal effort: 3-5 million speakers in Laos ���� 200-300 million speakers in Bangladesh

Page 19: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

19

1.2b (2nd example) – Reuse software code whenever possible

1 – Some methods to reduce costs

Page 20: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

20

Segmenting texts without separators

• Example of unsegmented text in Khmer, Lao and Thai

• Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.

• ��ព�ខ�� �យ����ម�យ�� ���ព�ខ�� �យ����ម�យ�� ���ព�ខ�� �យ����ម�យ�� ���ព�ខ�� �យ����ម�យ�� � �ន�ព�ម�យ�ន�ព�ម�យ�ន�ព�ម�យ�ន�ព�ម�យខ�� ��ន�ឃ�ញ��ប�ព����� � �!ក��ង�$%&�'ម�យ�នច�នង�ជ�ង*ខ�� ��ន�ឃ�ញ��ប�ព����� � �!ក��ង�$%&�'ម�យ�នច�នង�ជ�ង*ខ�� ��ន�ឃ�ញ��ប�ព����� � �!ក��ង�$%&�'ម�យ�នច�នង�ជ�ង*ខ�� ��ន�ឃ�ញ��ប�ព����� � �!ក��ង�$%&�'ម�យ�នច�នង�ជ�ង*««««ក��ង��,ងព-./ក��ង��,ងព-./ក��ង��,ងព-./ក��ង��,ងព-./»»»» 1��ន-2យព�3�ព$4�ក1��ន-2យព�3�ព$4�ក1��ន-2យព�3�ព$4�ក1��ន-2យព�3�ព$4�ក5��ប�ព�67ប89 ញព�ព$:*; ន:ក�ព�ង��ប$.<�ម=គម�យ5��ប�ព�67ប89 ញព�ព$:*; ន:ក�ព�ង��ប$.<�ម=គម�យ5��ប�ព�67ប89 ញព�ព$:*; ន:ក�ព�ង��ប$.<�ម=គម�យ5��ប�ព�67ប89 ញព�ព$:*; ន:ក�ព�ង��ប$.<�ម=គម�យ5?ង����ន7គ@A��ប�ពច��ង3នគ�ន���675?ង����ន7គ@A��ប�ពច��ង3នគ�ន���675?ង����ន7គ@A��ប�ពច��ង3នគ�ន���675?ង����ន7គ@A��ប�ពច��ង3នគ�ន���67 5555

• ຕອນຕອນຕອນຕອນ ຂອ້ຍມອີາຍພຸຽງຂອ້ຍມອີາຍພຸຽງຂອ້ຍມອີາຍພຸຽງຂອ້ຍມອີາຍພຸຽງ໖໖໖໖ປີຄ ັງ້ນຶ່ ງປີຄ ັງ້ນຶ່ ງປີຄ ັງ້ນຶ່ ງປີຄ ັງ້ນຶ່ ງຂອ້ຍໄດເ້ຫັນຮບູສວຍງາມໃນປື້ມກຽ່ວກບັປ່າດງົດບິທີ່ເອີນ້ວາ່ຂອ້ຍໄດເ້ຫັນຮບູສວຍງາມໃນປື້ມກຽ່ວກບັປ່າດງົດບິທີ່ເອີນ້ວາ່ຂອ້ຍໄດເ້ຫັນຮບູສວຍງາມໃນປື້ມກຽ່ວກບັປ່າດງົດບິທີ່ເອີນ້ວາ່ຂອ້ຍໄດເ້ຫັນຮບູສວຍງາມໃນປື້ມກຽ່ວກບັປ່າດງົດບິທີ່ເອີນ້ວາ່«««« ປະຫວດັຈງິປະຫວດັຈງິປະຫວດັຈງິປະຫວດັຈງິ »»»».... ມນັໄດມ້ງີເູຫຼືອມກາໍລງັກນືສດັປ່າມນັໄດມ້ງີເູຫຼືອມກາໍລງັກນືສດັປ່າມນັໄດມ້ງີເູຫຼືອມກາໍລງັກນືສດັປ່າມນັໄດມ້ງີເູຫຼືອມກາໍລງັກນືສດັປ່າ.... ນີຄ້ຮືບູລອກມານີຄ້ຮືບູລອກມານີຄ້ຮືບູລອກມານີຄ້ຮືບູລອກມາ....

• เมือ่เมือ่เมือ่เมือ่ ตอนฉนัอายไุดตอนฉนัอายไุดตอนฉนัอายไุดตอนฉนัอายไุด ๖๖๖๖ ขวบขวบขวบขวบ ฉนัไดเหน็รปูภาพจบัใจรปูหนึง่ในฉนัไดเหน็รปูภาพจบัใจรปูหนึง่ในฉนัไดเหน็รปูภาพจบัใจรปูหนึง่ในฉนัไดเหน็รปูภาพจบัใจรปูหนึง่ในหนงัสอืเกีย่วกบัปาดงดบิชือ่วาหนงัสอืเกีย่วกบัปาดงดบิชือ่วาหนงัสอืเกีย่วกบัปาดงดบิชือ่วาหนงัสอืเกีย่วกบัปาดงดบิชือ่วา """"ประวตัชิวีติธรรมชาติประวตัชิวีติธรรมชาติประวตัชิวีติธรรมชาติประวตัชิวีติธรรมชาติ"""" รปูนัน้รปูนัน้รปูนัน้รปูนัน้เปนรปูงเูหลอืมกาํลงักลนืสตัวปาเปนรปูงเูหลอืมกาํลงักลนืสตัวปาเปนรปูงเูหลอืมกาํลงักลนืสตัวปาเปนรปูงเูหลอืมกาํลงักลนืสตัวปา นีค่อืรปูลอกของภาพนัน้นีค่อืรปูลอกของภาพนัน้นีค่อืรปูลอกของภาพนัน้นีค่อืรปูลอกของภาพนัน้

The Little Prince

Antoine de Saint Exupéry

Page 21: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

21

Segmenting texts without separators

• The same text after being segmented

• Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.

• ��������----ព�ព�ព�ព�----ព�ព�ព�ព�----ខ�� �ខ�� �ខ�� �ខ�� �----យ�យ�យ�យ�----យ�យ�យ�យ�----���ម�យ���ម�យ���ម�យ���ម�យ----�� ��� ��� ��� �----�ន�ន�ន�ន----�ព��ព��ព��ព�----ម�យម�យម�យម�យ----ម�យម�យម�យម�យ----ខ�� �ខ�� �ខ�� �ខ�� �----�ន�ន�ន�ន----�ន�ន�ន�ន----�ន�ន�ន�ន----�ឃ�ញ�ឃ�ញ�ឃ�ញ�ឃ�ញ----��ប�ព��ប�ព��ប�ព��ប�ព----��ប�ព��ប�ព��ប�ព��ប�ព----��������----��� � ��� � ��� � ��� � ----�!�!�!�!----�!�!�!�!----ក��ងក��ងក��ងក��ង----�$%&�'�$%&�'�$%&�'�$%&�'----ម�យម�យម�យម�យ----ម�យម�យម�យម�យ----�ន�ន�ន�ន----ច�នង�ជ�ងច�នង�ជ�ងច�នង�ជ�ងច�នង�ជ�ង----****----****----««««----ក��ងក��ងក��ងក��ង----��,ង��,ង��,ង��,ង----ព-.ព-.ព-.ព-.----////----»»»»----1��1��1��1��----ន-2យន-2យន-2យន-2យ----ព�ព�ព�ព�----ព�ព�ព�ព�----3�ព3�ព3�ព3�ព----$4�ក$4�ក$4�ក$4�ក----5555----��ប�ព��ប�ព��ប�ព��ប�ព----��ប�ព��ប�ព��ប�ព��ប�ព----�67�67�67�67----ប89 ញប89 ញប89 ញប89 ញ----ព�ព�ព�ព�----ព�ព�ព�ព�----ព$:ព$:ព$:ព$:----*; ន:*; ន:*; ន:*; ន:----ក�ព�ងក�ព�ងក�ព�ងក�ព�ង----��ប��ប��ប��ប----$.<$.<$.<$.<----�ម=គ�ម=គ�ម=គ�ម=គ----ម�យម�យម�យម�យ----ម�យម�យម�យម�យ----5?ង5?ង5?ង5?ង----������������----������������----�ន7�ន7�ន7�ន7----គ@គ@គ@គ@----AAAA----��ប�ព��ប�ព��ប�ព��ប�ព----��ប�ព��ប�ព��ប�ព��ប�ព----ច��ងច��ងច��ងច��ង----3ន3ន3ន3ន----គ�ន��គ�ន��គ�ន��គ�ន��----�67�67�67�67----5555

• ຕອນຕອນຕອນຕອນ----ຂອ້ຍຂອ້ຍຂອ້ຍຂອ້ຍ----ມ ີມ ີມ ີມ-ີ---ອາຍຸອາຍຸອາຍຸອາຍ-ຸ---ພຽງພຽງພຽງພຽງ----໖໖໖໖----ປີປີປີປີ----ຄ ັງ້ນຶ່ ງຄ ັງ້ນຶ່ ງຄ ັງ້ນຶ່ ງຄ ັງ້ນຶ່ ງ----ຂອ້ຍຂອ້ຍຂອ້ຍຂອ້ຍ----ໄດ້ໄດ້ໄດ້ໄດ-້---ເຫັນເຫັນເຫັນເຫັນ----ຮບູຮບູຮບູຮບູ----ສວຍງາມສວຍງາມສວຍງາມສວຍງາມ----ໃນໃນໃນໃນ----ປື້ມປື້ມປື້ມປື້ມ----ກຽ່ວກບັກຽ່ວກບັກຽ່ວກບັກຽ່ວກບັ----ປ່າດງົດບິປ່າດງົດບິປ່າດງົດບິປ່າດງົດບິ----ທີ່ທີ່ທີ່ທີ່----ເອີນ້ວາ່ເອີນ້ວາ່ເອີນ້ວາ່ເອີນ້ວາ່----««««----ປະຫວດັປະຫວດັປະຫວດັປະຫວດັ----ຈງິຈງິຈງິຈງິ----»»»»....----ມນັມນັມນັມນັ----ໄດ້ໄດ້ໄດ້ໄດ-້---ມ ີມ ີມ ີມ-ີ---ງເູຫຼືອມງເູຫຼືອມງເູຫຼືອມງເູຫຼືອມ----ກາໍລງັກາໍລງັກາໍລງັກາໍລງັ----ກນືກນືກນືກນື----ສດັປ່າສດັປ່າສດັປ່າສດັປ່າ----....----ນີ້ນີ້ນີ້ນີ-້---ຄືຄືຄືຄື----ຮບູຮບູຮບູຮບູ----ລອກລອກລອກລອກ----ມາມາມາມາ----....

• เมื่อเมื่อเมื่อเมื่อ----ตอนตอนตอนตอน----ฉันฉันฉันฉัน----อายุอายุอายุอายุ----ไดไดไดได----๖๖๖๖----ขวบขวบขวบขวบ----ฉันฉันฉันฉัน----ไดไดไดได----เห็นเห็นเห็นเห็น----รปูภาพรปูภาพรปูภาพรปูภาพ----จบัใจจบัใจจบัใจจบัใจ----รปูรปูรปูรปู----หนึง่หนึง่หนึง่หนึง่----ในในในใน----หนงัสือหนงัสือหนงัสือหนงัสือ----เกีย่วกับเกีย่วกับเกีย่วกับเกีย่วกับ----ปาดงดบิปาดงดบิปาดงดบิปาดงดบิ----ชื่อชื่อชื่อชื่อ----วาวาวาวา----""""----ประวัติประวัติประวัติประวัติ----ชีวิตชีวิตชีวิตชีวิต----ธรรมชาติธรรมชาติธรรมชาติธรรมชาติ----""""----รปูรปูรปูรปู----นั้นนั้นนั้นนั้น----เปนเปนเปนเปน----รปูรปูรปูรปู----งเูหลือมงเูหลือมงเูหลือมงเูหลือม----กําลงักําลงักําลงักําลงั----กลืนกลืนกลืนกลืน----สัตวปาสัตวปาสัตวปาสัตวปา----นี่นีน่ี่นี่----คือคือคือคือ----รปูรปูรปูรปู----ลอกลอกลอกลอก----ของของของของ----ภาพภาพภาพภาพ----นั้นนั้นนั้นนั้น

The Little Prince

Antoine de Saint Exupéry

Page 22: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

22

Segmentation: “Motor”, an online generic segmentor

• Maximum-length matching algorithm

• List of words per language

• Some language-specific heuristics

• ~500 hours for the initial software

• 10 extra hours for adding a new language (except the word list)

• ���� GAIN = 98 %

http://www.taranis-software.com/Heloise/Segmentation

Page 23: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

23

1.3 – Recycle existing dictionaries

1 – Some methods to reduce costs

Page 24: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

24

Recycling the French-Lao Reinhorn-Berment dictionary

• ~40,000 entries (~60,000 word-senses)

• ~15,000 examples

• Published in 2013

• At least 20,000 hours since 1970

http://www.you-feng.com/dictionnaire_fran_lao_reinhorn.php

• ~ 200 hours for cleaning, parsing + the Web site ���� GAIN = 99 %

• Currently being used for a Machine Translation project

Page 25: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

25

Recycling the French-Lao Reinhorn-Berment dictionary

http://laosoftware.com/HeloiseTest/OutilProfs/

Page 26: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

26

Summary of the gains achieved

99,00%20020.000

Cleaning, parsing + Web site French-Lao dictionary (MS Word)

98,00%10500

Per added languageMotor (segmenter)

92,80%18250

ThamWord / BanglaWordLaoWord WLL (text input part)

90,00%2.50025.000

LaoWord WLLMicrosoft Word

GainSpecific development (hours)Reused item (hours)

� To apply an appropriate method is essential

Page 27: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

27

2.1 – Looking for a framework able to address any pair of languages

(including economic concerns)

2 – MT for any language pair

Page 28: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

28

More than 1G speakers 1 language

Chinese

Between 100M and 1G speakers8 languages

Hyperercentral (English) and supercentral languages

Between 10M and 100M speakers≈≈≈≈ 65 languages

Major central languages

Between 1M and 10M speakers≈≈≈≈ 250 languages

Secondary central languages

Between 100,000 and 1,000,000 speakers≈≈≈≈ 1000 languages

Peripheral languages I

Precarious existence

Between 10,000 and 100,000 speakers≈≈≈≈ 1600 languages

Peripheral languages II

Existence immediately threatened

Less than 10,000 speakers≈≈≈≈ 3400 languages

Peripheral languages III

Dying language

http://www.ethnologue.com

N > 10 M native speakers≈≈≈≈ 5 billion persons (70%)≈≈≈≈ 74 languages (1%)

1 M speakers < N < 10 M speakers≈≈≈≈ 1 billion persons (15%)≈≈≈≈ 250 languages (4%)

N < 1 M speakers≈≈≈≈ 1 billion persons (15%)≈≈≈≈ 6000 languages (95%)

74 la

ngua

ges

≈≈ ≈≈5

G p

eopl

e≈≈ ≈≈

250

lang

uage

s≈≈ ≈≈

1 G

peo

ple

≈≈ ≈≈60

00 la

ngua

ges

≈≈ ≈≈1

G p

eopl

e

Situation of the ≳≳≳≳6000 languages in the world

Page 29: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

29

Languages

Quality

Many speakers

Excellent existing linguistic resources

Few speakers

Few linguistic resources

Poor

Excellent

9 74 324 ≳≳≳≳ 6000

Living languages having > 1M speakers

≈ 20% of the languages having > 1M speakers are covered80 languages at Google, among which ~70 having > 1M speakers≳ 1% of the ≈ 6000+ languages in the world

≈ 80% of the languages having > 1M speakers are not covered≳ 250 of the ≳ 320 languages having > 1M speakers≈ 4% of the world languages

Living languages having < 1M speakers

≈ 0.1% of these languages (6 languages) are in Google TranslateBasque, Welsh, Irish, Icelandic, Maltese, MaoriSome languages of several 100k speakers can be profitable

324 languages of more than 1 million speakers

Systran(Expert MT)

Google(Empirical MT)

Exa

mpl

e: T

he le

ader

s of

the

2 m

ain

appr

oach

es in

MT

Situation in Machine Translation (MT)

Quality

Excellent

Page 30: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

30

Languages

9 324

Systran

Google

≳≳≳≳ 6000

324 languages of more than 1million speakers

The survival of these languages depends on the satisfaction of MT need in this blue area !

How could Machine Translation address any language?

Many speakers

Excellent existing linguistic resources

Few speakers

Few linguistic resources

Poor

Quality

Excellent

74

Page 31: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

31

How many languages will be addressed by the private sector?

• Is it possible to provide fair MT services to the 6000 languages of the planet?

• Private companies have to take into account the Return On Investment

• � Languages spoken by few people generally present a low ROI

• The cost of a system increases with the scarcity of the resources (parallel corpora, dictionaries…)

• � Languages spoken by few people generally have scarce resources

Double breaking effect for the under-resourced languages� Less profitable� More expensiveAnd finally poor quality translations, with no guaranty of the sense

Page 32: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

32

How many languages will be addressed by the private sector?

• How many languages will be addressed by the private sector?

• Probably less than 350 languages if one assumes a ROI limit around 1 million speakers (324 such languages, 70 already done)

• Probably more than the 172 official languages (national or regional (*))

• � Private sector limit: ~200-300 / 6000 = 3-5 % of the world languages

(*) : http://fr.wikipedia.org/wiki/Liste_des_langues_officielles

Page 33: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

33

Evolution of the number of languages at Google’s

0

10

20

30

40

50

60

70

80

90

oct-0

7

janv

-08

avr-

08

juil-

08

oct-0

8

janv

-09

avr-

09

juil-

09

oct-0

9

janv

-10

avr-

10

juil-

10

oct-1

0

janv

-11

avr-

11

juil-

11

oct-1

1

janv

-12

avr-

12

juil-

12

oct-1

2

janv

-13

avr-

13

juil-

13

oct-1

3

Série1

• 10/2007-10/2009 (2 years): + 40 languages � a mean of 20 languages per year

• 10/2009-10/2011 (2 years): + 12 languages � a mean of 6 languages per year

• 10/2011-10/2013 (2 years): + 24 languages � a mean of 12 languages per year

• And after?

• 324 languages of more than 1M speakers ���� 30 years at this rate! (10 languages/year)

• Christian Boitet claims that 600 languages will have MT by ≲2025 ���� +50 languages/year!

~ Average 10/year

Page 34: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

34

Number of languages: Where is the limit?

• Cost of the linguistic resources for under-resourced languages (from Boitet 2008 (*))

• Empirical approach (Google…)

• Corpus: 200k to 800k translated pages �100 à 400 man x year � 10 to 40 M€

• Expert approach (Systran…)

• Dictionaries: 50k to 500k entries � 15 to 150 man x year � 1,5 to 15 M€

• Grammars: 25 man x year � 2,5 M€

The expert approach is cheaper but still much too expensive for most languages

Would there be ways to dramatically reduce these costs?

Cost of a system without initial resources� 10 M€ for a system with an empirical approach (mostly statistical)� 4 M€ for a system with an expert approach (mostly rule-based)

(*) : http://www-clips.imag.fr/geta/christian.boitet/pages_personnelles/zArticles_sur_la_TAO_pdf/TALN-08-ArchiTA.080218.v7-final.pdf

Page 35: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

35

Promising approaches for under-resourced languages

• Empirical approaches (Google, Microsoft, ASEAN MT, Asia Online…) (*)

• They proved that it can provide very fast development of (sometimes dirty) systems, when parallel texts are available

• Expert approaches (Systran, Apertium…) (*)

• They can provide quality solutions, are cheaper when large quantities of parallel texts are missing, and definitely present an intrinsic linguistic interest

• Possibility to use a pivot to reduce the cost when translating between many languages:

• Semantic (IF)

• Central language (double transfer)

• English (UNL, double transfer)

(*) : These names refer to existing solutions for Southeast languages

Page 36: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

36

Best practices for under-resourced languages

• Open solutions (Apertium “spirit”) allow the populations to develop their own MT services thus no longer depending on majors such as Google or Microsoft

• Collaborative / networking work can bring the interested persons together (diasporas, foreign languages schools, volunteers…)

• Projects involving many languages (Apertium, ASEAN MT) can provide significant synergies

• Derive MT systems from existing ones for a genetically or “gravitationally” close language: languages spoken by <1M persons are often close to one of the 324 languages spoken by >1M

• Reuse and mutualise lexical resources � FOSS lexical resources enriched collaboratively

• When using an expert approach, give preference to a solution in which analyses and generations are purely monolingual

Page 37: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

37

MT Technology able to address under-resourced languages

• Objective = Make possible the production of quality MT systems:

• For any pair of languages, thus potentially with big differences between the languages

• To be developed in a reasonable time by any group of people willing to do it themselves

• Able to guarantee the sense of the translations

• The candidate solution should:

• Allow developers work together in network (online tools)

• Be well-proven and easy to use (documentation, open reusable systems, tools…)

• Is there an existing solution able to fulfill these constraints?

• Apertium is limited by its choice of methods and tools to “weak problems of translation”

• GÉTA’s Ariane-G5 and methodology fulfill all these requirements, except the online availability

• MT systems between rather different languages: deep level

• Adapted to collaborative work and to “small” languages

• Specialized Languages for Linguistic Programming (SLLPs): easy programming

• Able to guarantee the sense of the translations: expert approach (vs. empirical)

� Develop an online framework “lingware-compatible” with Ariane-G5: Heloise

Page 38: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

38

2.2 – Ariane-G5 and Heloise

2 – MT for any language pair

Page 39: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

39

• Prototypes

• Russian-French: RU5-FR5

• French-English: FR3-AN3

• English-French: ANG-FRA

• Demonstrators

• English-French: BEX-FEX

• French-English: FEX-BEX

• French-English (DGT)

• Portuguese-English

• Russian-French (LIDIA)

• German-French (LIDIA)

• English-French (LIDIA)

• UNL-Chinese: WNL-HN3

Existing Ariane-compatible Open Source lingware

• Demonstrators (continued)

• UNL-French: UNL-FR5

• French-UNL: FR6-UNL

• German-French

• English-Malay: ANG-MAL

• English-Thai: IN4-TH4

• English-(Chinese, Japanese, Arabic)

• Chinese-(English, French, German, Russian, Japanese)

• Steps or groups of isolated steps

• Portuguese analysis: AMPOR+ASPOR

• German analysis: AMALX...

• Japanese morph. analysis (Annick Laurent’s PhD thesis)

This lingware is available under BSD licence(GÉTALP Group, LIG Laboratory, Grenoble, France)

Page 40: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

40

Sizing examples for two existing prototypes

• French-English FR3-AN3 (typology = aircraft manuals)

• MA phase: 16403 lexical units, 221 rules

• SA phase: 795 rules

• LT phase: 16027 lexical units

• ST phase: 188 rules

• SG phase: 311 rules

• MG phase: 21601 lexical units, 27 rules

• Russian-French RU5-FR5 (typology = signaletic bulletins)

• MA phase: 7750 lexical units, 264 rules

• SA phase: 295 rules

• LT phase: 7974 lexical units

• ST phase: 134 rules

• SG phase: 128 rules

• MG phase: 5576 lexical units, 19 rules

Nota: MA, SA, LT, ST, SG, MG = the six mandatory phases o f Ariane-G5 (described later in the presentation)

Page 41: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

41

Ariane-G5: an MT system generator

Dictionaries and grammars are written in high-level prog ramming languagesThey are compiled to produce the MT systems

���� Specialized Languages for Linguistic Programming• ATEF• ROBRA• EXPANS• SYGMOR• TRACOMPL

Ariane-G5 Compiling

Sourcetext

OutputtextMT system

Lingware

dictionaries grammars

Page 42: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

42

The 3 steps of a system developed under Ariane-G5

A unique internal data structure: “decorated” trees

Tree

String

Tree

String

Tree Treestructuraltransfer

(ROBRA)

lexicaltransfer

(EXPANS)

structuralgeneration(ROBRA)

morphologicalgeneration(SYGMOR)

structuralanalysis

(ROBRA)

dictionary grammar

grammar

grammardictionary dictionary

grammar

grammar

morphologicalanalysis(ATEF)A

N A

L Y

S I

SG

E N

E R

A T

I O N

T R A N S F E R

Tree

Page 43: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

43

The 3 steps of a system developed under Ariane-G5

Optional EXPANS phases

String String

structuraltransfer

(ROBRA)

lexicaltransfer

(EXPANS)

structuralgeneration(ROBRA)

morphologicalgeneration(SYGMOR)

structuralanalysis

(ROBRA)

dictionary grammar

grammar

grammardictionary dictionary

grammar

grammar

morphologicalanalysis(ATEF)A

N A

L Y

S I

SG

E N

E R

A T

I O N

T R A N S F E R

OPT OPT

OPT: Optional EXPANS phase

dictionary

dictionary

dictionary

Dictionary (x2)

dictionary

OPTDictionary (x2)

OPT

OPT

Page 44: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

44

The interface structures are explicitly defined

Tree

String

Tree

String

Treem-structure

Treeg-structure

structuraltransfer

(ROBRA)

lexicaltransfer

(EXPANS)

structuralgeneration(ROBRA)

morphologicalgeneration(SYGMOR)

structuralanalysis

(ROBRA)

dictionary grammar

grammar

grammardictionary dictionary

grammar

grammar

morphologicalanalysis(ATEF)A

N A

L Y

S I

SG

E N

E R

A T

I O N

T R A N S F E R

Tree

Bernard Vauquois’ linguistic methodology

Target interface structureSource interface structure

(a-structure) (s-structure)

Page 45: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

45

The interface structures contain three levels

Syntactic andlogico-semantic

levels

Syntagmatic level(bracketing)

Via

the

deco

ratio

ns

Example (in French):

« Cette personne a démonté le meuble. »

Page 46: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

46

Syntagmatic analysis level

PROPOSITION

GROUPENOMINAL

NOYAUVERBVAL

GROUPENOMINAL

"CE" "AVOIR-V" "DE!1MONTER-V" "LE" "MEUBLE-N""PERSONNE-N"

The geometry of the analysis tree reflects the synt agmatic level

« Cette personne a démonté le meuble. »

Page 47: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

47

The syntactic and logico-semantic analysis levels a re derivable from the geometry and decorations of the interface structure

DE!MONTER-V

PERSONNE-N

AVOIR-V

MEUBLE-N

LECETTE

DESIGNATEUR DESIGNATEUR

AUXILIAIRE

OBJETSUJET

Syntactic analysis level

« Cette personne a démonté le meuble. »

Page 48: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

48

The deep level will be the same for the three paraphrases (active, passive and nominalised forms) :

� Cette personne a démonté le meuble.� Le meuble a été démonté par cette personne.� Le démontage du meuble par cette personne.

PREDICAT

ARGUMENT 0 ARGUMENT 1

''personne-n'

'de!1monter-v'

'meuble-n'

Logico-semantic analysis level

« Cette personne a démonté le meuble. »

Page 49: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

49

SL2 Analysis

Source text inSL2 language

Target text inTL2 language

TL2 Generation

Transfer LS2 →→→→ LC2

Source text inSL1 language

Target text inTL1 language

SL1 Analysis TL1 Generation

SL1 →→→→ TL1 Transfer

SL2 →→→→ TL1 Transfer

SL1 →→→→ TTTTL2 Transfer

Multilingualism in Ariane-G5

Analysis and generation steps are monolingual (inde pendent)

(SL: Source language) (TL: Target language)

N^2-N transfers(N = number of languages)

Page 50: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

50

SL2 Analysis

Source text inSL2 language

Target text inTL2 language

TL2 Generation

Tra

nsfe

r I →→→→

TL2

Source text inSL1 language

Target text inTL1 language

SL1 Analysis TL1 Generation

SL1 →→→→ I Transfer I →→→→ TL1 Transfer

SL2

→→→→IIII T

ransf

er

Multilingualism in Ariane-G5

Analysis and generation steps are monolingual (inde pendent)

(SL: Source language) (I: Interface structureTL: Target language)

The number of transfers is reducedfrom (N^2-N) to 2x(N-1) by

using a double transfer or a pivot

Page 51: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

51

The Heloise framework

Editing / compiling a grammar

Page 52: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

52Analyzing the traces of a translation

The Heloise framework

Page 53: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

53

3.1 – Reducing the development complexity

3 – Current activities around Heloise

Page 54: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

54

Reduce the development complexity

• In order to:

• Make the development of a system accessible to non-specialist groups,

• Reduce the development times,

• One has to:

• Simplify the development of dictionaries,

• Simplify the development of grammars,

• Explain the methodology in an educational manner with illustrative examples,

• Provide an interchange collaborative platform.

Page 55: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

55

Simplify the development of dictionaries

• Examples of dictionary programming language (entries of a morphological analyzer):

EMBALLAGE ==NPL1 (FT01387 ,EMBALLER-V ).

EMBARQU ==AIM (FT01390 ,EMBARQUER-V ).

� Use a lexical database for the monolingual lexical information (easier to maintain)

� Use a tool to generate the morphological analyzer (ATEF) and synthesis (SYGMOR)

� Possibly use dedicated tools to manage (create, delete, modify) the lexical data

• Populate the lexical database with all freely available lexical resources (personal, WordNet…)

• Manage the lexical information by layer (Sylviane Chappuy’s “layer cake” or “millefeuille”)

� Morphological, derivational, syntactic, semantic and logic information

Page 56: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

56

Simplify the development of dictionaries

Page 57: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

57

Simplify the development of grammars

• Specify the grammars with the “static grammars” formalism

• Correspondence between a string and a tree (String-Tree Correspondence Grammars / STCG)

• Derive the analyzers and the syntheses from the “boards” of a static grammar

• Intensive work on static grammars in France and in Malaysia

• Sylviane Chappuy

• Christian Boitet

• Zaharin Yusoff

• Tang Enya Kong

• Amélie Bosc

• …

Principle of a STCG board

Page 58: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

58

Explain the methodology in an educational manner

• Many methodological documents already do exist

• Audience = non-specialists

� This methodology needs to be presented differently

� Step by step approach

� Many illustrative and executable examples

• Guillaume de Malézieux’s approach:

• Provide a detailed example in several language pairs for the “Little Prince”

• This Saint-Exupéry’s tale has been widely translated (270 languages)

• Current work on 8 languages: Croatian, English, French, Hindi, Indonesian, Khmer, Lao and Thai

• New comers willing to develop their own system can become familiar by doing the dictionaries and grammars for the “Little Prince” in their language by mimesis

Page 59: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

59

Several sentences of the “Little Prince” in 8 languages

• Croatian: Kad mi je bilo šest godina vidio sam jednu veličanstvenu sliku u nekoj knjizi o prašumi koja se zvalaIstinite priče Slika je predstavljala zmijskog cara kako guta neku zvijer Evo tog crteža

• English: Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.

• French: Lorsque j’avais six ans, j’ai vu, une fois, une magnifique image, dans un livre sur la Forêt Vierge qui s’appelait « Histoires Vécues ». Ça représentait un serpent boa qui avalait un fauve. Voilà la copie du dessin.

• Hindi: म� कोई छः साल का रहा होऊंगा। एक �कताब �मल गई जंगल क� स�ची कहा�नयां। उस म� एक अ!तु त#वीर थी। अजगर एकजंगल& जानवर को �नगल रहा था। यह रह& वह त#वीर।

• Khmer: �ព� ព� ខ�� � យ� យ� ���ម�យ �� � �ន �ព� ម�យ ម�យ ខ�� � �ន �ន �ន �ឃ�ញ ��ប�ព ��ប�ព �� ��� � �!�! ក��ង �$%&�' ម�យ ម�យ �ន ច�នង�ជ�ង * * « ក��ង ��,ង ព-. / » 1�� ន-2យ ព� ព� 3�ព $4�ក 5 ��ប�ព ��ប�ព�67 ប89 ញ ព� ព� ព$: *; ន: ក�ព�ង ��ប $.< �ម=គ ម�យ ម�យ 5?ង ��� ��� �ន7 គ@ A ��ប�ព ��ប�ព ច��ង 3ន គ�ន���67 5

• Lao: ຕອນຂອ້ຍມອີາຍພຸຽງ ໖ ປີ ຄ ັງ້ນຶ່ງຂອ້ຍໄດເ້ຫັນຮບູສວຍງາມໃນປື້ມກຽ່ວກບັປາ່ດງົດບິທີ່ເອີນ້ວາ່ « ປະຫວດັ ຈງິ ». ມນັໄດມ້ ີງເູຫຼືອມກາໍລງັກນືສດັປາ່. ນີຄ້ຮືບູລອກມາ.

• Indonesian: waktu berusia enam tahun , aku pernah melihat gambar luar biasa dalam buku tentang hutan perawanyang berjudul Kisah-kisah nyata . gambar itu memperlihatkan seekor boa pembelit sedang menelan binatang buas inilah salinan

• Thai: เมื่อ ตอน ฉัน อาย ุได ๖ ขวบ ฉัน ได เห็น รูปภาพ จับใจ รปู หนึ่ง ใน หนังสือ เกีย่วกับ ปาดงดิบ ชือ่ วา " ประวตัิ ชวีิต ธรรมชาติ " รูป นั้น เปน รปู งูเหลือม กําลัง กลืน สัตวปา นี่ คือ รปู ลอก ของ ภาพ นั้น

Page 60: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

60

One needs an interchange collaborative platform

• Groups of people scattered across the planet do need a good collaborative platform so they can comfortably and efficiently work together as if they were colocalized

• Repositories for:

• Generic documents (methodology, linguistics, programming…)

• Project documents

• Available linguistic modules

• Chat and email services

• Wiki

• News

• …

Page 61: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

61

An interchange collaborative platform: lingwarium.org

Page 62: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

62

3.2 – Development projects with skills transfer

3 – Current activities around Heloise

Page 63: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

63

Cooperative development projects with skills transfer

• Firstly, address the major languages not or badly covered by the current systems

• Supercentral languages ↔ peripheral languages (e.g.: English ↔ Lao, English ↔ Khmer…)

• Agriculture, tourism, information websites...

• Languages of geographically close populations (e.g.: Vietnamese ↔ Lao, Lao ↔ Khmer…)

• Administrations (customs...), regional integration...

• Then go deeper inside less-resourced languages, working by mimesis (gravitational approach)

• English ↔ Tai dam, Lao ↔ Tai dam…

Page 64: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

64

Cooperative development projects with skills transfer

• Ongoing collaborative projects

• Developing a new version of the User Interface of Heloise (V. Berment)

• Tuning a step by step illustrated design methodology based on the “Little Prince” (G. de Malézieux + the contributors for Hindi, Indonesian, Thai…)

• Improving the definition and developing tools for the static grammars (A. Bosc, S. Chappuy, C. Boitet, V. Berment…)

• Documenting and improving the French-English and English-French prototypes

• Developing a Lao-French prototype (D. Bouquetvichit,)

• Developing a Khmer-French prototype (G. de Malézieux)

• Developing a morphological analyzer for the German language (J.-P. Guilbaud)

• Developing a morphological analyzer for the Lithuanian language (J. Kapočiūtė-Dzikienė)

• Developing a morphological analyzer for the Quechua language (M. Duran, J. Sitko)

• …

Page 65: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

65

Cooperative development projects with skills transfer

• Pending proposals:

• FraCroTA: LIG Grenoble, FER Zagreb

� A number of tasks including a speech MT system for French ↔ Croatian

• GETALCO (UNESCO): Taranis Paris, LIG Grenoble, INaLCO Paris

� A number of tasks including the development of a system for an under-resourced language and the development of new tools to reduce the development times

• Proposals looking for a funding:

• WHISTLE: LIG Grenoble, INaLCO Paris, Kasetsart University, Srinakharinwirot University

� A number of tasks including MT systems for: French ↔ Lao, French ↔ Thai, English ↔Lao, English ↔ Thai + possibly Korean.

• GMSTech: LIG Grenoble, Taranis Paris + others

� Common NLP tools for the Greater Mekong Subregion including word processors and dictionaries based on GMSWord, LaoUniKey, LaoTrans… and driving to the creation of a regional Machine Translation tool based on UNL

Page 66: Some thoughts on how to address commercially unprofitable ... · We defined a Tham Lao keyboard (QWERTY mapping) The Tham Lao writing system is used in Laos for the Buddhist texts

Taranis Software

66

Perspectives

YOU ARE WELCOME!