Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Taranis Software
1
Some thoughts on how to address commercially unprofitable languages and language pairs
Vincent BERMENT (*)Saturday August 23rd 2014
(*) Lecturer at INaLCO, Paris (Lao Section) – Associated Researcher at GETALP, Grenoble – Director of Taranis, Paris
Taranis Software
2
Master plan
My motivations: to contribute to the reduction of the digital divide
1. Some experiments highlighting the gains achieved with appropriate methods
• Integrate with generic environments
• Reuse software code whenever possible
• Recycle existing dictionaries
2. Looking for an efficient Machine Translation solution
• Looking for a framework able to address any language pair
• Ariane-G5 and Heloise
3. Current activities around Heloise
• Reducing the development costs
• Cooperative development projects with skills transfer
Perspectives
Taranis Software
3
My motivation: contribute to reducing the digital divide
• Since 1996: Work on Lao NLP, starting was at a time it was difficult even to type text
• Microsoft Word add-in: LaoWord (intuitive input, “legacy font” and Unicode interworking, sorting, smart syllabic selections, dictionaries, phonetic transcriptions, miscellaneous functions)
• Unicode Virtual Keyboard: LaoUniKey (intuitive input; system hook)
• Online Lao-French and French-Lao dictionary: LaoDict and word for word translator: LaoTrans
• French-Lao Reinhorn-Berment dictionary: 1729 pages, ~40,000 entries (~60,000 word senses), ~15,000 examples, http://www.you-feng.com/dictionnaire_fran_lao_reinhorn.php
• Lao Software Web site: http://www.laosoftware.com
• Since 2001: Work on NLP for under-resourced languages, especially in the GMS area
• PhD thesis “Methods to computerize under-resourced languages and groups of languages” (2004)
• Extension of LaoWord to other languages & writing systems (Khmer, Tham, Bengali): GMSWord
• Word segmenter: Motor (Burmese, Khmer, Lao, Thai, Tibetan)
• GMS Software Web site: http://www.gmsware.org
� LaoWord, LaoUniKey and GMSWord are Free and Open Source Software (FOSS)
Taranis Software
4
My motivation: participate to reducing the digital divide
• Since 2004: Work on developing open source Machine Translation systems & teams
• Integration and strengthening of a “stream” launched by Bernard Vauquois: Work on under-resourced language pairs, often linguistically distant, with the aim to create autonomous qualified teams (which was done by him in Malaysia and Thailand, at least)
• Different from other efforts such as the work done by Carnegie Mellon University, or those funded by DARPA (USA) or ODA (Japan)
• Similar to some projects such as PAN-Asia funded by IDRC and driven by Sarmad Hussein (Pakistan)
• Development of a MT framework able to address under-resourced languages: Heloise
• Based on tools and methods that proved their ability to address pairs of languages with big differences: GÉTA’s Ariane-G5 and methodology
• Open source reusable lingware tools and components already exists: English, French, Russian (prototypes), Malay, Thai, Arabic, Portuguese… Recent development of an advanced large coverage morphological analyzer for German (JP. Guilbaud)
• Lingwarium Web site: http://www.lingwarium.org
• Taranis Web site (under construction): http://www.taranis-software.com
Taranis Software
5
1.1 – Integrate with generic environments
1 – Some methods to reduce costs
Taranis Software
6
Integrate with generic environments: a Word add-in for Lao
Taranis Software
7
Integrate with generic environments: time saved for LaoWord
• LaoWord: WLL (Word add-in) embedded in the MS Word environment
� Lao functions: intuitive input, “legacy font” text import, sorting, smart syllabic selections, dictionaries, phonetic transcriptions, miscellaneous functions
• ~2500 working hours “only”, against ~25.000 (?) if developing a complete but simple word processor � GAIN ~90 % (and more with maintenance!)
GENERIC PART OF THE SOFTWARE
SPECIFIC PART OF THE SOFTWARE
IGS/SS
e.g. MS Word
e.g. LaoWord
Taranis Software
8
1.2 – Reuse software code whenever possible
1 – Some methods to reduce costs
Taranis Software
9
Many specific writing systems remain under-computerized
• Yuon
• Tham Lao
Taranis Software
10
• Tai Lü and khün (Tham Lanna)
• Example of “ book” (Bailan)
Many specific writing systems remain under-computerized
Taranis Software
11
The Nidana Sutta written in Tham Lao script
Taranis Software
12
Gatha (magical formula) written in Tham Lao script
Taranis Software
13
We defined a Tham Lao keyboard (QWERTY mapping)
The Tham Lao writing system is used in Laos for the Buddhist texts
No Unicode encoding, no font, no existing keyboard layout standard…
� Define an interim solution so the monks can type in Tham Lao
Taranis Software
14
We defined a Tham Lao input sequence
• Tham script is used in Laos for writing Buddhist texts (the word “Tham” comes from “Dhamma”), either in Lao or in Pali languages
• Characters can be written in line or with modifications (subscript…)
• In the word “” (indri, from indriya: sense, faculty), the (r) is transformed
into and the (d) is transformed into by prefixing the character by “-” (so the sequence is: <-rn-di)
• � Specific software deriving from LaoWord = GMSWord (MS Word add-in for the languages of the Greater Mekong Region: Khmer, Lao, Tham…)
Taranis Software
15
Reuse software code: adapting LaoWord to the Tham script
Taranis Software
16
The LaoWord text input
LAO FRANCETABLE
FONT LAO 2TABLE
Abstract character set for Lao
FONT LAO 3TABLE
FONT LAO 1TABLE
LaoWord text input
DLL + HOOK (input) DLL + HOOK (output)
150 hours(generic)
100 hours (1 font)+ ≈≈≈≈ 3 hours per font
DUANG JANTABLE
Taranis Software
17
Adaptation of the LaoWord text input to the Tham script
THAM SCRIPTTABLE
THAM SCRIPTTABLE
Abstract character set for Tham
DLL + HOOK (input) DLL + HOOK (output)
“GMSWord” text input for Tham
(GMSWord also contains Khmer and Lao)
8 hours(reuse)
10 hours (1 font)
• This excludes the time for defining the input sequence, the keyboards and developing a font for Tham, as well as the time to do GMSWord (listbox…).
Taranis Software
18
Time saving
COST AT THE FIRST TIME
(Lao)
SUBSEQUENT COST
(Tham)GAIN
250 h 18 h 93 %
NOTA:
• Another adaptation of LaoWord was done in 2002 for Bengali (directly in LaoWord ���� BanglaWord). It took about the same time.
• Commercial interest for a very low marginal effort: 3-5 million speakers in Laos ���� 200-300 million speakers in Bangladesh
Taranis Software
19
1.2b (2nd example) – Reuse software code whenever possible
1 – Some methods to reduce costs
Taranis Software
20
Segmenting texts without separators
• Example of unsegmented text in Khmer, Lao and Thai
• Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.
• ��ព�ខ�� �យ����ម�យ�� ���ព�ខ�� �យ����ម�យ�� ���ព�ខ�� �យ����ម�យ�� ���ព�ខ�� �យ����ម�យ�� � �ន�ព�ម�យ�ន�ព�ម�យ�ន�ព�ម�យ�ន�ព�ម�យខ�� ��ន�ឃ�ញ��ប�ព����� � �!ក��ង�$%&�'ម�យ�នច�នង�ជ�ង*ខ�� ��ន�ឃ�ញ��ប�ព����� � �!ក��ង�$%&�'ម�យ�នច�នង�ជ�ង*ខ�� ��ន�ឃ�ញ��ប�ព����� � �!ក��ង�$%&�'ម�យ�នច�នង�ជ�ង*ខ�� ��ន�ឃ�ញ��ប�ព����� � �!ក��ង�$%&�'ម�យ�នច�នង�ជ�ង*««««ក��ង��,ងព-./ក��ង��,ងព-./ក��ង��,ងព-./ក��ង��,ងព-./»»»» 1��ន-2យព�3�ព$4�ក1��ន-2យព�3�ព$4�ក1��ន-2យព�3�ព$4�ក1��ន-2យព�3�ព$4�ក5��ប�ព�67ប89 ញព�ព$:*; ន:ក�ព�ង��ប$.<�ម=គម�យ5��ប�ព�67ប89 ញព�ព$:*; ន:ក�ព�ង��ប$.<�ម=គម�យ5��ប�ព�67ប89 ញព�ព$:*; ន:ក�ព�ង��ប$.<�ម=គម�យ5��ប�ព�67ប89 ញព�ព$:*; ន:ក�ព�ង��ប$.<�ម=គម�យ5?ង����ន7គ@A��ប�ពច��ង3នគ�ន���675?ង����ន7គ@A��ប�ពច��ង3នគ�ន���675?ង����ន7គ@A��ប�ពច��ង3នគ�ន���675?ង����ន7គ@A��ប�ពច��ង3នគ�ន���67 5555
• ຕອນຕອນຕອນຕອນ ຂອ້ຍມອີາຍພຸຽງຂອ້ຍມອີາຍພຸຽງຂອ້ຍມອີາຍພຸຽງຂອ້ຍມອີາຍພຸຽງ໖໖໖໖ປີຄ ັງ້ນຶ່ ງປີຄ ັງ້ນຶ່ ງປີຄ ັງ້ນຶ່ ງປີຄ ັງ້ນຶ່ ງຂອ້ຍໄດເ້ຫັນຮບູສວຍງາມໃນປື້ມກຽ່ວກບັປ່າດງົດບິທີ່ເອີນ້ວາ່ຂອ້ຍໄດເ້ຫັນຮບູສວຍງາມໃນປື້ມກຽ່ວກບັປ່າດງົດບິທີ່ເອີນ້ວາ່ຂອ້ຍໄດເ້ຫັນຮບູສວຍງາມໃນປື້ມກຽ່ວກບັປ່າດງົດບິທີ່ເອີນ້ວາ່ຂອ້ຍໄດເ້ຫັນຮບູສວຍງາມໃນປື້ມກຽ່ວກບັປ່າດງົດບິທີ່ເອີນ້ວາ່«««« ປະຫວດັຈງິປະຫວດັຈງິປະຫວດັຈງິປະຫວດັຈງິ »»»».... ມນັໄດມ້ງີເູຫຼືອມກາໍລງັກນືສດັປ່າມນັໄດມ້ງີເູຫຼືອມກາໍລງັກນືສດັປ່າມນັໄດມ້ງີເູຫຼືອມກາໍລງັກນືສດັປ່າມນັໄດມ້ງີເູຫຼືອມກາໍລງັກນືສດັປ່າ.... ນີຄ້ຮືບູລອກມານີຄ້ຮືບູລອກມານີຄ້ຮືບູລອກມານີຄ້ຮືບູລອກມາ....
• เมือ่เมือ่เมือ่เมือ่ ตอนฉนัอายไุดตอนฉนัอายไุดตอนฉนัอายไุดตอนฉนัอายไุด ๖๖๖๖ ขวบขวบขวบขวบ ฉนัไดเหน็รปูภาพจบัใจรปูหนึง่ในฉนัไดเหน็รปูภาพจบัใจรปูหนึง่ในฉนัไดเหน็รปูภาพจบัใจรปูหนึง่ในฉนัไดเหน็รปูภาพจบัใจรปูหนึง่ในหนงัสอืเกีย่วกบัปาดงดบิชือ่วาหนงัสอืเกีย่วกบัปาดงดบิชือ่วาหนงัสอืเกีย่วกบัปาดงดบิชือ่วาหนงัสอืเกีย่วกบัปาดงดบิชือ่วา """"ประวตัชิวีติธรรมชาติประวตัชิวีติธรรมชาติประวตัชิวีติธรรมชาติประวตัชิวีติธรรมชาติ"""" รปูนัน้รปูนัน้รปูนัน้รปูนัน้เปนรปูงเูหลอืมกาํลงักลนืสตัวปาเปนรปูงเูหลอืมกาํลงักลนืสตัวปาเปนรปูงเูหลอืมกาํลงักลนืสตัวปาเปนรปูงเูหลอืมกาํลงักลนืสตัวปา นีค่อืรปูลอกของภาพนัน้นีค่อืรปูลอกของภาพนัน้นีค่อืรปูลอกของภาพนัน้นีค่อืรปูลอกของภาพนัน้
The Little Prince
Antoine de Saint Exupéry
Taranis Software
21
Segmenting texts without separators
• The same text after being segmented
• Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.
• ��������----ព�ព�ព�ព�----ព�ព�ព�ព�----ខ�� �ខ�� �ខ�� �ខ�� �----យ�យ�យ�យ�----យ�យ�យ�យ�----���ម�យ���ម�យ���ម�យ���ម�យ----�� ��� ��� ��� �----�ន�ន�ន�ន----�ព��ព��ព��ព�----ម�យម�យម�យម�យ----ម�យម�យម�យម�យ----ខ�� �ខ�� �ខ�� �ខ�� �----�ន�ន�ន�ន----�ន�ន�ន�ន----�ន�ន�ន�ន----�ឃ�ញ�ឃ�ញ�ឃ�ញ�ឃ�ញ----��ប�ព��ប�ព��ប�ព��ប�ព----��ប�ព��ប�ព��ប�ព��ប�ព----��������----��� � ��� � ��� � ��� � ----�!�!�!�!----�!�!�!�!----ក��ងក��ងក��ងក��ង----�$%&�'�$%&�'�$%&�'�$%&�'----ម�យម�យម�យម�យ----ម�យម�យម�យម�យ----�ន�ន�ន�ន----ច�នង�ជ�ងច�នង�ជ�ងច�នង�ជ�ងច�នង�ជ�ង----****----****----««««----ក��ងក��ងក��ងក��ង----��,ង��,ង��,ង��,ង----ព-.ព-.ព-.ព-.----////----»»»»----1��1��1��1��----ន-2យន-2យន-2យន-2យ----ព�ព�ព�ព�----ព�ព�ព�ព�----3�ព3�ព3�ព3�ព----$4�ក$4�ក$4�ក$4�ក----5555----��ប�ព��ប�ព��ប�ព��ប�ព----��ប�ព��ប�ព��ប�ព��ប�ព----�67�67�67�67----ប89 ញប89 ញប89 ញប89 ញ----ព�ព�ព�ព�----ព�ព�ព�ព�----ព$:ព$:ព$:ព$:----*; ន:*; ន:*; ន:*; ន:----ក�ព�ងក�ព�ងក�ព�ងក�ព�ង----��ប��ប��ប��ប----$.<$.<$.<$.<----�ម=គ�ម=គ�ម=គ�ម=គ----ម�យម�យម�យម�យ----ម�យម�យម�យម�យ----5?ង5?ង5?ង5?ង----������������----������������----�ន7�ន7�ន7�ន7----គ@គ@គ@គ@----AAAA----��ប�ព��ប�ព��ប�ព��ប�ព----��ប�ព��ប�ព��ប�ព��ប�ព----ច��ងច��ងច��ងច��ង----3ន3ន3ន3ន----គ�ន��គ�ន��គ�ន��គ�ន��----�67�67�67�67----5555
• ຕອນຕອນຕອນຕອນ----ຂອ້ຍຂອ້ຍຂອ້ຍຂອ້ຍ----ມ ີມ ີມ ີມ-ີ---ອາຍຸອາຍຸອາຍຸອາຍ-ຸ---ພຽງພຽງພຽງພຽງ----໖໖໖໖----ປີປີປີປີ----ຄ ັງ້ນຶ່ ງຄ ັງ້ນຶ່ ງຄ ັງ້ນຶ່ ງຄ ັງ້ນຶ່ ງ----ຂອ້ຍຂອ້ຍຂອ້ຍຂອ້ຍ----ໄດ້ໄດ້ໄດ້ໄດ-້---ເຫັນເຫັນເຫັນເຫັນ----ຮບູຮບູຮບູຮບູ----ສວຍງາມສວຍງາມສວຍງາມສວຍງາມ----ໃນໃນໃນໃນ----ປື້ມປື້ມປື້ມປື້ມ----ກຽ່ວກບັກຽ່ວກບັກຽ່ວກບັກຽ່ວກບັ----ປ່າດງົດບິປ່າດງົດບິປ່າດງົດບິປ່າດງົດບິ----ທີ່ທີ່ທີ່ທີ່----ເອີນ້ວາ່ເອີນ້ວາ່ເອີນ້ວາ່ເອີນ້ວາ່----««««----ປະຫວດັປະຫວດັປະຫວດັປະຫວດັ----ຈງິຈງິຈງິຈງິ----»»»»....----ມນັມນັມນັມນັ----ໄດ້ໄດ້ໄດ້ໄດ-້---ມ ີມ ີມ ີມ-ີ---ງເູຫຼືອມງເູຫຼືອມງເູຫຼືອມງເູຫຼືອມ----ກາໍລງັກາໍລງັກາໍລງັກາໍລງັ----ກນືກນືກນືກນື----ສດັປ່າສດັປ່າສດັປ່າສດັປ່າ----....----ນີ້ນີ້ນີ້ນີ-້---ຄືຄືຄືຄື----ຮບູຮບູຮບູຮບູ----ລອກລອກລອກລອກ----ມາມາມາມາ----....
• เมื่อเมื่อเมื่อเมื่อ----ตอนตอนตอนตอน----ฉันฉันฉันฉัน----อายุอายุอายุอายุ----ไดไดไดได----๖๖๖๖----ขวบขวบขวบขวบ----ฉันฉันฉันฉัน----ไดไดไดได----เห็นเห็นเห็นเห็น----รปูภาพรปูภาพรปูภาพรปูภาพ----จบัใจจบัใจจบัใจจบัใจ----รปูรปูรปูรปู----หนึง่หนึง่หนึง่หนึง่----ในในในใน----หนงัสือหนงัสือหนงัสือหนงัสือ----เกีย่วกับเกีย่วกับเกีย่วกับเกีย่วกับ----ปาดงดบิปาดงดบิปาดงดบิปาดงดบิ----ชื่อชื่อชื่อชื่อ----วาวาวาวา----""""----ประวัติประวัติประวัติประวัติ----ชีวิตชีวิตชีวิตชีวิต----ธรรมชาติธรรมชาติธรรมชาติธรรมชาติ----""""----รปูรปูรปูรปู----นั้นนั้นนั้นนั้น----เปนเปนเปนเปน----รปูรปูรปูรปู----งเูหลือมงเูหลือมงเูหลือมงเูหลือม----กําลงักําลงักําลงักําลงั----กลืนกลืนกลืนกลืน----สัตวปาสัตวปาสัตวปาสัตวปา----นี่นีน่ี่นี่----คือคือคือคือ----รปูรปูรปูรปู----ลอกลอกลอกลอก----ของของของของ----ภาพภาพภาพภาพ----นั้นนั้นนั้นนั้น
The Little Prince
Antoine de Saint Exupéry
Taranis Software
22
Segmentation: “Motor”, an online generic segmentor
• Maximum-length matching algorithm
• List of words per language
• Some language-specific heuristics
• ~500 hours for the initial software
• 10 extra hours for adding a new language (except the word list)
• ���� GAIN = 98 %
http://www.taranis-software.com/Heloise/Segmentation
Taranis Software
23
1.3 – Recycle existing dictionaries
1 – Some methods to reduce costs
Taranis Software
24
Recycling the French-Lao Reinhorn-Berment dictionary
• ~40,000 entries (~60,000 word-senses)
• ~15,000 examples
• Published in 2013
• At least 20,000 hours since 1970
http://www.you-feng.com/dictionnaire_fran_lao_reinhorn.php
• ~ 200 hours for cleaning, parsing + the Web site ���� GAIN = 99 %
• Currently being used for a Machine Translation project
Taranis Software
25
Recycling the French-Lao Reinhorn-Berment dictionary
http://laosoftware.com/HeloiseTest/OutilProfs/
Taranis Software
26
Summary of the gains achieved
99,00%20020.000
Cleaning, parsing + Web site French-Lao dictionary (MS Word)
98,00%10500
Per added languageMotor (segmenter)
92,80%18250
ThamWord / BanglaWordLaoWord WLL (text input part)
90,00%2.50025.000
LaoWord WLLMicrosoft Word
GainSpecific development (hours)Reused item (hours)
� To apply an appropriate method is essential
Taranis Software
27
2.1 – Looking for a framework able to address any pair of languages
(including economic concerns)
2 – MT for any language pair
Taranis Software
28
More than 1G speakers 1 language
Chinese
Between 100M and 1G speakers8 languages
Hyperercentral (English) and supercentral languages
Between 10M and 100M speakers≈≈≈≈ 65 languages
Major central languages
Between 1M and 10M speakers≈≈≈≈ 250 languages
Secondary central languages
Between 100,000 and 1,000,000 speakers≈≈≈≈ 1000 languages
Peripheral languages I
Precarious existence
Between 10,000 and 100,000 speakers≈≈≈≈ 1600 languages
Peripheral languages II
Existence immediately threatened
Less than 10,000 speakers≈≈≈≈ 3400 languages
Peripheral languages III
Dying language
http://www.ethnologue.com
N > 10 M native speakers≈≈≈≈ 5 billion persons (70%)≈≈≈≈ 74 languages (1%)
1 M speakers < N < 10 M speakers≈≈≈≈ 1 billion persons (15%)≈≈≈≈ 250 languages (4%)
N < 1 M speakers≈≈≈≈ 1 billion persons (15%)≈≈≈≈ 6000 languages (95%)
74 la
ngua
ges
≈≈ ≈≈5
G p
eopl
e≈≈ ≈≈
250
lang
uage
s≈≈ ≈≈
1 G
peo
ple
≈≈ ≈≈60
00 la
ngua
ges
≈≈ ≈≈1
G p
eopl
e
Situation of the ≳≳≳≳6000 languages in the world
Taranis Software
29
Languages
Quality
Many speakers
Excellent existing linguistic resources
Few speakers
Few linguistic resources
Poor
Excellent
9 74 324 ≳≳≳≳ 6000
Living languages having > 1M speakers
≈ 20% of the languages having > 1M speakers are covered80 languages at Google, among which ~70 having > 1M speakers≳ 1% of the ≈ 6000+ languages in the world
≈ 80% of the languages having > 1M speakers are not covered≳ 250 of the ≳ 320 languages having > 1M speakers≈ 4% of the world languages
Living languages having < 1M speakers
≈ 0.1% of these languages (6 languages) are in Google TranslateBasque, Welsh, Irish, Icelandic, Maltese, MaoriSome languages of several 100k speakers can be profitable
324 languages of more than 1 million speakers
Systran(Expert MT)
Google(Empirical MT)
Exa
mpl
e: T
he le
ader
s of
the
2 m
ain
appr
oach
es in
MT
Situation in Machine Translation (MT)
Quality
Excellent
Taranis Software
30
Languages
9 324
Systran
≳≳≳≳ 6000
324 languages of more than 1million speakers
The survival of these languages depends on the satisfaction of MT need in this blue area !
How could Machine Translation address any language?
Many speakers
Excellent existing linguistic resources
Few speakers
Few linguistic resources
Poor
Quality
Excellent
74
Taranis Software
31
How many languages will be addressed by the private sector?
• Is it possible to provide fair MT services to the 6000 languages of the planet?
• Private companies have to take into account the Return On Investment
• � Languages spoken by few people generally present a low ROI
• The cost of a system increases with the scarcity of the resources (parallel corpora, dictionaries…)
• � Languages spoken by few people generally have scarce resources
Double breaking effect for the under-resourced languages� Less profitable� More expensiveAnd finally poor quality translations, with no guaranty of the sense
Taranis Software
32
How many languages will be addressed by the private sector?
• How many languages will be addressed by the private sector?
• Probably less than 350 languages if one assumes a ROI limit around 1 million speakers (324 such languages, 70 already done)
• Probably more than the 172 official languages (national or regional (*))
• � Private sector limit: ~200-300 / 6000 = 3-5 % of the world languages
(*) : http://fr.wikipedia.org/wiki/Liste_des_langues_officielles
Taranis Software
33
Evolution of the number of languages at Google’s
0
10
20
30
40
50
60
70
80
90
oct-0
7
janv
-08
avr-
08
juil-
08
oct-0
8
janv
-09
avr-
09
juil-
09
oct-0
9
janv
-10
avr-
10
juil-
10
oct-1
0
janv
-11
avr-
11
juil-
11
oct-1
1
janv
-12
avr-
12
juil-
12
oct-1
2
janv
-13
avr-
13
juil-
13
oct-1
3
Série1
• 10/2007-10/2009 (2 years): + 40 languages � a mean of 20 languages per year
• 10/2009-10/2011 (2 years): + 12 languages � a mean of 6 languages per year
• 10/2011-10/2013 (2 years): + 24 languages � a mean of 12 languages per year
• And after?
• 324 languages of more than 1M speakers ���� 30 years at this rate! (10 languages/year)
• Christian Boitet claims that 600 languages will have MT by ≲2025 ���� +50 languages/year!
~ Average 10/year
Taranis Software
34
Number of languages: Where is the limit?
• Cost of the linguistic resources for under-resourced languages (from Boitet 2008 (*))
• Empirical approach (Google…)
• Corpus: 200k to 800k translated pages �100 à 400 man x year � 10 to 40 M€
• Expert approach (Systran…)
• Dictionaries: 50k to 500k entries � 15 to 150 man x year � 1,5 to 15 M€
• Grammars: 25 man x year � 2,5 M€
The expert approach is cheaper but still much too expensive for most languages
Would there be ways to dramatically reduce these costs?
Cost of a system without initial resources� 10 M€ for a system with an empirical approach (mostly statistical)� 4 M€ for a system with an expert approach (mostly rule-based)
(*) : http://www-clips.imag.fr/geta/christian.boitet/pages_personnelles/zArticles_sur_la_TAO_pdf/TALN-08-ArchiTA.080218.v7-final.pdf
Taranis Software
35
Promising approaches for under-resourced languages
• Empirical approaches (Google, Microsoft, ASEAN MT, Asia Online…) (*)
• They proved that it can provide very fast development of (sometimes dirty) systems, when parallel texts are available
• Expert approaches (Systran, Apertium…) (*)
• They can provide quality solutions, are cheaper when large quantities of parallel texts are missing, and definitely present an intrinsic linguistic interest
• Possibility to use a pivot to reduce the cost when translating between many languages:
• Semantic (IF)
• Central language (double transfer)
• English (UNL, double transfer)
(*) : These names refer to existing solutions for Southeast languages
Taranis Software
36
Best practices for under-resourced languages
• Open solutions (Apertium “spirit”) allow the populations to develop their own MT services thus no longer depending on majors such as Google or Microsoft
• Collaborative / networking work can bring the interested persons together (diasporas, foreign languages schools, volunteers…)
• Projects involving many languages (Apertium, ASEAN MT) can provide significant synergies
• Derive MT systems from existing ones for a genetically or “gravitationally” close language: languages spoken by <1M persons are often close to one of the 324 languages spoken by >1M
• Reuse and mutualise lexical resources � FOSS lexical resources enriched collaboratively
• When using an expert approach, give preference to a solution in which analyses and generations are purely monolingual
Taranis Software
37
MT Technology able to address under-resourced languages
• Objective = Make possible the production of quality MT systems:
• For any pair of languages, thus potentially with big differences between the languages
• To be developed in a reasonable time by any group of people willing to do it themselves
• Able to guarantee the sense of the translations
• The candidate solution should:
• Allow developers work together in network (online tools)
• Be well-proven and easy to use (documentation, open reusable systems, tools…)
• Is there an existing solution able to fulfill these constraints?
• Apertium is limited by its choice of methods and tools to “weak problems of translation”
• GÉTA’s Ariane-G5 and methodology fulfill all these requirements, except the online availability
• MT systems between rather different languages: deep level
• Adapted to collaborative work and to “small” languages
• Specialized Languages for Linguistic Programming (SLLPs): easy programming
• Able to guarantee the sense of the translations: expert approach (vs. empirical)
� Develop an online framework “lingware-compatible” with Ariane-G5: Heloise
Taranis Software
38
2.2 – Ariane-G5 and Heloise
2 – MT for any language pair
Taranis Software
39
• Prototypes
• Russian-French: RU5-FR5
• French-English: FR3-AN3
• English-French: ANG-FRA
• Demonstrators
• English-French: BEX-FEX
• French-English: FEX-BEX
• French-English (DGT)
• Portuguese-English
• Russian-French (LIDIA)
• German-French (LIDIA)
• English-French (LIDIA)
• UNL-Chinese: WNL-HN3
Existing Ariane-compatible Open Source lingware
• Demonstrators (continued)
• UNL-French: UNL-FR5
• French-UNL: FR6-UNL
• German-French
• English-Malay: ANG-MAL
• English-Thai: IN4-TH4
• English-(Chinese, Japanese, Arabic)
• Chinese-(English, French, German, Russian, Japanese)
• Steps or groups of isolated steps
• Portuguese analysis: AMPOR+ASPOR
• German analysis: AMALX...
• Japanese morph. analysis (Annick Laurent’s PhD thesis)
This lingware is available under BSD licence(GÉTALP Group, LIG Laboratory, Grenoble, France)
Taranis Software
40
Sizing examples for two existing prototypes
• French-English FR3-AN3 (typology = aircraft manuals)
• MA phase: 16403 lexical units, 221 rules
• SA phase: 795 rules
• LT phase: 16027 lexical units
• ST phase: 188 rules
• SG phase: 311 rules
• MG phase: 21601 lexical units, 27 rules
• Russian-French RU5-FR5 (typology = signaletic bulletins)
• MA phase: 7750 lexical units, 264 rules
• SA phase: 295 rules
• LT phase: 7974 lexical units
• ST phase: 134 rules
• SG phase: 128 rules
• MG phase: 5576 lexical units, 19 rules
Nota: MA, SA, LT, ST, SG, MG = the six mandatory phases o f Ariane-G5 (described later in the presentation)
Taranis Software
41
Ariane-G5: an MT system generator
Dictionaries and grammars are written in high-level prog ramming languagesThey are compiled to produce the MT systems
���� Specialized Languages for Linguistic Programming• ATEF• ROBRA• EXPANS• SYGMOR• TRACOMPL
Ariane-G5 Compiling
Sourcetext
OutputtextMT system
Lingware
dictionaries grammars
Taranis Software
42
The 3 steps of a system developed under Ariane-G5
A unique internal data structure: “decorated” trees
Tree
String
Tree
String
Tree Treestructuraltransfer
(ROBRA)
lexicaltransfer
(EXPANS)
structuralgeneration(ROBRA)
morphologicalgeneration(SYGMOR)
structuralanalysis
(ROBRA)
dictionary grammar
grammar
grammardictionary dictionary
grammar
grammar
morphologicalanalysis(ATEF)A
N A
L Y
S I
SG
E N
E R
A T
I O N
T R A N S F E R
Tree
Taranis Software
43
The 3 steps of a system developed under Ariane-G5
Optional EXPANS phases
String String
structuraltransfer
(ROBRA)
lexicaltransfer
(EXPANS)
structuralgeneration(ROBRA)
morphologicalgeneration(SYGMOR)
structuralanalysis
(ROBRA)
dictionary grammar
grammar
grammardictionary dictionary
grammar
grammar
morphologicalanalysis(ATEF)A
N A
L Y
S I
SG
E N
E R
A T
I O N
T R A N S F E R
OPT OPT
OPT: Optional EXPANS phase
dictionary
dictionary
dictionary
Dictionary (x2)
dictionary
OPTDictionary (x2)
OPT
OPT
Taranis Software
44
The interface structures are explicitly defined
Tree
String
Tree
String
Treem-structure
Treeg-structure
structuraltransfer
(ROBRA)
lexicaltransfer
(EXPANS)
structuralgeneration(ROBRA)
morphologicalgeneration(SYGMOR)
structuralanalysis
(ROBRA)
dictionary grammar
grammar
grammardictionary dictionary
grammar
grammar
morphologicalanalysis(ATEF)A
N A
L Y
S I
SG
E N
E R
A T
I O N
T R A N S F E R
Tree
Bernard Vauquois’ linguistic methodology
Target interface structureSource interface structure
(a-structure) (s-structure)
Taranis Software
45
The interface structures contain three levels
Syntactic andlogico-semantic
levels
Syntagmatic level(bracketing)
Via
the
deco
ratio
ns
Example (in French):
« Cette personne a démonté le meuble. »
Taranis Software
46
Syntagmatic analysis level
PROPOSITION
GROUPENOMINAL
NOYAUVERBVAL
GROUPENOMINAL
"CE" "AVOIR-V" "DE!1MONTER-V" "LE" "MEUBLE-N""PERSONNE-N"
The geometry of the analysis tree reflects the synt agmatic level
« Cette personne a démonté le meuble. »
Taranis Software
47
The syntactic and logico-semantic analysis levels a re derivable from the geometry and decorations of the interface structure
DE!MONTER-V
PERSONNE-N
AVOIR-V
MEUBLE-N
LECETTE
DESIGNATEUR DESIGNATEUR
AUXILIAIRE
OBJETSUJET
Syntactic analysis level
« Cette personne a démonté le meuble. »
Taranis Software
48
The deep level will be the same for the three paraphrases (active, passive and nominalised forms) :
� Cette personne a démonté le meuble.� Le meuble a été démonté par cette personne.� Le démontage du meuble par cette personne.
PREDICAT
ARGUMENT 0 ARGUMENT 1
''personne-n'
'de!1monter-v'
'meuble-n'
Logico-semantic analysis level
« Cette personne a démonté le meuble. »
Taranis Software
49
SL2 Analysis
Source text inSL2 language
Target text inTL2 language
TL2 Generation
Transfer LS2 →→→→ LC2
Source text inSL1 language
Target text inTL1 language
SL1 Analysis TL1 Generation
SL1 →→→→ TL1 Transfer
SL2 →→→→ TL1 Transfer
SL1 →→→→ TTTTL2 Transfer
Multilingualism in Ariane-G5
Analysis and generation steps are monolingual (inde pendent)
(SL: Source language) (TL: Target language)
N^2-N transfers(N = number of languages)
Taranis Software
50
SL2 Analysis
Source text inSL2 language
Target text inTL2 language
TL2 Generation
Tra
nsfe
r I →→→→
TL2
Source text inSL1 language
Target text inTL1 language
SL1 Analysis TL1 Generation
SL1 →→→→ I Transfer I →→→→ TL1 Transfer
SL2
→→→→IIII T
ransf
er
Multilingualism in Ariane-G5
Analysis and generation steps are monolingual (inde pendent)
(SL: Source language) (I: Interface structureTL: Target language)
The number of transfers is reducedfrom (N^2-N) to 2x(N-1) by
using a double transfer or a pivot
Taranis Software
51
The Heloise framework
Editing / compiling a grammar
Taranis Software
52Analyzing the traces of a translation
The Heloise framework
Taranis Software
53
3.1 – Reducing the development complexity
3 – Current activities around Heloise
Taranis Software
54
Reduce the development complexity
• In order to:
• Make the development of a system accessible to non-specialist groups,
• Reduce the development times,
• One has to:
• Simplify the development of dictionaries,
• Simplify the development of grammars,
• Explain the methodology in an educational manner with illustrative examples,
• Provide an interchange collaborative platform.
Taranis Software
55
Simplify the development of dictionaries
• Examples of dictionary programming language (entries of a morphological analyzer):
EMBALLAGE ==NPL1 (FT01387 ,EMBALLER-V ).
EMBARQU ==AIM (FT01390 ,EMBARQUER-V ).
� Use a lexical database for the monolingual lexical information (easier to maintain)
� Use a tool to generate the morphological analyzer (ATEF) and synthesis (SYGMOR)
� Possibly use dedicated tools to manage (create, delete, modify) the lexical data
• Populate the lexical database with all freely available lexical resources (personal, WordNet…)
• Manage the lexical information by layer (Sylviane Chappuy’s “layer cake” or “millefeuille”)
� Morphological, derivational, syntactic, semantic and logic information
Taranis Software
56
Simplify the development of dictionaries
Taranis Software
57
Simplify the development of grammars
• Specify the grammars with the “static grammars” formalism
• Correspondence between a string and a tree (String-Tree Correspondence Grammars / STCG)
• Derive the analyzers and the syntheses from the “boards” of a static grammar
• Intensive work on static grammars in France and in Malaysia
• Sylviane Chappuy
• Christian Boitet
• Zaharin Yusoff
• Tang Enya Kong
• Amélie Bosc
• …
Principle of a STCG board
Taranis Software
58
Explain the methodology in an educational manner
• Many methodological documents already do exist
• Audience = non-specialists
� This methodology needs to be presented differently
� Step by step approach
� Many illustrative and executable examples
• Guillaume de Malézieux’s approach:
• Provide a detailed example in several language pairs for the “Little Prince”
• This Saint-Exupéry’s tale has been widely translated (270 languages)
• Current work on 8 languages: Croatian, English, French, Hindi, Indonesian, Khmer, Lao and Thai
• New comers willing to develop their own system can become familiar by doing the dictionaries and grammars for the “Little Prince” in their language by mimesis
Taranis Software
59
Several sentences of the “Little Prince” in 8 languages
• Croatian: Kad mi je bilo šest godina vidio sam jednu veličanstvenu sliku u nekoj knjizi o prašumi koja se zvalaIstinite priče Slika je predstavljala zmijskog cara kako guta neku zvijer Evo tog crteža
• English: Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.
• French: Lorsque j’avais six ans, j’ai vu, une fois, une magnifique image, dans un livre sur la Forêt Vierge qui s’appelait « Histoires Vécues ». Ça représentait un serpent boa qui avalait un fauve. Voilà la copie du dessin.
• Hindi: म� कोई छः साल का रहा होऊंगा। एक �कताब �मल गई जंगल क� स�ची कहा�नयां। उस म� एक अ!तु त#वीर थी। अजगर एकजंगल& जानवर को �नगल रहा था। यह रह& वह त#वीर।
• Khmer: �ព� ព� ខ�� � យ� យ� ���ម�យ �� � �ន �ព� ម�យ ម�យ ខ�� � �ន �ន �ន �ឃ�ញ ��ប�ព ��ប�ព �� ��� � �!�! ក��ង �$%&�' ម�យ ម�យ �ន ច�នង�ជ�ង * * « ក��ង ��,ង ព-. / » 1�� ន-2យ ព� ព� 3�ព $4�ក 5 ��ប�ព ��ប�ព�67 ប89 ញ ព� ព� ព$: *; ន: ក�ព�ង ��ប $.< �ម=គ ម�យ ម�យ 5?ង ��� ��� �ន7 គ@ A ��ប�ព ��ប�ព ច��ង 3ន គ�ន���67 5
• Lao: ຕອນຂອ້ຍມອີາຍພຸຽງ ໖ ປີ ຄ ັງ້ນຶ່ງຂອ້ຍໄດເ້ຫັນຮບູສວຍງາມໃນປື້ມກຽ່ວກບັປາ່ດງົດບິທີ່ເອີນ້ວາ່ « ປະຫວດັ ຈງິ ». ມນັໄດມ້ ີງເູຫຼືອມກາໍລງັກນືສດັປາ່. ນີຄ້ຮືບູລອກມາ.
• Indonesian: waktu berusia enam tahun , aku pernah melihat gambar luar biasa dalam buku tentang hutan perawanyang berjudul Kisah-kisah nyata . gambar itu memperlihatkan seekor boa pembelit sedang menelan binatang buas inilah salinan
• Thai: เมื่อ ตอน ฉัน อาย ุได ๖ ขวบ ฉัน ได เห็น รูปภาพ จับใจ รปู หนึ่ง ใน หนังสือ เกีย่วกับ ปาดงดิบ ชือ่ วา " ประวตัิ ชวีิต ธรรมชาติ " รูป นั้น เปน รปู งูเหลือม กําลัง กลืน สัตวปา นี่ คือ รปู ลอก ของ ภาพ นั้น
Taranis Software
60
One needs an interchange collaborative platform
• Groups of people scattered across the planet do need a good collaborative platform so they can comfortably and efficiently work together as if they were colocalized
• Repositories for:
• Generic documents (methodology, linguistics, programming…)
• Project documents
• Available linguistic modules
• Chat and email services
• Wiki
• News
• …
Taranis Software
61
An interchange collaborative platform: lingwarium.org
Taranis Software
62
3.2 – Development projects with skills transfer
3 – Current activities around Heloise
Taranis Software
63
Cooperative development projects with skills transfer
• Firstly, address the major languages not or badly covered by the current systems
• Supercentral languages ↔ peripheral languages (e.g.: English ↔ Lao, English ↔ Khmer…)
• Agriculture, tourism, information websites...
• Languages of geographically close populations (e.g.: Vietnamese ↔ Lao, Lao ↔ Khmer…)
• Administrations (customs...), regional integration...
• Then go deeper inside less-resourced languages, working by mimesis (gravitational approach)
• English ↔ Tai dam, Lao ↔ Tai dam…
Taranis Software
64
Cooperative development projects with skills transfer
• Ongoing collaborative projects
• Developing a new version of the User Interface of Heloise (V. Berment)
• Tuning a step by step illustrated design methodology based on the “Little Prince” (G. de Malézieux + the contributors for Hindi, Indonesian, Thai…)
• Improving the definition and developing tools for the static grammars (A. Bosc, S. Chappuy, C. Boitet, V. Berment…)
• Documenting and improving the French-English and English-French prototypes
• Developing a Lao-French prototype (D. Bouquetvichit,)
• Developing a Khmer-French prototype (G. de Malézieux)
• Developing a morphological analyzer for the German language (J.-P. Guilbaud)
• Developing a morphological analyzer for the Lithuanian language (J. Kapočiūtė-Dzikienė)
• Developing a morphological analyzer for the Quechua language (M. Duran, J. Sitko)
• …
Taranis Software
65
Cooperative development projects with skills transfer
• Pending proposals:
• FraCroTA: LIG Grenoble, FER Zagreb
� A number of tasks including a speech MT system for French ↔ Croatian
• GETALCO (UNESCO): Taranis Paris, LIG Grenoble, INaLCO Paris
� A number of tasks including the development of a system for an under-resourced language and the development of new tools to reduce the development times
• Proposals looking for a funding:
• WHISTLE: LIG Grenoble, INaLCO Paris, Kasetsart University, Srinakharinwirot University
� A number of tasks including MT systems for: French ↔ Lao, French ↔ Thai, English ↔Lao, English ↔ Thai + possibly Korean.
• GMSTech: LIG Grenoble, Taranis Paris + others
� Common NLP tools for the Greater Mekong Subregion including word processors and dictionaries based on GMSWord, LaoUniKey, LaoTrans… and driving to the creation of a regional Machine Translation tool based on UNL
Taranis Software
66
Perspectives
YOU ARE WELCOME!