Upload
lamkhanh
View
218
Download
5
Embed Size (px)
Barriers to Localisa-ona case study of Nepal
Pat Hall
2
Wikipedia2009 July 1
Sri LankaMaldives
1,169.5
20.20.3
50.0
China1,333.2
India
Pakistan
BangladeshMyanmar
Afghanistan
Iran
BhutanNepal29.3 0.7167.5
162.2
74.2
28.2
major migrantcommunities
worldwide
28th Sept 2009 LRC 2009
brief background to Nepal
popula-ons –2009 es-mated millions
28th Sept 2009 LRC 2009 3
language diversity in South Asia
Pushtu
English pervasive- only 5%
Hindi
Tibetan
Urdu
Bangla
Sinhala
Tamil
Kashmiri
Punjabi DzongkaNepaliSindhi
Gujurati
Telugu
Malayalam
Kannada
Marathi
Assamese
Oriya
INDIAOver 500 languages 22 mother-tongues with > 1 million speakers 114 mother-tongues with > 10,000 speakers22 official in India
Tibeto-Burmese
Dravidian
Indo-European
Indo-European
scholarshipSanskrit
Farsi
large diaspora
Austro-Asiatic (Munda)
28th Sept 2009 LRC 2009 4
linguis-c history and poli-cs
• Gorkhali (later Nepali) official language since 1768– mother tongue of ruling elite from Gorkha
– 1951-‐1990 ‘one na-on, one culture, one language’
• around 100 other languages– Nepali now almost universally spoken – 98%
• and in neighbouring regions
– 5 other languages with mature wriVen tradi-ons
– 60 completely unwriVen
• 1990 enabled other mother-‐tongue educa-on– the irony of English
• 2006 interim cons-tu-on enabled use of other languages
• 2008 manda-ng mother-‐tongue teaching
5
Wri-ng through the computer
• all South Asian writing derived from ancientBrahmi– abugida (alphabetic with implicit vowel)– largely phonetic - but phonology changes with time– multiple forms (conjuncts)
– maybe only 50 languages across South Asia are written– around 50% literacy
• new writing systems for unwritten languages– by linguists, by missionaries (SIL)
Tamil Kannada
Hindi (Devanagari)
28th Sept 2009 LRC 2009
an old mechanical typewriterdirect connec-on between key and type
6
7
keyboard
internal codes
communications/applications
font tablecode -> dotmap
early computersbased on typewriters
dot matrixprinter
electrical signals“codes”
ASCIIISO 646
ASCII60-70 Roman letters5-10 control7+1 bitsISO646
EBCDIC8 bitsmore
characters
8
indic “hack font” developerskeyboards, codes, fonts dependent
internal codes
communications/applications
font tablecode -> bitmap
1. designyour keyboard
2. acceptwhatever internalcode arises
3. use tool tocreate fonts
disasterNO communication
one-to-onecorrespondence
keyboard
charconceptuallinkage
128 Roman/control128 Indic characters8 bit BYTE
9
encodings• ISCII
– IIT Kanpur, then CDAC Pune
– leVers, not glyphs• like Arabic• needs renderer
– based on Brahmi view
– single code table with language switch to render “same”leVer appropriately
– sequence of characters as spoken
• Unicode– based on ISCII but separate tables for each script
• s-ll controversial– new Tamil standard
hardware GIST cardthen software (DLL)cost prohibitive
28th Sept 2009 LRC 2009 10
keyboard
internal Unicodes
communications/applications
font tablecode -> bitmap
modern computersseparate issues
laser or inkjetprinter
key mappingkey -> code
the essential
writing system
letters andcharacters
not graphics
28th Sept 2009 LRC 2009 11
Nepali language into Computers
• Desk top publishing with PCs– hack fonts to give some representa-on of the wri-ng
• 128 places not enough for everything• implicitly defined an internal code
– fonts copied widely and then changed slightly
– result – inability to exchange data
– 1997 aVempts to standardise in context of emerging Unicode• impressed by ISCII, but not interna-onal
– some fonts also developed for other languages
• today, accept Unicode Devanagari for Nepali– 1997 IDRC and later UNDP UNESCO funded keyboard drivers and fonts
• can use Indian open type fonts, though not liked
– s-ll no agreed keyboard layout for Nepal
two key issues in technology projects
Research versus applica-on• doing ICT4D projects requires new understanding
• what understanding?– ACM paper claims must deliver computer science research
Technology transfer versus knowledge transfer
• do we do it for our beneficiaries?• or teach them how to do it for themselves?
Could academic objec-ves cloud development objec-ves?
28th Sept 2009 LRC 2009 12
28th Sept 2009 LRC 2009 13
Nepalinux at MPP funded by IDRC
• 1997, 2000 produced Nepali Unicode-‐driver add on to Windows
• 2004 joined PAN localisa-on project
• Debian Linux with GNOME desktop, KDE desktop, OpenOffice, ...– release 1 December 2005
– release 1.1 October 2006
• FOSS cri-cally important here
• then PDAs, mobile phones, OCR
• in parallel Microsoi produced LIP for Nepali– launched November 2005
what is involved in localisa-on?
• transla-on of all text in screens, menus, and help– available in separate ‘resource’ files
– agree terminology in na-onal commiVees
– store past transla-ons in ‘transla-on memory’
• develop spell checker– need standard spellings
– morphology
• other capabili-es that might be needed– code converters = hack fonts to unicode
– develop fonts
– text-‐to-‐speech
– OCR
– is a grammar checker needed?
28th Sept 2009 LRC 2009 √14
need to research into languages
but not into technology
28th Sept 2009 LRC 2009 15
NeLRaLEC –funded by the EUOpenU, LancasterU, GoterborgU, ELRA, MPP, TribhuvanU
for Nepali, even though not (yet) endangered
• Nepali Na-onal Corpus– text (5 M words, 100K parallel) and speech (4hrs+130K words)
• Nepali dic-onary – aiming at 100,000 words– word list and entries for most frequent
• linguis-c tools
• fonts for Nepali– and Maithili
• speech genera-on
• trials in schools and universi-es– localised schools management soiware
• computa-onal and corpus linguis-cs course in University
ensures sustainability
renamed Bhasha Sanchar
28th Sept 2009 LRC 2009 16
Nepali Na-onal Corpus – what people actually write
advice from Lancaster University
• wanted material from 1991/2 to match corpora in US and UK– not much material then, just aier end of repressive Panchayat era
• wanted in fixed genres with strong western bias– very liVle science and technology, 1 science fic-on, no westerns (or Kung Fu)
• CORE – aimed a 1 million words– collected 500 documents, had them typed
– some got lost in civil disturbances and lack of records
• General – opportunis-c, at least 4 million words– already digi-sed but in hack fonts, needed conversion
– when to stop?• English dic-onaries used 200 million words or more
– eventually stopped at 13 million words, good enough will do
28th Sept 2009 LRC 2009 17
Nepali Corpus-‐based Dic-onary
first ever in South Asia
• needed soiware to store entries as produced– mul-ple lexicographers
– could not find suitable exis-ng system
• developed own system to meet needs of Nepali and linguists– 2 soiware engineers in exploratory development
• developed entries using Oxford’s Xiara concordance system– linguists s-ll learning, each did things differently– at around 20,000 entries realised quality problem
• started again, reusing earlier entries or producing new ones– only reached 8,000, okay for an on-‐line dic-onary, but not in print
• expected linguists to con-nue aier project– but didn’t because not paid
28th Sept 2009 LRC 2009 18
Nepali Font Development• hiring font developers
– original person went to UK to do masters
– recruited two graphic designers, arranged training
• short training from Reading in Kathmandu, open to public– some hazards from civil disturbances
• font development process– draw many examples of characters needed, including ligatures
– scan into computer, add rules of combina-on
– test and improve constantly
• eventually agreed needed further training in UK– two months in Reading
• outputs– font for Nepali in two versions
– font for Maithili Mithilaaksha style of wri-ng.
28th Sept 2009 LRC 2009 19
Nepali TTS speech genera-on
based on Fes-val/Festvox concatena-ve synthesis
• expected help from UK, Roger Tucker and Ksenia Shalapova– Ksenia refused to travel to Nepal to give training and guidance
• had to help our 2 soiware engineers in other ways– spent 1 month in Hyderabad with Kishore and Rajeev Sangal
• recorded voices– selected words containing all 1764 diphones, chose 1,200 sentences
– had to research speech – eg “Schwa dele-on”
• developed TTS through several versions, adding prosody– reasonable quality, judged reasonably “natural”
• wanted to make screen reader for visually disabled and illiterates– latest Linux supported this, so in Nepalinux and Ubuntu
– Fes-val not on Windows, could not find genera-on engine
evalua-on of use of Nepali soiware
Nepalinux and MS LIP launched late 2005• 1000 CDs distributed, 500 aVended demos
– heard not being used, interes-ng novelty
• Conducted surveys to find out– what was really going on, and why?
– grounded theory, analysed with NVIVO
– done by Ganesh Ghimire and Maria Newton
what we found outITID journal, Vol 5, Issue 1 -‐ SPRING 2009
– normal social processes were at work
• should we move to other 100 languages of Nepal?
28th Sept 2009 20LRC 2009
1. first impressions of computers
Problem: hardware not localised– cabinets labeled in English
– keyboards not marked in local script
Example: delivered computers to schools– could only give keyboard layout charts
• not well produced
– techies claimed phone-c keyboards easy• but nobody spoke or typed English!• ironical – claim systems for non-‐speakers of English.
Solu-on: get keyboards marked for Nepali– keyboards in Thailand are marked in Thai!
28th Sept 2009 21LRC 2009
2. transla-on qualityProblem: technical terms not understood
– Nepali terms agreed by commiVee
Example:– “I have used all its func-on because I am a writer. Some of the words like
“radditokari” and “anuprayog” seem to be the unusual ones. They look likethey have been directly borrowed from Sanskrit and that make Nepali evenmore difficult than English” – banking officer
– “It’s good for people who are trained in Nepali that don’t have exposure toEnglish. For people like us who have already started to use one system itbecomes difficult to switch over” – linguist from Kathmandu
Solu-on:– don’t translate, English terminology oien bisarre, maybe transliterate– listen to users, standardise terminology
– will they get used to it anyway?
28th Sept 2009 22LRC 2009
3. teacher/trainer iner-a
• Problem: trainer not familiar with the localised soiware
• Example: at Sankhu telecentre– language of instruc-on -‐ Nepali
– started teaching Nepali interfaces
– switched back to English interfaces
• Solu-on:– train the trainers
• in soiware as well as hardware
– don’t give op-on of English interface
28th Sept 2009 23LRC 2009
4. cri-cal mass of users
Problem: want to get help from others
Examples:– Nepal’s na-onal library
• one user, only used to catalogue Nepali books
– government officer taught himself
Solu-on: need to create cri-cal mass,– train complete organisa-on
– restrict opportunity to use English• example – telephone, video tape formats
Sociology– “social interac-on” Brock and Durlauf
– “social embeddedness”
28th Sept 2009 24LRC 2009
5. language shii and social mobility
Problem: people change to dominant language– see economic advantage in English or Nepali or ...
Example:– “Yeah I liked, but when we used Nepali windows that -me I feel we are
going to forget English language” – telecentre social mobilizer
– “I have daughter and I will not ask my daughter to use the Nepaliinterface because I want my daughter to be good in English” – computeracademic
Solu-on: accept that for some languages– not worth localising the OS, but enable content
– suppor-ng the language may reduce the shii
Sociology– “sanskri-sa-on” caste mobility -‐ Srinivas
28th Sept 2009 25LRC 2009
6. soiware eco-‐system
Problem: soiware works with other soiware– cannot use Unicode for informa-on exchange!
– other soiware is not Unicode compliant, use hack fonts
– but can do Desk Top Publishing
Example: journalists– “But it has font problem. In publica-on houses mainly they use
pree-, kan-pur so it’s not worthy for nepali compu-ng” – journalist
Solu-on: raise an Open Source project to produce compliantpublishing soiware.
28th Sept 2009 26LRC 2009
7. support all communi-es
Problem: language communi-es very small– commercial development not viable
Example: Lohrung Rai in Nepal• subject of socio-‐linguist study by Jens Allwood, Yogendra Yadava,
and Bhim Regmi
– 1,207 ‘mother-‐tongue’ speakers
– language not yet wriVen, want Roman system.
Solu-on:– localise for local lingua franca
• create wri-ng close to that of lingua franca
– only enable content in local language
28th Sept 2009 27LRC 2009
8. cost of entry for new language
Problem: Transla-on cost can be significant
Example: Gnome interface for Linux– 40,000 ‘strings’ = 500,000 words approx
– grows by 5 to 10% a distribu-on
– at 1,500 words per day, this takes 300 days• 30 days for a new release
Solu-on: avoid manual transla-on– machine transla-on?
– be smart technically
– language genera-on from model of soiware
28th Sept 2009 28LRC 2009
Conclusions
localiza-on must be TOTAL– all hardware including keyboards
– all soiware in use by a community
– transla-ons sensi-ve to language poli-cs
• otherwise it will rejected
entry cost must be minimal
• cheap and easy to localize for a new language– base interac-on by language genera-on
– part of s/w development process
– some-mes only enable content
next step – harmonised wri-ng and encoding
• then machine (assisted) transla-on28th Sept 2009 29LRC 2009