Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX -...

Preview:

Citation preview

Localization and Language Technology Standards

Kavi Narayana MurthyUniversity of Hyderabad

ELITEX - 2007New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

2

Outline Character Encoding Standards Fonts, Glyphs, Mapping Standards OS/Browser Support, Drivers Transliteration, Romanization Translation, Linguistic Resources Speech and OCR Technologies Enforcement

Kavi Narayana Murthy UoH

3

Goals Functionality

Whatever we can do with English, we must be able to do with our own languages and scripts with equal ease

Inter-operability, Platform Independence All Applications must work seemlessly on all

hardware and software platforms Language and Script Independence

Multi-lingual, Multi-Script Support

Kavi Narayana Murthy UoH

4

Standards Even a poor standard is better than no

standard Standards save us a lot in the long run Commercial forces promoting non-

standard, proprietary, secret systems must not be allowed to succeed Let us not say “Let the Market Decide”!!!

Kavi Narayana Murthy UoH

5

Character Encoding Standards ISCII and Unicode ISCII is a BIS Standard, Unicode is

not Unicode is based on ISCII In some sense, Unicode is a step in

the backward direction Let us understand ISCII first

Kavi Narayana Murthy UoH

6

Language and Script Do not confuse one for the other Many-to-Many Script is neither language nor font Script and SuperScript Phonetic Basis

Common SuperScript for all ILs Script Grammar

Kavi Narayana Murthy UoH

7

Language and Script Sanskrit is written in Devanagari,

Telugu, Kannada, Bangla etc. scripts

Devanagari is used for writing Sanskrit, Hindi, Marathi, etc.

English words are often written (transliterated) in local language scripts

Kavi Narayana Murthy UoH

8

Phonetic Basis Words: Meanings, Sounds, Written

Symbols Meanings are supreme but difficult

to quantify and encode Sounds are the next best

A ‘ka’ sound is a ‘ka’ sound, whatever be the language – Hence ‘Universal’

No need for ‘Spellings’ What is write is what we speak - directly

Kavi Narayana Murthy UoH

9

Orthography Written symbols correspond with

phonemes – basic sound units Minor variations in sounds

(allophones, co-articulation effects etc.) are not depicted in orthography t: Mountain, tea, truck, spilt, little

Special Symbols not to confused with basic Characters

Kavi Narayana Murthy UoH

10

What is a Character? Indian Languages:

No ‘alphabet’, not letters, no spellings Phoneme-based Units are syllable-like: called ‘akshara’-

s akshara-s very large in number

Corpus studies not sufficient Made up of vowels, consonants etc. Not all sequences valid

Kavi Narayana Murthy UoH

11

Script Grammar A Grammar for Scripts Allows all valid sequences, only valid

sequences No need to code all possible akshara-s Script grammar must be part of

standards: ISCII includes. UNICODE? Script Grammar to be enforced by s/w

Kavi Narayana Murthy UoH

12

SuperScript ILs: 10 Scripts with a nearly common

sound system – all derived from the ancient ‘braahmi’ script

=> SuperScript Super Set of all Phonemes

Common encoding: ISCII Extendable to all languages of the

world

Kavi Narayana Murthy UoH

13

ISCII: (BIS – 1991: IS 13194) 128 codes more than sufficient Uses second half of ASCII, first half

untouched – allows mixing with English

SuperScript: Transliteration built-in Long Standing: ISCII 1988, 1991 Well thought and well designed

Kavi Narayana Murthy UoH

14

Why did ISCII fail to catch on? Silent on Character-to-Font mapping

A complex many-to-many mapping Fonts not standardized, fonts not available

Not registered, no OS/Browser Support (BIS – 1991: IS 13194) Rationale not explained Not publicized, not enforced

Kavi Narayana Murthy UoH

15

History Proprietary, non-standard, secret

font based encoding schemes Promoted by commercial companies Near Zero Inter-operability Ad-hoc ISCII-to-font mapping schemes Mapping schemes not made public To be made Illegal and Punishable

Put India back by at least a decade!

Kavi Narayana Murthy UoH

16

Improving ISCII Register - To get OS/Browser Support

Remove encoding of allophones, allographs Script Grammar: FSM enough, CFG - not needed

Include Rationale, explanatory notes Remove Attribute/Extension codes Standardize ISCII-to-Font Mapping Scheme Promote, Enforce

Kavi Narayana Murthy UoH

17

Character-to-Font Mapping Complex scripts – not linear Glyphs: shape units convenient for

rendering Poor correspondence with sound

units Many-to-Many mappings

Glyph selection, scaling, positioning No Glyph Encoding Standard

Kavi Narayana Murthy UoH

18

From Character to Font Must be provably complete and

100% consistent Current systems are all ad-hoc –

neither complete nor consistent Finite State Transducers:

Necessary and Sufficient Without restricting Creativity and

Flexibility Simple, Efficient, Re-Usable

Kavi Narayana Murthy UoH

19

Encoding Standards: Unicode For Language/Script/SuperScript?

CJK. Why not for ILs? Script Grammar? Character-to-Font:

relegated to font level font effects

ISCII-88 Based, Has Errors Once added, cannot be deleted!

Kavi Narayana Murthy UoH

20

ISCII or Unicode? Unicode:

To be with the World, to know and be known ‘Correcting’ Mistakes, Improving Standards Support (OS, Fonts, etc.), Education, Training Converting Legacy Data – A Huge Task

ISCII-to-Unicode is not trivial Ignore BIS Standard and embrace what is not

yet ‘standardized’? Why not co-exist? – Internal and External

Views

Kavi Narayana Murthy UoH

21

Keyboard Layouts, Drivers Several de-facto standards and

many variations in use To select a few and standardize

So called Roman Phonetic Typing ILs through English! OK for oldies, not for future!

INSCRIPT: ISCII Standard, Good for new comers

To strictly enforce Script Grammar

Kavi Narayana Murthy UoH

22

Document Encoding Standards Plain Text: pure ISCII/UNICODE

Mono-lingual Plain Text? Annotated Text (Ex. Word

Processors) XML Style, Open, Readable formats to

be encouraged Proprietary, secret, non-standard

encodings must be discouraged

Kavi Narayana Murthy UoH

23

Transliteration Widely used, part of our Tradition

Sanskrit texts in local scripts English, Hindi, Urdu words in local

scripts Music Compositions

Automatic in ISCII. Unicode? Quality of transliteration

To and From English?

Kavi Narayana Murthy UoH

24

Romanization Need:

Where there is no support for local languages English dailies, posters, advertisements etc. Lack of support: OS/Browser/Fonts etc.

Where users prefer Roman A variety of ad-hoc schemes in use

iTRANS, RTS, W-X, etc. Standards badly wanted

Kavi Narayana Murthy UoH

25

Romanization Multi-dimensional optimization problem

Case Mix-up 26 Letters not sufficient 52 nearly sufficient Not always supported

Storage space, Ease of Typing, Aesthetics Scientific/Logical Design/Naturalness

English-like – for the oldies: a, ee, oo, a, oa ??? Futuristic: aa/ii/uu/ee/oo

Kavi Narayana Murthy UoH

26

Romanization Clashes: a+u/au, k+h/kh, s’

Two way conversion, cyclic check Ex. Long Vowels:

a: -clashes with colon diacritic –not supported ipa –not understood –not supported A +single char. +saves space –ugly –

difficult to type –case-mix-up aa +logical (like ee) +easy to type

Kavi Narayana Murthy UoH

27

Romanization: An Example a aa i ii u uu R RR e ee ai o oo au M H k kh g gh n~ c ch j jh n` T TH D DH N t th d dh n p ph b bh m y r l v s’ S s h L

Kavi Narayana Murthy UoH

28

Translation Create Material Afresh Translate by Hand Automatic/Machine Translation Machine Aided Translation English – Local Language

Translation Local – Local Language Translation

Kavi Narayana Murthy UoH

29

Translation Resource Intensive

Manpower, Time, Cost Quality/Uniformity

Standards, Bench-Mark Data, Testing and Evaluation Procedures

Dictionaries, Terminology Databases Pan-Indian Terms/Sanskritize/Localize

Kavi Narayana Murthy UoH

30

Linguistic Resources Dictionaries – General, Domain Specific Terminological Databases Thesauri, WordNets, Ontologies Morphological Analyzers, Generators Spell/Grammar/Style Checkers Annotated Text and Speech Corpora

Kavi Narayana Murthy UoH

31

India: Future is in Speech One Billion People, A Sixth of the World More than 150 Languages, 22 Recognized 95 % not comfortable with English Computers, Current, Connectivity Info Revolution benefits: Majority

Deprived 10 M Computers, 100 M Phones Future is in Speech

Kavi Narayana Murthy UoH

32

Speech Natural Easy, Fast Hands-Free No need to Learn

Technology Language

Available to all

Kavi Narayana Murthy UoH

33

Text and Speech Speech is Natural Reading/Writing is learnt, Artificial Some never learn – Illiterates Oral Tradition Speech is more permanent than Text! “I did not steal that ring of gold” Trust Yourself!

Kavi Narayana Murthy UoH

34

Speech Technologies Speech Recognition: Speech to Text Speech Synthesis: Text to Speech Speaker Recognition,Verification,ID Speech Coding/Decoding,

Compression Slow down, Speed up Speech as Evidence

Kavi Narayana Murthy UoH

35

Applications Telephone Dialing Form Filling Dictation Machine Command and Control Voice enabled Web OCR+WP+TTS MT: Cross-Lingual IR, S2S

Kavi Narayana Murthy UoH

36

OCR OCR in Local Scripts Needed

To digitize and save legacy data To compile/process/edit/refine data

For Printed Texts/Manuscripts Old Data

deterioration of paper old type fonts, problems of type-

setting

Kavi Narayana Murthy UoH

37

Multi-Modal Interfaces

To Reach out to 1 Billion People, we must get the best of many worlds: Speech Recognition and Synthesis Graphics and iconic Interfaces OCR Technologies Translation, CLIR Camera, Gestures, Touch Screen

Kavi Narayana Murthy UoH

38

Balance Between Backward Compatibility

and Future-Proof Designs Quick Fix Solutions and Long Haul One Standard or Several? Economics and Business Sense

versus Social Responsibilities Acceptance versus Enforcement

Kavi Narayana Murthy UoH

39

The 3 Most Important Things1. Develop/Refine/Update Standards

Detailed Documentation Including rationale, issues, evaluation,

etc.

2. Education and Training3. Enforcement

Make use of non-standard methods illegal and punishable under law

Technical Workshops for detailing

Thank You!

Visitwww.LanguageTechnologies.a

c.in

Recommended