25
1 National Informatics Centre, Government of India “Localization & Language Technology Standards” Kewal Krishan Technical Director & Member Secretary e-Governance Standards Working Group on Localisation & Language Technology (eGS-WG-LLT)

Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

  • Upload
    dotu

  • View
    227

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

1National Informatics Centre, Government of India

“Localization & Language Technology Standards”

Kewal Krishan

Technical Director & Member Secretary

e-Governance Standards Working Group on Localisation & Language Technology (eGS-WG-LLT)

Page 2: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

2

National Informatics Centre, Government of India

E-Governance StandardsThe Government of India has launched the National e-Governance Action Plan (NeGP) with the intent to support the growth of e-governance within the country. The Plan envisages creation of right environments to implement G2G,G2B,G2E and G2C services.

Many developments initiated by various government agencies are seemingly done in isolation. Different development platforms are used and the applications under different platforms are seldom interoperable with the result that it is difficult to integrate them even though many have similar features and functionalities. Added to this, is the fact that there is no single agency responsible for framing enforceable e-governance standards and processes that must be adhered to, by all developers.

Keeping in view of the strategic and contemporary importance of standards for e-Governance, the Department of Information Technology has constituted an apex body to oversee the process of bringing out e-governance standards. The following five areas have been identified:

1. Network and Information Security 2. Metadata and Data Standards for Application Domains 3. Quality and Documentation 4. Localization and Language Technology Standards5. Technical Standards and E-Governance Architecture

Localisation & Language Technology Standards

Page 3: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

3

Language diversity in India

•India is a multi-lingual and multi-script country with

- Over 500 languages;- 216 mother-tongues with > 10000 dialects in use;- 22 constitutionally recognized languages

• Only 6% of the Indian population speak the “English” language• Operating Systems as well as Software Tools & Applications packages are in “English” language

• Roadmap : is to promote and achieve “e-Governance” through the language by which the Common man can speak, transact, understand, generate contents and communicate with each other

National Informatics Centre, Government of India

Localisation & Language Technology Standards

Page 4: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

4

• Language plays a vital role in successful implementation of e-Governance

• Need for user-friendly interfaces.

• Especially targeted towards the ruralenvironment

E-Governance

National Informatics Centre, Government of India

Localisation & Language Technology Standards

Page 5: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

5

Need of Standards

• Interoperability and Information Sharing between systems.

• Interoperability between systems supplied by different vendors.

• “Platform-Independent Modelling” approach.

• Increased Adaptability & Flexibility.

National Informatics Centre, Government of India

Localisation & Language Technology Standards

Page 6: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

6

Standardization• Storage standards• Font standards• Inputting standards• Transliteration /Roman equivalent• Sorting order /sequence for Indian languages.• OCR Standards• Standards for Website and E-mail• Local Search Engine Standards• Availability of all constitutionally recognised Indian Languages in

all Operating Systems• Strategy for conversion of data from ISCII to UNICODE

National Informatics Centre, Government of India

Localisation & Language Technology Standards

Page 7: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

7

Make all the Government services accessible to the Citizens through Common Services Centres in his own language

National Informatics Centre, Government of India

Localisation & Language Technology Standards

Page 8: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

8National Informatics Centre, Government of India

Localisation & Language Technology Standards

Page 9: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

9National Informatics Centre, Government of India

Localisation & Language Technology Standards

SCA – Service Centre Agencies

VLE – Village Level Entrepreneur

NLSA – National Level Service Agency

Page 10: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

10

Scheduled Languages in descending order of strength - 2001 Census

2.8%22. Maithili/Meetei-Mayek2.79%11. Punjabi/Gurumukhi, Shahmukhi

0.56%21. Santhali/Devnagari, OL (ciki)3.35%10. Oriya/Oriya

0.1%20. Dogri/Devanagari3.62%9. Malayalam/Malayalam, Malayalam -2

0.23%19. Bodo/Devanagari, Bangla(Modified)3.91%8. Kannada/Kannada

0.01%18. Sanskrit/Devanagari4.85%7. Gujarati/Gujarati

0.01%17. Kashmiri/Perso-Arabic5.18%6. Urdu/Perso-Arabic

0.15%16. Manipuri/Bangla, Manipuri-new6.32%5. Tamil/Tamil

0.21%15. Konkani/Devnagari, Kannada, Roman7.45%4. Marathi /Devnagari

0.25%14. Nepali/Devanagari7.87%3. Telugu/Telugu

0.25%13. Sindhi/Devanagari, Gujarati, Roman, Perso-Arabic8.30%2. Bengali/Bangla

1.56%12. Assamese/Bangla (Modified)40.22%1. Hindi /Devnagari

Language/Script %Population Language/Script %Population

National Informatics Centre, Government of India

Localisation & Language Technology Standards

Page 11: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

11

1. OS Supporti) Locales and Sorting

ii) User Interfaceiii) Searchingiv) Rendering on PC in Application/Browser : Display, Layoutv) Character Encoding : Unicode, ISCIIvi) Inputting Methods

- Keyboard Layouts : Typewriter, Inscript, Phonetic- Online Handwriting Recognition - Text OCR- Speech to Text

Editors for Desktop & Web (Front Page, Quanta Plus, Dreamweaver, Rational Site developer , W3Cindia.in compliant certified markup etc.) ODF and related Standards.

- Browser Support3. Resources and Tools:

- Processing Resources : Spell Checker- Language Resources : Dictionaries, Ontology, Glossary, Lexicon, Thesaurus- Annotated Corpora: Text & Speech

- Machine Translation- Transliteration (BARAHA software- Free ware, Internationalization Component for Unicode ICU)- Database Support : Data Storage & Retrieval

4. Search Engine Support (Google, Yahoo, MSN, LUCENE Raftar, Khoj etc.)5. Localized Applications

- Interoperability between Platforms and Technologies

Areas & Issues identified

Localisation & Language Technology Standards

National Informatics Centre, Government of India

Page 12: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

12

So far we have organized ten brainstorming sessions (in Tamil, Telugu, Malyalam, Kannada, Marathi, Gujarati, Oriya, Assamese, Bangla and Hindi Language) at the following locations:

1. 1. Chennai - 20th Feb., 20062. Mumbai - 13th April, 20063. Trivandrum - 15th May, 20064. Kolkatta - 2nd June, 20065. Bhubneshwar - 19th June, 20066. Guwhati - 26th June, 20067. Ahmedabad - 11th July, 20068. Bangalore - 08th August, 20069. Hydrabad - 20th September, 2006

10. Chandigarh - 29th September, 2006

First Working Group Meeting held in NIC, Delhi - 25th July, 2006

Technology Provider’s meeting held in NIC, New Delhi – 15th November, 2006

Second Working Group Meeting held in NIC, Delhi - 09th January, 2007

Localisation & Language Technology Standards

National Informatics Centre, Government of India

Page 13: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

13

Myself and my colleague Shri M.D. Kulkarni, Director, C-DAC prepared the draft on Roadmap for Localization & Language Technology Standards and later modification were made on the basis of feedback received from the

a) members who attended the Brainstorming sessions, First and Second Working Group meeting.

b) Technology Providers’

Localisation & Language Technology Standards

National Informatics Centre, Government of India

Page 14: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

14

•In Windows OS : Bodo, Dogri, Kashmiri, Maithili, Manipuri, Sindhi, Santhali Constitutionally recognized languages - support is not available•In Red Hat Linux : Assamese, Bodo, Dogri, Kannada, Konkani, Kashmiri, Maithili, Manipuri, Nepali, Sindhi, Santhali, Sanskrit, Urdu Constitutionally recognized languages - support is not available•In MAC OS X - 18/22 constitutionally recognized languages support is not available.

Desitination desired: All 22 constitutionally recognisedlanguages support must be available in all Operating systems.

Present Status : •In Windows 2000/XP –13/22 constitutionally recognized Indian Languages support is available.(Bangla, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali Punjabi, Sanskrit, Tamil, Telugu & Urdu).• 15/22 under Windows-Vista added support for Oriya and Assamese. • In RedHat Linux –9/22 constitutionally recognized Indian Languages support is available.(Bangla, Gujarati, Hindi, Marathi, Oriya, Punjabi, Malayalam, Tamil & Telugu). • In MAC OS X-4/22 Gujarati, Hindi, Punjabi, Tamil constitutionally recognized Indian Languages Support is available

OS Support under Windows, Linux, MAC OS

1.1

1. OS Support

Destination DesiredCurrent Issues/StatusAreaSr.No

Localisation & Language Technology Standards

National Informatics Centre, Government of India

Page 15: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

15

• Should be owned and managed by the various state and national authorities.

Sort order for official Indian Languages are available as a part of Common Locale Data Repository. (CLDR)

Sorting 1.3

•The Unicode Common Locale Data Repository is a public and open source locales database. MCIT is a member of the Unicode Consortium. CDAC and State authorities have access to it. This should be used to enhance needs.http://www.unicode.org/cldr/

• Nepali language locale data is available for Nepal as per their requirements.

• Available for 12/22 languages under Windows, • 9/22 languages under Linux

• Presently Locales data is insufficient and not accommodate Indian culture specific requirements.

Locales Data1.2

Localisation & Language Technology Standards

National Informatics Centre, Government of India

Page 16: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

16

Localisation & Language Technology Standards

1.Constant interaction with Unicode for proper representative of Indian languages.

2.There has to be standards enforcement at State level.

Unicode characters are almost complete according to the respective language

requirements.

Encoding 1.4

National Informatics Centre, Government of India

Page 17: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

17

Localisation & Language Technology Standards

• Typewriter keyboard and State level Languages specific requirements (KGP keyboard layout for Kannada, TAM99 keyboard layout for Tamil) should be supported at operating system level.

Note: Output of any user specific keyboard layout must conform to Unicode current version.

Present Statusa) Keyboard Layouts – Any Inputting method can be used in Unicode enabled OS.

- INSCRIPT keyboard layout is available at OS level.

b) Speech to Text Shrutlekhan- Rajbhasha is available for Hindi language.

c) Handwriting Recognition & OCR - Technology under development

Inputting Mechanisma) Keyboard Layouts

b) Speech to Text

c) Handwriting Recognition, Text OCR etc.

1.5

National Informatics Centre, Government of India

Page 18: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

18

Intelligent search engines are required to be developed.

• Character level search is available. • Contextual and intelligent search engines are not available

Searching 1.7

• Rasterisation engine is not being implemented for all the 22 scheduled Indian languages. Needs to be implemented by OS developers• Collation standards needs to be defined

For rendering Open Type fonts – rasterisation engine needs to be built in the OS.

Rendering of fonts in Application as well as in Browser

1.6

Localisation & Language Technology Standards

National Informatics Centre, Government of India

Page 19: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

19

IE6, IE7, FireFox, Netscape etc. supports Indian Languages.

Browser Support2.2

• Adoption of W3C specifications onevery government website.

Lot of tools is now available for content creation in Indian languages.

Content Creation Editors for Desktop & WebW3C :Markup languagesStandard Generalized Markup Language (SGML) old std.Hypertext Markup Language (HTML)Extensible Markup Language (XML)Extensible Hypertext Markup Language (XHTML)XLIFF – XML Localization Interchange File Format TEI – Text Encoding Initiative

2.1

2. Content Creation

Localisation & Language Technology Standards

National Informatics Centre, Government of India

Page 20: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

20

Needs to have a National initiative for development of Linguistic resources.

No Language Resources conforming to standards are available.

Language Resources : Dictionaries, Glossary, Lexicon, Thesaurus, WordNet, Corpora: Text & Speech

ISO : TermBase eXchange (TBX): ISO : Terminology Markup Framework

(TMF) ISO : Lexical Resource Markup

Framework (LRMF)EAGLES/ISLE: CES: Corpus Encoding

StandardsEAGLES/ISLE: XCES: XML based

Corpus Encoding StandardsEAGLES:MATE – Multilingual

Annotation Tools Engineering

3.2`

Plug-in spell checker is required which should work in all General purpose applications softwares.

Available in most of the Indian languages but need to be bettered. The available spell-checkers work in specific applications (i.e. will not plug-in to other/applications)

Processing Resources : SpellChecker

3.1

3. Resources & Tools

Localisation & Language Technology Standards

National Informatics Centre, Government of India

Page 21: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

21

The interfaces for these technologies should be general purpose and not platform-specific.

Research and Development in MT has been underway at several organizations in India.

i) English to Indian language MT Systems

ii) Indian language to Indian language MT Systems

iii) English has been the language of choice in the foreign language category among MT R&D community in India. Efforts are being made for building MT systems for English-Hindi language pair.Research is being done for developing MT systems for Hindi and other Indian languages.iv) University of Hyderabad is working on an English-Kannada MT system, using the Universal Clause Structure Grammar (UCSG). This is essentially a transfer-based approach, and will be used in all Government circulars.

v) Some organizations are also working in this area

IIT Kanpur (using Anglabharati approach)IIT Mumbai (using Universal Networking Language-UNL approach)Super Infosoft Pvt (developed Anuvadak system)IBM, Gurgaon (using Statistical approach)IIIT- Hyderabad and University of Hyderabad (developed Anusaraka – A Language Accessor).

vi) CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA (English to Hindi MT System) for Administrative, Finance, Agriculture and Small Scale Industry domains.

Machine Translation

3.3

Localisation & Language Technology Standards

National Informatics Centre, Government of India

Page 22: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

22

If database is Unicode complaint then there is no issue in regards with storage & retrieval. Most of the databases like (Sql Server,

Oracle, MySQL, DB2 etc)

support Unicode.

Database Support : Data Storage & Retrieval

3.5

Transliteration is available in few Indian Languages only.

Transliteration3.4

Localisation & Language Technology Standards

National Informatics Centre, Government of IndiaNational Informatics Centre, Government of India

Page 23: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

23

1.C-DAC, GIST, Pune has already taken the work of localization of BharateeyaOpen Office for all scheduled 22 Indian languages.2.Currently localized versions for Tamil, Hindi & Telugu are released.3.Localized versions for Kannada, Punjabi, Urdu, Oriya, Assamese, Bengali, Malayalam, Gujarati are ready and awaiting for release4.Rest of the language localization is in progress.

Open OfficeWorks on Operating systems which has language support such as Windows XP, Linux.

5. Localized Applications

Presently character level search is available in all major search engines.

W3C

4. Search Engine Supporting Indian Languages (Google, Yahoo etc)

Localisation & Language Technology Standards

National Informatics Centre, Government of India

Page 24: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

24

Main Points highlighted :

1. Strategy for conversion of data from ISCII and other formats to UNICODE.

2. Long term goal for manpower development in language technology.

3. Release of free Tools and tools-kit for software developers to develop portals, databases etc. in Indian Languages.

4. There should be a strategy for transparent Interoperability.

5. Setu-dev as double byte ttf fonts may be a standard that will work with all OS’s, all browsers and all word processors.

6. Stop leakage of Govt. money in non standard and proprietary items.

7. Chalk out a strategy for conversion of existing corpora

8. Adoption of W3C standards for developing websites in Indian Languages.

9. Adoption of Open Document format for G2C interaction.

Localisation & Language Technology Standards

National Informatics Centre, Government of India

Page 25: Kewal Krishan - PAN Localization India.ppt.pdf · Kewal Krishan Technical Director ... English to Indian language MT Systems ii) ... CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA

25

Thank YouThank You

Localisation & Language Technology Standards

National Informatics Centre, Government of India