Mahesh D. Kulkarni Group Coordinator C-DAC GIST 4 th August 2006 Venue : Hotel Raddison, Noida

presentation on digital library“Issues and Solutions”
Nurturing Living Languages
© C-DAC
Social and economic growth is catalyzed by the presence of Internet
Development of internet is mainly in English
Uses only 26 alphabet (unaccented Latin letters), the 10 digits (0-9), hyphen and the dot.
For proliferation and preservation of heritage, culture and content creation in multiple languages it is essential to have the domain names in multilingual scripts.
Background
Application (such as browser) converts to ASCII Compatible encoding (ACE) : www.xn--3b7vcv67.com
Registry entry : xn—3b7vcv67.com (ASCII characters)
Background
xn--e2br9czb

xn--m1be
India has largest linguistic diversities in the world
4 major language families and at least 35 different languages and around 2000 dialects.
Languages belong to either Indo-Aryan (ca.74%), the Dravidian (ca 24%), the Austro-Asiatic (Munda) (ca 1.2%) or the Tibeto-Burman (ca 0.6%) families. Some of the languages of Himalayas still unclassified.
India has 22 scheduled languages and English continue to be “associate additional official language”
Following scripts will be most needed : Assamese, Bangla, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu.
Devanagari – Hindi, Marathi, Konkani, Rajasthani, Sindhi, Nepali, Dogri, Santhali, etc.
Thus the code page Devanagari can support all languages using that particular script.
Solution :
Though the contents would reveal the language used, it would be ideal if a special attribute code to indicate language is inserted.
Konkani is written in Roman, Devanagari, Malayalam and Kannada.
Sindhi is written in Gurmukhi (Punjabi), Arabi (Perso-Arabic), Devanagari, Gujarati and also Roman.
Sindhi has adopted the Perso-Arabic script for representing their language. In case of Konkani, Devanagari is used as official script.
Hence it is proposed that the same formula be used in attributing in IDN.
However nothing stops a client from desiring to have his IDN in all the scripts and this can be efficiently catered by providing broad based transliteration facility which would transliterate a name from one Indian script to another.
Thus a Konkani domain name in Devanagari could be transliterated into Kannada, Malayalam and Roman.
Solution:
The best solution to this is by way of linguistic or political consensus
One language :: many scripts
The solution :
A tool for transliteration from one Indian script to another can be easily deployed.
The transliterated data could be presented to the client who could verify the transliteration and see if it meets his approval and if so, the IDN could be registered in all possible scripts
ACE i.e. ASCII compatible encoding.
This is intimately tied to NamePrep (3491)and PunyCode (RFC-3492) as well as to RFC 3454 StringPrep.
ACE prepares a IDN string to be sent down to PunyCode for storage where it is stored as a 7 bit numeric data
We would like to make a case for the use of ISCII 91 as a parallel code for Brahmi based scripts.
ISCII deploys the same encoding for all Brahmi based scripts.
The advantage for this obvious as storage in ISCII will allow IDN to transliterate on the fly a name into any Indic script and thereby ensure at the PunyCode level itself that a name allotted in one script is also automatically allotted in another script to the same owner, thereby doing away with name squatting in Indic scripts, which will be a regular feature for IDN allocation in Indic scripts.
Alternate mechanism
IDN & THE PROBLEM OF ALLOTTING NAMES
The IDN server which will attribute the domain names is to be automated and hence it is of vital interest that a mechanism of checks and counter-checks be set up to ensure the highest level of security.
Two major issues are at stake. These issues are mainly specific to Indian scripts and the complex nature of their visual rendering.
PROBLEM 1: DOUBLETS
The first is the need to ensure that doublets are avoided. Doublets are IDN’s which are nearly alike either as homophones or close homographs. Thus spelling: Mahararashtra as:

can lead to identity confusion and since all the three spellings are different, the server would attribute all the name as valid IDN’s whereas in fact the original client would not like that his IDN be misused.
Problem 2: SECURITY ISSUES
More serious is the willful use of such tactics to perpetrate fraud by misleading a user into believing that he has logged on to a bonafide site and thus persuade the user to divulge information such as the number of his credit card etc.
© C-DAC
UNDERLYING THESE PROBLEMS AND ISSUES ARE THREE MAJOR POTENTIAL SECURITY HOLES
HOMOPHONES AND HOMOGRAPHS
SPELLING VARIANTS
SPELLING ERRORS
Each of these will be studied in relation to their pertinence to ensuring maximal security
© C-DAC
These are aural and visual look-alikes and given the phonetic nature of Indian scripts are a potential source of confusion.
A typology of these has been established:
VISUAL LOOK ALIKES
AURAL LOOK ALIKES
Homophones and Homographs
Devanagari
The first ligature is a Half da+ Full dha, the second is a half dha followed by a full da. To an average reader of Hindi, the two forms look practically alike and lead to confusion.
A similar situation arises in the case of Gujarati

The first is ka+la The second is ka+halanta+la
AMBIGUITIES ARISING OUT OF POSSIBLE UNICODE VARIANTS.
This can be best seen in the case of Nukta characters. These can be generated out in two different manners:

In each pair, the first character is a single character whereas the second character is made up of two characters: the consonant followed by the dot or nukta character. To the naked eye the two look alike, whereas for the machine, these would be two different IDN’s.
SIMILAR LOOKING CHARACTERS WITHIN THE SAME CODE-PAGE:
Within a code-page two characters can look practically alike and create ambiguity. This is especially the case when on the client machine the font enabled is not of high quality and given the size of the characters (normally 10 point), can lead to confusion. Some examples are given below:
Devanagari
IDENTICAL CHARACTERS IN UNICODE
As is the case of the Urdu and Sindhi glyph. Character 06a9 is the letter /keheh/ in Urdu whereas the same symbol in Sindhi has the representation /kheheh/. Since both fall within the same codepage aural disambiguation apart from recourse to the language used is impossible.
Aural Look-Alikes: Homophones
Indian Languages being phonetic in nature, aural representation is a major issue.
These mainly arrive out of the fact that Indian languages are generally typed as they are spoken. Very often these arrive out of
spelling variants and/or
The ignorance of the user as to the correct spelling of the word.
A large number of sub-types of problems can emerge from such Homophonic representations
Aural Look-Alikes: Homophones-1
Confusion between the two nasal modifiers (wherever such nasal modifiers) exist.
Hindi Gujarati
Confusion between two or more similar sounding consonants (normally dental vs. retroflex sibilants and laterals):
Marathi Gujarati
Confusion arising out of short and long vowels:
Tamil: Gujarati Hindi
Absence or presence of a halanta.
This is a source of errors even among educated speakers of the language. Proper names tend to be written at times with or without the halanta.
Thus the name Shirke in Marathi can be written in the following two ways of which the first is correct, the second not normatively valid but could be accepted:

Confusion arising out of the use of the rakar+ “u” matra instead of the vowel form:
vs.
A remote source of error would be the use of the Visarga or Vowel lengthener to modify an IDN. The Visarga is mainly used in Sanskrit and very rarely in neo Indian Aryan languages. However an IDN with or without the Visarga could create ambiguity.

Insertion of a zero width character (ZWJ/ZWNJ) within the name string:
‍
The first has no non-joiner, the second has a non-joiner. Visually both look alike and can lead to confusion.
Sub-Type 2: SPELLING ERRORS
SUB-TYPE II Spelling Variants
This is best seen in the case of Hindi where a nasal modifier can substitute for a corresponding half nasal consonant.
The word Hindi itself allows to be written either as:

Obviously two IDN’s based on these spelling variants should not be allowed but must be resolved to the same norm.
A similar situation exists in Marathi in the use of (timba) vs. /e/ vowel modifier. The first is used in colloquial Marathi under special environments whereas the second is the literary form. A filter which would normalize the two would have to be written.

SUB-TYPE III SPELLING ERRORS
These whether conscious or unconscious could create homographic doublets and need to be detected in order to ensure that the client does not have a spurious IDN competing with his real IDN. Misspellings of words, introversions can all lead to IDN doublets.
A good example is words in Hindi which have Urdu roots and which can admit spellings without Halanta (Urdu norm) and with halanta (Hindi aural norm)
Proposed Recommendations
An action plan has been proposed for ensuring maximum security in allotment of IDN’s in Indian scripts.
This is in shape of recommendations arising out of discussions.
The recommendations are both specific and generic in nature.
Level 2 Government bodies and Institutions (Bank, insurance, healthcare, etc)
Level 3 Corporate and NGO’s
Level 4 All other users.
Proposed Recommendations: GENERIC STRATEGIES-2
The implementation should be tested in TESTBED mode and IDN’s should be allotted in a phased manner:
Level 1 (Highest security) and Level2 (Government bodies and Institutions) should be permitted to register in the test bed mode. This will also have the advantage of blocking out automatically all demands by “spoofers” and “hackers” to squat on such names.
Levels 1 and 2 should be automatically denied to users.
At this stage the automated software for providing variants based on visual and homophonic identities should be set in place.
Proposed Recommendations: GENERIC STRATEGIES-2
Subsequently Level 3 i.e. corporate, NGO’s should be allowed to register. The software which will generate out all possible variants for their names, as per the rules of the language can be proposed to them. If they so desire they can register all these variants or keep them open, after being overtly warned that such a step could lead to spoofing.
Level 4 can be integrated at the end
Phased allotment of IDN’s will eradicate to a large extent spoofing and phishing and ensure maximal security.
Two scripts page should not be mixed.
As far as possible, numbers (digits) should not be used, unless they acquire a linguistic value such as 365, 24/7 etc. Domain names are not like mail applications where you can have the name followed by a digit.
Punctuation marks should be avoided as far as possible. These can also result in confusion as is the case of eyelash repha in Marathi:
- ‍
4. Although under ideal circumstances, correct spelling would be the norm, the first instance of a name registered even if it is incorrect would be deemed as registered and all further variants including the correct one, generated out by the software would be reserved or permitted as per the wish of the sanctioning authority.
Proposed Recommendations: SPECIFIC ISSUES-2
5. The whole process to be automated by means of a software which will ensure to the highest degree that the “security holes” are not breached.
Given that there would be a large number of applications and that manual processing would not be possible and if possible would result in inordinate delays, automation is a pre-requisite.
Identification of Potential zones : Potential zones for ensuring were identified.
These are:
List of potential spelling variants
List of potential zones of error in terms of misspellings and which are not trapped by the variants list.
© C-DAC
Explanatory documents and Templates for each of the desired data were provided by CDAC GIST to the concerned
The templates gave examples for each type of requirements in the sample template below:
© C-DAC
CDAC. Pune has been entrusted with the creation of data for three languages: Hindi, Marathi and Urdu
As per agreement Expert committees for all these three languages have been appointed, the experts being professors and experts working in the publishing industry; since these have the linguistic skills and know-how to investigate and create the required data
A translation of the three letter extension of the names has also been provided. To ensure across the board intelligibility, this is in Sanskrit
In the slides that follow, samples of the quantum of work accomplished in each of the languages will be detailed out.
Report-1
1) EDU
2) GOV
3) IN
6) MIL -
7) RES
13) MED
14) AGRI
Report-1: Marathi
In the case of Marathi, a committee headed by Shri Phadake who has books on “shuddha-lekhan” to his credit has been appointed.
Work has commenced on all the three areas:
Variants list
Spelling Variants
Erroneous Spellings
A large number of rules have been generated and so is the data on spelling variants and misspellings
Report -2 Hindi
A similar exercise has been carried out for Hindi. Sample files are provided below. Over 100 different rule variants have been identified.
Report -3 Urdu
Under the able guidance of Prof Yunus Fahmi, spelling variants, misspellings and variant lists are being created.
Some sample files for variant list and spellings variants are appended
Language
Indo-Aryan
Gujarati
Hindi
Indo-Aryan
Devanagari
Language
Language
Nurturing living languages

Documents

Mahesh D. Kulkarni Group Coordinator C-DAC GIST 4 th August 2006 Venue : Hotel Raddison, Noida