WELCOME. Mahesh D. Kulkarni Group Coordinator C-DAC GIST 4 th August 2006 Venue : Hotel Raddison, Noida. Indian Language Domain Name Registration “Issues and Solutions”. Background. Social and economic growth is catalyzed by the presence of Internet - PowerPoint PPT Presentation
xn--m1be
India has largest linguistic diversities in the world
4 major language families and at least 35 different languages and
around 2000 dialects.
Languages belong to either Indo-Aryan (ca.74%), the Dravidian (ca
24%), the Austro-Asiatic (Munda) (ca 1.2%) or the Tibeto-Burman (ca
0.6%) families. Some of the languages of Himalayas still
unclassified.
India has 22 scheduled languages and English continue to be
“associate additional official language”
Following scripts will be most needed : Assamese, Bangla,
Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil,
Telugu, Urdu.
Nurturing Living Languages
Devanagari – Hindi, Marathi, Konkani, Rajasthani, Sindhi, Nepali,
Dogri, Santhali, etc.
Thus the code page Devanagari can support all languages using that
particular script.
Solution :
Though the contents would reveal the language used, it would be
ideal if a special attribute code to indicate language is
inserted.
Nurturing Living Languages
Konkani is written in Roman, Devanagari, Malayalam and
Kannada.
Sindhi is written in Gurmukhi (Punjabi), Arabi (Perso-Arabic),
Devanagari, Gujarati and also Roman.
Sindhi has adopted the Perso-Arabic script for representing their
language. In case of Konkani, Devanagari is used as official
script.
Hence it is proposed that the same formula be used in attributing
in IDN.
However nothing stops a client from desiring to have his IDN in all
the scripts and this can be efficiently catered by providing broad
based transliteration facility which would transliterate a name
from one Indian script to another.
Thus a Konkani domain name in Devanagari could be transliterated
into Kannada, Malayalam and Roman.
Solution:
The best solution to this is by way of linguistic or political
consensus
One language :: many scripts
The solution :
A tool for transliteration from one Indian script to another can be
easily deployed.
The transliterated data could be presented to the client who could
verify the transliteration and see if it meets his approval and if
so, the IDN could be registered in all possible scripts
Nurturing Living Languages
ACE i.e. ASCII compatible encoding.
This is intimately tied to NamePrep (3491)and PunyCode (RFC-3492)
as well as to RFC 3454 StringPrep.
ACE prepares a IDN string to be sent down to PunyCode for storage
where it is stored as a 7 bit numeric data
We would like to make a case for the use of ISCII 91 as a parallel
code for Brahmi based scripts.
ISCII deploys the same encoding for all Brahmi based scripts.
The advantage for this obvious as storage in ISCII will allow IDN
to transliterate on the fly a name into any Indic script and
thereby ensure at the PunyCode level itself that a name allotted in
one script is also automatically allotted in another script to the
same owner, thereby doing away with name squatting in Indic
scripts, which will be a regular feature for IDN allocation in
Indic scripts.
Alternate mechanism
IDN & THE PROBLEM OF ALLOTTING NAMES
The IDN server which will attribute the domain names is to be
automated and hence it is of vital interest that a mechanism of
checks and counter-checks be set up to ensure the highest level of
security.
Two major issues are at stake. These issues are mainly specific to
Indian scripts and the complex nature of their visual
rendering.
Nurturing Living Languages
PROBLEM 1: DOUBLETS
The first is the need to ensure that doublets are avoided. Doublets
are IDN’s which are nearly alike either as homophones or close
homographs. Thus spelling: Mahararashtra as:
The first is ka+la The second is ka+halanta+la
Homophones and Homographs
Nurturing Living Languages
AMBIGUITIES ARISING OUT OF POSSIBLE UNICODE VARIANTS.
This can be best seen in the case of Nukta characters. These can be
generated out in two different manners:
In each pair, the first character is a single character whereas the
second character is made up of two characters: the consonant
followed by the dot or nukta character. To the naked eye the two
look alike, whereas for the machine, these would be two different
IDN’s.
Homophones and Homographs
Nurturing Living Languages
SIMILAR LOOKING CHARACTERS WITHIN THE SAME CODE-PAGE:
Within a code-page two characters can look practically alike and
create ambiguity. This is especially the case when on the client
machine the font enabled is not of high quality and given the size
of the characters (normally 10 point), can lead to confusion. Some
examples are given below:
Devanagari
Homophones and Homographs
Nurturing Living Languages
IDENTICAL CHARACTERS IN UNICODE
As is the case of the Urdu and Sindhi glyph. Character 06a9 is the
letter /keheh/ in Urdu whereas the same symbol in Sindhi has the
representation /kheheh/. Since both fall within the same codepage
aural disambiguation apart from recourse to the language used is
impossible.
Homophones and Homographs
Nurturing Living Languages
Aural Look-Alikes: Homophones
Indian Languages being phonetic in nature, aural representation is
a major issue.
These mainly arrive out of the fact that Indian languages are
generally typed as they are spoken. Very often these arrive out
of
spelling variants and/or
The ignorance of the user as to the correct spelling of the
word.
A large number of sub-types of problems can emerge from such
Homophonic representations
Homophones and Homographs
Nurturing Living Languages
Aural Look-Alikes: Homophones-1
Confusion between the two nasal modifiers (wherever such nasal
modifiers) exist.
Hindi Gujarati
Confusion between two or more similar sounding consonants (normally
dental vs. retroflex sibilants and laterals):
Marathi Gujarati
Confusion arising out of short and long vowels:
Tamil: Gujarati Hindi
Homophones and Homographs
Nurturing Living Languages
Absence or presence of a halanta.
This is a source of errors even among educated speakers of the
language. Proper names tend to be written at times with or without
the halanta.
Thus the name Shirke in Marathi can be written in the following two
ways of which the first is correct, the second not normatively
valid but could be accepted:
Confusion arising out of the use of the rakar+ “u” matra instead of
the vowel form:
vs.
Homophones and Homographs
Nurturing Living Languages
Aural Look-Alikes: Homophones-3
A remote source of error would be the use of the Visarga or Vowel
lengthener to modify an IDN. The Visarga is mainly used in Sanskrit
and very rarely in neo Indian Aryan languages. However an IDN with
or without the Visarga could create ambiguity.
Homophones and Homographs
Nurturing Living Languages
Aural Look-Alikes: Homophones-4
Insertion of a zero width character (ZWJ/ZWNJ) within the name
string:
The first has no non-joiner, the second has a non-joiner. Visually
both look alike and can lead to confusion.
Homophones and Homographs
Nurturing Living Languages
Sub-Type 2: SPELLING ERRORS
SUB-TYPE II Spelling Variants
This is best seen in the case of Hindi where a nasal modifier can
substitute for a corresponding half nasal consonant.
The word Hindi itself allows to be written either as:
Obviously two IDN’s based on these spelling variants should not be
allowed but must be resolved to the same norm.
A similar situation exists in Marathi in the use of (timba) vs. /e/
vowel modifier. The first is used in colloquial Marathi under
special environments whereas the second is the literary form. A
filter which would normalize the two would have to be
written.