46
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 209 Chapter 9 South and Southeast Asian Scripts 9 The following scripts are encoded in this group: Devanagari Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada Malayalam Sinhala Thai Lao Tibetan Myanmar Khmer The scripts of South and Southeast Asia share so many common features that a side-by-side comparison of a few will often reveal structural similarities even in the modern letter- forms. With minor historical exceptions, they are written from left to right; many use no interword spacing but use spaces or marks between phrases. They are all essentially alpha- betic scripts in which most symbols stand for a consonant plus an inherent vowel (usually the sound “ah”). Word-initial vowels in many of these scripts have special symbols, but word-internal vowels are usually written by juxtaposing a vowel sign in the vicinity of the affected consonant. Absence of the inherent vowel, when that occurs, is frequently marked with a special sign. Most of the scripts of South and Southeast Asia, from the Himalayas in the north to Sri Lanka in the south, from Pakistan in the west to the eastern-most islands of Indonesia, are derived from the ancient Brahmi script. The oldest lengthy inscriptions of India, the Asoka edicts of the third century , were written in two scripts, Kharosthi and Brahmi. These are both ultimately of Semitic origin, probably deriving from Aramaic, which was an

Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

Chapter 9

South and Southeast Asian Scripts 9

The following scripts are encoded in this group:

• Devanagari

• Bengali

• Gurmukhi

• Gujarati

• Oriya

• Tamil

• Telugu

• Kannada

• Malayalam

• Sinhala

• Thai

• Lao

• Tibetan

• Myanmar

• Khmer

The scripts of South and Southeast Asia share so many common features that a side-by-sidecomparison of a few will often reveal structural similarities even in the modern letter-forms. With minor historical exceptions, they are written from left to right; many use nointerword spacing but use spaces or marks between phrases. They are all essentially alpha-betic scripts in which most symbols stand for a consonant plus an inherent vowel (usuallythe sound “ah”). Word-initial vowels in many of these scripts have special symbols, butword-internal vowels are usually written by juxtaposing a vowel sign in the vicinity of theaffected consonant. Absence of the inherent vowel, when that occurs, is frequently markedwith a special sign.

Most of the scripts of South and Southeast Asia, from the Himalayas in the north to SriLanka in the south, from Pakistan in the west to the eastern-most islands of Indonesia, arederived from the ancient Brahmi script. The oldest lengthy inscriptions of India, the Asokaedicts of the third century , were written in two scripts, Kharosthi and Brahmi. Theseare both ultimately of Semitic origin, probably deriving from Aramaic, which was an

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 209

Page 2: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts

important administrative language of the Middle East at that time. Kharosthi, written fromright to left, was supplanted by Brahmi and its derivatives. The descendents of Brahmispread with myriad changes throughout the subcontinent and outlying islands. There aresaid to be some 200 different scripts deriving from it. By the eleventh century, the modernscript known as Devanagari was in ascendancy in India proper as the major script of San-skrit literature and the archetypal representative of the northern branch of the script fam-ily. This northern branch includes such modern scripts as Myanmar (Burmese),Gurmukhi, and Tibetan; the southern branch includes scripts such as Malayalam.

The major scripts of India proper, including Devanagari, are all encoded according to acommon plan, so that comparable characters are in the same order and relative location.This structural arrangement, which facilitates transliteration to some degree, is based uponthe Indian national standard (ISCII) encoding for these scripts. Sinhala, Myanmar, andKhmer have virama-based models that do not map to ISCII. Thai and Lao share a differentmodel, based on the Thai Industrial Standard encoding for Thai, which uses visual order-ing and is not laid out compatibly to ISCII. Tibetan stands apart from all of these models,reflecting its somewhat different structure and usage. Many of the character names in thisgroup of scripts represent the same sounds, and naming conventions are similar across therange.

210 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 3: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.1 Devanagari

9.1 Devanagari

Devanagari: U+0900–U+097FThe Devanagari script is used for writing classical Sanskrit and its modern historical deriv-ative, Hindi. Extensions to Devanagari are used to write other related languages of India(such as Marathi) and of Nepal (Nepali). In addition, the Devanagari script is used to writethe following languages: Awadhi, Bagheli, Bhatneri, Bhili, Bihari, Braj Bhasha, Chhattis-garhi, Garhwali, Gondi (Betul, Chhindwara, and Mandla dialects), Harauti, Ho, Jaipuri,Kachchhi, Kanauji, Konkani, Kului, Kumaoni, Kurku, Kurukh, Marwari, Mundari, Newari,Palpa, and Santali.

All other Indic scripts, as well as the Sinhala script of Sri Lanka, the Tibetan script, and theSoutheast Asian scripts (Thai, Lao, Khmer, and Myanmar), are historically connected withthe Devanagari script as descendants of the ancient Brahmi script. The entire family ofscripts shares a large number of structural features.

The principles of the Indic scripts are covered in some detail in this introduction to theDevanagari script. The remaining introductions to the Indic scripts are abbreviated buthighlight any differences from Devanagari where appropriate.

Standards. The Devanagari block of the Unicode Standard is based on ISCII-1988 (IndianStandard Code for Information Interchange). The ISCII standard of 1988 differs from andis an update of earlier ISCII standards issued in 1983 and 1986.

The Unicode Standard encodes Devanagari characters in the same relative position as thosecoded in positions A0–F416 in the ISCII-1988 standard. The same character code layout isfollowed for eight other Indic scripts in the Unicode Standard: Bengali, Gurmukhi, Guja-rati, Oriya, Tamil, Telugu, Kannada, and Malayalam. This parallel code layout emphasizesthe structural similarities of the Brahmi scripts and follows the stated intention of theIndian coding standards to enable one-to-one mappings between analogous coding posi-tions in different scripts in the family. Sinhala, Thai, Lao, Khmer, and Myanmar depart to agreater extent from the Devanagari structural pattern, so the Unicode Standard does notattempt to provide any direct mappings for these scripts to the Devanagari order.

In November 1991, at the time The Unicode Standard, Version 1.0, was published, theBureau of Indian Standards published a new version of ISCII in Indian Standard (IS)13194:1991. This new version partially modified the layout and repertoire of the ISCII-1988 standard. Because of these events, the Unicode Standard does not precisely follow thelayout of the current version of ISCII. Nevertheless, the Unicode Standard remains a super-set of the ISCII-1991 repertoire except for a number of new Vedic extension charactersdefined in IS 13194:1991 Annex G—Extended Character Set for Vedic. Modern, non-Vedictexts encoded with ISCII-1991 may be automatically converted to Unicode code values andback to their original encoding without loss of information.

Encoding Principles. The writing systems that employ Devanagari and other Indic scriptsconstitute a cross between syllabic writing systems and phonemic writing systems (alpha-bets). The effective unit of these writing systems is the orthographic syllable, consisting of aconsonant and vowel (CV ) core and, optionally, one or more preceding consonants, witha canonical structure of ((C )C )CV. The orthographic syllable need not correspondexactly with a phonological syllable, especially when a consonant cluster is involved, butthe writing system is built on phonological principles and tends to correspond quite closelyto pronunciation.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 211

Page 4: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.1 Devanagari South and Southeast Asian Scripts

The orthographic syllable is built up of alphabetic pieces, the actual letters of the Devana-gari script. These pieces consist of three distinct character types: consonant letters, inde-pendent vowels, and dependent vowel signs. In a text sequence, these characters are storedin logical (phonetic) order.

Principles of the Script

Rendering Devanagari Characters. Devanagari characters, like characters from manyother scripts, can combine or change shape depending on their context. A character’sappearance is affected by its ordering with respect to other characters, the font used to ren-der the character, and the application or system environment. These variables can cause theappearance of Devanagari characters to differ from their nominal glyphs (used in the codecharts).

Additionally, a few Devanagari characters cause a change in the order of the displayed char-acters. This reordering is not commonly seen in non-Indic scripts and occurs indepen-dently of any bidirectional character reordering that might be required.

Consonant Letters. Each consonant letter represents a single consonantal sound but alsohas the peculiarity of having an inherent vowel, generally the short vowel /a/ in Devanagariand the other Indic scripts. Thus U+0915 represents not just /k/but also /ka/. In the presence of a dependent vowel, however, the inherent vowel associatedwith a consonant letter is overridden by the dependent vowel.

Consonant letters may also be rendered as half-forms, which are presentation forms used todepict the initial consonant in consonant clusters. These half-forms do not have an inher-ent vowel. Their rendered forms in Devanagari often resemble the full consonant but aremissing the vertical stem, which marks a syllabic core. (The stem glyph is graphically andhistorically related to the sign denoting the inherent /a/ vowel.)

Some Devanagari consonant letters have alternative presentation forms whose choicedepends upon neighboring consonants. This variability is especially notable for U+0930 , which has numerous different forms, both as the initial elementand as the final element of a consonant cluster. Only the nominal forms, rather than thecontextual alternatives, are depicted in the code chart.

The traditional Sanskrit/Devanagari alphabetic encoding order for consonants followsarticulatory phonetic principles, starting with velar consonants and moving forward tobilabial consonants, followed by liquids and then fricatives. ISCII and the Unicode Stan-dard both observe this traditional order.

Independent Vowel Letters. The independent vowels in Devanagari are letters that standon their own. The writing system treats independent vowels as orthographic CV syllablesin which the consonant is null. The independent vowel letters are used to write syllablesthat start with a vowel.

Dependent Vowel Signs (Matras). The dependent vowels serve as the common manner ofwriting noninherent vowels and are generally referred to as vowel signs, or as matras in San-skrit. The dependent vowels do not stand alone; rather, they are visibly depicted in combi-nation with a base letterform. A single consonant, or a consonant cluster, may have adependent vowel applied to it to indicate the vowel quality of the syllable, when it is differ-ent from the inherent vowel. Explicit appearance of a dependent vowel in a syllable over-rides the inherent vowel of a single consonant letter.

The greatest variation among different Indic scripts is found in the way that the dependentvowels are applied to base letterforms. Devanagari has a collection of nonspacing depen-dent vowel signs that may appear above or below a consonant letter, as well as spacingdependent vowel signs that may occur to the right or to the left of a consonant letter or

212 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 5: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.1 Devanagari

consonant cluster. Other Indic scripts generally have one or more of these forms, but whatis a nonspacing mark in one script may be a spacing mark in another. Also, some of theIndic scripts have single dependent vowels that are indicated by two or more glyph compo-nents—and those glyph components may surround a consonant letter both to the left andright or may occur both above and below it.

The Devanagari script has only one character denoting a left-side dependent vowel sign:U+093F . Other Indic scripts either have no such vowel signs(Telugu and Kannada) or include as many as three of these signs (Bengali, Tamil, andMalayalam).

A one-to-one correspondence exists between the independent vowels and the dependentvowel signs. Independent vowels are sometimes represented by a sequence consisting of theindependent form of the vowel /a/ followed by a dependent vowel sign. For example,Figure 9-1 illustrates this relationship (see the notation formally described in the “Rules forRendering” later in this section).

The combination of the independent form of the default vowel /a/ (in the Devanagariscript, U+0905 ) with a dependent vowel sign may be viewed as analternative spelling of the phonetic information normally represented by an isolated inde-pendent vowel form. However, these two representations should not be considered equiva-lent for the purposes of rendering. Higher-level text processes may choose to consider thesealternative spellings equivalent in terms of information content, but such an equivalence isnot stipulated by this standard.

Virama. Devanagari and other Indic scripts employ a sign known as the virama, halant, orvowel omission sign. A virama sign (for example, U+094D )nominally serves to cancel (or kill) the inherent vowel of the consonant to which it isapplied. The virama functions as a combining character, with its shape varying from scriptto script. When a consonant has lost its inherent vowel by the application of virama, it isknown as a dead consonant; in contrast, a live consonant is one that retains its inherentvowel or is written with an explicit dependent vowel sign. In the Unicode Standard, a deadconsonant is defined as a sequence consisting of a consonant letter followed by a virama.The default rendering for a dead consonant is to position the virama as a combining markbound to the consonant letterform.

For example, if Cn denotes the nominal form of consonant C, and Cd denotes the deadconsonant form, then a dead consonant is encoded as shown in Figure 9-2.

Consonant Conjuncts. The Indic scripts are noted for a large number of consonant con-junct forms that serve as orthographic abbreviations (ligatures) of two or more adjacent

Figure 9-1. Dependent Versus Independent Vowels

+� �

+� �

��→

��→

An Ivs+ → Ivs + An

In

An Uvs+ → An + Uvs Un

/a/ + Dependent Vowel Independent Vowel

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 213

Page 6: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.1 Devanagari South and Southeast Asian Scripts

letterforms. This abbreviation takes place only in the context of a consonant cluster. Anorthographic consonant cluster is defined as a sequence of characters that represents one ormore dead consonants (denoted Cd) followed by a normal, live consonant letter (denotedCl) or an independent vowel letter.

Under normal circumstances, a consonant cluster is depicted with a conjunct glyph if sucha glyph is available in the current font(s). In the absence of a conjunct glyph, the one ormore dead consonants that form part of the cluster are depicted using half-form glyphs. Inthe absence of half-form glyphs, the dead consonants are depicted using the nominal con-sonant forms combined with visible virama signs (see Figure 9-3).

A number of types of conjunct formations appear in these examples: (1) a half-form of GAin its combination with the full form of DHA; (2) a vertical conjunct K.KA; (3) a fullyligated conjunct K.SSHA, in which the components are no longer distinct; and (4) a rareconjunct formed with an independent vowel letter, in this case the vowel letter RI (alsoknown as vocalic r). Note that in example (4) in Figure 9-3, the dead consonant RAd isdepicted with the nonspacing combining mark RAsup (repha).

A well-designed Indic script font may contain hundreds of conjunct glyphs, but they arenot encoded as Unicode characters because they are the result of ligation of distinct letters.Indic script rendering software must be able to map appropriate combinations of charac-ters in context to the appropriate conjunct glyphs in fonts.

When an independent vowel appears as the terminal element of a consonant cluster, as inexample (4) in Figure 9-3, the independent vowel should not be depicted as a dependentvowel sign, but as an independent vowel letterform.

Explicit Virama. Normally a virama character serves to create dead consonants that are, inturn, combined with subsequent consonants to form conjuncts. This behavior usuallyresults in a virama sign not being depicted visually. Occasionally, however, this defaultbehavior is not desired when a dead consonant should be excluded from conjunct forma-tion, in which case the virama sign is visibly rendered. To accomplish this goal, the UnicodeStandard adopts the convention of placing the character U+200C -

immediately after the encoded dead consonant that is to be excluded from conjunct

Figure 9-2. Dead Consonants

Figure 9-3. Conjunct Formations

TAn + VIRAMAn → TAd

� ��+ → �

� + →�

KAd + SSHAl → K.SSHAn

� + →� �

GAd + DHAl → GAh + DHAn

� + →� � ��KAd + KAl → K.KAn

� + →� � �

RAd + RIn → RIn + RAsup

₍₎

₍₎

₍₎

₍₎

214 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 7: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.1 Devanagari

formation. In this case, the virama sign is always depicted as appropriate for the consonantto which it is attached.

For example, in Figure 9-4, the use of - prevents the default forma-tion of the conjunct form (K.SSHAn).

Explicit Half-Consonants. When a dead consonant participates in forming a conjunct, thedead consonant form is often absorbed into the conjunct form, such that it is no longer dis-tinctly visible. In other contexts, however, the dead consonant may remain visible as a half-consonant form. In general, a half-consonant form is distinguished from the nominalconsonant form by the loss of its inherent vowel stem, a vertical stem appearing to the rightside of the consonant form. In other cases, the vertical stem remains but some part of itsright-side geometry is missing.

In certain cases, it is desirable to prevent a dead consonant from assuming full conjunctformation yet still not appear with an explicit virama. In these cases, the half-form of theconsonant is used. To explicitly encode a half-consonant form, the Unicode Standardadopts the convention of placing the character U+200D immediatelyafter the encoded dead consonant. The denotes a nonvisible letter thatpresents linking or cursive joining behavior on either side (that is, to the previous or fol-lowing letter). Therefore, in the present context, the may be consid-ered to present a context to which a preceding dead consonant may join so as to create thehalf-form of the consonant.

For example, if Ch denotes the half-form glyph of consonant C, then a half-consonantform is encoded as shown in Figure 9-5.

• In the absence of the , this sequence would normally pro-duce the full conjunct form (K.SSHAn).

This encoding of half-consonant forms also applies in the absence of a base letterform.That is, this technique may also be used to encode independent half-forms, as shown inFigure 9-6.

Consonant Forms. In summary, each consonant may be encoded such that it denotes a liveconsonant, a dead consonant that may be absorbed into a conjunct, or the half-form of adead consonant (see Figure 9-7).

Figure 9-4. Preventing Conjunct Forms

Figure 9-5. Half-Consonants

��

KAd + ZWNJ + SSHAl → KAd + SSHAn

���� →+ +ZNJW

��

KAd + ZWJ + SSHAl → KAh + SSHAn

��� →+ +ZJW

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 215

Page 8: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.1 Devanagari South and Southeast Asian Scripts

Rendering

Rules for Rendering. The following provides more formal and detailed rules for minimalrendering of Devanagari as part of a plain text sequence. It describes the mapping betweenUnicode characters and the glyphs in a Devanagari font. It also describes the combiningand ordering of those glyphs.

These rules provide minimal requirements for legibly rendering interchanged Devanagaritext. As with any script, a more complex procedure can add rendering characteristics,depending on the font and application.

It is important to emphasize that in a font that is capable of rendering Devanagari, the set ofglyphs is greater than the number of Devanagari Unicode characters.

Notation. In the next set of rules, the following notation applies:

Cn Nominal glyph form of consonant C as it appears in the codecharts.

Cl A live consonant, depicted identically to Cn.

Cd Glyph depicting the dead consonant form of consonant C.

Ch Glyph depicting the half-consonant form of consonant C.

Ln Nominal glyph form of a conjunct ligature consisting of two ormore component consonants. A conjunct ligature composed oftwo consonants X and Y is also denoted X.Yn.

RAsup A nonspacing combining mark glyph form of the U+0930 positioned above or attached to theupper part of a base glyph form. This form is also known asrepha.

RAsub A nonspacing combining mark glyph form of the U+0930 positioned below or attached to thelower part of a base glyph form.

Vvs Glyph depicting the dependent vowel sign form of a vowel V.

Figure 9-6. Independent Half-Forms

Figure 9-7. Consonant Forms

��

GAd + ZWJ → GAh

�→+ ZJW

� ��+ → �

� �+ → �+ ZJW

� �→ KAl

KAd

KAh

216 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 9: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.1 Devanagari

VIRAMAn The nominal glyph form nonspacing combining mark depict-ing U+094D .

• A virama character is not always depicted; when it is depicted, it adopts thisnonspacing mark form.

Dead Consonant Rule. The following rule logically precedes the application of any otherrule to form a dead consonant. Once formed, a dead consonant may be subject to otherrules described next.

R1 When a consonant Cn precedes a VIRAMAn , it is considered to be a dead conso-nant Cd . A consonant Cn that does not precede VIRAMAn is considered to be a liveconsonant Cl .

Consonant RA Rules. The character U+0930 takes one of a num-ber of visual forms depending on its context in a consonant cluster. By default, this letter isdepicted with its nominal glyph form (as shown in the code charts). In two contexts, it isdepicted using a nonspacing glyph form that combines with a base letterform.

R2 If the dead consonant RAd precedes either a consonant or an independent vowel,then it is replaced by the superscript nonspacing mark RAsup , which is positioned sothat it applies to the logically subsequent element in the memory representation.

R3 If the superscript mark RAsup is to be applied to a dead consonant and that deadconsonant is combined with another consonant to form a conjunct ligature, then themark is positioned so that it applies to the conjunct ligature form as a whole.

R4 If the superscript mark RAsup is to be applied to a dead consonant that is subse-quently replaced by its half-consonant form, then the mark is positioned so that itapplies to the form that serves as the base of the consonant cluster.

TAn + VIRAMAn → TAd

� ��+ → �

����

���

RAd + KAl → KAl + RAsup

�+ → +

��

RAd + RAd → RAd + RAsup

+ → + ��

1 2 12

�→

DisplayedOutput

���

RAd + JAd + NYAl → J.NYAn + RAsup

+ → + + ��� �→

DisplayedOutput

���

RAd + GAd + GHAl → GAh + GHAl + RAsup

+ → + + ��� �+ ��→

DisplayedOutput

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 217

Page 10: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.1 Devanagari South and Southeast Asian Scripts

R5 In conformance with the ISCII standard, the half-consonant form RRAh is repre-sented as eyelash-RA. This form of RA is commonly used in writing Marathi.

R5a For compatibility with The Unicode Standard, Version 2.0, if the dead consonantRAd precedes , then the half -consonant form RAh , depicted aseyelash-RA, is used instead of RAsup .

R6 Except for the dead consonant RAd , when a dead consonant Cd precedes the liveconsonant RAl , then Cd is replaced with its nominal form Cn , and RA is replacedby the subscript nonspacing mark RAsub , which is positioned so that it appliesto Cn .

R7 For certain consonants, the mark RAsub may graphically combine with the conso-nant to form a conjunct ligature form. These combinations, such as the one shownhere, are further addressed by the ligature rules described shortly.

R8 If a dead consonant (other than RAd ) precedes RAd , then the substitution of RAfor RAsub is performed as described above; however, the VIRAMA that formedRAd remains so as to form a dead consonant conjunct form.

A dead consonant conjunct form that contains an absorbed RAd may subsequentlycombine to form a multipart conjunct form.

��

RRAn + VIRAMAn → RRAh

+ →��

��

RAd + ZWJ → RAh

→+ ZJW �

���

THAd + RAl → THAn + RAsub

�+ → + � ��→

DisplayedOutput

���

PHAd + RAl → PHAn + RAsub

�+ → + � �→

DisplayedOutput

�� ���

TAd + RAd → TAn + RAsub + VIRAMAn → T.RAd

+ → + � →�+ ��

�� � �

T.RAd + YAl → T.R.YAn

+ → �

218 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 11: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.1 Devanagari

Modifier Mark Rules. In addition to vowel signs, three other types of combining marksmay be applied to a component of an orthographic syllable or to the syllable as a whole:nukta, bindus, and svaras.

R9 The nukta sign, which modifies a consonant form, is placed immediately after theconsonant in the memory representation and is attached to that consonant in ren-dering. If the consonant represents a dead consonant, then NUKTA should precedeVIRAMA in the memory representation.

R10 The other modifying marks, bindus and svaras, apply to the orthographic syllable asa whole and should follow (in the memory representation) all other characters thatconstitute the syllable. In particular, the bindus should follow any vowel signs, andthe svaras should come last. The relative placement of these marks is horizontalrather than vertical; the horizontal rendering order may vary according to typo-graphic concerns.

Ligature Rules. Subsequent to the application of the rules just described, a set of rules gov-erning ligature formation apply. The precise application of these rules depends on theavailability of glyphs in the current font(s) being used to display the text.

R11 If a dead consonant immediately precedes another dead consonant or a live conso-nant, then the first dead consonant may join the subsequent element to form a two-part conjunct ligature form.

R12 A conjunct ligature form can itself behave as a dead consonant and enter into further,more complex ligatures.

A conjunct ligature form can also produce a half-form.

� ��

KAn + NUKTAn + VIRAMAn → QAd

+ → ��+ �

KAn + AAvs + CANDRABINDUn

+ → � !+ !

"��

JAd + NYAl → J.NYAn

→+ ��� #

TTAd + TTHAl → TT.TTHAn

→+ �

$�%+ ��%&%�%→ $�%+ �%→ 'SAd + TAd + RAn → SAd + T.R.An → S.T.RAn

(�K.SSHAd + YAl → K.SSHh + YAn

→+ ��

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 219

Page 12: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.1 Devanagari South and Southeast Asian Scripts

R13 If a nominal consonant or conjunct ligature form precedes RAsub as a result of theapplication of rule R2, then the consonant or ligature form may join with RAsub toform a multipart conjunct ligature (see rule R2 for more information).

R14 In some cases, other combining marks will also combine with a base consonant,either attaching at a nonstandard location or changing shape. In minimal renderingthere are only two cases, RAl with Uvs or UUvs .

Memory Representation and Rendering Order. The order for storage of plain text inDevanagari and all other Indic scripts generally follows phonetic order; that is, a CV sylla-ble with a dependent vowel is always encoded as a consonant letter C followed by a vowelsign V in the memory representation. This order is employed by the ISCII standard andcorresponds with both the phonetic and keying order of textual data (see Figure 9-8).

Because Devanagari and other Indic scripts have some dependent vowels that must bedepicted to the left side of their consonant letter, the software that renders the Indic scriptsmust be able to reorder elements in mapping from the logical (character) store to the pre-sentational (glyph) rendering. For example, if Cn denotes the nominal form of consonantC, and Vvs denotes a left-side dependent vowel sign form of vowel V, then a reordering ofglyphs with respect to encoded characters occurs as just shown.

R15 When the dependent vowel Ivs is used to override the inherent vowel of a syllable, it isalways written to the extreme left of the orthographic syllable. If the orthographic syl-lable contains a consonant cluster, then this vowel is always depicted to the left of thatcluster. For example:

Sample Half-Forms. Table 9-1 shows examples of half-consonant forms that are com-monly used with the Devanagari script. These forms are glyphs, not characters. They maybe encoded explicitly using as shown; in normal conjunct formation,they may be used spontaneously to depict a dead consonant in combination with subse-quent consonant forms.

Figure 9-8. Rendering Order

KAn + RAsub → K.RAn

+ � )→ �

PHAn + RAsub → PH.RAn

+ � �→

RAl + Uvs → RUn

+ * +→ �

RAl + UUvs → RUUn

+ , -→

KAn + Ivs → Ivs + KAn

� ���+ →

Character Order Glyph Order

���

TAd + RAl + Ivs → T.RAn + Ivs → Ivs + T.RAd

+ → + →� ����+

220 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 13: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.1 Devanagari

Sample Ligatures. Table 9-2 shows examples of conjunct ligature forms that are commonlyused with the Devanagari script. These forms are glyphs, not characters. Not every writingsystem that employs this script uses all of these forms; in particular, many of these formsare used only in writing Sanskrit texts. Furthermore, individual fonts may provide fewer ormore ligature forms than are depicted here.

Table 9-1. Sample Half-Forms

� � . /0 1 2 3� � � 4� 5 6 78 9 : ;� < = >? � @� A B CD E F G� H I JK L � M� N $ O

Table 9-2. Sample Ligatures

� � � " � #� � P � � Q� � ) R � S� � R R TU � V R W XU 0 Y � � ZU � [ � � �U � \ . . ]� � ^ � � �� � � I � _` � a b = c` ` d b � e` � f b B g

� ZJW � Z

JW

� ZJW � Z

JW

� ZJW � Z

JW

� ZJW � Z

JW

� ZJW � Z

JW

� ZJW � Z

JW

� ZJW � � Z

JW

� ZJW � Z

JW

� ZJW � Z

JW

� ZJW � Z

JW

� ZJW � Z

JW

� ZJW � Z

JW

� �� �� �� �� �� �� �� �� �� �� �� �� �

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 221

Page 14: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.1 Devanagari South and Southeast Asian Scripts

Sample Half-Ligature Forms. In addition to half-form glyphs of individual consonants,half-forms are also used to depict conjunct ligature forms. A sample of such forms is shownin Table 9-3. These forms are glyphs, not characters. They may be encoded explicitly using as shown; in normal conjunct formation, they may be used spontane-ously to depict a conjunct ligature in combination with subsequent consonant forms.

Combining Marks. Devanagari and other Indic scripts have a number of combining marksthat could be considered diacritic. One class of these marks, known as bindus, is repre-sented by U+0901 and U+0902 -

. These marks indicate nasalization or final nasal closure of a syllable. U+093C is a true diacritic. It is used to extend the basic set of consonantletters by modifying them (with a subscript dot in Devanagari) to create new letters.U+0951..U+0954 are a set of combining marks used in transcription of Sanskrit texts.

Digits. Each Indic script has a distinct set of digits appropriate to that script. These digitsmay or may not be used in ordinary text in that script. European digits have displaced theIndic script forms in modern usage in many of the scripts. Some Indic scripts—notablyTamil—lack a distinct digit for zero.

Punctuation and Symbols. U+0964 is similar to a full stop. Corre-sponding forms occur in many other Indic scripts. U+0965

marks the end of a verse in traditional texts.

Many modern languages written in the Devanagari script intersperse punctuation derivedfrom the Latin script. Thus U+002C and U+002E are freely used inwriting Hindi, and the danda is usually restricted to more traditional texts.

Encoding Structure. The Unicode Standard organizes the nine principal Indic scripts inblocks of 128 encoding points each. The first six columns in each script are isomorphicwith the ISCII-1988 encoding, except that the last 11 positions (U+0955..U+095F in

` 6 h b F i` : j b k` = l � +` � m � -` F n $ � '" " o

Table 9-3. Sample Half-Ligature Forms

� � (� �� � p� � �I � q

Table 9-2. Sample Ligatures (Continued)

� �� r%� s� t� ��

� � ZJW

� � ZJW

� � ZJW

� � ZJW

� � ZJW

222 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 15: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.1 Devanagari

Devanagari, for example), which are unassigned or undefined in ISCII-1988, are used inthe Unicode encoding.

The seventh column in each of these scripts, along with the last 11 positions in the sixthcolumn, represent additional character assignments in the Unicode Standard that arematched across all nine scripts. For example, positions U+xx66..U+xx6F and U+xxE6..U+xxEF code the Indic script digits for each script.

The eighth column for each script is reserved for script-specific additions that do not cor-respond from one Indic script to the next.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 223

Page 16: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.2 Bengali South and Southeast Asian Scripts

9.2 Bengali

Bengali: U+0980–U+09FFThe Bengali script is a North Indian script closely related to Devanagari. It is used to writethe Bengali language primarily in West Bengal state (India) and in the nation of Ban-gladesh. It is also used to write Assamese in Assam (India) and a number of other minoritylanguages (Daphla, Garo, Hallam, Khasi, Manipuri, Mizo, Munda, Naga, Rian, and Santali)in northeastern India.

Two-Part Vowel Signs. The Bengali script, along with a number of other Indic scripts,makes use of two-part vowel signs; in these vowels one-half of the vowel is placed on eachside of a consonant letter or cluster—for example, U+09CB andU+09CC . The vowel signs are coded in each case in the position inthe charts isomorphic with the corresponding vowel in Devanagari. Hence U+09CC -

is isomorphic with U+094C . To providecompatibility with existing implementations of the scripts that use two-part vowel signs,the Unicode Standard explicitly encodes the right half of these vowel signs; for example,U+09D7 represents the right-half glyph component ofU+09CC .

Special Characters. U+09F2..U+09F9 are a series of Bengali additions for writing currencyand fractions.

Rendering Behavior. For rendering of the Bengali script, see the rules for rendering inSection 9.1, Devanagari.

224 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 17: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.3 Gurmukhi

9.3 Gurmukhi

Gurmukhi: U+0A00–U+0A7FThe Gurmukhi script is a North Indian script historically derived from an older scriptcalled Lahnda. It is quite closely related to Devanagari structurally. Gurmukhi is used towrite the Punjabi language in the Punjab in India.

Rendering Behavior. For rendering of the Gurmukhi script, see the rules for rendering inSection 9.1, Devanagari.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 225

Page 18: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.4 Gujarati South and Southeast Asian Scripts

9.4 Gujarati

Gujarati: U+0A80–U+0AFFThe Gujarati script is a North Indian script closely related to Devanagari. It is most obvi-ously distinguished from Devanagari by not having a horizontal bar for its letterforms, acharacteristic of the older Kaithi script to which Gujarati is related. The Gujarati script isused to write the Gujarati language of the Gujarat state in India.

Rendering Behavior. For rendering of the Gujarati script, see the rules for rendering inSection 9.1, Devanagari.

226 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 19: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.5 Oriya

9.5 Oriya

Oriya: U+0B00–U+0B7FThe Oriya script is a North Indian script structurally similar to Devanagari, but with semi-circular lines at the top of most letters instead of the straight horizontal bars of Devanagari.The actual shapes of the letters, particularly for vowel signs, show similarities to Tamil. TheOriya script is used to write the Oriya language, of Orissa state, India, as well as minoritylanguages such as Khondi and Santali.

Special Characters. U+0B57 is provided as an encoding for theright side of the surroundrant vowel U+0B4C .

Rendering Behavior. For rendering of the Oriya script, see the rules for rendering inSection 9.1, Devanagari.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 227

Page 20: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.6 Tamil South and Southeast Asian Scripts

9.6 Tamil

Tamil: U+0B80–U+0BFFThe Tamil script is a South Indian script. South Indian scripts are structurally related to theNorth Indian scripts, but they are used to write the Dravidian languages of southern Indiaand of Sri Lanka, which are genetically unrelated to the North Indian languages such asHindi, Bengali, and Gujarati. The shapes of letters in the South Indian scripts are generallyquite distinct from the shapes of letters in Devanagari and its related scripts. This differenceis partly a result of the fact that the South Indian scripts were originally carved with needleson palm leaves, a technology that apparently favored rounded letter shapes rather thansquare, blocklike shapes.

The Tamil script is used to write the Tamil language of Tamil Nadu state in India as well asminority languages such as Badaga. Tamil is also used in Sri Lanka, Singapore, and parts ofMalaysia. This script has fewer consonants than the other Indic scripts. It also lacks con-junct consonant forms. Instead of conjunct consonant forms, the virama (U+0BCD) isnormally fully depicted in Tamil text.

Naming Conventions for Mid Vowels. The Unicode character encoding for Tamil uses adistinct set of naming conventions for mid vowels in the South Indian (Dravidian) scripts.These conventions are illustrated by U+0B8E and U+0B8F

, to be contrasted with the isomorphic positions in Devanagari: U+090E

and U+090F . The Dravidian languages have a reg-ular length distinction in the mid vowels that is not reflected in normal Devanagari.U+090E is an addition to Devanagari to enable transcriptionof the Dravidian short vowel forms. The naming conventions are chosen to best reflect theactual nature of the vowels in question in the Dravidian scripts, as well as in Devanagariand the other North Indian scripts.

Special Characters. U+0BD7 is provided as an encoding for theright side of the surroundrant (or two-part) vowel U+0BCC .

Rendering of Tamil Script. The South Indic scripts function in much the same way asDevanagari, with the additional feature of two-part vowels. As in the Devanagari example,the words “ ” and “ ” will be omitted where this deletiondoes not cause ambiguity.

It is important to emphasize that in a font that is capable of renderingTamil, the set of glyphs is greater than the number of Tamil characters.

Table 9-4 is a summary of the Tamil letters.

228 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 21: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.6 Tamil

Table 9-4. Tamil Letter Summary

Independent Versus Dependent Vowels. As with Devanagari, the dependent vowel signsare not equivalent to a sequence of virama + independent vowel. For example:

As in the case of Devanagari, a consonant cluster is any sequence of one or more consonantsseparated by viramas, possibly terminated with a virama.

Two-Part Vowels. Certain Indic vowels consist of two discontiguous elements. As in othercases of discontiguous elements, two sequences of Unicode values can be used to expressequivalent spellings. This representation is similar to the case of letters such as “â”, whichcan be spelled either with “a” followed by a nonspacing “ ��” or with a single Unicode char-acter “â”.

Note that the � in the third example is not U+0BB3 ; it is U+0BD7 .

If the precomposed forms are used in the memory representation instead of the separatecharacters, then a similar transformation occurs in the rendering process. The precom-posed form on the left is transformed into the two separate forms equivalent to those onthe right, which are then subject to vowel reordering, as shown below. Thus in rendering:

� � � � � � � � �KA NGA CA JA NYA TTA NNA TA NA NNNA PA

� � � � � � � � � �MA YA RA RRA LA LLA LLLA VA SSA SA HA

� � � � � � � � � ! "A AA I II U UU E EE AI O OO AU

# $ �% �&|' �( |) *� +� ,� *�# +�# *��A AA I II U UU E EE AI O OO AU

�-- �VIRAMA AU

LENGTH

� + $ ≠ � + �--� + �

*�# (0BCA).≈ *� + # (0BC6 + 0BBE)

+�# (0BCB) ≈ +� + # (0BC7 + 0BBE)

*�� (0BCC) ≈ *� + � (0BC7 + 0BD7)

*�# → *� + #+�# → +� + #*�� → *� + �

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 229

Page 22: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.6 Tamil South and Southeast Asian Scripts

Vowel Reordering. As shown in Table 9-5, the following vowels are always reordered infront of the previous consonant cluster, similar to the rendering behavior of the -

:

The same effect occurs with the results of vowel splitting (see Table 9-6).

In both cases, the ordering of the elements is unambiguous: the consonant (cluster) occursfirst in the memory representation. The vowel " also has two discontinuous parts andcan be composed using the .

Ligatures. The following examples illustrate the range of ligatures available in Tamil. Thesechanges take place after vowel reordering and vowel splitting. Unlike Devanagari, Tamilincludes very few conjunct consonants; most ligatures are located between a vowel and aneighboring consonant.

1. Conjunct consonants.

As with Devanagari, vowel reordering occurs around conjunct consonants. Forexample:

Table 9-5. Vowel Reordering

Memory Representation Display

� *� → *�� +� → +�� ,� → ,�

Table 9-6. Vowel Splitting and Reordering

Memory Representation Display

� *�# → *�#� *� # → *�#� +�# → +�#� +� # → +�#� *�� → *��� *� � → *��

*� (0BC6) +� (0BC7) ,� (0BC8)

� + + � → /�--

� + + � + *� + # → */#�--

230 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 23: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.6 Tamil

2. The vowel # optionally ligates with �,.�, or � on its left:

Because this process takes place after reordering and splitting, the following lig-atures may also occur:

3. The vowel signs $ and �% form ligatures with � on their left.

These vowels often change shape or position slightly to link up with the appro-priate shape of the consonant on their left:

4. The vowel signs �&-'.and �(-) typically change form or ligate (see Table 9-7).

Table 9-7. Ligating Vowel Signs

x x + �& |' x + �( |) x x + �& |' x + �( |)� 0 1 � 2 3� 4 5 6 7� 8 9 � : ;� < <= � > ?� @ A � B B=� C C= � D D= E E= � F G

�. + # → H� + # → I� + # → J

Separate Vowels Precomposed Vowels

�.+ *� + # → *H � +*�# → *H� + +� + # → +H � + +�#→ +H� + *� + # → *I � + *�#→ *I� + +� + # → +I � + +�#→ +I� + *�+ # → *J � + *�#→ *J� + +� + # → +J � + +�#→ +J

� + $ → K� + �% → L

�.+ $ → M� + �% → N

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 231

Page 24: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.6 Tamil South and Southeast Asian Scripts

To the right of �, �, �, �, or /, these forms have a spacing form(see Figure 9-9).

5. The vowel sign ,� changes to O�to the left of �, �, �, or �.

Remember that this change takes place after the vowel reordering. In the firstexample, the vowel ,� follows � in the memory representation. Aftervowel reordering, it is on the left of �, and thus changes form. The com-plete process is

6. The consonant � changes shape to #.

This change occurs when the #. form of.�.U+0BB0 wouldnot be confused with the nominal form # of U+0BBE

(for example, when – is combined with�--, $., or�% ).

P P= � Q Q� R R= � S T

Figure 9-9. Spacing Forms of Vowels

Table 9-7. Ligating Vowel Signs (Continued)

x x + �& |' x + �( |) x x + �& |' x + �( |)G

�.+ �& → �'� + �( → �)

+ � → O� + � → O� + � → O�

,� + � → O�,�,�,�

� + ,� → ,� + � → O�

� + → # + � + $ → # + $� + �%→ # + �%

�-- �--

232 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 25: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.7 Telugu

9.7 Telugu

Telugu: U+0C00–U+0C7FThe Telugu script is a South Indian script used to write the Telugu language of AndhraPradesh state in India, as well as minority languages such as Gondi (Adilabad and Koi dia-lects) and Lambadi.

Rendering Behavior. For rendering of the Telugu script, see the rules for rendering inSection 9.6, Tamil. Take note that, unlike Tamil, the Telugu script writes conjunct conso-nants with subscripted letters. There are also numerous consonant letters with contextualshape changes when used in conjuncts. Some vowel signs also change their shape in speci-fied combinations.

Special Characters. U+0C55 is provided as an encoding for the sec-ond element of the vowel U+0C47 . U+0C56

is provided as an encoding for the second element of the surroundrant vowelU+0C48 . The length marks are both nonspacing characters.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 233

Page 26: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.8 Kannada South and Southeast Asian Scripts

9.8 Kannada

Kannada: U+0C80–U+0CFFThe Kannada script is a South Indian script used to write the Kannada (or Kanarese) lan-guage of Karnataka state, as well as minority languages such as Tulu. It is very closelyrelated to the Telugu script both with regard to the shapes of the letters and in the way thatconjunct consonants behave.

Special Characters. U+0CD5 is provided as an encoding for theright side of the two-part vowel U+0CC7 . U+0CD6

is provided as an encoding for the right side of the two-part vowel U+0CC8 . The Kannada two-part vowels actually consist of a nonspacingelement above the consonant letter and one or more spacing letters to the right of the con-sonant letter.

Kannada Letter LLLA. U+0CDE is actually an obsolete Kannada let-ter that is transliterated in Dravidian scholarship as an “r” marked with the diacriticU+0324 . This form ought to have been given the letter name“”, rather than “”, so the Unicode name is simply a mistake. The letter in questionhas not been actively used in Kannada since the end of the tenth century. Collations shouldtreat U+0CDE as following U+0CB3 .

234 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 27: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.9 Malayalam

9.9 Malayalam

Malayalam: U+0D00–U+0D7FThe Malayalam script is a South Indian script used to write the Malayalam language of Ker-ala state.

The shapes of Malayalam letters closely resemble those of Tamil. Malayalam, however, has avery full and complex set of conjunct consonant forms.

Special Characters. U+0D57 is provided as an encoding forthe right side of the two-part vowel U+0D4C .

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 235

Page 28: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.10 Sinhala South and Southeast Asian Scripts

9.10 Sinhala

Sinhala: U+0D80–U+0DFFThe Sinhala script, also known as Sinhalese, is used to write the Sinhala language, themajority language of Sri Lanka, formerly Ceylon. It is also used to write the Pali and San-skrit languages. The script is a descendant of Brahmi and resembles the scripts of SouthIndia in form and structure.

Sinhala differs from other languages of the region in that it has a series of prenasalized stopsthat are distinguished from the combination of a nasal followed by a stop. In other words,both forms occur and are written differently—for example: ��[U+0D85 U+0DAC] /a.Nda/ “sound” versus ���� [U+0D85 U+0DAB U+0DCA U+0DA9] /a.n.da/ “egg”. Inaddition, Sinhala has separate distinct signs for both a short and long low front vowelsounding similar to the initial vowel of the English word “apple,” usually represented inIPA as U+00E6 æ (ash). The independent forms of these vowels are encoded at U+0D87and U+0D88; the corresponding dependent forms are U+0DD0 and U+0DD1.

Because of these extra letters, the encoding for Sinhala does not precisely follow the patternestablished for the other Indic scripts (for example, Devanagari), but does use the samegeneral structure, making use of phonetic order, matra reordering, and use of the virama(U+0DCA -,) to indicate conjunct consonant clusters. Sinhaladoes not use half-forms in the Devanagari manner, but does use many ligatures.

Other Letters for Tamil. The Sinhala script may also be used to write Tamil. In this case,some additional combinations may be required. Some letters, such as U+0DBB

and U+0DB1 , may be modified by adding the equiva-lent of a nukta. There is, however, no nukta presently encoded in the Sinhala block.

Historical Symbols. Neither the Sinhala numerals nor the punctuation sign Kunddaliya(U+0DF4) is in general use today, having been replaced by Western digits and Western-style punctuation. The Kunddaliya was formerly used as a full stop; it is included for schol-arly use. The Sinhala numerals are not presently encoded.

236 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 29: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.11 Thai

9.11 Thai

Thai: U+0E00–U+0E7FThe Thai script is used to write Thai and other Southeast Asian languages, such as Kuy,Lavna, and Pali. It is a member of the Indic family of scripts descended from Brahmi. Thaimodifies the original Brahmi letter shapes and extends the number of letters to accommo-date features of the Thai language, including tone marks derived from superscript digits.On the other hand, Thai script lacks the conjunct consonant mechanism and independentvowel letters found in most other Bhrami-derived scripts. As in all scripts of this family, thepredominant writing direction is left to right.

The Lao script is closely related to Thai, and the encoding principles described in this sec-tion apply to the Lao encoding as well.

Standards. Thai layout in the Unicode Standard is based on the Thai Industrial Standard620-2529, and its updated version 620-2533.

Encoding Principles. In common with the Indic scripts, each Thai letter is a consonantpossessing an inherent vowel sound. Thai letters further feature inherent tones. The inher-ent vowel and tone can be modified by means of vowel signs and tone marks attached to thebase consonant letter. Some of the vowel signs and all of the tone marks are rendered in thescript as diacritics attached above or below the base consonant. These combining signs andmarks are encoded after the modified consonant in the memory representation.

Most of the Thai vowel signs are rendered by full letter-sized in-line glyphs placed eitherbefore (that is, to the left of) or after (to the right of) or around (on both sides of) theglyph for the base consonant letter. In the Thai encoding, the letter-sized glyphs that areplaced before (left of) the base consonant letter, in full or partial representation of a vowelsign, are in fact encoded as separate characters that are typed and stored before the baseconsonant character. This encoding for left-side Thai vowel sign glyphs (and similarly inLao) differs from the conventions for all other Indic scripts, which uniformly encode allvowels after the base consonant. The difference is necessitated by the encoding practicecommonly employed with Thai character data as represented by the Thai Industrial Stan-dard.

Thai Punctuation. Thai uses a variety of punctuation marks particular to this script.U+0E4F is the Thai bullet, used to mark items in lists, orappearing at the beginning of a verse, sentence, paragraph, or other textual segment.U+0E46 is used to mark repetition of preceding letters.U+0E2F is used to indicate ellision or abbreviation of letters;it is itself viewed as a kind of letter, however, and is used with considerable frequencybecause of its appearance in such words as the Thai name for Bangkok. Paiyannoi is alsoused in combination (U+0E2F U+0E25 U+0E2F) to create a construct called paiyanyai,which means “et cetera, and so forth.” The Thai paiyanyai is comparable to the isomorphicanalogue in the Khmer script: U+17D8 .

U+0E5A is used to mark the end of a long segment of text.It can be combined with a following U+0E30 to mark a larger seg-ment of text; typically this usage can be seen at the end of a verse in poetry. U+0E5B

marks the end of a chapter or document, where it always follows theangkhankhu + sara a combination. The Thai angkhankhu and its combination with sara ato mark breaks in text have analogues in many other Brahmi-derived scripts. For example,they are closely related to U+17D4 and U+17D5

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 237

Page 30: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.11 Thai South and Southeast Asian Scripts

, which are themselves ultimately related to the danda and double danda ofDevanagari.

Thai words are not separated by spaces. Text is laid out with spaces introduced at text seg-ments where Western typography would typically make use of commas or periods. How-ever, Latin-based punctuation such as comma, period, and colon are also used in text,particularly in conjunction with Latin letters, or in formatting numbers, addresses, and soforth. If word boundary indications are desired—for example, for the use of automatic linelayout algorithms—the character U+200B should be used to placeinvisible marks for such breaks. The can grow to have a visible widthwhen justified.

Thai Transcription of Pali and Sanskrit. The Thai script is frequently used to write Paliand Sanskrit. When so used, consonant clusters are represented by the explicit use ofU+0E3A (virama) to mark the removal of the inherent vowel.There is no conjoining behavior, unlike other Indic scripts. U+0E4D

is the Pali nigghahita and Sanskrit anusvara. U+0E30 isthe Sanskrit visarga. U+0E24 and U+0E26 arevocalic /r/ and /l/, with U+0E45 used to indicate theirlengthening.

238 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 31: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.12 Lao

9.12 Lao

Lao: U+0E80–U+0EFFThe Lao language and script are closely related to Thai. The Unicode Standard encodes theLao script in the same relative order as Thai.

A few additional letters in Lao have no match in Thai:

U+0EBB

U+0EBC

U+0EBD

The preceding two semivowel signs are the last remnants of the system of subscript medials,which in Myanmar also includes original “rw”. Myanmar and Khmer include a full set ofsubscript consonant forms used for conjuncts. Thai no longer uses any of these forms; Laohas just the two.

There are also two ligatures in the Unicode character encoding for Lao: U+0EDC

and U+0EDD . They correspond to sequences of [h] plus [n] or [h] plus [m]without ligating. Their function in Lao is to provide versions of the [n] and [m] consonantswith a different inherent tonal implication.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 239

Page 32: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.13 Tibetan South and Southeast Asian Scripts

9.13 Tibetan

Tibetan: U+0F00–U+0FBFThe Tibetan script is used for writing Tibetan in several countries and regions throughoutthe Himalayas. Aside from Tibet itself, the script is used in Ladakh, Nepal, and northernareas of India bordering Tibet where large Tibetan-speaking populations now reside. TheTibetan script is also used in Bhutan to write Dzongkha, the official language of that coun-try. Tibetan is also used as the language of philosophy and liturgy by Buddhist traditionsspread from Tibet into the Mongolian cultural area that encompasses Mongolia, Buriatia,Kalmykia, and Tuva.

The Tibetan scripting and grammatical systems were originally defined together in thesixth century by royal decree when the Tibetan King Songtsen Gampo sent 16 men to Indiato study Indian languages. One of those men, Thumi Sambhota, is credited with creatingthe Tibetan writing system upon his return, having studied various Indic scripts and gram-mars. The king’s primary purpose was to bring Buddhism from India to Tibet. The newscript system was therefore designed with compatibility extensions for Indic (principallySanskrit) transliteration so that Buddhist texts could be properly represented. Because ofthis origin, over the last 1,500 years the Tibetan script has also been widely used to repre-sent Indic words, a number of which have been adopted into the Tibetan language retain-ing their original spelling.

A note on Latin transliteration: Tibetan spelling is traditional, and does not generallyreflect modern pronunciation. Throughout this section, Tibetan words are represented initalics when transcribed as spoken, followed at first occurrence by a parenthetical translit-eration; in these transliterations, presence of the tsek (tsheg) character is expressed with ahyphen.

Thumi Sambhota’s original grammar treatise defined two script styles. The first, calleduchen (dbu-can, “with head”), is a formal “inscriptional capitals” style said to be based onan old form of Devanagri. It is the script used in Tibetan xylograph books and the one usedin the coding tables. The second style, called u-mey (dbu-med, or “headless”), is more cur-sive and said to be based on the Wartu script. Numerous styles of u-mey have evolved sincethen, including both formal calligraphic styles used in manuscripts and running handwrit-ing styles. All Tibetan scripts follow the same lettering rules, though there is a slight differ-ence in the way that certain compound stacks are formed in uchen and u-mey.

General Principles of the Tibetan Script. Tibetan grammar divides letters into consonantsand vowels. There are 30 consonants, and each consonant is represented by a discrete writ-ten character. There are five vowel sounds, only four of which are represented by writtenmarks. The four vowels that are explicitly represented in writing are each represented witha single mark that is applied above or below a consonant to indicate the application of thatvowel to that consonant. The absence of one of the four marks implies that the first vowelsound (like a short “ah” in English) is present and is not modified to one of the four otherpossibilities. Three of the four marks are written above the consonants; one is writtenbelow.

Each word in Tibetan has a base or root consonant. The base consonant can be written sin-gly or it can have other consonants added above or below it to make a vertically “stacked”letter. Tibetan grammar contains a very complete set of rules regarding letter gender, andthese rules dictate which letters can be written in adjacent positions. The rules thereforedictate which combinations of consonants can be joined to make stacks. Any combinationnot allowed by the gender rules does not occur in native Tibetan words. However, when

240 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 33: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.13 Tibetan

transcribing other languages (for example, Sanskrit, Chinese) into Tibetan, these rules donot operate. In certain instances, other than transliteration, any consonant may be com-bined with any other subjoined consonant. Implementations should therefore be preparedto accept and display any combinations.

The model adopted to encode the Tibetan lettering set described above contains the follow-ing groups of items: Tibetan consonants, vowels, numerals, punctuation, ornamental signsand marks, and Tibetan-transliterated Sanskrit consonants and vowels. Each of these willbe described below.

Both in this description and in Tibetan, the terms “subjoined” (-btags) and “head” (-mgo)are used in different senses. In the structural sense, they indicate specific slots defined innative Tibetan orthography. In spatial terms, they refer to the position in the stack; any-thing in topmost position is “head,” anything not in topmost position is “subjoined.”Unless explicitly qualified, the terms “subjoined” and “head” are used here in their spatialsense. For example, in a conjunct like “rka” the letter in the root slot is “KA”, but, because itis not the topmost letter of the stack, it is expressed with a subjoined character code, while“RA”, which is structurally in the head slot, is expressed with a nominal character code. Onthe other hand, in a conjunct “kra” in which the root slot is also occupied with “KA”, the“KA” is encoded with a nominal character code, because it is in the topmost position in thestack.

The Tibetan script has its own system of formatting and details of that system relevant tothe characters encoded in this standard are explained herein. However, an increasing num-ber of publications in Tibetan do not strictly adhere to this original formatting system. Thischange is due to the partial move from publishing on long, horizontal, loose-leaf folios, topublishing in vertically oriented, bound books. The Tibetan script also has a punctuationset designed to meet needs quite different from the punctuation that has evolved for West-ern scripts. With the appearance of Tibetan newspapers, magazines, school textbooks, andWestern style reference books in the last 20 or 30 years, Tibetans have begun using thingslike columns, indented blocks of text, Western-style headings, and footnotes. Some West-ern punctuation marks, including brackets, parentheses, and quotation marks, are alsobecoming commonplace in these kinds of publication. With the introduction of moresophisticated electronic publishing systems, there is also a renaissance in the publication ofvoluminous religious and philosophical works in the traditional horizontal, loose-leaf for-mat—many set in digital typefaces closely conforming to the proportions of traditionalhand-lettered text.

Consonants. The system devised to encode the Tibetan system of writing consonants inboth single and stacked forms as described above is as follows:

All of the consonants are encoded a first time from U+0F40 through U+0F69. There are thebasic Tibetan consonants and, in addition, six compound consonants used to represent theIndic consonants gha, jha, d.ha, dha, bha, and ksh.a. These codes are used to representoccurrences of either a stand-alone consonant or a consonant in the head position of a ver-tical stack. Glyphs generated from these codes will always sit in the normal position startingat and dropping down from the design baseline. All of the consonants are then encoded asecond time. These second encodings from U+0F90 through U+0FB9 represent conso-nants in subjoined stack position.

To represent a single consonant in a text stream, one of the first, “nominal,” set of codes isplaced. To represent a stack of consonants in the text stream, a “nominal” consonant codeis followed directly by one or more of the subjoined consonant codes. The stack so formedcontinues for as long as subjoined consonant codes are contiguously placed.

This encoding method was chosen over an alternative method that would have involved avirama-based encoding, like Devanagari. There were two main reasons for this choice.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 241

Page 34: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.13 Tibetan South and Southeast Asian Scripts

First, the virama is not normally used in the Tibetan writing system to create letter combi-nations. (There is a virama in the Tibetan script, but only because of the need to representDevanagari; it is called “srog-med” and encoded at U+0F84. The virama is never used inwriting Tibetan words and can be, but is almost never, used as a substitute for stacking inwriting Sanskrit mantras in the Tibetan script.) Second, there is a prevalence of stacking innative Tibetan, and the model chosen specifically results in decreased data storage require-ments. Furthermore, in languages other than Tibetan, there are many cases where stacksoccur that do not occur in Tibetan-language texts; it is thus imperative to have a model thatallows for any consonant to be stacked with any subjoined consonant(s). Thus a model forstack building was chosen that follows the Tibetan approach to creating letter combina-tions, but is not limited to a specific set of the possible combinations.

Vowels. Each of the four basic Tibetan vowel marks mentioned above is coded as a separateentity. They are U+0F72, U+0F74, U+0F7A, and U+0F7C. For compatibility, a set of sev-eral compound vowels for Sanskrit transcription is also provided in the other codepointsbetween U+0F71 and U+0F7D. Most Tibetan users do not view these compound vowels assingle characters, and their use is limited to Sanskrit words. It is acceptable for users toenter these compounds as a series of simpler elements and have software render themappropriately. Canonical equivalences are specified for all except U+0F77 and U+0F79. Allvowel signs are nonspacing marks above or below a stack of consonants, sometimes onboth sides.

A stand-alone consonant or a stack of consonants can have a vowel sign applied to it. Inaccordance with the rules of Tibetan writing, a code for a vowel sign applied to a consonantshould always be placed after the bare consonant or the stack of consonants formed by themethod just described.

All of the symbols and punctuation marks have straightforward encodings. Further infor-mation about many of them is included below.

Coding Order. In general, the correct coding order for a stream of text will be the same asthe order in which Tibetans spell and in which the characters of the text would be writtenby hand. For example, the correct coding order for the most complex Tibetan stack wouldbe:

head position consonant

first subjoined consonant

... (intermediate subjoined consonants if any)

last subjoined consonant

subjoined vowel a-chung (U+0F71)

standard or compound vowel sign, or virama

Where used, the character U+0F39 - occurs immediately after theconsonant it modifies.

Allographical Considerations. When consonants are combined to form a stack, one ofthem retains the status of being the principal consonant in the stack. The principal conso-nant always retains its stand-alone form. However, consonants placed in the “head” and“subjoined” positions to the main consonant sometimes retain their stand-alone form andsometimes are given a new, special form. Because of this fact, certain of the consonants aregiven a further, special encoding treatment. The affected consonants are “wa” (U+0F5D),“ya” (U+0F61), and “ra” (U+0F62).

Head Position “ra”. When the consonant “ra” is written in the “head” position (ra-mgo) atthe top of a stack in the normal Tibetan-defined lettering set, the shape of the consonantcan change. It can either be a full-form shape or the full-form shape but with the bottomstroke removed (looking like a short-stemmed letter “T”). This requirement of “ra” in the

242 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 35: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.13 Tibetan

head position where the glyph representing it can change shape is correctly coded by usingthe stand-alone “ra” consonant (U+0F62) followed by the appropriate subjoined conso-nant(s). For example, in the normal Tibetan ra-mgo combinations, the “ra” in the headposition is mostly written as the half-ra but in the case of “ra + sub-joined nya” must bewritten as the full-form “ra”. Thus the normal Tibetan ra-mgo combinations are correctlyencoded with the normal “ra” consonant (U+0F62) because it can change shape asrequired. It is the responsibility of the font developer to provide the correct glyphs for rep-resenting the characters where the “ra” in the head position will change shape—for exam-ple, as in “ra + subjoined nya”.

Full-Form “ra” in Head Position. Some instances of “ra” in the head position require thatthe consonant be represented as a full-form “ra” that never changes. This is not standardusage for the Tibetan language itself, but occurs in transliteration and transcription. Inthese cases, only the code U+0F6A - should be used. This“ra” will always be represented as a full-form “ra consonant” and will never change shapeto the form where the lower stroke has been cut off. For example, the letter combination“ra + ya” when appearing in transliterated Sanskrit works is correctly written with a full-form “ra” followed by either a modified subjoined “ya” form or a full-form subjoined “ya”form. Note that the fixed-form “ra” should be used only in combinations where “ra” wouldnormally transform into a short form but the user specifically wants to prevent thatchange. For example, the combination “ra + subjoined nya” never requires the use of fixed-form “ra”, because “ra” normally retains its full glyph form over “nya”. It is the responsibil-ity of the font developer to provide the appropriate glyphs to represent the encodings.

Subjoined Position “wa”, “ya”, and “ra”. All three of these consonants can be written insubjoined position to the main consonant according to normal Tibetan grammar. In thisposition, all of them change to a new shape. The “wa” consonant when written in sub-joined position is not a full “wa” letter any longer but is literally the bottom-right corner ofthe “wa” letter cut off and appended below. For that reason it is called a wazur (wa-zur, or“corner of a wa”) or less frequently, but just as validly, wa-ta (wa-btags) to indicate that it isa subjoined “wa”. The consonants “ya” and “ra” when in the subjoined position are calledya-ta (ya-btags) and ra-ta (ra-btags), respectively. To encode these subjoined consonantsthat follow the rules of normal Tibetan grammar, the shape-changed, subjoined formsU+0F5D , U+0F61 , and U+0F62

should be used.

All three of these subjoined consonants also have full-form non-shape-changing counter-parts for the needs of transliterated and transcribed text. For this purpose, the full sub-joined consonants that do not change shape (encoded at U+0FBA, U+0FBB, and U+0FBC,respectively) are used where necessary. The combinations of “ra + ya” are a good examplebecause they include instances of “ra” taking a short (ya-btags) form and “ra” taking a full-form subjoined “ya”.

U+0FB0 - (a-chung) should be used only in the very rarecases where a full-sized subjoined a-chung letter is required. The small vowel lengtheninga-chung encoded as U+0F71 is far more frequently used inTibetan text, and it is therefore recommended that implementations treat this characterrather than U+0FB0 as the normal subjoined a-chung.

Line-Breaking Considerations. Tibetan text separates units called natively “tsheg-bar”, aninexact translation of which is “syllable.” Tsheg-bar is literally the unit of text between tseks,and is generally a consonant cluster with all of its prefixes, suffixes, and vowel signs. It is nota “syllable” in the English sense.

Tibetan script has two break characters only. The primary break character is the standardinterword tsek (“tsheg”), which is encoded at U+0F0B. The other break character is thespace. Space or tsek characters in a stream of Tibetan text are not always break characters

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 243

Page 36: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.13 Tibetan South and Southeast Asian Scripts

and so need proper contextual handling. Issues surrounding these two potential breakcharacters will now be discussed.

The primary delimiter character in Tibetan text is the tsek (U+0F0B -

). In general, automatic line-breaking processes may break after any occur-rence of this tsek, except where it follows a U+0F44 (with or withouta vowel sign) and precedes a shay (U+0F0D), or where Tibetan grammatical rules do notpermit a break. (Normally, tsek is not written before shay except after “nga”. This type oftsek-after-nga is called “nga-phye-tsheg”, and may be expressed by U+0F0B, or by the spe-cial character U+0F0C, a nonbreaking form of tsek.) The Unicode names for these twotypes of tsek are misnomers, retained for compatibility. The standard tsek U+0F0B is always required to be a potentially breaking char-acter, whereas the “nga-phye-tsheg” is always required to be a nonbreaking tsek. U+0F0C is specifically not a “delimiter” and is not for gen-eral use.

There are no other break characters in Tibetan text. Unlike English, Tibetan has no systemfor hyphenating or otherwise breaking a word within the group of letters making up theword. Tibetan text formatting does not allow text to be broken within a word.

Whitespace appears in Tibetan text, although it should be represented by U+00A0 -

instead of U+0020 . Tibetan text breaks lines after tsek instead of atwhitespace.

Complete Tibetan text formatting is best handled by a formatter in the application and notjust by the code stream. If the interword and nonbreaking tseks are properly employed asbreaking and nonbreaking characters, respectively, and if all spaces are nonbreaking spaces,then any application will still wrap lines correctly on that basis, even though the breaksmight be sometimes inelegant.

Tibetan Punctuation. The punctuation apparatus of Tibetan is relatively limited. Theprinciple punctuation characters are the tsek already mentioned, the shay (transliterated“shad”), which is a vertical stroke used to mark the end of a section of text, the space usedsparingly as a space, and two of several variant forms of the shay that are used in specializedsituations requiring a shay. There are also several other marks and signs but they are spar-ingly used.

The shay at U+0F0D marks the end of a piece of text called “tshig-grub”. The mode ofmarking bears no commonality with English phrases or sentences and should not bedescribed as a delimiter of phrases. In Tibetan grammatical terms, a shay is used to markthe end of an expression (“brjod.pa”) and a complete expression. Two shays are used at theend of whole topics (“don.tshan”). Because some writers use the double shay with a differ-ent spacing than would be obtained by coding two adjacent occurrences of U+0F0D, thedouble shay has been coded at U+0F0E with the intent that it would have a larger spacingbetween component shays than if two shays were simply written together. However, mostwriters do not use an unusual spacing between the double shay, so the application shouldallow the user to write two U+0F0D codes one after the other. Additionally, font designerswill have to decide whether to implement these shays with a larger than normal gap.

The U+0F11 rin-chen-pung-shay (rin-chen-spung-shad) is a variant shay used in a specific“new-line” situation. Its use was not defined in the original grammars but Tibetan tradi-tion gives it a highly defined use. The drul-shay (“sbrul-shad”) is likewise not defined bythe original grammars but Tibetan tradition gives it a highly defined use. It is used for sep-arating sections of meaning that are equivalent to topics (“don.tshan”) and subtopics. Adrul-shay is usually surrounded on both sides by the equivalent of about three spaces(though there is no rule specified). Hard spaces will be needed for these because the

244 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 37: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.13 Tibetan

drul-shay should not appear at the beginning of a new line and the whole structure of spac-ing-plus-shay should not be broken up, if possible.

Tibetan texts use a yig-go (“head mark,” yig-mgo) to indicate the beginning of the front ofa folio, there being no other certain way, in the loose-leaf style of traditional Tibetan books,to tell which is the front of a page. The head mark can and does vary from text to text; thereare many different ways to write it. The common type of head mark has been provided forwith U+0F04 and its extension U+0F05 . An initial mark yig-go can be written alone or combinedwith as many as three closing marks following it. When the initial mark is written in com-bination with one or more closing marks, the individual parts of the whole must stay inproper registration with each other to appear authentic. Therefore, it is strongly recom-mended that font developers create precomposed ligature glyphs to represent the variouscombinations of these two characters. The less common head marks mainly appear inNyingmapa and Bonpo literature. Three of these head marks have been provided for withU+0F01, U+0F02, and U+0F03; however, many others have not been encoded. Font devel-opers will have to deal with the fact that many types of head marks in use in this literaturehave not been encoded, cannot be represented by a replacement that has been encoded, andwill be required by some users.

Two characters, U+0F3C and U+0F3D

, are paired punctuation, typically used together forming a roof over oneor more digits or words. In this case, kerning or special ligatures may be required for properrendering. The right ang khang may also be used much as a single closing parenthesis isused in forming lists; again, special kerning may be required for proper rendering. Themarks U+0F3E and U+0F3F are pairedsigns used to combine with digits; special glyphs or compositional metrics are required fortheir use.

A set of frequently occurring astrological and religious signs specific to Tibetan is encodedbetween U+0FBE and U+0FCF.

U+0F34 means “et cetera” or “and so on” and is used after the first few tsek-bar of a recur-ring phrase. U+0FBE (often three times) indicates a refrain.

U+0F36 and U+0FBF are used to indicate where text should be inserted within other textor as references to footnotes or marginal notes.

Other Characters. The Wheel of Dharma, which occurs sometimes in Tibetan texts, isencoded in the Miscellaneous Symbols block at U+2638.

Left-facing and right-facing swastika symbols are likewise used. They are found among theChinese ideographs at U+534D (“yung-drung-chi-khor”) and U+5350 (“yung-drung-nang-khor”).

The marks U+0F35 and U+0F37

conceptually attach to a tsek-bar rather than to an individual characterand function more like attributes than characters—like underlining to mark or emphasizetext. In Tibetan interspersed commentaries, they may be used to tag the tsek-bar belongingto the root text that is being commented on. The same thing is often accomplished by set-ting the tsek-bar belonging to the root text in large type and the commentary in small type.Correct placement of these glyphs may be problematic. If they are treated as normal com-bining marks, they can be entered into the text following the vowel signs in a stack; if used,their presence will need to be accounted for by searching algorithms, and so forth.

Tibetan Half-Numbers. The half-number forms (U+0F2A..U+0F33) are peculiar toTibetan, though other scripts (for example, Bengali) have similar fractional concepts. Thevalue of each half-number is 0.5 less than the number within which it appears. These forms

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 245

Page 38: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.13 Tibetan South and Southeast Asian Scripts

are used only in some traditional contexts and appear as the last digit of a multidigit num-ber. The sequence of digits “U+0F24 U+0F2C” represents the number 42.5 or forty-twoand one-half.

Tibetan Transliteration and Transcription of Other Languages. Tibetan traditions are inplace for transliterating other languages. Most commonly, Sanskrit has been the languagebeing transliterated, though Chinese has become more common in modern times. Addi-tionally, Mongolian has a transliterated form. There are even some conventions for translit-erating English. One feature of Tibetan script/grammar is that it allows for totally accuratetransliteration of Sanskrit. The basic Tibetan letterforms and punctuation marks containmost of what is needed, although a few extra things are required. With these additions,Sanskrit can be transliterated perfectly into Tibetan, and the Tibetan transliteration can berendered backward perfectly into Sanskrit with no ambiguities or difficulties.

The six Sanskrit retroflex letters are interleaved among the other consonants.

The compound Sanskrit consonants are not included in normal Tibetan. They could bemade using the method described earlier for Tibetan stacked consonants, generally by sub-joining “ha”. However, to maintain consistency in transliterated texts and for ease in trans-mission and searching, it is recommended that implementations of Sanskrit in the Tibetanscript use the precomposed forms of aspirated letters (and U+0F69 “ka + reversed sha”)whenever possible, rather than implementing these consonants as completely decomposedstacks. Note that implementations must ensure that decomposed stacks and precomposedforms are interpreted equivalently (see Section 3.6, Decomposition). The compound conso-nants are explicitly coded as follows: U+0F93 , U+0F9D , U+0FA2 , U+0FA7 , U+0FAC , andU+0FB9 .

The vowel signs of Sanskrit not included in Tibetan are encoded with other vowel signsbetween U+0F70 and U+0F7D. U+0F7F (nam chay) is thevisarga, and U+0F7E (ngaro) is the anusvara. See Section 9.1,Devanagari, for more information on these two characters.

The characters encoded in the range U+0F88..U+0F8B are used in transliterated text andare most commonly found in Kalachakra literature.

When the Tibetan script is used to transliterate Sanskrit, consonants are sometimes stackedin ways that are not allowed in native Tibetan stacks. Even complex forms of this stackingbehavior are catered for properly by the method described earlier for coding Tibetan stacks.

Other Signs. U+0F09 is a list enumerator used at the startof administrative letters in Bhutan, as is the petition honorific U+0F0A

- .

U+0F3A and U+0F3B arepaired punctuation marks (brackets).

The sign U+0F39 - (tsa-’phru, which is a lenition mark) is theornamental flaglike mark that is an integral part of the three consonants U+0F59

, U+0F5A , and U+0F5B . Althoughthose consonants are not decomposable, this mark has been abstracted and may by itself beapplied to “pha” and other consonants to make new letters for use in transliteration andtranscription of other languages. For example, in modern literary Tibetan, it is one of theways used to transcribe the Chinese “fa” and “va” sounds not represented by the normalTibetan consonants. Tsa-’phru is also used to represent tsa, tsha, or dza in abbreviations.

Traditional Text Formatting and Line Justification. Native Tibetan texts (“pecha”) arewritten and printed using a justification system that is, strictly speaking, right-ragged but

246 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 39: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.13 Tibetan

with an attempt to right justify. Each page has a margin. That margin is usually demarcatedwith visible border lines required of a pecha. In modern times, as Tibetan text is producedin Western-style books, the margin lines may be dropped and an invisible margin used.When writing the text within the margins, an attempt is made to have the lines of text jus-tified up to the right margin. To do so, writers keep an eye on the overall line length as theyfill lines with text and try manually to justify to the right margin. Even then, there is often agap at the right margin that cannot be filled. If the gap is short, it will be left as is and theline will be said to be justified enough, even though by machine-justification standards theline is not truly flush on the right. If the gap is large, the intervening space will be filled withas many tseks as are required to justify the line. Again, the justification is not done perfectlyin the way that English text might be perfectly right-justified; as long as the last tsek is moreor less at the right margin, that will do. The net result is that of a right-justified, blocklikelook to the text, but the actual lines are always a little right-ragged.

Justifying tseks are nearly always used to pad the end of a line when the preceding characteris a tsek—in other words, when the end of a line arrives in the middle of tshig-grub (see theprevious definition under “Tibetan Punctuation”). However, it is unusual for a line thatends at the end of tshig-grub to have justifying tseks added to the shay at the end of thetshig-grub. That is,

xxxx| |

is not usually padded like this (though it is allowable):

xxxx| |.....

In this case, instead of justifying the line with tseks, the space between shays is enlargedand/or the whitespace following the final shay is usually left as is. Padding is never appliedfollowing an actual space character. For example, given the existence of a space after a shay,the following may not be written with the padding,

xxxxxxx| ......

because the final shay should have a space after it, and padding is never applied after spaces.The same applies where the final consonant of a tshig-grub that ends a line is a “ka” or “ga”.In that case, the ending shay is dropped but a space is still required after the consonant andthat space must not be padded. For example, the following is not acceptable:

xxxxxga ......

Tibetan text has two rules regarding the formatting of text at the beginning of a new line.There are severe constraints on which characters can start a new line, and the rule is tradi-tionally stated as follows: a shay of any description may never start a new line. Nothing butactual words of text can start a new line, the only exception being a go-yig (yig-mgo) at thehead of a front page or a da-tshe (zla-tshe, meaning “crescent moon”—for example,U+0F05) or one of its variations, which is effectively an “in-line” go-yig (yig-mgo), on anyother line. One of two or three ornamental shays is also commonly used in short pieces ofprose in place of the more formal da-tshe. This rule also means that a space may not start anew line in the flow of text. If there is a major break in a text, a new line might be indented.

A syllable (tsheg-bar) that comes at the end of a tshig-grub and that starts a new line musthave the shay that would normally follow it replaced by a rin-chen-spung-shad (U+0F11).(The reason for this rule is that the presence of the rin-chen-spung-shad makes the end oftshig-grub more visible and hence makes the text easier to read.)

In verse, the second shay following the first rin-chen-spung-shad is also replaced some-times with a rin-chen-spung-shad, though the practice is formally incorrect. It is a writer’strick done to make a particular scribing of a text more elegant. It is moderately populardevice but does breaks the rule. Not only is rin-chen-spung-shad used as the replacement

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 247

Page 40: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.13 Tibetan South and Southeast Asian Scripts

for the shay but a whole class of “ornamental shays” are used for the same purpose. All arescribal variants on a rin-chen-spung-shad, which is correctly written with three dots above it.

Tibetan Shorthand Abbreviations (bskungs-yig) and Limitations of the Encoding. A con-sonant functioning as the word-base (ming-gzhi) is allowed to take only one vowel signaccording to Tibetan grammar. The Tibetan shorthand writing technique called bskungs-yig does allow one or more words to be contracted into a single, very unusual combinationof consonants and vowels. This construction frequently entails the application of morethan one vowel sign to a single consonant or stack, and the composition of the stacks them-selves can break the rules of normal Tibetan grammar. For this reason, vowel signs dosometimes interact typographically, which accounts for their particular combining classes(see Section 4.2, Combining Classes—Normative).

The Unicode Standard accounts for plain text compounds of Tibetan that contain at mostone base consonant, any number of subjoined consonants, followed by any number ofvowel signs. This coverage constitues the vast majority of Tibetan text. Rarely, stacks areseen that contain more than one such consonant-vowel combination in a vertical arrange-ment. These stacks are highly unusual and are considered beyond the scope of plain textrendering. They may be handled by higher-level mechanisms.

248 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 41: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.14 Myanmar

9.14 Myanmar

Myanmar: U+1000–U+109FThe Myanmar script is used to write Burmese, the majority language of Myanmar (for-merly called Burma). Variations and extensions of the script are used to write other lan-guages of the region, such as Shan and Mon, as well as Pali and Sanskrit. The Myanmarscript was formerly known as the Burmese script, but the term “Myanmar” is now pre-ferred.

The Myanmar writing system derives from a Brahmi-related script borrowed from SouthIndia in about the eighth century for the Mon language. The first inscription in the Myan-mar script dates from the eleventh century, using an alphabet almost identical to that of theMon inscriptions. Aside from rounding of the originally square characters, this script hasremained largely unchanged to the present. It is said that the rounder forms were devel-oped to permit writing on palm leaves without tearing the writing surface of the leaf.

Because of its Brahmi origins, the Myanmar script shares the structural features of its Indicrelatives: consonant symbols include an inherent “a” vowel; various signs are attached to aconsonant to indicate a different vowel; ligatures and conjuncts are used to indicate conso-nant clusters; and the overall writing direction is left to right. Thus, despite great differ-ences in appearance and detail, the Myanmar script follows the same basic principles as, forexample, Devanagari.

Standards. There is not yet an official national standard for the encoding of Myanmar/Burmese. The current encoding was prepared with the consultation of experts from theMyanmar Information Technology Standardization Committee (MITSC) in Yangon(Rangoon). The MITSC, formed by the government in 1997, consists of experts from theMyanmar Computer Scientists’ Association, Myanmar Language Commission, and Myan-mar Historical Commission.

Encoding Principles. As with Indic scripts, the Myanmar encoding represents only thebasic underlying characters; multiple glyphs and rendering transformations are required toassemble the final visual form for each syllable. Even some single characters, such asU+102C , may assume variant forms depending on the othercharacters with which they combine. Conversely, characters or combinations that mayappear visually identical in some fonts, such as U+101D and U+1040 , are distinguished by their underlying encoding.

Composite Characters. As is the case in Extended Latin and many other scripts, someMyanmar letters or signs may be analyzed as composites of two or more other characters,and are not encoded separately. The following is an example of a Myanmar letter repre-sented by a combining character sequence:

myanmar vowel sign o = U+1031 + U+102C

Encoding Subranges. The basic consonants, independent vowels, and dependent vowelsigns required for writing the Myanmar language are encoded at the beginning of theMyanmar range. Extensions of each of these categories for use in writing other languages,such as Pali and Sanskrit, are appended at the end of the range. In between these two sets liethe script-specific signs, punctuation, and digits.

Conjunct and Medial Consonants. As in other Indic-derived scripts, conjunction of twoconsonant letters is indicated by the insertion of a virama U+1039

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 249

Page 42: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.14 Myanmar South and Southeast Asian Scripts

between them; it causes ligation or other rendered combination of the consonants,although the virama itself is not rendered visibly.

The conjunct form of U+1004 is rendered as a superscript signcalled kinzi. Kinzi is encoded in logical order as a conjunct consonant before the syllable towhich the mark applies, such as the Devanagari ra. (See Section 9.1, Devanagari, Rule R2.)For example, kinzi applied to U+1000 would be written via the fol-lowing sequence:

U+1004

U+1039

U+1000

The Myanmar script traditionally distinguishes a set of subscript “medial” consonants:forms of , , , and that are considered to be modifiers of the syllable’s vowel. Inthe Myanmar encoding, the medial consonants are treated as conjuncts; that is, they arecoded using the virama. So, for example, the word “kywe” (“to drop off”) would be writtenvia the following sequence:

U+1000

U+1039

U+101A

U+1039

U+101D

U+1031

Explicit Virama. The virama U+1039 also participates in somecommon constructions where it appears as a visible sign, commonly termed killer. In thisusage where it appears as a visible diacritic, U+1039 is followed by a U+200C

-, as with Devanagari (see Figure 9-7).

Signs After Consonants. Dependent vowels and other signs (except kinzi, as noted above)are encoded in logical order after the consonant to which they apply, regardless of wherethe glyph for the sign happens to be rendered relative to the glyph for the consonant. Inparticular, U+1031 is encoded after its consonant (as in the previ-ous example), although in visual presentation it is reordered to appear before (to the left of)the consonant form.

Spacing. Myanmar does not use any whitespace between words. If word boundary indica-tions are desired—for example, for the use of automatic line layout algorithms—the char-acter U+200B should be used to place invisible marks for such breaks.The can grow to have a visible width when justified.

250 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 43: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.15 Khmer

9.15 Khmer

Khmer: U+1780–U+17FFKhmer, also known as Cambodian, is the official language of Kampuchea. Mutually intelli-gible dialects are also spoken in northeastern Thailand and the Mekong Delta region ofVietnam. Although Khmer is not an Indo-European language, it has borrowed muchvocabulary from Sanskrit and Pali, and religious texts in those languages have been translit-erated as well as translated into Khmer.

The Khmer script, called a’saa kmae (“Khmer letters”), is descended from the Brahmiscript of South India, as are Thai, Lao, Myanmar, Old Mon, and others. The exact sourceshave not been determined, but there is a great similarity between the earliest inscriptions inthe region and the Pallawa script of the Coromandel coast of India. Modern Khmer has twobasic styles of script: the a’saa criang (“slanted script”) and the a’saa muul (“round script”).There is no fundamental structural difference between the two. The slanted script (“stand-ing” variant) is chosen as representative in Chapter 14, Code Charts.

Structurally, the Khmer script stays close to its southern Brahmi origins; in modern terms,it shares features with both Thai and Myanmar. Consonant letters bear an inherent vowelsound, with additional signs placed before, above, below, and after the consonants to indi-cate vowels other than the inherent one. Consonant clusters are represented by conjuncts,where the first consonant of the cluster maintains its full form and succeeding consonantsare written as subscripts. The overall writing direction is left to right.

The Khmer language has a richer set of vowels than the languages for which the ancestralscript was used, but it has a smaller set of consonant sounds. The Khmer script takesadvantage of this situation by assigning different symbols to the same consonant but withdifferent inherent vowels. This usage is analogous to the situation in Thai, where differentconsonant symbols have the same sound but encode different tones. The Khmer consonantletters are organized into two series or registers, whose inherent vowels are nominally “a”and “o”. Two “shifter” signs convert a consonant from one series to the other. The depen-dent vowel signs then do not have a single phonetic value, but rather are interpreted in thecontext of the consonant to which they are attached.

Encoding Principles. As with other related scripts, the Khmer encoding represents only thebasic underlying characters; multiple glyphs and rendering transformations are required toassemble the final visual form for each syllable. Even some single characters, such asU+1789 , may assume variant forms depending on the other characterswith which they combine. Conversely, characters or combinations that may appear visuallyidentical in some fonts, such as U+17A2 and U+17A3 -

, are distinguished by their underlying encoding.

Independent Vowels. In Khmer, as in most related scripts (but in contrast to Thai), inde-pendent vowels have their own letterforms.

Conjunct Consonants. U+17D2 plays the conjunct formation role ofIndic virama, killing the vowel of its preceding consonant, and indicating that the followingconsonant should be treated as a subscript. Sign coeng should not be confused withU+17D1 , which has a name similar to virama but an unrelated func-tion—indicating that the base character is part of the previous word.

Note that a subjoined U+179A is typed normally—that is, after the pre-ceding consonant and sign coeng, even though the glyph for the subjoint letter ro is ren-dered before (to the left of) the full-size consonant. U+17CC

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 251

Page 44: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

9.15 Khmer South and Southeast Asian Scripts

historically corresponds to the Devanagari repha (that is, an initial /r/), but it has lost thisfunction in Khmer and instead is treated as a simple diacritical mark rather than as part ofa conjunct.

Consonant Shifters. The marks U+17C9 and U+17CA

are used to shift the base consonant between registers. In the presence ofother superscript glyphs, both of these signs may be rendered via the same glyph shape asthat of U+17BB . Selection of the proper rendering form is left to thedisplay software.

Dependent Vowel Signs. Structurally, the Khmer dependent vowel signs are close to thoseof Thai, but the encoding principle is different. Khmer follows the general model forBrahmi-derived scripts, in which each dependent vowel sign is represented by a single codethat occurs after the code for the base consonant. This principle reflects the fact that avowel sign has its own identity, regardless of the number and placement of glyph fragmentsthat may be used to render the sign, and regardless of its contextual phonetic interpreta-tion. Each Khmer consonant either expresses its implicit vowel or is followed by a singledependent vowel sign character.

Ordering of Syllable Components. The standard order of components in consonantal syl-lables as expressed in BNF is

C ( SC C )* { S } { V } { O }

where

C is a consonant

SC is a sign coeng (= virama)

S is a consonant shift

V is a dependent vowel sign

O is any other sign

For example, the word /khnyom/, “I”, would represented by the following sequence:

U+1781

U+17D2

U+1789

U+17B9

U+17C6

Except for the presence of consonant shifters, this structure is analogous to Devanagari (seeSection 9.1, Devanagari).

Special Character Usage. Although U+17A3 and U+17A2 have identical glyphs, only U+17A2 should be used for modern text.U+17A3 and U+17A4 are solely for Pali/Sanskrit transliteration.

U+1789 may have two different subscript forms. Furthermore, theglyph part of the base letter U+1789, which normally occurs under the baseline, is omittedwhenever a subscript is conjoined. U+17B1

bears a close relationship to U+17B2 . Thevery commonly used word meaning “give” should be spelled U+17B2 U+17D2 U+1799,with rendering software optionally replacing the glyph for U+17B1 with the glyph forU+17B2. U+17D3 is an exceedingly rare sign used in historiclunar dates; it should not be confused with U+17C6 , which carries

252 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard

Page 45: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

South and Southeast Asian Scripts 9.15 Khmer

an “m” sound and occurs frequently in modern Khmer text. U+17C6 -

, U+17C7 , and U+17C8 are signsthat occur in many permutations with vowels; they are not vowels themselves.

Spacing. Khmer does not use any whitespace between words. If word boundary indicationsare desired—for example, for the use of automatic line layout algorithms—the characterU+200B should be used to place invisible marks for such breaks. The can grow to have a visible width when justified.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 253

Page 46: Brahmi Home - Chapter 9 South and Southeast …brahmi.sourceforge.net/docs/unicode-ch09.pdfthe structural similarities of the Brahmi scripts and follows the stated intention of the

This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor-tium and published by Addison-Wesley. The material has been modified slightly for this online edi-tion, however the PDF files have not been modified to reflect the corrections found on the Updatesand Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recentversions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/).

Many of the designations used by manufacturers and sellers to distinguish their products are claimedas trademarks. Where those designations appear in this book, and Addison-Wesley was aware of atrademark claim, the designations have been printed in initial capital letters. However, not all wordsin initial capital letters are trademark designations.

The authors and publisher have taken care in preparation of this book, but make no expressed orimplied warranty of any kind and assume no responsibility for errors or omissions. No liability isassumed for incidental or consequential damages in connection with or arising out of the use of theinformation or programs contained herein.

The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims aremade as to fitness for any particular purpose. No warranties of any kind are expressed or implied. Therecipient agrees to determine applicability of information provided.

Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi andpublished by Taishukan Shoten.

ISBN 0-201-61633-5

Copyright © 1991-2000 by Unicode, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, ortransmitted in any form or by any means, electronic, mechanical, photocopying, recording or other-wise, without the prior written permission of the publisher or Unicode, Inc.

This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset usingFrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout.The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies andorganizations supplied fonts:

Apple Computer, Inc.Atelier Fluxus VirusBeijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc.IBM CorporationMonotype Typography, Inc.Microsoft CorporationPeking University Founder Group CorporationProduction First Software

Additional fonts were supplied by individuals as listed in the Acknowledgments.

The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc.The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions.

All other company and product names are trademarks or registered trademarks of the company ormanufacturer, respectively.

The publisher offers discounts on this book when ordered in quantity for special sales. For moreinformation please contact:

Corporate, Government, and Special SalesAddison Wesley Longman, Inc.One Jacob WayReading, Massachusetts 01867

Visit A-W on the Web: http://www.awl.com/cseng/

First printing, January 2000.