View
221
Download
0
Category
Preview:
Citation preview
Computers and the Thai Language
Hugh Thaweesak KoanantakoolNational Science and Technology Development Agency
Theppitak KaroonboonyananThai Linux Working Group
Chai WutiwiwatchaiNational Electronics and Computer Technology Center
This article explains the history of Thai language development forcomputers, examining such factors as the language, script, andwriting system, among others. The article also analyzes characteristicsof Thai characters and I/O methods, and addresses key issues involvedin Thai text processing. Finally, the article reports on languageprocessing research and provides detailed information on Thailanguage resources.
Thai is the official language of Thailand.The Thai script system has been usedfor Thai, Pali, and Sanskrit languages in Bud-dhist texts all over the country. StandardThai is used in all schools in Thailand, andmost dialects of Thai use the same script.
Thai is the language of 65 million people,and has a number of regional dialects, suchas Northeastern Thai (or Isan; 15 millionpeople), Northern Thai (or Kam Meuang orLanna; 6 million people), Southern Thai(5 million people), Khorat Thai (400,000people), and many more variations (http://en.wikipedia.org/wiki/Thai_language). Thailanguage is considered a member of the fam-ily of Tai languages, the language usedin many parts of the Indochina subre-gion including India, southern China,northern Myanmar, Laos, Thai, Cambodia,and North Vietnam.
The Thai script of today has a historygoing back about 700 years, with gradualchanges in the script’s shape and writing sys-tem evolving over the years. The script wasoriginally derived from the Khmer script inthe sixth century. It is generally thoughtthat the Khmer script developed from thePallava script of India.1
Thai is written left-to-right, without spacesbetween words. Each character has only oneform, that is, no notion of uppercase andlowercase characters. Some vowels are writ-ten before and after the main consonant.
Certain vowels, all tone marks, and diacriticsare written above and below the maincharacter.
Pronunciation of Thai words does notchange with their usage, as each word has afixed tone. Changing the tone of a syllablemay lead to a totally different meaning.Thai verbs do not change their forms aswith tense, gender, and singular or pluralform, as is the case in European languages. In-stead, there are other additional words to helpwith the meaning for tense, gender, and sin-gular or plural. Basic Thai words are typicallymonosyllabic. Contemporary Thai makes ex-tensive use of adapted Pali, Sanskrit, English,and Chinese words embedded in day-to-dayvocabulary. Some words have been in uselong enough that people have forgotten thatthey originated from other languages. At pres-ent, most new words are created fromEnglish.
A Thai word is typically formed by thecombination of one or more consonants,one vowel, one tone mark, and one or morefinal consonants to make one syllable. Cer-tain words may be polysyllabic and thereforethey may consist of many characters in com-bination. Because of a limited number ofcharacters (see Figure 1), the writing systemcan be implemented on typewriters by map-ping each symbol directly to each key. A typ-ing sequence is exactly the same as a writingsequence.
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 46
46 IEEE Annals of the History of Computing 1058-6180/09/$25.00 �c 2009 IEEEPublished by the IEEE Computer Society
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
The computerization of Thai script for theThai language preserves the input sequence,and assigns one character in storage to corres-pond to one input keystroke. This concept issimple to understand, and all language pro-cessing algorithms are designed to suit thisarrangement, despite the fact that someThai vowels consist of up to three characters(from the vowel group) in that word.
Early IT standards in Thailand defined thecode points for computer handling, keyboardlayout, and I/O method. Subsequent researchprojects created many useful functions suchas word break, text-to-speech, optical charac-ter recognition (OCR), voice recognition, ma-chine translation, and search algorithms,which we describe elsewhere in this article.
Thai language on typewritersand computers
Printing of the Thai language was first ac-complished at Serampore in 1819: accordingto professor Michael Winship,2 a Catechism ofReligion had been prepared by the AmericanBaptist missionary Ann Hasseltine Judsonfor distribution to a small community ofThai prisoners of war in Burma. By 1836,Dan Beach Bradley, a missionary physician,started printing in Thailand using a donatedpress by the American Board of Commis-sioners for Foreign Mission. At the time, pub-lications were intended primarily forspreading Christianity into Thailand.3
Thai-language typewriter localizationwas first shown to the Thai public in1891, when Edwin McFarland, in coopera-tion with Smith Premier, a US typewriter
company, introduced the first Thai languagetypewriter. In 1898, son George McFarlandopened a typewriter shop in Bangkok.Those first typewriters had a moving carriagewithout any shift mechanism.4 Today’s type-writer keyboard has five rows, with standardshift keys.
The use of computers with the Thai lan-guage started in the late 1960s when IBMintroduced card-punch machines and lineprinters with Thai character capability. Indoing so, the 8-bit EBCDIC code assignmentswere made for Thai characters.5 With theprinter’s mechanical limitations, for eachline of Thai text an equivalent of four-passprinting was required to complete the multi-level nature of the Thai writing system. In the1970s, Univac, Control Data, and Wangintroduced their computers using the Thailanguage. In 1979, some companies offeredCRT terminals, with a 12-lines-per-screen ca-pability, which could display Thai characters.Interactive computing came to Thailand in1983 when a Thai inventor modified the Her-cules Graphic Card (HGC) circuitry and madeit capable of displaying 25 lines per screen intext mode. The full display capability of theThai language has been adopted widelysince then.
The computer industry in Thailand in the1980s was very much vertically integrated:that is, hardware manufacturers always sup-ported their version of localized operatingsystems and applications. Every companyused different Thai character codes as a resultof competition and corporate strategy to re-tain their customers. Application developers,however, suffered from the codes’ incompat-ibility and from differences in the systeminterfaces. By 1984, there were at least 20 dif-ferent character coding conventions used inThailand.6 This situation prompted the ThaiIndustrial Standard Institute (TISI) to estab-lish a standards committee to draft the na-tional standard code for computers. Thework, completed in 1986, is known as TIS620-2529—The Standard for Thai CharacterCodes for Computers B.E. [Buddhist Era]2529.7 Basically, two designs of the codepoints were adopted in the standard: theIBM-EBCDIC and the extended ASCII (8-bitplane). In 1990, the TIS 620 standard wasenhanced with a clearer explanation butwithout any change to the code points, andthis standard became TIS 620-2533. The tech-nical committees of TISI subsequently issueda number of IT standards for Thailand(see Table 1).
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 47
Consonants
Vowels
Tone marks
Diacritics
Numerals
Other symbols
Total
44
18
4
5
10
6
87
Figure 1. Thai characters.
January–March 2009 47
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
It took only one year after the standard’sannouncement for every computer companyto be fully compliant with the TIS 620 stan-dard code. Everyone’s data then becameinterchangeable.
In 1990, the Thai API Consortium, a groupof Thailand software developers led by Tha-weesak Koanantakool, drafted a commonspecification for computer handling of theThai I/O method. The work was funded bythe National Electronics and Computer Tech-nology Center (NECTEC) and was publishedin 19918 as the WTT 2.0 specification.9 Thespecification was widely used by industry,including IBM, Digital Equipment, Hewlett-Packard, and, later, Microsoft. WTT 2.0 wasproposed to TISI as a draft national standardand, seven years later, became TIS 1566-2541(1998). At the same time, many more inter-esting, natural language processing (NLP)projects with useful results and solutionsfound their way into industrial use. In thepast two decades, Thailand has achieved sev-eral milestones for advanced computer pro-cessing with the Thai language; we reporton most of them in this article.
Thai alphabetsEncoding of Thai characters into an octet
has been made simple, since we need only87 characters to represent the language. Inthe code space between 0�80 and 0�FF, TIS620 occupies only the code space between0�A1 and 0�FB. The ASCII-compatible ver-sion of TIS 620 was interoperable with anyASCII-based computer with true 8-bit storageper character. The actual placement on the
code table was placed in the ‘‘most accept-able’’ collating sequence. The TIS 620 se-quence was subsequently registered with theUniversal Character Set defined by the inter-national standard ISO/IEC 10646, UniversalMultiple-Octet Coded Character Set, in the re-gion Uþ0E00 . . .Uþ0E7F, and also as theISO/IEC 8859-11 standard for 8-bit (singlebyte) coded graphic character sets, Latin/Thai alphabet. Figure 2 shows the code assign-ment for Thai characters and their names inTIS 620 and Unicode.
Consonants
Classifications of the Thai consonants canbe made as plosives (stops), non-plosives, sibi-lants, and voiced ‘‘h.’’ In the plosive table inFigure 3, each column is also groupedby voiced/unvoiced properties. In the lastgroup, อ is classified as a zero consonant,and it can be used to write with vowels,which cannot be stand-alone. Figure 3 showsthe consonants, classifications, and the associ-ated International Phonetic Alphabet (createdby the International Phonetic Association,IPA) and the International Alphabet of San-skrit Transliteration (IAST) representations.
Vowels
The 18 symbols for vowels, together withthree consonants—ย, ว, and อ—are used incombination to create 32 vowels for Thai.
The 18 symbols are listed in Figure 4, withthe associated Unicode values.
These 18 vowel symbols and 3 consonants(21 in total) combine to form 32 vowels, asFigure 5 shows.
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 48
Computers and the Thai Language
Standard number
TIS 620-2529 (1986)TIS 820-2531 (1988)TIS 620-2533 (1990)TIS 988-2533 (1990)
TIS 1074-2535 (1992)TIS 1075-2535 (1992)
TIS 1099-2535 (1992)TIS 1111-2535 (1992)TIS 820-2538 (1995)TIS 1566-2541 (1988)
Standard title
Standard for Thai Character Codes for ComputersLayout of Thai Character Keys on Computer KeyboardsStandard for Thai Character Codes for Computers (Enhanced)Recommendation for Thai Combined Character Codes and Symbols for Line Graphics for Dot-Matrix PrintersStandard for 6-Bit Teletype CodesStandard for Conversion Between Computer Codes and 6-Bit Teletype CodesStandard for Province Identification Codes for Data InterchangeStandard for Representation of Dates and TimesLayout of Thai Character Keys on Computer Keyboards (Enhanced)Thai Input/Output Methods for Computers
Table 1. Thai industrial standards on information technology. The four digits at the end of the standard number (e.g., 2529) is the year of issue, according to Buddhist Era.
Source: http://www.nectec.or.th/it-standards/.
48 IEEE Annals of the History of Computing
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
Thai input method and keyboardsThai is written with the combining marks
stacked above or below the base consonant,like diacritics in European languages. How-ever, although the concepts are quite sim-ilar, the implementations are significantlydifferent.
Thai input method requirementsFirst, there are too many possible combina-
tions of base consonants and combining marksin Thai to be enumerated like Latin accents inthe ISO/IEC 8859 series standard for 8-bit char-acter encoding. Therefore, base characters andcombining characters are encoded separately,rather than precombined.
Second, Thai combining marks are classi-fied into upper or lower vowels, tone marks,and other diacritics. Moreover, a Thai baseconsonant can be combined with up to twocombining marks, that is, zero or one upperor lower vowel and zero or one tone mark or
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 49
TIS
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
Fto patak fo fan paiyannoi baht fongman
yamakkan
nikhahit
thanthakhat
mai chattawa
pinthu mai tri angkhankhu
thai ninemai thosara uu
thai eightmai eksara u
khomut
ho nokhukpho phando chada
o angfo fayo ying
cho choe pho phung lo chula
so so po pla ho hip
cho chang bo baimai so sua
cho ching no nu so rusi
cho chan tho thong so sala
thai sevenmaitaikhusara ueengo ngu tho thahan wo waen
thai sixmaiyamoksara uekho rakhang tho thung lu
thai fivelakkhangyaosara iikho khon to tao lo ling
thai foursara ai maimalaisara ikho khwai do dek ru
thai threesara ai maimuansara amkho khuat no nen ro rua
thai twosara osara aakho khai tho phuthao yo yak
thai onesara aemai han-akatko kai tho nangmontho mo ma
thai zerosara esara atho than pho samphao
Unicode U+0E0x U+0E1x U+0E2x U+0E3x U+0E4x U+0E5x U+0E6x U+0E7x
0xAx 0xBx 0xCx 0xDx 0xEx 0xFx
Figure 2. Standard codes for Thai characters—TIS 620 and Unicode, with their names as defined by the
WTT 2.0 specification.
unaspirated unaspiratedaspirated aspirated nasal
voicedUnvoicedPlosiveclass
velar
palatal
retroflex
dental
labial
non-plosive
sibilants
Voiced ''h''
zero consonant (never a vowel)
(semi-vowel)
tone middle high low low low
palatal retroflex dental labial
Figure 3. Classification of Thai consonants. Each cell consists of the
character, International Phonetic Alphabet, and the International Alpha-
bet of Sanskrit Transliteration (IAST) representations in square brackets.
January–March 2009 49
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
diacritic. The upper/lower vowel, if present, isalways attached to the consonant before thetone/diacritic. These conditions must be gov-erned by some rules to limit the number andorder of the combining characters to be putafter the base consonant.
As a third difference, Thai users are famil-iar with typing the combining charactersafter the base consonant, in contrast tomany European input methods in whichusers type the diacritic before the base letterto compose an accent. The typing sequenceand the internal storage of a Thai string,
therefore, are totally consistent. Moreover, atyping sequence is almost exactly the sameas a handwriting sequence, with the excep-tion of only one character, sara am, a com-posite vowel that consists of two symbolssharing one code point.
Typewriter keyboardsThe keyboard of the first Thai typewriter
made by Smith Premier in 1891 consisted of12 keys in seven rows (Figure 6a shows the lay-out). When the typewriter industry startedusing the shift key, the Thai layout was alsoadapted on the typical QWERTY arrangement.The characters that normally appear aboveand below the base characters are assigned toindividual keys, and they appear above orbelow the previous position without movingthe carriage. This concept is called dead keys.
The two main differences between Thaiand English typewriters are that there aretwo different characters on the same key,and that there are eight dead keys. All thedead keys are grouped in the center of the Ket-manee keyboard layout (see Figure 6b), theclassic layout named after its inventor.
In 1966, Sarit Pattajoti,11 an engineerat the Royal Irrigation Department, rede-signed the layout to improve the mechanicalcoordination of human fingers. Statistically,the new layout would eliminate the imbal-ance of the Ketmanee layout (30% distrib-uted to the left hand and 70% to the righthand), and distribute more of the workloadto the index and middle fingers. The Patta-joti layout (see Figure 6c) became the officialstandard. Manufacturers were encouraged tomake them in quantity for government offi-ces. However, this ‘‘official’’ standard wasabandoned in 1971 because its marginalspeed improvement was not significantenough for people to migrate from the defacto Ketmanee.
In 1988, TISI issued the standard layout forthe computer keyboard, TIS 820-2531 (1988)(see Figure 6d). The TISI technical committeeadopted the Ketmanee layout with only onechange. In 1993, the standard was enhancedto utilize the additional three keys availableon the computer keyboards. Figure 6e showsthe present standard layout as definedby TIS 820-2536 (1993). Four rarely used char-acters, however, are defined in the TIS 620standard that are not on the keyboard; theywould be entered on the specific programthat uses them, typically through a GUI or,in the pre-Microsoft Windows era, via directhexadecimal code in DOS.
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 50
Computers and the Thai Language
Front BackUnrounded
ShortClose
Close-mid
Open-midOpen
(a) 18 Monophthongs
(b) 9 Diphthongs
(c) 5 Semi-vowels
Thai
IPA
Thai
IPA
Short ShortLong Long LongRoundedUnrounded
Figure 5. Thirty-two vowels in Thai: 18 monophthongs, 9 diphthongs,
and 5 semi-vowels.
Vowel Name
sara a
mai han-akat
sara aa
sara am
sara i
sara ii
sara ue
sara uee
sara ue
sara uu
sara e
sara ae
sara o
sara ai maimuan
sara ai maimalai
ru
lu
lakkhangyao
FV1
AV2
FV1
FV1
AV1
AV3
AV2
AV3
BV1
BV2
LV
LV
LV
LV
LV
FV3
FV3
FV2
follow
above
follow
follow
above
above
above
above
below
below
leading vowel
leading vowel
leading vowel
leading vowel
leading vowel
follow
follow
follow
Type Position relative tothe main consonant
Figure 4. Eighteen vowel symbols in Thai.
50 IEEE Annals of the History of Computing
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
WTT 2.0 canonical orderThe Thai I/O method specifications
described in WTT 2.0 are based on the TIS620 standard code.8 The same concept waslater extended to the Thai code page in Uni-code. TIS 620 defines a code point for each ofthe Thai symbols, irrespective of their posi-tions when rendered. In the same manneras Unicode normalization, the WTT 2.0 speci-fication defines the canonical order of Thaicharacter strings. It differs, however, inthe implementation details and in terms ofsyntactic strictness. WTT 2.0 compliancerequires that certain input-sequence rulesmust be met, and most (if not all) input syn-tactic errors are eliminated at the time of dataentry (see Figure 7).
First, WTT 2.0 requires strings to be storedand transmitted in canonical order, not to benormalized at processing time. Therefore, inWTT 2.0, canonical and noncanonical stringsmean different things and can yield differentresults in rendering, sorting, searching, andso on. This helps simplify the string handlingroutines to some degree, leaving the task ofensuring the order at a single point, theinput method.
Second, WTT 2.0 concerns syntactic cor-rectness of the input string. It inhibits multi-ple tone marks to combine in one cell, whileUnicode does not care. Moreover, WTT 2.0defines three levels of syntactic strictness ofinput method. Level 0 (Passthrough) doesnot filter at all. Level 1 (BasicCheck) justensures that the input sequence complieswith canonical order and can be displayedgracefully in the output method. Level 2(Strict) is more picky in filtering out mostproblematic sequences. Therefore, the WTT2.0 input conditioning helps make the storageorder of Thai text cleaner than would the rawUnicode concept. The WTT 2.0 input methodproduces strings of a proper subset of the Uni-code specification, and thus it does not nega-tively affect Unicode-compliant programs.WTT 2.0 eliminates problematic strings thatmight otherwise fail string matching orthat might confuse the output method.
Technical characteristics of Thai input methodTechnical requirements of the Thai TIS
1566-2541 (WTT 2.0) input method differfrom input methods of other languages.The input method
� does not need pre-edit-and-commit stages(as used in some CJK [Chinese, Japanese,and Korean] input methods),
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 51
a
b
c
d
e
Figure 6. Evolution of Thai typewriter keyboard layout: from
top, (a) Smith Premier layout, ca. 1890, (b) Ketmanee layout (no
date), and computer keyboard layouts: (c) Pattajoti layout (1966),
(d) TIS 820-2531 (1988), and (e) TIS 820-2536 (1993). (Source:
Koanantakool10)
January–March 2009 51
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
� does not use the composing method (asused in Europe), and
� is not a straight key-to-character mapping(as used in ordinary English input method).
Rather, the input method requires thefollowing:
� one-to-one key-to-character mapping (ir-respective of the width and position ofthe character),
� ability to retrieve context character from ap-plication input buffer (to verify validity),and
� (optional) write access to applicationinput buffer.
The key-to-character mapping is straightfor-ward. Thailand has an industrial standardfor a keyboard map, totally compatible withthe traditional (i.e., mechanical) typewriter.
The ability to retrieve context charactersfrom the application input buffer is intendedto validate the key events, according to thestate machine. Note that other solutions arenot adequate, such as these:
� using pre-edit string and committing vali-dated characters as a chunk,
� using composing method to commitstrings cell-by-cell, and
� remembering previously typed charactersin the input context memory.
These solutions can serve only for the input-ting of new text but cannot handle the edit-ing of existing text, especially when the firstkey to be input is a combining character.
The last requirement of write access to theapplication input buffer is for input sequencecorrection enhancement.
ImplementationsKnown implementations of WTT-based
Thai input methods include the following:
� Thai Language Environment for Solaris� Microsoft DOS 6.0 and Windows (all
versions)� X Window: XIM (embedded in Xlib)� GTKþ:
—Thai-Lao IM module (stock GTKþ IMmodule)—gtk-im-libthai (third party, based onlibthai)
� SCIM:—scim-thai (third party, based on libthai)—scim-m17n (third party, by ETL, Japan)
Making TIS 620 a part of international standards
The early 1990s witnessed different opin-ions concerning the encoding of the Thaiblock in the making of draft Basic Multilin-gual Plane (BMP), the code space between0�0000 and 0�FFFF in Unicode. Trin Tant-setthi, a Thai computer architecture and soft-ware expert, developed proposals to relevantstandards bodies that were responsible forthe internationalization of Thai charactercodes.12 In order to have a Thai languagecharacter set for the ISO 8859 series, the TIS620 standard was registered under ISO 2375by the ECMA (European Computer Manufac-turers Association) as ISO-IR-166. The processencountered many obstacles. According toTantsetthi, the first proposal was to eliminatenonspacing characters, which were nonop-tional atomic parts of the script, by pre-composing characters into a rectangularbounding box of glyphs. This proposal waslater dropped because there were an insuffi-cient number of code spaces to accommodateall precomposed glyphs. Once the industryhad learned about this limitation, the long-awaited ISO/IEC 8859-11 Latin/Thai standardwas approved. TIS 620 was formally regis-tered by the Internet Assigned Numbers
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 52
Computers and the Thai Language
Forward
NON
LV
FV1
FV2
FV3
AV1
AV2
AV3
BV1
BV1
AD3
AD2
AD1
BD
TONE
CONS
NON = non-Thai printableCONS = Thai consonants
LV = leading vowels
FV1 = ordinary following vowels
FV2 = dependent following vowel
FV3 = special following vowels (RU, LU)BV1 = short below vowel (SARA U)
BV2 = long below vowel (SARA UU)BD = Below diacritic (PHINTHU)
TONE = tone marksAD1 = above diacritic 1
AD2 = above diacritic 2 (MAITAIKHU)
AD3 = above diacritic 3 (YAMAKKAN)
AV1 = above vowel 1 (SARA I)
AV2 = above vowel 2 (MAI HAN-AKAT,
AV3 = above vowel 3 (SARA II, SARA UEE) SARA UE)
(THANTHAKHAT, NIKHAHIT)
(LAKKHANGYAO)
Dead
Figure 7. WTT 2.0 Level 1 I/O state machine.
52 IEEE Annals of the History of Computing
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
Authority (IANA) as a legitimate encoding forthe Internet in 1998.12
Thailand also encountered another prob-lem in defining Thai in the ISO 10646 stan-dard, as some nonnative linguists hadstrongly proposed the use of phonetic order forThai, following Hindi-based scripts. Althoughthis scheme allows words to be normalizedinto a ‘‘proper form,’’ making word parsing abit easier, the proposal could not address thefollowing problems:
� it could not handle language exceptionsvigorously;
� compounded vowels were not addressed,neither in code-point assignments nor ininput method; and
� no backward compatibility was provided.
We opted for the traditional input method(i.e., visual, instead of phonetic, order) be-cause, since Thai localization had begun in1968, by 1990 the Thai computer industryhad equipped itself with libraries and toolsto use visual order without a problem.That’s basically how the Thai block in BMP/Unicode has developed. In 1998, IANA(Internet Assigned Numbers Authority)adopted TIS 620 as a legitimate character seton the Internet. The visual order was for-mally announced as a national standard inTIS 1566-2541 (1998).
Computer display, printing, and Thaiword processors
Like most languages, modern Thai textdisplay on computers has evolved from tradi-tional printing technologies. Mechanicaltypewriter design has allocated four verticalzones for placing Thai characters: one forbase characters, one for combining charactersbelow the baseline, and the other two forcombining characters above the base charac-ter. Characters are strictly classified on thebasis of their assigned positions. This haspropagated to a text-mode computer displaygrid. The WTT 2.0 specification was designedto accommodate exactly this.
Thai typography has a more-refinedscheme than that of typewriters. The top-most combining characters can be shifteddown slightly when the combining characterbelow them is missing to eliminate theempty gap. In addition, the combining char-acters can be slightly biased to the left orright of the base characters, which havelong ascenders (stems) to deliver aestheticprint results.
The bias technique was necessary for hot-metal typesetting because all the combinedsymbols had to be molded individually asone piece, and there were about 28 combi-nations to be implemented. Another set of28 types precombined with ascenders wasalso required. In phototypesetting or com-puter typesetting, these precombinationtechniques were no longer a concern, butthe same concept is useful to improve thespeed of dot matrix printing. To print a lineof Thai text on an impact printer normallyrequires four passes. By precombining thetop characters (two levels) before printing,the speed improves by 25% because onlythree passes are required, rather than thefour that mechanical typesetting required—one for each of the four vertical zones. Tomake all dot matrix printers interoperable,TISI issued TIS 988-2533, the standard codefor combined characters in dot matrix print-ers, in 1990.
Thai WTT-2.0�based rendering enginesfor text-mode display usually handle Thaitext in two steps. The text string is first clus-tered into cells, which are then shaped byproper placements of the combining charac-ters. Printing with an impact printer, on theother hand, requires a conversion of oneline of text into three or four stringswhose lengths are equal to the line width,with each string corresponding to eachpass of printing. The fastest dot matrixprinter manages to print a Thai line withonly two passes; typical printing requiresthree passes.
A number of DOS-based Thai-languageI/O subsystems were developed at KasetsartUniversity, Chulalongkorn University, andThammasat University. Yuen Poovorawanand his team at Kasetsart University createdthe first Thai word processor called ThaiEasy Writer. He also developed the Thai Ker-nel System as a resident program for DOS tohandle the Thai language.13 The most popu-lar DOS-based Thai word processor wasreleased in 1989. According to professorWanchai Rivepiboon of Chulalongkorn, CUWriter version 1.1 was first released to thepublic in April 1989. It was a tremendous suc-cess because of the demands for quality print-ing of Thai documents. The Faculty ofEngineering of Chulalongkorn Universitycontinued to support and improve theprogram for about five years until it wassuperseded in 1994 by commercial wordprocessing software running on the Windowsplatform.14
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 53
January–March 2009 53
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
Cell clusteringCell clustering shares the same state table
with the input method. Composable se-quences are tokenized into cells. Rejectedsequences are separated, with an optional dot-ted circle as the base character that can beeasily caught by human eyesight, as Figure 8illustrates.
Shaping
For typewriter-style displays, the toke-nized cells are rendered in a perfect rectangu-lar bounding box, and all combining marksare placed at fixed vertical positions. But formodern desktop publishing and the Thai lan-guage, beginning in 1989, combining markshave been typographically adjusted for thefollowing issues:
� The topmost combining mark without amark below it is shifted down to eliminatethe gap between it and the base character.
� Combining marks that are combined withthe base character with long ascenders(stems) are biased slightly to the left toavoid overlapping.
� Consonants with removable descenders(yo ying and tho than) have the descenderremoved when combined with a lowervowel or diacritic.
To accommodate these issues, typesettershave used Private Use Area glyphs. PUAglyphs are still widely used in most fontstoday. OpenType may eventually maketheir use obsolete with Thai, although notin the near term.
PUA glyphs solutionsOperating system vendors assigned sep-
arate, predefined PUA code points toshifted glyphs. Font creators prepared therequired glyphs, and the rendering engineknew how to use them. Unfortunately,Microsoft and Apple defined their ownPUA code points differently, which meantfonts could not be used across Windowsand Mac platforms.
To make matters worse, Adobe did notsupport any PUA glyphs for Thai. This
situation caused developers to create somecommon hacks with automatic filters toencode text with PUA glyphs so that suchapplications would render properly. This situ-ation created another mess: when users ofsuch filters transmitted messages in PUA,they were unreadable by anyone using an-other platform.
Thai users have long suffered from thislack of interoperability. The availability ofPUA glyphs, however, is important enoughin Thai typography to live on.
Pango, a free software rendering engine,supports both sets of PUA glyphs—Windowsand Apple—by detecting them before using.Open standard and open source softwarehas significantly helped the local Thai indus-try to find useful typographical solutions.
OpenType solutions
With OpenType technology, all shapingdetails can be moved into fonts. All that therendering engines need do is call the appro-priate features for the language.
Glyph processing features in OpenTypefonts are represented in two forms: glyphsubstitution (GSUB) and glyph positioning(GPOS). GSUB contains substitution rules,while GPOS describes anchor points forattaching combining marks to base glyph orcombining marks to other combining marks.
According to Microsoft’s specification forThai OpenType fonts,15 several features arerequired, as explained next.
Character Composition/Decomposition(‘‘ccmp’’). This must contain GSUB rules to
1. select the lower variation for topmostmarks in the absence of an upper vowel(see Figure 9, left),
2. remove the descender for the consonantsyo ying and tho than when combined withbelow-base characters (see Figure 9, mid-dle), and
3. decompose the composite vowel sara aminto the nikhahit sign and sara aa vowel,plus rearrange them with the combinedtone mark if one is present (see Figure 9,right).
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 54
Computers and the Thai Language
Figure 8. WTT 2.0 clustering. The dotted circles
indicate a rejected sequence.
Upper/lowervariation selection
Descenderremoval
SARA AMdecomposition
Figure 9. Glyph substitutions for shaping.
54 IEEE Annals of the History of Computing
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
Mark to Base Positioning (‘‘mark’’). This iscomposed of GPOS base anchors in the basecharacters and mark anchors in the combin-ing marks. This feature places marks aboveor below the base characters (see Figure 10).
Mark to Mark Positioning (‘‘mkmk’’). Thisis composed of GPOS base anchors in thebase marks and mark anchors in the combin-ing marks. This feature places topmost markson upper vowels.
Thai cultural conventionsThe most basic kind of cultural convention
is the Posix locale for the standard C library.According to Posix, a number of C functionsare defined to be locale-dependent, such asdate and time format, string collation, and soon. Many other programming languages alsorely on these functions.
The Posix locale specification has beenfurther extended in ISO/IEC TR 14652, Speci-fication Method for Cultural Conventions, byadding more categories, which have nowbeen supported by more-modern software.
Theppitak Karoonboonyanan16 has gath-ered information on Thai cultural conven-tions and documented his Thai localecreation for the GNU C library. Most Posixcategories were defined and later extendedwhen ISO/IEC TR 14652 was supported.
Handling of Thai wordsAccording to the Royal Institute of Thai-
land’s Royal Institute Dictionary (http://en.wikipedia.org/wiki/The_Royal_Institute_of_Thailand#The_Royal_Institute_Dictionary),Thai strings can be collated by comparisonfrom left to right, with two exceptions:
� Leading vowels are less significant thanthe initial consonant of the syllable andmust be compared after the consonant.
� Tone marks, thanthakhat and maitaikhu,are ignored unless all other parts are equal.
No syllable structure or word boundary analysisis required. Thai words are sorted alphabeti-cally, not phonetically. Figure 11 shows anexample of a sorted list of Thai words.
String collation solutions
The first known solution for Thai stringcollation was proposed in 1969 by D. Londeand Udom Warotamasikkhadit,17 and laterreferenced by Vichit Lorchirachoonkul18 in1979. Londe and Warotamasikkhadit pro-posed an algorithm for converting Thaistrings into left-to-right comparable form.
In 1992, Samphan Khamthaidee (a penname used by Samphan Raruenrom)19 pub-lished an article in a computer hobbyist jour-nal describing an algorithm for comparingtwo Thai strings on the fly. The algorithmwas well crafted so that the strings arescanned from left to right in a single pass,and the difference is detected as early as pos-sible. Karoonboonyanan20 later extendedKhamthaidee’s algorithm to cover punctua-tion marks by generalizing it to accommo-date extra levels of character weights in1997. Subsequently, details about the orderof Thai punctuation marks have been sum-marized in the literature.21,22
International standardsString collation is based primarily on the
Posix standard library functions strcoll()
and strxfrm(). Both are based on the LC_COL-
LATE locale setting. The Posix locale was laterextended as ISO/IEC TR 14652, SpecificationMethod for Cultural Conventions. For LC_COL-
LATE in particular, a Common TemplateTable for collating strings encoded in ISO/IEC
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 55
Figure 10. Mark positioning with GPOS.
Figure 11. Example of a sorted list of Thai words.
January–March 2009 55
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
10646, Universal Multiple-Octet Coded CharacterSet, in general has been defined as ISO/IEC14651, International String Ordering and Compar-ison. Every locale can customize the table to fitlocal needs. Unicode also defines UTS #10 (theUnicode Collation Algorithm) in parallel withthe ISO/IEC 14651 standard.
Both standards already supported multiple-level character weights in the first place,which was important for handling Thaitone marks and diacritics. So, all the specifica-tion needed at that time was the properweights definition. The reordering of theleading vowels, however, had been contro-versial for a while, as we will explain.
Defending the non-Indic style of Thai Uni-code encoding had given the Unicode commit-tee sufficient information to address Thai inUTS#10 since the beginning. The reorderingwas defined as a separate preprocessing stage.It took time, however, to convince the ISO/IEC 14651 committee to include this in theCommon Template Table (CTT) as a set of pre-defined contraction forms as there was no pre-processing concept in it. Rather, the reorderingrequirement was first described in Annex C inDIS 14651:200023 without realization in theCTT, before the actual amendment was imple-mented in 2003.24
Leading-vowel rearrangement issueOne common misinterpretation of the
leading-vowels rearrangement by nonnativelinguists, which has been frequently encoun-tered at conferences and meetings, was thatThai string collation required linguistic analysis,as a consequence of a Thai visual encodingscheme,asopposedtoaphoneticencodingschemeof other Indic scripts. This was simply wrong.
A major ambiguity problem would result ifone tried to do linguistic analysis. For example,the Thai string เพลา can mean two differentwords. One is a two-syllable word phe-la (mean-ing ‘‘time’’), and the other is a one-syllableword phlao (meaning ‘‘axel’’ or ‘‘abate’’). In try-ing to preprocess the string by rearranging theleading vowel เ with the syllable initial sound,one will find two possible rearrangements:พเลา for the former case, and พลเา for the lat-ter. This makes it impossible to determine asingle weight for the string. This is another
reason why Thailand completely rejected pho-netic encoding in its standards.
Thai string collation is in fact as simple asits encoding scheme. The leading vowel onlyneeds to be swapped with its immediate suc-ceeding character—nothing more than that.
Word and syllable boundaryTypical Thai writing has no space be-
tween words and syllables. We use spaceonly to separate sentences. Lacking a wordor syllable boundary marker has been amajor problem for early Thai language proc-essing applications, which has beenresearched since the 1980s. Early works fo-cused on syllable segmentation due to itspotential for being modeled in rule-basedfashion.25,26 Word segmentation, on theother hand, requires a dictionary. Longestor maximal matching incorporated with awell-designed dictionary was the first practi-cal approach.27 Over time, researchers haveproposed advanced algorithms. A majorcontribution includes the first open-sourcesoftware for word segmentation calledSmart Word Analyzer for THai (SWATH),which NECTEC developed.28 It is based on astatistical part-of-speech n-gram. In 2002,Wirote Aroonmanakun constructed a wordsegmentation engine using two-pass proc-essing, syllable segmentation, and syllablemerging based on collocations.29 These fun-damental tools have been applied in variouslanguage processing tasks such as line break-ing and word boundary detection in wordprocessing and publishing software.
SoundexSoundex is a phonetic algorithm for index-
ing words by sound. It has been widely usedin Internet search engines for flexible searchingby the same or a similar sound without regardto the written form. An original algorithm,proposed by Vichit Lorchirachoonkul,30 wasbased on a set of rules used to segment Thaicharacter strings into syllables, simplify thesyllables, and encode them in a five-charactercode. Another general procedure is to findpronunciations of a given keyword andmatch them to indexed words in the searchdatabase. Standard sound representationsare defined by the International Phonetic As-sociation (IPA) and the Speech AssessmentMethods Phonetic Alphabet (SAMPA).31,32
The Thai Royal Institute has defined a stan-dard sound-based romanization of Thaiwords. Figure 12 demonstrates these standardsound representations.
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 56
Computers and the Thai Language
Sample text Romanization
sawatdikhopkhunlakon
sa_2 wat_2 di:_1khOp_2 khun_1la:_1 kOn_2
IPA SAMPA
Figure 12. Examples of Thai sound representations.
56 IEEE Annals of the History of Computing
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
Automatic letter-to-sound conversion isan important tool that activates the auto-matic soundex search. One of the first com-plete tools for letter-to-sound conversionwas based on a probabilistic, generalizedleft-to-right parser trained by a pronuncia-tion dictionary.33
Advanced human�computer I/OSeveral forms of human�computer I/O—
OCR, handwriting recognition, text-to-speechand speech synthesis, and automaticspeech recognition—present different chal-lenges with respect to the Thai language.This section reviews major contributions tothese advanced technologies.
OCR and handwriting recognition
Thai OCR is not trivial, as a bounding boxmay contain up to three symbols stacking to-gether, and the written (or printed) sentencesdo not have spaces within them. In addition,modern Thai texts do mix with Englishwords. The challenges to OCR developershave been tremendous.
The history of Thai OCR spans well over20 years. Pipat Hiranvanichakorn et al. andChom Kimpan et al. were two of the originalgroups of researchers in this area.34,35 Theirsubsequent works primarily concerned recog-nition of individual (isolated) printed charac-ters, that is, without connections among thecharacters. A number of commercial productshave been launched since the 1990s—for ex-ample, ArnThai developed by NECTEC was oneof the first complete OCR software applica-tions that helped users convert scanned pic-tures of documents to editable documents,with over 90% conversion accuracy underrestricted scanning conditions.36
The technology trend has been tailored tothe use of machine learning such as an artifi-cial neural network. Although OCR technol-ogy has reached a commercial level, itsmass utilization has not yet been achieved.Problems include connected characters,which often appear in dirty documents;poorly scanned documents; or documentswith rarely used fonts that cause recognitionerrors. Research on Thai OCR is ongoing: forexample, an introduction of postprocessingto recover recognition errors.
Thai handwriting recognition wasexplored about a decade after OCR. Academicresearch started also on isolated-character rec-ognition.37 Current applicable systems stillmostly focus on isolated characters with aneural network as a basic classifier, like those
of OCR. Some systems have been adoptedcommercially in mobile devices.38 On thissimple but accurate task, characters aregrouped into Thai alphabets, English alpha-bets, and digits. Modification on charactershapes is applied to enlarge the difference ofsome characters or to simplify the writing.
Text-to-speech and speech synthesis
Research on Thai speech synthesis, chieflyfor academic purposes, began in the 1980s.Early work was reported by Sudaporn Luksa-neeyanawin.39 Text-to-speech synthesis(TTS) has resulted in several practical prod-ucts produced around 1990—for example,CUTalk by Luksaneeyanawin and Vaja byNECTEC.40 Major efforts required for Thai TTSare text processing, where a given text is seg-mented, normalized, and converted to soundrepresentatives; prosody prediction includingestimating unit durations, filled pauses, andfundamental frequencies; and synthesis, includ-ing the speech database and synthesizingalgorithms.
The technology for speech synthesis hasbeen incorporated by adopting availableTTS engines in applications such as the IBMhomepage reader,41 screen reader, and tele-phone voice response services. Limitationson speech quality and intelligibility inhibitwidespread use. The desire to provide TTSapplications on portable devices is increas-ing, and therefore researchers will have to op-timize the size of the speech database andtext processing dictionary. Thai TTS hasbeen used widely, however, for persons withvisual disabilities in Thailand.42
Automatic speech recognitionDevelopment of Thai automatic speech rec-
ognition (ASR) began with isolated wordrecognition (in which speakers utter only ashort command, not a sentence) in the mid-1990s, primarily at Chulalongkorn Univer-sity.43 A few years ago, the research expandedto applications of continuous speech in lim-ited tasks: for example, email access and tele-banking. Several attempts were made onlarge vocabulary continuous speech recogni-tion (LVCSR) for Thai; Table 2 summarizesthe results. The lack of a very large speechdatabase has slowed the advance of Thaispeech recognition; however, NECTEC is oneof the organizations that has continuouslydeveloped Thai speech recognition resourcesavailable for research, such as the LOTUScorpus.44
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 57
January–March 2009 57
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
Search and Semantic WebIncreasingly, as more companies and org-
anizations in Thailand, both public and pri-vate, start adopting digital solutions in placeof their original paper-based workflow, theneed for search technology becomes inevita-ble. Most of the available software tools fordeveloping information retrieval (IR) systemshave not been specifically designed to workwith Thai. Consequently, there are two possi-ble approaches for indexing Thai texts,including suffix array and inverted file index.The former approach treats text as a stringof characters and records character positions;the latter approach adopts the conventionalword-based paradigm applied for Latin-basedlanguages. Word segmentation is thereforenecessary in the latter approach. The stem-ming process is, however, inapplicable sinceThai words are considered noninflectional.
An early publication describing full-textsearch for Thai was published in 1999 by PraditMitrapiyanuruk et al.47 In 2000, SurapanMeknavin, a co-author of that publication, fo-cused his research on founding an Internet ser-vice business called SiamGuru (http://www.siamguru.com), which could be recognizedas the first Thai-specific search engine. Today,the research trend in Thai-language searchengines is moving toward the natural-language
and semantic searches for implementingquestion-answering (QA) systems.
The Semantic Web research initiatives inThailand can be classified into four majorcategories:
� information extraction (IE) and ontologylearning,
� metadata/ontology standards and tools,� knowledge representation (KR) and infer-
ences, and� applications and services.
A wide spectrum of techniques has beenresearched and developed since the early2000s. For example, a Thai translation ofthe Dublin Core metadata element sets48
was completed by Science and TechnologyKnowledge Services (STKS), which also pro-motes its uses in Thailand, and the XML De-clarative Description (XDD)49 language wasproposed by the Asian Institute of Technol-ogy to extend the capabilities of XML andRDF to support rule-based reasoning.
Language resources and industrialstandard lists
One of the most important languageresources is a dictionary. So Sethaputra is con-sidered the first Thai-English dictionarymade into a commercial electronic dictionary
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 58
Computers and the Thai Language
Table 2. State-of-the-art Thai large vocabulary continuous speech recognition (LVCSR) performance. The organization that provided the research results is listed after each “Task.”
Task
Newspaper reading (Carnegie Mellon University)45
Newspaper reading (Tokyo Institute of Technology)46
Vocabularysize (words)
~7,400
5,000
Perplexity
140
140
Word error rate(%)
14.0
11.6
Table 3. Available Thai dictionaries.
Dictionary
Royal Institute DictionaryNECTEC
LEXiTRONKDictThaiSaikam DictionaryLongdo Dictionary
Type
Monolingual with pronunciationBilingual (En-Th) with pronunciationBilingual (En-Th)Bilingual (Jp-Th)Multilingual (En-Th, Th-En, De-Th, Th-De, Jp-Th Th-Jp, Fr-Th, Th-Fr)
Source
http://rirs3.royin.go.th/dictionary.asp
http://lexitron.nectec.or.th/
http://kdictthai.sourceforge.net/http://saikam.nii.ac.jp/http://dict.longdo.com/
Availability
Web application
Publicly available
Publicly availableWeb applicationWeb, gadget, widget
Size (words)
133,524>600,000
37,01835,000 Thai53,000 English,
33,582
58 IEEE Annals of the History of Computing
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
product. LEXiTRON is another electronic dic-tionary with a long history.50 The first versionof LEXiTRON was launched by NECTEC in bothstandalone and Web-based systems in 1995.This first version contained approximately13,000 Thai words and 9,000 English words.Through user collaboration, the number ofentries has been more than doubled in thecurrent version, LEXiTRON 2.2. Table 3 listsother available dictionaries. Another languageresource necessary for language processing istext and speech corpora. Orchid and LOTUS,shown in Table 4, are considered two of thefirst official language corpora available pub-licly for Thai language research.
Concluding remarksThe history of the Thai language on com-
puters dates back to about 1965, but local re-search involvement began around 1975,when low-cost microcomputers became avail-able. Since the first standard Thai code forcomputers was developed in 1986, the com-puter processing speed has improved from8-bit, 8-MHz to 32-bit, 3.6-GHz (1,800times), and the RAM size for an entry-levelcomputer has increased from 256 Kbytes to1 Gbyte (3,900 times). NLP in real time is be-coming a reality. Research opportunities areopen to any students and researchers whocan afford these inexpensive computers. Thai-land’s I18N (internationalization) and L10N(localization) processes owe much to openstandards and many open source projects. Itis anticipated that openness in the interna-tional IT community will enable the Thai lan-guage to be more usable on any computers inthe future.
WithmoreITusers, improvement inhuman-computer interaction for the Thai language willbecome more desirable. Because computers arealsobecoming indispensable for the elderly,per-sons with disabilities, and those who are illiter-ate, many new developments will be possible.Speech I/O, handwriting input, sign languages,mobile-phone text entry, a voice-command sys-tem, speech-to-speech translation, voice search,and so on are the most likely efforts to be devel-oped and commercialized.
Acknowledgments
The authors would like to thank Trin Tantset-thi and Yuen Poovorawan for their personalcommunications, and to Choochart Harue-chaiyasak and Marut Buranarach from NECTEC
for their initial survey on search engines andSemantic Web technology.
References and Notes1. P. Soonthornpoct, From Freedom to Hell: A His-
tory of Foreign Interventions in Cambodian Politics
and Wars, Vantage Press, 2006.
2. M. Winship, ‘‘The Printing Press as an Agent of
Change? Early missionary printing in Thailand’’;
http://www.historycooperative.org/journals/cp/
vol-08/no-02/tales/.
3. NECTEC National Font Committee, The Time Line
of Thai Printing, Thai National Font Book (in
Thai), National Electronics and Computer Tech-
nology Center (NECTEC), Bangkok, 2001.
4. Antique Phonograph and Gramophone Thai
Society, ‘‘The Origins of Thai Typewriter’’; http://
www.talkingmachine.org/smithpremierorigin.html.
5. K. Hensch, ‘‘IBM History of Far Eastern Lan-
guages in Computing, Part 1: Requirements
and Initial Phonetic Product Solutions in the
1960s,’’ IEEE Annals of the History of Computing,
vol. 27, no. 1, 2005, pp. 17-26.
6. T. Koanantakool, ‘‘Compilation of all Thai Char-
acter Codes for Computers,’’ Microcomputer
Magazine, vol. 1, no. 9, Aug. 1984.
7. National Electronics and Computer Technology
Center, Thai Information Technology Standards;
http://www.nectec.or.th/it-standards/.
8. T. Koanantakool and the Thai API Consortium,
Computer Gub Pasa Thai [Computers and the
Thai Language], NECTEC, Oct. 1991 (in Thai).
9. The WTT specification was published in the book
Computers and the Thai Language in 1991. WTT,
a project code name, is a Thai acronym of
‘‘Wing Took Thee,’’ meaning ‘‘I run everywhere.’’
10. T. Koanantakool, ‘‘The Keyboard Layouts and
Input Method of the Thai Language,’’ Proc.
Symp. Natural Language Processing, Chulalong-
korn Univ., 1993.
11. S. Pattajoti, The Evolution of the Typewriter,
National Research Council, The Office of the
Prime Minister, 4 Nov. 1966 (in Thai).
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 59
Table 4. Thai text and speech resources.
Corpus (Organization)
Orchid (NECTEC)51
http://www.hlt.nectec.or.th/orchid/LOTUS (NECTEC)44
Purpose
Annotated 568,316 words of Thai junior encyclopedias and NECTEC technical papersWell-designed speech utterances for 5,000-word dictation systems
Type
Text
Speech
January–March 2009 59
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
12. T. Tantsetthi, ‘‘Thai Software Alert #1: Legiti-
mate Thai Charset on the Internet’’; http://
software.thai.net/alerts/tis-620/index.html.
13. P. Suriyawong, Y. Poovarawan, and C. Wongchai-
suwat, ‘‘A Development of Thai Kernel System,’’
Proc. Regional Workshop on Computer Processing of
Asian Languages (CPAL), 26-28 Sept. 1989.
14. T. Koanantakool, ‘‘A Brief History of ICT in
Thailand’’; http://www.bangkokpost.com/
20th_database/07Feb2007_data00.php.
15. Microsoft, ‘‘Thai OpenType Specification:
Creating and Supporting OpenType fonts for
the Thai Script’’; http://www.microsoft.com/
typography/otfntdev/thaiot/default.htm.
16. T. Karoonboonyanan, ‘‘Thai Locale’’; http://
linux.thai.net/~thep/th-locale/.
17. D. Londe and U. Warotamasikkhadit, ‘‘Compu-
terized Alphabetization of Thai, tech. memo TM-BA-
1000/000/01, SystemDevelopment Corp., 1969.
18. V. Lorchirachoonkul, Thai Sorting Algorithm, Thai
Algorithm: Design, Analysis and Implementation,
tech. report, NIDA and Electrical Generation
Authority of Thailand, Bangkok, 1979.
19. S. Khamthaidee, ‘‘Thai Style Algorithm: Sorting
Thai,’’ Computer Rev., vol. 91, Mar. 1992.
20. T. Karoonboonyanan, ‘‘Thai Sorting Algo-
rithms’’; http://linux.thai.net/~thep/tsort.html.
21. T. Karoonboonyanan, S. Raruenrom, and
P. Boonma, ‘‘Thai-English Bilingual Sorting’’;
http://linux.thai.net/~thep/blsort_utf.html.
22. T. Karoonboonyanan, S. Raruenrom, and
P. Boonma, ‘‘Sorting Thai Words with Punctua-
tion Marks,’’ NECTEC J., vol. 14, Jan.�Feb. 1997,
NECTEC, 1997.
23. A. LaBonte, ed., ISO/IEC DIS 14651: Interna-
tional String Ordering and Comparison—Method
for Comparing Character Strings and Description
of the Common Template Tailorable Ordering,
ISO/IEC JTC1/SC22/WG20 N731, 2000.
24. K. Karlsson, Ordering Rules for Thai and Lao,
ISO/IEC JTC1/SC22/WG20 N1077, 2003.
25. Y. Thairatananond, ‘‘Towards the Design of a
Thai Text Syllable Analyzer,’’ master’s thesis,
Asian Inst. of Technology, Pathumthani, 1981.
26. S. Charnyapornpong, ‘‘A Thai Syllable Separa-
tion Algorithm,’’ master’s thesis, Asian Inst. of
Technology, Pathumthani, 1983.
27. R. Varakulsiripunth et al., ‘‘Word Segmentation
in Thai Sentence by Longest Word Mapping,’’
Papers on Natural Language Processing: Multilin-
gual Machine Translation and Related Topics
(1987�1994), NECTEC, 1989.
28. S. Meknavin, P. Charoenpornsawat, and
B. Kijsirikul, ‘‘Feature-Based Thai Word Segmen-
tation,’’ Proc. Natural Language Processing Pacific
Rim Symp. (NLPRS 97), 1997, pp. 41-48.
29. W. Aroonmanakun, ‘‘Collocation and Thai Word
Segmentation,’’ Proc. Joint Int’l Conf. Symp.
Natural Language Processing (SNLP) and Oriental
Chapter of the Int’l Committee for the Coordina-
tion and Standardization of Speech Databases
and Assessment Techniques (O-COCOSDA),
2002, pp. 68-75.
30. V. Lorchirachoonkul, ‘‘A Thai Soundex System,’’
Information Processing and Management, vol. 18,
no. 5, 1982, pp. 243-255.
31. K. Tingsabadh and A.S. Abramson, Illustrations of
the IPA: Thai, Handbook of the International Pho-
netic Association, Cambridge Univ. Press, 1999.
32. J.C. Wells, ‘‘SAMPA for Thai’’; http://www.phon.
ucl.ac.uk/home/sampa/thai.htm.
33. P. Tarsaku, V. Sornlertlamvanich, and R. Thong-
prasirt, ‘‘Thai Grapheme-to-Phoneme Using
Probabilistic GLR Parser,’’ Proc. European Conf.
Speech Communication and Technology (Euro-
Speech), ISCA, 2001, pp. 1057-1060.
34. P. Hiranvanichakorn, T. Agui, and M. Nkajima,
‘‘A Recognition of Thai Characters,’’ Trans. IECE
Japan, vol. E 54, no. 12, 1982.
35. C. Kimpan et al., ‘‘Thai Character Pattern Rec-
ognition Preliminary,’’ Proc. Conf. L-05, College
of Science and Technology, Nihon Univ., 1980.
36. NECTEC, ‘‘Open TLE’’; http://www.opentle.org/.
37. V. Tanrudee, ‘‘Thai Character Handwriting Rec-
ognition System,’’ master’s thesis, Chulalong-
korn Univ., 1989.
38. G Softbiz Co., Ltd., ‘‘Thai-G’’; http://www.thai-g.
com/.
39. S. Luksaneeyanawin, ‘‘A Thai Text-to-Speech Sys-
tem,’’ Proc. Regional Workshop Computer Processing
of Asian Languages (CPAL), 1989, pp. 305-315.
40. P. Mittrapiyanurak et al., ‘‘Issues in Thai Text-to-
Speech Synthesis: The NECTEC Approach,’’ Proc.
NECTEC Ann. Conf., NECTEC, 2000, pp. 483-495.
41. IBM, ‘‘IBM Launches IBM Home Page Reader
V3.02 Thai, A Thai-speaking Web Browser—The
12th Language of the World,’’ press release,
12 Feb. 2003.
42. Thailand Association of the Blind, ‘‘Software for
the Blind’’; http://www.tab.or.th/.
43. T. Pathumthan, ‘‘Thai Speech Recognition Using
Syllable Units,’’ master’s thesis, Chulalongkorn
Univ., 1987 (in Thai).
44. S. Kasuriya et al., ‘‘Thai Speech Corpus for Speech
Recognition,’’ Proc. Int’l Conf. Int’l Committee for
the Coordination and Standardization of Speech
Databases and Assessments (O-COCOSDA),
2003, pp. 105-111.
45. S. Suebvisai et al., ‘‘Thai Automatic Speech Rec-
ognition,’’ Proc. IEEE Int’l Conf. Acoustics, Speech,
and Signal Processing (ICASSP), IEEE Press, 2005,
pp. 857-860.
46. I. Thienlikit, C. Wutiwiwatchai, and S. Furui,
‘‘Language Model Construction for Thai
LVCSR,’’ Reports of the Meeting of the Acoustic
Soc. of Japan, ASJ, 2004, pp. 131-132.
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 60
Computers and the Thai Language
60 IEEE Annals of the History of Computing
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
47. P. Mitrapiyanuruk et al., ‘‘A Development of
Full-Text Search Engine for Large Scale Thai
Text Database,’’ Proc. Nat’l Science and Technol-
ogy Development Agency (NSTDA), annual meet-
ing, 1999, pp. 246-257 (in Thai).
48. Science and Technology Knowledge Service
(STKS), ‘‘Dublin Core Metadata Element Set
(Thai translation)’’; http://dublin.stks.or.th/.
49. V. Wuwongse et al., ‘‘XML Declarative Descrip-
tion: A Language for the Semantic Web,’’ IEEE In-
telligent Systems, vol. 16, no. 3, 2001, pp. 54-65.
50. NECTEC, ‘‘LEXiTRON Online Dictionary’’; http://
lexitron.nectec.or.th/.
51. V. Sornlertlamvanich, T. Charoenporn, and
H. Isahara, ORCHID: Thai part-of-speech tagged
corpus, tech. report TR-NECTEC-1997-001,
ISBN 9747576-98-8, NECTEC, 1997.
Hugh Thaweesak Koananta-kool is a vice president of theNational Science and Technol-ogy Development Agency.He received a PhD in digitalcommunications from the Im-perial College of Science andTechnology, London Univer-
sity. His major works are in IT standards ofThailand, the establishment of the Internetin Thailand, SchoolNet Thailand, and e-Government.Koanantakool served on Thailand’s National ITCommittee, the parliamentary commissions on
the Electronic Transactions Act of 2001 and theComputer-related Offences Act in 2007. Contacthim at htk@nectec.or.th.
Chai Wutiwiwatchai receiveda PhD from Tokyo Instituteof Technology in 2004. Cur-rently he is the director ofthe Human Language Tech-nology Laboratory in NECTEC,Thailand. His research inter-ests include speech process-
ing, natural language processing, and human�machine interaction. Contact him at chai.wutiwi-watchai@nectec.or.th.
Theppitak Karoonboonyananreceived a B.Eng. (honors) incomputer engineering fromChulalongkorn University in1993. His interests are soft-ware engineering, and freeand open source software phi-losophy. He has contributed
to the GNOME, Debian, and other projects forThai supporting infrastructure, and has maintainedupstream projects for Thai resources. Contact himat thep@linux.thai.net.
For further information on this or any other
computing topic, please visit our Digital Library
at http://computer.org/csdl.
[3B2-6] man2009010046.3d 12/2/09 13:47 Page 61
January–March 2009 61
Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.
Recommended