Multicode: a truly multilingual approach to text encoding

0018-9162/97/$10.00 1997 IEEE April 1997 37

Multicode: A TrulyMultilingualApproach to TextEncoding

priate for use in different countries haveincreased demand for a standard character setfor use with many different languages.Currently, the ASCII character set1 is theworld’s most widely accepted and used stan-dard character set for computers, operatingsystems, compilers, and e-mail systems.However, while ASCII encoding adequatelyrepresents English text, it does not addressthe problem of handling text in other lan-guages.

ASCII is a 7-bit code and defines only 128characters. When used with an 8-bit characterformat, the 128 characters that ASCII wouldnot use could be used as extensions to ASCII.These extensions would be used to definecharacters of different languages. For exam-ple, the ISO 8859 standard2 defines a Latinextension (that supports many European lan-guages), as well as Cyrillic, Arabic, Greek,Hebrew, and other language extensions.

To handle documents that mix English with

a second language, you must use the ASCII extensionfor the second language. However, this approach pre-sents two key problems:

• The 128 characters represented by the extensionsare used for every language besides English.Therefore, mixing more than one language withEnglish presents a problem: You cannot tellwhich language they refer to in a piece of text.Traditionally, programs used awkward escapesequences to shift the meaning of bits.

• The 8-bit extension cannot accommodate lan-guages with more than 128 characters, such asJapanese, which would require a 16-bit extension.

Major computer companies and organizationsdeveloped the Unicode character set standard toaddress these problems. However, the 16-bit Unicodesystem, which uses one character set for all languages,creates some new problems. I developed Multicode,which uses multiple character sets for multiple lan-guages, to address Unicode’s principal drawbacks.

The Multicode system is new, so it needs more studyand more exposure to the computer industry before

Cyb

ers

qu

are

Unicode was designed to extend ASCII

for encoding text in different languages,

but it still has several important

drawbacks. Multicode overcomes

those drawbacks.

Muhammad F. MudawwarAmerican University in Cairo

he Internet’s explosive growth and the recentinterest in building international software appro-

.

38 Computer

it can be considered for standardization. However, itshows great promise as a system that can effectivelyencode text in many languages.

UNICODE STANDARDThe Unicode3,4 standard is a fixed-width scheme for

encoding written characters, including alphabeticcharacters, ideographic characters, and symbols. Thestandard is modeled on the ASCII character set butuses 16-bit encoding to support full multilingual text.Unicode requires no escape sequence or control codeto specify a character. Figure 1 illustrates the differ-ence between ASCII text and Unicode text. (All char-acter codes in this article appear in hexadecimalnotation.)

The Unicode project began in 1988, and its primarygoal is to remedy two serious problems common tomost multilingual computer programs:

• They overload the font mechanism when encod-ing characters.

• They use multiple, inconsistent character codesbecause of conflicting national and industry char-acter standards.

Major computer companies and organizationsincorporated the Unicode Consortium in January1991 to promote the Unicode standard as an interna-tional encoding system for information interchange, toaid in the standard’s implementation, and to maintainquality control over future revisions.

Design principlesThe Unicode standard’s designers envisioned a sim-

ple, comprehensive uniform encoding system that couldmeet the needs of technical and multilingual computing,as well as high-quality typesetting and desktop pub-lishing. The designers tried to achieve a balance betweenconsistency, simplicity, efficiency, and compatibility

with existing encoding standards.3 Unicode’s design wasbased on the following principles:

• 16-bit characters: All Unicode character codesare 16 bits wide, which provides more code spacethan the 8-bit ASCII system.

• Full encoding: The full 16-bit code space can beused to represent up to 65,536 characters.Unicode Standard Version 2.0 represents 38,885characters.

• Encoding of characters, not glyphs: The Unicodestandard encodes characters (the smallest com-ponent of a written language that has a semanticvalue, not glyphs (the shape of a character whenit is rendered or displayed).

• Semantics: Characters have well-defined seman-tics. Unicode includes character property tablesthat can be used with parsing, sorting, and otheralgorithms that require semantic knowledgeabout code points.

• Plain text: The Unicode standard encodes plaintext, rather than fancy (or rich) text, whichincludes such information as font, size, color, andhypertext links.

• Logical order: For all scripts, Unicode text is storedin memory in the same order in which the text istyped, which affects text rendering in some bidi-rectional scripting systems.

• Unification: Because of the limited 16-bit codespace, the Unicode standard avoids duplicateencoding of characters. It uses the same code forthe same character, regardless of the languagethat is being represented or the way the languageuses the character. So, the letter “a” has the samecode whether you are encoding text in English,Spanish, or French.

• Dynamic composition: The Unicode standardpermits the dynamic composition of accentedforms (those letters having diacritical marks).Dynamic composition allows us to form a newcharacter on the fly by combining a characterwith a diacritical mark.

• Equivalent sequence: Some text elements can alsobe encoded as static precomposed forms—more convenient, compact representations. TheUnicode standard provides a way to map to theequivalent dynamic composition, so programscan check for equivalent character sequences.

• Convertibility: The standard provides accurateconvertibility between Unicode and other widelyaccepted encoding standards.

Unicode’s drawbacksAlthough Unicode can remedy some problems expe-

rienced by multilingual computer programs, it alsohas several drawbacks.

Figure 1. ComparingASCII and Unicode.ASCII, which wasdesigned primarily torepresent English,uses 7-bit charactersto represent letters,symbols, numbers,and so on. Unicode,which was designedto represent a varietyof languages, uses16-bit characters torepresent letters,symbols, andideographic characters.

ASCII text Unicode text

A

S

C

I

I

t

e

x

t

.

41

53

43

49

49

20

74

65

78

74

2E

A

S

C

I

I

α

⇒

0041

0053

0043

0049

0049

0639

0631

0628

064A

03B1

21D2

.

Efficiency. Most documents are written entirely inone language, and most languages can be encodedusing an 8-bit system. Using a 16-bit encoding schemeuses twice the storage capacity and incurs twice thetransmission delay. Although compressing Unicode-encoded files or strings would alleviate this problem,it would have a substantial overhead if done fre-quently and is thus impractical.

Language orientation. To ensure simple characterrepresentation, the Unicode standard is not (and wasnot) intended to be language-oriented. No informa-tion about the language that is being used in a par-ticular piece of text is encoded in a Unicode file.

This can be a problem because in many cases, thesame letter exists in different languages. For example,the Latin letter “a” (Unicode 0041) exists in English,Spanish, and French. The Arabic letter “alef”(Unicode 0627) exists in Arabic, Farsi, and Urdu. Thesame is true for the Chinese letter “zi,” Japanese let-ter “ji,” and Korean letter “ja.” The same charactercode (Unicode 5B57) represents these letters.

Different languages use letters or symbols in differ-ent ways, and this affects multilingual text process-ing. You need to know which languages are being usedto perform many kinds of character data processing,such as searching, sorting, spell checking, and gram-mar checking. However, Unicode encoding does nottell you which languages are being used in text files.

A programmer might be able to determine the lan-guage of a Unicode text heuristically, based on its con-tent. However, this requires additional work, and theaccuracy of such an effort depends on the effective-ness of the heuristic used.

Encoding. The unification principle necessary toUnicode results in a poor encoding of characterblocks. For example, the Latin 1 supplement (0080-00FF) mixes over a dozen major European languages.Similarly, the Arabic character block (0600-06FF)encodes characters for Arabic, Persian, Urdu, Pashto,Sindhi, and Kurdish. The same can be said about theunified Chinese, Japanese, and Korean (CJK) ideo-graphs. This causes difficulties in designing applica-tions for a single language, such as those for naturallanguage processing. It is much easier to design suchapplications if the characters in the underlying blockare restricted to one language. This argues for anencoding system that uses separate character blocksfor different languages.

Compatibility with ASCII. Unicode is not truly com-patible with ASCII. Unicode files are sequences of 16-bit characters, while ASCII files are sequences of 8-bitcharacters. Unicode’s encoding scheme does not rec-ognize ASCII files and requires that you convert themto Unicode format by adding a null byte to the begin-ning of each ASCII code.

Although conversion from ASCII to Unicode is

straightforward, to exchange plain Unicode files overthe Internet, you would have to use universal charac-ter transformation formats (UTF-7 and UTF-8) witheach file.3,4 It is not clear whether systems that adoptthe Unicode standard should convert all ASCII files toUnicode and convert their low-level I/O routines sothey can handle Unicode’s 16-bit characters.

MULTICODEIn response to the recognized shortcomings of

Unicode, I developed Multicode in early 1996.Multicode’s most important feature is its use of mul-

tiple character sets. Multicode is not an extension toASCII but rather a collection of character sets. Mostcharacter sets are designed for specific languages, buta few can be designed to be language-independent,representing arrows, mathematical symbols, and soon. Thus, unlike Unicode, Multicode is language-ori-ented. Each character set is designed to be self-suffi-cient and, like ASCII, includes all necessary controlcharacters, punctuation, and special symbols.

Character setsInstead of unifying all languages into a 16-bit char-

acter set, Multicode defines 8-bit character sets formost languages and 16-bit character sets for some oth-ers, such as Chinese, Japanese, and Korean. There canbe a total of 256 character sets, enough to representall the world’s official written languages. (There areabout 185 independent countries, and some share thesame written language.) Each character set has aunique code; the default set, ASCII, has code 0.

A considerable overlap may exist between two char-acter sets, but each set will have some unique charac-ters and symbols and a unique ordering of characters.For example, Latin languages share many commonletters, but each has some unique letters and a uniqueordering. The same can be said of Arabic, Farsi, andUrdu, and about East Asian languages. It is also pos-sible to use more than one character set for a language,in cases where a different version of the same languageis used in different countries or regions.

Multicode attempts to define a unique character setfor each written language, rather than unifying char-acters across languages as Unicode does. It is also pos-sible to incorporate more than one character setstandard in a given language, in case different coun-tries or regions use these sets.

Switching between character setsMulticode provides a simple and uniform technique

for switching between character sets, which is essen-tial for handling text in multiple languages. A switchcharacter triggers switching.

For 8-bit character sets, Multicode reserves the lastcharacter (code FF) as the switch character. To switch

You need toknow whichlanguagesare beingused to performmany kindsof characterdataprocessing,such assearching,sorting, spell checking,and grammarchecking.

April 1997 39

.

40 Computer

from one 8-bit character set to another 8- or 16-bitset, you would use the switch character and then a sec-ond byte, which designates the second character set’scode. For example, if you want to switch from French(suggested code 01) to Hindi (code 50) you would usecharacter FF and then 50.

For 16-bit character sets, Multicode reserves the last256 characters (codes FF00 to FFFF) as switch char-acters. The first byte of the 2-byte switch character isalways FF, while the second byte designates the codeof the new character set. Figure 2 shows how to switchbetween character sets in Multicode. In Files 1 and 2,the switches from the 8-bit character sets were accom-plished with two 8-bit characters. In File 3, the switchfrom 16-bit Japanese was accomplished with one 16-bit character.

Addressing Unicode’s drawbacksMulticode is designed to overcome Unicode’s major

drawbacks.Efficiency. Because Multicode uses 8-bit and 16-bit

character sets, it can use 8-bit characters for 8-bit doc-uments. It will thus use only half the storage capac-ity handling such documents that the 16-bit Unicodesystem would use.

Language orientation. Multicode is designed to belanguage-oriented. Each character set is designed fora particular language and includes all the necessarycontrol characters and special symbols. Control char-acters are similar to those used in ASCII, which elimi-nates the need to switch character sets for control. Forexample, Multicode incorporates the newline charac-ter—a control character at the end of each line in anASCII file—in the character sets of other languages.

Unlike Unicode, Multicode provides language infor-mation directly via the switch characters in Multicodetext. This information can be important to language-dependent applications.

Meanwhile, because character sets are language-oriented and small, character and string operationsare well-defined and can be implemented easily.

Compatibility with ASCII. Multicode recognizes andis truly compatible with ASCII. You do not even haveto use a switch character to begin an ASCII file, sinceASCII’s character set code (00) is Multicode’s defaultcode.

Multicode also recognizes and accommodates theUnicode standard by reserving its last 16-bit charac-ter set (FF) for Unicode. In Multicode, you begin aplain Unicode file with the “switch to Unicode” char-acter (FFFF).

Scripting systemsIn Multicode, various languages or character sets

can share a common scripting system. A scripting sys-tem designates the rules for rendering characters,including their display, order, and format.6 Scriptingsystems differ due to the direction in which the text iswritten (English is written left-to-right; Arabic, right-to-left), alignment, character representation, andwhether a character changes its form depending on itsposition relative to other characters. One scripting sys-tem can serve several languages. For example, theRoman scripting system can serve ASCII, French,German, and other character sets.

A scripting system is responsible for interpreting acharacter sequence and handling font files. Charactersand glyphs can have one-to-one, one-to-many, ormany-to-one relationships. The Roman scripting sys-tem is relatively simple because character codes havea one-to-one mapping with glyph codes (used to accessa glyph in a font file). The Arabic scripting system iscomplex because it uses all three relationships. Forexample, an Arabic letter can change its form accord-ing to its relative position within a word; such a letterrequires a one-to-many mapping. Ligatures, whichcombine two or more characters into a single glyph,require a many-to-one mapping.

Two characters in different character sets may sharethe same form (or glyph) but have different codes. Forexample, the French “a” may be designed to have a dif-

Figure 2. Switching between character sets in Multicode. File 1 is an Arabic text file. Tostart in Arabic, a switch character (FF) is followed by the code of the Arabic character set(suggested code 30). File 2 mixes ASCII and French characters. Since this file does notstart with a switch character, the character set code defaults to 00 (the code for ASCII),an 8-bit character set. Following the ASCII characters is a switch (FF) to French(suggested code 01), another 8-bit character set. File 3 mixes Japanese and ASCII char-acters. It starts with a switch (FF) to Japanese (suggested code E1), continues with 16-bitJapanese characters, switches to ASCII (FF00), and finishes with 8-bit ASCII characters.

File 1

FF

30

5E

37

20

A7

64

5B

Switch toArabic

8-bitArabicchars

File 2

4A

3B

FF

01

65

7C

8-bitASCIIchars

Switch toFrench

8-bitFrenchchars

File 3

FF

E1

02

5D

12

FF

FF

00

41

72

Switch toJapanese

16-bitJapanese

chars

Switch toASCII

8-bitASCIIchars

.

ferent character code than the ASCII ‘‘a.” To rendertext and handle fonts properly, I suggest the followingtwo alternatives for designing a scripting system.

Fonts corresponding to character sets. A font filecould correspond to a unique character set. Youwould thus have different font files for the ASCII andFrench character sets. Glyph codes would then haveto correspond directly to character codes. Thus, if thecharacter code for the French “a” is different thanthat of the ASCII “a,” the glyph codes should also bedifferent. When switching character sets, the scriptingsystem would have to switch font files as well.

A one-to-one relationship between a font file andcharacter set simplifies the scripting system imple-mentation. It also keeps font files small because theyneed not include glyphs for characters that are notpart of a particular language’s character set. However,this approach requires a large number of overlappingfont files, particularly for scripting systems that sup-port multiple languages.

Unified fonts. This approach unifies all charactersets supported by a scripting system into a charactertable with no duplicate characters. (The unified char-acter table would be part of the scripting system, soit would not be a concern to application program-mers.) Thus, the Latin “a” would represent the ASCII“a,” the Spanish “a,” the French “a,” and so on. Youwould define a mapping function for each characterset supported by a scripting system. This would map

a character in a language onto its corresponding uni-fied character. In this case, glyph codes would haveto correspond directly to unified characters. Hence,all character sets supported by a scripting systemcould use the same font file.

This approach would decrease the number of fontfiles. However, each font file would be larger, and ascripting system would have to define mapping func-tions for each language’s character set. This wouldcreate a smaller overhead than the previous approach.

Implementing MulticodeImplementing Multicode is modular and incre-

mental. A Multicode-based system need not supportall scripting systems simultaneously. Some scriptingsystems could be added as system extensions imple-mented in a shared library.

Figure 3 illustrates how a windowing system canhandle Multicode text. The Multicode text managermanages all the scripting systems supported by oradded to the windowing system. It interprets Multi-code text by directing characters to the appropriatescripting system for the language in use. The scriptingsystem converts character codes into glyph codes. Insome cases, it may combine several characters into asingle glyph code or assign a character a different glyphcode based on context. It then obtains the appropriateglyph from a font file and sends the glyph to the draw-ing manager for rendering.

April 1997 41

PixelsMulticode

text Multicodetext

manager

Fonts

Romanscripting system

8-bit chars Glyphs

Fonts

Arabicscripting system

8-bit chars Glyphs

Fonts

Japanesescripting system

16-bit chars Glyphs

Dra

win

g m

anag

er

Figure 3. Handling Multicode text. Multicode text is initially received by the Multicode text manager, which directs the text tothe scripting system for the language in use. The scripting system converts the character code sequence into a glyph sequence.It then obtains the glyphs from a font file and sends them to the drawing manager for rendering.

.

42 Computer

When the Multicode text manager does not supporta language’s scripting system, the characters still willnot be misinterpreted. When text cannot be renderedor displayed, the Multicode text manager can displaya special glyph for every character that cannot be ren-dered or can display nothing but reports this situationto the user.

PROGRAMMING ISSUESBecause all Unicode characters have a uniform 16-

bit width, they can be represented in a programminglanguage by a single character type. In fact, theUnicode standard suggests the type unichar to rep-resent Unicode characters. We can define unichar asunsigned short or as wchar_t in C or C++. TheJava programming language5 defines a 16-bit chartype to represent Unicode characters.

Because of this uniformity, it seems that program-ming with Unicode should be simpler than pro-gramming with Multicode. If displaying characters isthe only operation performed on characters, it wouldindeed be simpler to use a single character type, anadvantage in using Unicode. However, it is very diffi-cult to use Unicode for many other types of stringoperations.

Disadvantages of UnicodeAny programming language’s character type is an

abstract data type—it represents a set of values and aset of operations on these values. To work consistently,the set of operations would have to be meaningful anduniform across all values of an abstract data type. Thiswould be the case when a character type representsonly one language, as is the case with ASCII. However,this would not be the case with Unicode becauseUnicode’s character type is meant to represent manylanguages, and character and string operations are notmeaningful and uniform across languages. Some char-acter and string operations apply to one language butare meaningless to others. Some character and stringoperations behave differently across languages.

For example, the use of upper or lower case is mean-ingful in languages that use the Latin alphabet but notin many other languages. Similarly, changing the dia-critical marks of letters is meaningful to Arabic butnot to ASCII. Meanwhile, some languages that use asimilar alphabet order the letters differently. InUnicode, the Arabic letter “ha” (Unicode 0647) comesbefore the Arabic letter “waw” (Unicode 0648), whichis correct in Arabic but not Farsi. Similarly, Swedishtreats “ä” (letter “a” with a dieresis, Unicode 00E4)as an individual letter, placing it after “z” in the alpha-bet. However, German places it either like “ae” orafter “a.” Unicode string processing cannot handlethese problems without information about the lan-guage in use.

Programming with MulticodeIn Multicode, programs define a string of charac-

ters for each character set. This makes sense becausecharacter and string operations are language-specificand vary from one character set to another. Thus,instead of having one char (or unichar) type in aprogramming language, we would define a family ofcharacter types, calling them ascii, french, ara-bic, chinese, and so on. These character types neednot be built in or predefined by the programming lan-guage. Instead, we can define them in a standardlibrary that supports the programming language.

Strings are arrays of either 8- or 16-bit characters;we cannot mix the two. Because a program defines astring over exactly one character set, a string shouldnot contain a switch character. Data types interpretsuch characters as invalid values. Thus, it is not pos-sible to mix characters of two data types in one stringin a programming language.

Character and string operations. Defining a charac-ter data type for exactly one character set simplifiesthe implementation of these operations. Each char-acter type defines a set of operations meaningful tothe corresponding language. We can also define someoperations for more than one character type. Theseoperations are overloaded and may behave differentlyfor different character types. It is also possible todesign an algorithm for use by several data types. Forexample, a string comparison algorithm designed forthe ascii character type should also work withoutmodification for the french character type, providedthe French character set is properly ordered.

Writing to a file. Writing a character or a string toan output file is straightforward. The programminglanguage I/O library introduces a switch characterevery time the system encounters a character or stringthat belongs to a new character set. Clearly, the writeprocedure should be overloaded to handle many char-acter types.

Reading from a file. Reading characters from a fileoccurs at two levels. At the lowest level, an operatingsystem treats a Multicode file as a sequence of bytes,and a read system call specifies the number of bytesto be read. The operating system associates no mean-ing to these bytes.

The programming-language level uses a read pro-cedure to read characters. The read procedure is over-loaded to handle characters of different types. Thetype of the actual character parameter should matchthe character set code of the read characters.Otherwise, no character is read. The program con-tinues reading until it encounters a switch character,but does not read the switch character.

When this happens, the program determines thecharacter set code of the characters that follow. Acharset function performs this task, returning the

Bothapproachescan coexist—Multicode forprogrammingease and Unicode to supportunified fonts.

.

character set code of the next character to be read. Ifthe next character is a switch character, charsetskips that character and returns a new character setcode. Otherwise, it returns the current code withoutchanging the file position.

Multicode addresses many of Unicode’s draw-backs and should have considerable appeal toprogrammers who work with text in a variety

of languages. Its future, however, depends on thecomputer industry’s acceptance. Multicode can rep-resent Unicode files because it reserves a character setfor Unicode. Converting Multicode to Unicode is alsostraightforward (although the opposite is not). Thus, both approaches can coexist—Multicode forprogramming ease and Unicode to support unifiedfonts. ❖

References

1. ANSI X3.4: Coded Character Set—7-Bit AmericanNational Standard Code for Information Interchange,Am. Nat’l Standards Inst., New York, 1986.

2. ISO 8859: Information Processing—8-Bit Single-Byte-Coded Graphic Character Sets, Int’l Organization forStandardization, Geneva, 1987.

3. Unicode Consortium, The Unicode Standard, Version2.0, Addison-Wesley, Reading, Mass., 1996.

4. Unicode Consortium home page, http://www.uni-code.org.

5. D. Flanagan, Java in a Nutshell, O’Reilly & Associates,Sebastopol, Calif., 1996.

6. Apple Computer, Inside Macintosh: Text, Addison-Wesley, Reading, Mass., 1993.

Muhammad F. Mudawwar is an assistant professorin the Computer Science Department at the AmericanUniversity in Cairo. His research interests include par-allel programming language design and implementa-tion, multilingual systems, and shared memorymultiprocessors. Mudawwar received a BS in electri-cal engineering from the American University, Beirut,Lebanon, and an MS and a PhD in computer engi-neering from Syracuse University. He is a member ofthe IEEE and the ACM.

Contact Mudawwar at [email protected].

April 1997 43

How to Reach ComputerWritersWe welcome submissions. For detailed information,write for a Contributors’ Guide ([email protected]) or visit our Web site:http://computer.org/pubs/computer/computer.htm.

Letters to the EditorPlease provide an e-mail address or daytime phone with your letter.

Computer Letters10662 Los Vaqueros CircleLos Alamitos, CA 90720

fax (714) [email protected]

On the WebVisit our Web site at http://computer.org forinformation about joining and getting involvedwith the Computer Society and Computer.

Magazine Change of AddressSend change-of-address requests for magazinesubscriptions to [email protected] sure to specify Computer.

Membership Change of AddressSend change-of-address requests for the membershipdirectory to [email protected].

Missing or Damaged CopiesIf you are missing an issue or received a damaged copy, contact [email protected].

ReprintsWe sell reprints of articles. For price information or toorder, send a query to [email protected] or a faxto (714) 821-4010.

Reprint PermissionTo obtain permission to reprint an article, contact William Hagen, IEEE Copyrights and TrademarksManager, at [email protected].

.

Documents

Multicode: a truly multilingual approach to text encoding