CHARACTER ENCODING: How do computers deal with multiple language?

8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

1/26

Part 2CHARACTER ENCODING:

How do computers deal withmultiple languages?

by Tze Wei [email protected]


2/26

Content

Basic Computing Knowledge

Binary, Decimal and Hexadecimal Numbers

Unicode Character Set

Character Encoding

Language Input Software

Fonts

Glyphs


3/26

Data Communication

In order for computers to understand each other, they have to speak andunderstand the same language.

In computing terms, they must have the same encoding (speaking) and decoding

(understanding) protocol.


4/26

Data Communication

Every time we press a button on a keyboard, it generates a sequence of high andlow voltages which resemble binary numbers.

These sequences of data are saved in memory or transmitted to another computervia a network.

In order for the recipient to understand (decode) what the sender was speaking

(encode), both of them have to have the same understanding (encoding) of whatthat string of binary numbers mean.


5/26

Numeral Systems Computer data is represented in binary numbers (base-2 numeral system) as

opposed to decimal numbers (base-10 numeral system) we use in daily life.

Decimal Numbers Binary Numbers Hexadecimal Numbers

0 0 0

1 1 1

2 10 2

3 11 3

4 100 4

5 101 5

6 110 6

7 111 7

8 1000 8

9 1001 9

10 1010 A

11 1011 B

12 1100 C

13 1101 D

14 1110 E

15 1111 F

16 10000 10


6/26

Common Character Sets

ASCII (American Standard Code for Information Interchange)

- originally based on the English language that encodes 128 characters

- numbers 0-9, letters a-z and A-Z, some basic punctuation symbols, some controlcodes

- all stored in 7 binary digits (bits)

Keys Binary Representation Decimal Number

A 1000001 65

B 1000010 66

C 1000011 67

! 0100001 33

? 0111111 63

$ 0100100 36

Backspace 0001000 8

Escape 0011011 27

Delete 1111111 127


7/26


Most early computers kept data in an 8-bit byte system.

With an 8-bit byte, not only is it possible to store everypossible ASCII character, but there is also one whole bit

spare.

Byte = the smallest addressable unit of memory in many

computer architectures

Because bytes have room for up to eight bits, manypeople had their own ideas of what should go where inthe space from 100000002(or 12810) to 111111112(or

25510).

For example on some American PCs the character code100000102(or 13010) would display as , but oncomputers in Israel it was the Hebrew letter Gimel (), sowhen Americans sent their rsums to Israel they arrivedas rsums.


8/26


Unicode

- A group of ambitious people came up with the idea of creating a single character setthat included every reasonable writing system in the world, covering 110,181

characters from the world's alphabets, ideograph sets, and symbol collections.

- (Amharic), (Tamil) and even old characters which are not commonlyused anymore such as (Baybayin), the old Filipino writing system,(Chnm), the old Vietnamese characters are assigned binary codes (aka codepoints) to prevent confusion between computers.


9/26

Unicode

The code assigned to a specific character in Unicode Standard is called a codepoint.

A binary number for a character can be very long.

The Chinese characteris represented by this string of binary number100100101101100010 (or 15037010).

Note: To make the code point more concise, it is expressed in this format: U+hexadecimal number. Thus,

U+24B62.


10/26

UTF-8 Encoding The string of numbers has to be encoded and segmented into several 8-bit bytes in

order to store on computer memory, transmit across communication networks, andbe deciphered correctly by other computers.

UTF-8 is an encoding method widely used on the internet and increasingly beingused as the default character encoding in operating systems, programminglanguages, and software applications.

First CodePoint

Last CodePoint

No. ofBit

No. ofBytes

Required 1st Byte 2nd Byte 3rd Byte 4th Byte 5th Byte 6th Byte

U+0000 U+007F 7 bits 1 0xxxxxxx

U+0080 U+07FF 11 bits 2 110xxxxx 10xxxxxx

U+0800 U+FFFF 16 bits 3 1110xxxx 10xxxxxx 10xxxxxx

U+10000 U+1FFFFF 21 bits 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

U+200000 U+3FFFFFF 26 bits 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U+4000000 U+7FFFFFFF 31 bits 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx


11/26

UTF-8 EncodingTo encode the Chinese characterwhich is represented by this string of binary

number 100100101101100010. The following protocol is performed by the encoder:

1. Since it is between U+10000 and U+1FFFFF, this will take 4 bytes to encode.

2. Three leading zeros are added in front of 100100101101100010to make it000100100101101100010so it can fill up all the variable x.

3. The character is now made up of 4-byte binary numbers (32 bits) ready to be savedand transmitted to another computer:

11110000101001001010110110100010

Note: This lengthy binary number can be concisely written in hexadecimal number: F0 A4 AD A2

No. of BitNo. of Bytes

Required 1st Byte 2nd Byte 3rd Byte 4th Byte

21 bits 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


12/26

Decoding

When the recipient receives a string of 32 bits data, 11110000101001001010110110100010, the numbers in purple colour will be removed by the recipients decoderand revert back to the original 21-bit binary number 100100101101100010.

It is now ready to be opened with a computer programs and equipped with a fontwhich can render the 21-bit binary number into apicture(better known as a glyphintypography).

There are other encoding methods such as UTF-16, UTF-32 etc. to suit different typeof computer architectures.

Why do computers need encoding and decoding?

- So that the receiver can make sense of the seemingly random signal. It knows that a

new character is being received when it detects 11110.


13/26


Typing the English language is relatively straight forward in computing. Thekeyboard generates a binary number 10000012(or 6510or 4116) when A is pressed.

To type non-English languages, the computer needs a language input software that

convert 7-bit ASCII binary number to a Unicode binary number.

To type the Arabic Alif as in , we essentially press the h key. The keyboardgenerates a binary number 10010002(or 7210or 4816). The language input softwarethen converts 10010002(or 7210or 4816) to 110001001112(or 157510or 62716).


14/26


To encode the Arabic Alif as in which is represented by this 11-bit binarynumber 11000100111. The following protocol is performed by the encoder:

1. Since it is between U+0080 and U+07FF, this will take 2 bytes to encode.

2. The 11-bit binary number will fill up all the variable x.

3. The character is now made up of 2-byte binary numbers (16 bits) ready to be savedand transmitted to another computer:

1101100010100111

Note: This lengthy binary number can be concisely written in hexadecimal number: D8 A7

No. of Bit No. of Bytes Required 1st Byte 2nd Byte

11 bits 2 110xxxxx 10xxxxxx


15/26

Font

Font is a file that maps strings of binary data with designated pictorial glyphs to beshown on computer screen.

The most common font types are:

1. OpenType Fonts

2. TrueType Fonts

3. PostScript Fonts

Font can be developed with software i.e. Fontlab, Adobe FDK, RoboFont, Glyphs,DTL Font Master


16/26

Font

Fonts are files kept in Universal Type Client or your Font Book (Mac) and Font folderin Windows.


17/26

Font vs. Glyph

A font file contains a collection of glyphs(pictures) files assigned with numbers.


18/26

Font vs. Glyph

A glyphis the design of a character, a symbolor even an object.


19/26


20/26

Which Font is Better?

Well-developed fonts usually havemany glyphs and therefore are able tosupport many languages.

Less-developed fonts have lesserglyphs and therefore are less versatilein coping with different languages.

MHeiHK-SArial Unicode MS


21/26

This is the usual process of encoding a non-ASCII character.

Keyboard

Computer A

LanguageInputSoftware

UTF-8Encoder

Data Processing

Computer B

UTF-8Decoder

Unicode

Font


22/26

Input Software-inducedDisconcordance

Some language input developers prefer to use U+807C (a rare character) overU+807D (the more common one) i.e. Microsoft Pinyin New Experience Input Style.

Keyboard

Computer A

LanguageInput

Software

UTF-8Encoder

Computer B

UTF-8Decoder

Unicode

Font807C1

6or

807D16


23/26

Font-induced Disconcordance

This is the usual computing process to type the Arabic character Alif.

Keyboar

d

Computer A

LanguageInput

Software

UTF-8

Encoder

Computer B

UTF-8

Decoder

Unicod

eFont

4816ASCII D8A716

62716Unicode

D8A716

62716Unicode


24/26

However, some font developers skip the language input software and UTF-8encoding by creating non-Unicode fonts i.e. Kruti Dev 010, MHeiHK-S.

Keyboard

Non-Unicode

Font

481

6

ASCII


25/26

Non-Unicode Fonts Some font developers create fonts that assign characters to code points that have

already been taken by other characters.


26/26

Non-Unicode Fonts

These fonts are called non-Unicode fonts.

Data typed with these fonts are not able to be read by other fonts.

Documents

CHARACTER ENCODING: How do computers deal with multiple language?