CHARACTER ENCODING: How do computers deal with multiple language?

Embed Size (px)

Citation preview

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    1/26

    Part 2CHARACTER ENCODING:

    How do computers deal withmultiple languages?

    by Tze Wei [email protected]

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    2/26

    Content

    Basic Computing Knowledge

    Binary, Decimal and Hexadecimal Numbers

    Unicode Character Set

    Character Encoding

    Language Input Software

    Fonts

    Glyphs

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    3/26

    Data Communication

    In order for computers to understand each other, they have to speak andunderstand the same language.

    In computing terms, they must have the same encoding (speaking) and decoding

    (understanding) protocol.

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    4/26

    Data Communication

    Every time we press a button on a keyboard, it generates a sequence of high andlow voltages which resemble binary numbers.

    These sequences of data are saved in memory or transmitted to another computervia a network.

    In order for the recipient to understand (decode) what the sender was speaking

    (encode), both of them have to have the same understanding (encoding) of whatthat string of binary numbers mean.

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    5/26

    Numeral Systems Computer data is represented in binary numbers (base-2 numeral system) as

    opposed to decimal numbers (base-10 numeral system) we use in daily life.

    Decimal Numbers Binary Numbers Hexadecimal Numbers

    0 0 0

    1 1 1

    2 10 2

    3 11 3

    4 100 4

    5 101 5

    6 110 6

    7 111 7

    8 1000 8

    9 1001 9

    10 1010 A

    11 1011 B

    12 1100 C

    13 1101 D

    14 1110 E

    15 1111 F

    16 10000 10

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    6/26

    Common Character Sets

    ASCII (American Standard Code for Information Interchange)

    - originally based on the English language that encodes 128 characters

    - numbers 0-9, letters a-z and A-Z, some basic punctuation symbols, some controlcodes

    - all stored in 7 binary digits (bits)

    Keys Binary Representation Decimal Number

    A 1000001 65

    B 1000010 66

    C 1000011 67

    ! 0100001 33

    ? 0111111 63

    $ 0100100 36

    Backspace 0001000 8

    Escape 0011011 27

    Delete 1111111 127

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    7/26

    Common Character Sets

    Most early computers kept data in an 8-bit byte system.

    With an 8-bit byte, not only is it possible to store everypossible ASCII character, but there is also one whole bit

    spare.

    Byte = the smallest addressable unit of memory in many

    computer architectures

    Because bytes have room for up to eight bits, manypeople had their own ideas of what should go where inthe space from 100000002(or 12810) to 111111112(or

    25510).

    For example on some American PCs the character code100000102(or 13010) would display as , but oncomputers in Israel it was the Hebrew letter Gimel (), sowhen Americans sent their rsums to Israel they arrivedas rsums.

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    8/26

    Common Character Sets

    Unicode

    - A group of ambitious people came up with the idea of creating a single character setthat included every reasonable writing system in the world, covering 110,181

    characters from the world's alphabets, ideograph sets, and symbol collections.

    - (Amharic), (Tamil) and even old characters which are not commonlyused anymore such as (Baybayin), the old Filipino writing system,(Chnm), the old Vietnamese characters are assigned binary codes (aka codepoints) to prevent confusion between computers.

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    9/26

    Unicode

    The code assigned to a specific character in Unicode Standard is called a codepoint.

    A binary number for a character can be very long.

    The Chinese characteris represented by this string of binary number100100101101100010 (or 15037010).

    Note: To make the code point more concise, it is expressed in this format: U+hexadecimal number. Thus,

    U+24B62.

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    10/26

    UTF-8 Encoding The string of numbers has to be encoded and segmented into several 8-bit bytes in

    order to store on computer memory, transmit across communication networks, andbe deciphered correctly by other computers.

    UTF-8 is an encoding method widely used on the internet and increasingly beingused as the default character encoding in operating systems, programminglanguages, and software applications.

    First CodePoint

    Last CodePoint

    No. ofBit

    No. ofBytes

    Required 1st Byte 2nd Byte 3rd Byte 4th Byte 5th Byte 6th Byte

    U+0000 U+007F 7 bits 1 0xxxxxxx

    U+0080 U+07FF 11 bits 2 110xxxxx 10xxxxxx

    U+0800 U+FFFF 16 bits 3 1110xxxx 10xxxxxx 10xxxxxx

    U+10000 U+1FFFFF 21 bits 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

    U+200000 U+3FFFFFF 26 bits 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    U+4000000 U+7FFFFFFF 31 bits 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    11/26

    UTF-8 EncodingTo encode the Chinese characterwhich is represented by this string of binary

    number 100100101101100010. The following protocol is performed by the encoder:

    1. Since it is between U+10000 and U+1FFFFF, this will take 4 bytes to encode.

    2. Three leading zeros are added in front of 100100101101100010to make it000100100101101100010so it can fill up all the variable x.

    3. The character is now made up of 4-byte binary numbers (32 bits) ready to be savedand transmitted to another computer:

    11110000101001001010110110100010

    Note: This lengthy binary number can be concisely written in hexadecimal number: F0 A4 AD A2

    No. of BitNo. of Bytes

    Required 1st Byte 2nd Byte 3rd Byte 4th Byte

    21 bits 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    12/26

    Decoding

    When the recipient receives a string of 32 bits data, 11110000101001001010110110100010, the numbers in purple colour will be removed by the recipients decoderand revert back to the original 21-bit binary number 100100101101100010.

    It is now ready to be opened with a computer programs and equipped with a fontwhich can render the 21-bit binary number into apicture(better known as a glyphintypography).

    There are other encoding methods such as UTF-16, UTF-32 etc. to suit different typeof computer architectures.

    Why do computers need encoding and decoding?

    - So that the receiver can make sense of the seemingly random signal. It knows that a

    new character is being received when it detects 11110.

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    13/26

    Language Input Software

    Typing the English language is relatively straight forward in computing. Thekeyboard generates a binary number 10000012(or 6510or 4116) when A is pressed.

    To type non-English languages, the computer needs a language input software that

    convert 7-bit ASCII binary number to a Unicode binary number.

    To type the Arabic Alif as in , we essentially press the h key. The keyboardgenerates a binary number 10010002(or 7210or 4816). The language input softwarethen converts 10010002(or 7210or 4816) to 110001001112(or 157510or 62716).

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    14/26

    Language Input Software

    To encode the Arabic Alif as in which is represented by this 11-bit binarynumber 11000100111. The following protocol is performed by the encoder:

    1. Since it is between U+0080 and U+07FF, this will take 2 bytes to encode.

    2. The 11-bit binary number will fill up all the variable x.

    3. The character is now made up of 2-byte binary numbers (16 bits) ready to be savedand transmitted to another computer:

    1101100010100111

    Note: This lengthy binary number can be concisely written in hexadecimal number: D8 A7

    No. of Bit No. of Bytes Required 1st Byte 2nd Byte

    11 bits 2 110xxxxx 10xxxxxx

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    15/26

    Font

    Font is a file that maps strings of binary data with designated pictorial glyphs to beshown on computer screen.

    The most common font types are:

    1. OpenType Fonts

    2. TrueType Fonts

    3. PostScript Fonts

    Font can be developed with software i.e. Fontlab, Adobe FDK, RoboFont, Glyphs,DTL Font Master

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    16/26

    Font

    Fonts are files kept in Universal Type Client or your Font Book (Mac) and Font folderin Windows.

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    17/26

    Font vs. Glyph

    A font file contains a collection of glyphs(pictures) files assigned with numbers.

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    18/26

    Font vs. Glyph

    A glyphis the design of a character, a symbolor even an object.

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    19/26

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    20/26

    Which Font is Better?

    Well-developed fonts usually havemany glyphs and therefore are able tosupport many languages.

    Less-developed fonts have lesserglyphs and therefore are less versatilein coping with different languages.

    MHeiHK-SArial Unicode MS

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    21/26

    This is the usual process of encoding a non-ASCII character.

    Keyboard

    Computer A

    LanguageInputSoftware

    UTF-8Encoder

    Data Processing

    Computer B

    UTF-8Decoder

    Unicode

    Font

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    22/26

    Input Software-inducedDisconcordance

    Some language input developers prefer to use U+807C (a rare character) overU+807D (the more common one) i.e. Microsoft Pinyin New Experience Input Style.

    Keyboard

    Computer A

    LanguageInput

    Software

    UTF-8Encoder

    Computer B

    UTF-8Decoder

    Unicode

    Font807C1

    6or

    807D16

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    23/26

    Font-induced Disconcordance

    This is the usual computing process to type the Arabic character Alif.

    Keyboar

    d

    Computer A

    LanguageInput

    Software

    UTF-8

    Encoder

    Computer B

    UTF-8

    Decoder

    Unicod

    eFont

    4816ASCII D8A716

    62716Unicode

    D8A716

    62716Unicode

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    24/26

    However, some font developers skip the language input software and UTF-8encoding by creating non-Unicode fonts i.e. Kruti Dev 010, MHeiHK-S.

    Keyboard

    Non-Unicode

    Font

    481

    6

    ASCII

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    25/26

    Non-Unicode Fonts Some font developers create fonts that assign characters to code points that have

    already been taken by other characters.

  • 8/14/2019 CHARACTER ENCODING: How do computers deal with multiple language?

    26/26

    Non-Unicode Fonts

    These fonts are called non-Unicode fonts.

    Data typed with these fonts are not able to be read by other fonts.