Upload
jeffery-hudson
View
217
Download
5
Embed Size (px)
Citation preview
Chapter 3Data RepresentationText Characters
2
Representing Text
To represent a text document in digital form, we need to be able to represent every possible character that may appear.
There is a finite number of characters to represent, so the general approach is to list them all and assign each a binary string.
3
Representing Text
A character set is a list of characters and the codes used to represent each one.
In 1960, a survey revealed 60 different characters sets in use. At IBM alone there were 9 different sets.
By agreeing to use ONE particular character set, computer manufacturers have made the processing of text data easier.
4
The ASCII Character Set
ASCII stands for American Standard Code for Information Interchange.
The ASCII character set originally used seven bits to represent each character, allowing for 128 unique characters.
Wikipedia has an excellent entry on ASCII.
5
The ASCII Character Set (7 bit)
6
The ASCII Character Set
Notice the organisation of the ASCII table. The table divides in half according to the MSB.
Letters are all in the second half so all codes for alphabetic characters start with 1. This second half of the table divides in half again according to
the next bit: UPPERCASE letters start 10. lowercase letters start 11.
The first half of the table also divides in half according to the next bit: Control characters start 00. Numerals and punctuation start 01.
7
The ASCII Character Set
Note that control characters (the first 32 in the ASCII character set) do not have simple character representations that you could print to the screen.
Many, however, perform actions with which you are familiar.
Some have there own keys, others need to be constructed.
8
The ASCII Character SetControl CharactersControl sequences are created by holding the Ctrl key
(control) and pressing a letter.
This has the effect of subtracting 64 from the ASCII value of the letter pressed.
For example: ‘M’ has ASCII value 77
(1001101 in binary), Ctrl-M has ASCII value 13
(0001101 in binary).
Alternately, we can see this as “masking bit 6.”
9
The ASCII Character SetCommon Control Characters
Hex Binary Decimal Name Function
00 0000000 0 NUL Null
07 0000111 7 BEL Bell
08 0001000 8 BS Backspace
09 0001001 9 HT Horizontal Tab
0A 0001010 10 LF Line Feed
0B 0001011 11 VT Vertical Tab
0C 0001100 12 FF Form Feed
0D 0001101 13 CR Carriage Return
0E 0001110 14 SO Shift Out
0F 0001111 15 SI Shift In
1B 0011011 29 ESC Escape
10
The ASCII Character Set
Coding letters in ASCII is easy.
Let’s look at ‘j’ as an example:
Since ‘j’ is a letter, its code starts with a 1.
Since it’s lowercase, the next bit is also a 1.
Since it’s the tenth letter of the alphabet the rest of the code is 01010.
The complete ASCII code for ‘j’ is 1101010.
11
The ASCII Character Set
ASCII evolved so that eight bits are used. The 7-bit codes were simply prefixed with
another bit, giving another natural doubling. The original 7-bit codes were padded with 0.
So the code for ‘j’ became 01101010. 128 new characters were added.
The codes for this alternate character set start with 1.
12
The Unicode Character Set
Even the extended version of the ASCII character set is not enough for international use.
The Unicode character set uses 16 bits per character. The Unicode character set can represent 216, or over 65 thousand characters.
Unicode was designed to be a superset of ASCII. That is, the first 256 characters in the Unicode character set correspond exactly to the extended ASCII character set.
13
Examples of Unicode Characters
Figure 3.6 A few characters in the Unicode character set