Electronic Text
• ASCII — American Standard Code for Information Interchange
• EBCDIC (IBM Mainframes, not standard)• Extended ASCII (8-bit, not standard)
– DOS Extended ASCII– Windows Extended ASCII– Macintosh Extended ASCII
• UNICODE (16-bit, standard-in-progress)
The ASCII CodeNUL
SOH
STX
ETX
EOT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
blank
!
"
#
$
%
&
'
(
)
*
+
`
-
.
/
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
~
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
0 1 2 3 4 5 6 7
An Example Text
T h i s i s a n e x a m p l e
84 104 105 115 32 105 115 32 97 110 32 101 120 97 109 112 108 101
Note that each ASCII character corresponds to a number, including spaces, carriage returns, etc. Everything must be represented somehow, otherwise the computer couldn’t do anything with it.
Representation in Memory32
101108112109
97120101
32110
97
_el
pmaxe_na
0110101001101001011010000110011101100110011001010110010001100011011000100110000101100000
Features of ASCII
• 7 bit fixed-length code – all codes have same number of bits
• Sorting: A precedes B, B precedes C, etc.• Caps + 32 = Lower case (A + space = a)• Word divisions, etc. must be parsed
ASCII is very widespread and almost universally supported.
Variable-Length Codes
• Some symbols (e.g. letters) have shorter codes than others– E.g. Morse code:
e = dot, j = dot-dash-dash-dash
– Use frequency of symbols to assign code lentgths
• Why? Space efficiency – compression tools such as gzip and zip use variable-
length codes (based on words)
Requirements
Starting and ending points of symbols must be clear(simplistic) example: four symbols must be encoded:
0 10 110 1110
All symbols end with a zeroAny zero ends a symbolAny one continues a symbolAverage number of bits per symbol = 2
Example
• 12 symbols– digits 0-9
– decimal point and space (end of number)
0 1 2 3 4 5 6 7 8 9 _ .
00
00
0 11
11
10 1
0
0 00
01
11
11
0 001 0102 01103 011104 0111105 0111116 107 1108 11109 11110_ 111110. 111111
Efficient Coding
Huffman coding (gzip)1. count the number of times each symbol occurs
2. start with the two least frequent symbola) combine them using a tree
b) put 0 on one branch, 1 on the other
c) combine counts and treat as a single symbol
3. continue combining in the same way until every symbol is assigned a place in the tree
4. read the codes from the top of the tree down to each symbol
Information Theory
• Mathematical theory of communication– How many bits in an efficient variable-length
encoding?– How much information is in a chunk of data?– How can the capacity of an information medium be
measured?
• Probabilistic model of information– “Noisy channel” model– less frequent ≈ more surprising ≈ more informative
• Measures information using the notion entropy
Noisy Channel
1
0
1
0
Source Destination
We measure the probability of each possible path (correct reception and errors)
Entropy
• Entropy of a symbol is calculated from its probability of occurrenceNumber of bits required hs = log2 ps
Average entropy: H(p) = – sum( pi log pi )
• Related to variance
• Measured in bits (log2)
Base 2 Logarithms
2log2x = x ; e.g. log22 = 1, log24 = 2, log28 = 3,
etc.
Often we round up to the nearest power of two (= min number of bits)
Unicode
• Administered by the Unicode Consortium
• Assigns unique code to every written symbol (21 bits: 2,097,152 codes)– UTF-32: four-byte fixed-length code– UTF-16: two to four-byte variable-length code– UTF-8: one to 4-byte variable length code
• ASCII Block (one byte) + basic multilingual plane (2-3 bytes) + supplementary (4 bytes)