Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 1 of 16 by Paul Long
Huffman coding and trees
Huffman coding is another method for lossless compression. It reduces the number
of bits needed to store data. It is based on the number of times that each data
item (character for text or pixel for image) is repeated. The data items that occur
most frequently will be stored using a fewer number of bits.
Before looking at how Huffman coding works, let’s look at how to calculate the
number of bits needed to store a simple text file.
Each character in an ASCII text file uses 7 bits, but most text files use extended
ASCII which is 8 bits per character.
Therefore, to calculate the file size of an ASCII text file, count the number of
characters, including spaces, and that is the number of bytes. Then multiply by 8
to give the number of bits.
ASCII file size (bytes) = number of characters
ASCII file size (bits) = number of characters x 8
Example – ASCII file size How many bits and bytes are in this sentence?
8 x spaces, 1 x ? and 36 letters = 45 bytes. Multiply by 8 for 280 bits.
Huffman coding uses a method of identifying which characters occur most
frequently. The most frequent character is assigned the least number of bits for
storage.
Using a Huffman Tree
A Huffman Tree is used to identify the bit pattern that should be used for each
character in a file. This is best explained using an example.
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 2 of 16 by Paul Long
Example – Huffman Tree
The Huffman Tree can now be used to identify the bit pattern to be used for each
character. This is done by inserting a 0 (zero) on every left hand branch and a 1
(one) on every right hand branch. The bit pattern for each letter is calculated by
following the branches and writing down the 0s and 1s passed to get to each
character.
This is a Huffman Tree for the word
“abracadabra”. There are 5 letters in this
word – a, b, c, d and r.
Each circle is called a node.
The frequency of each letter (number of
times each letter appears in the word)
can be seen on the nodes above each
number:
a:5 r:2 b:2 c:1 d:1
The nodes above are the totals of adding
up each pair of nodes beneath them. For
example, for c and d, 1 + 1 = 2 and so 2
appears in the node above. The node at
the top indicates the total number of
characters in the phrase.
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 3 of 16 by Paul Long
Example – calculating the bit pattern
The Huffman Coding can now be calculated by replacing each character in the
file with its bit pattern.
Example – Huffman Coding Each character of abracadabra is represented as follows:
a:0 r: 10 b: 110 c: 1110 d: 1111
Therefore, the word Huffman Coding will be:
This is written out as:
01101001110011110110100
0
0
0
0
1
1
1
1
To get to the letter a, only a single 0 (zero)
is passed from the top and so the bit
pattern for a is 0 (zero).
To get to the letter r, you have to pass a 1
and then a 0 so the bit pattern for r is 10.
To get to the letter c, you have to pass a
1, then another 1, then another 1 and
then a 0 so the bit pattern for c is 11110.
Here are all the bit patterns for each
character:
a:0 r: 10 b: 110 c: 1110 d: 1111
Notice how the shortest bit pattern is used
for the highest frequency character (a).
a b r a c a d a b r a
0 110 10 0 1110 0 1111 0 110 10 0
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 4 of 16 by Paul Long
The Huffman Coding is now used to store the data. This will require less bits to store
the data. The number of bits needed can be calculated by adding up the
number of 1s and 0s in the bit pattern. This can then be compared with the
number of bits needed to store the data using ASCII to show how much storage
could be saved using the Huffman method of compression.
Example – Huffman compression savings Using Huffman Coding, abracadabra is stored as:
01101001110011110110100
Count up the number of 1s and 0s to calculate the number of bits required to
store. Total = 23 bits.
Compare this with the space needed to store in ASCII:
11 characters = 11 bytes
11 x 8 bits = 88 bits.
The saving is calculated as 88 – 23 = 65 bits saved.
Example – using a Huffman Tree This is a Huffman Tree for “poppy pop”.
This uses a total of 15 bits.
Using ASCII, 9 characters of 8 bits each would be needed making a total of 72 bits.
There is a compression saving of 72 – 15 = 57 bits.
The coding for p will be 1 (move right once).
The coding for o will be 00 (left, then left).
The coding for space will be 010 (left, right,
left).
The coding for y will be 011 (left, right, right).
Therefore, the word Huffman Coding will be:
P O P P Y space P O P
1 00 1 1 011 010 1 00 1
This is written out as:
100110110101001
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 5 of 16 by Paul Long
Activity
1) The following Huffman tree has been created for “SHE SELLS SEA SHELLS”
a) Identify the binary code to be used for each character.
b) Write out the Huffman binary encoding for “SHE SELLS SEA SHELLS”.
c) Calculate how many bits would be required to store the sentence using
ASCII.
2) The following Huffman tree has been created for “EDDIE EDITED IT”
Created using http://huffman.ooz.ie
a) Write out the Huffman binary encoding for this sentence.
b) Calculate how many bits are needed to store this data using Huffman.
c) Calculate how many bits are needed to store this data using ASCII.
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 6 of 16 by Paul Long
3) The following Huffman tree has been created for “WHICH WITCH IS WHICH”
a) Write out the Huffman binary encoding for this sentence.
b) Calculate how many bits are needed to store this data using Huffman.
c) Calculate how many bits are needed to store this data using ASCII.
4) The following Huffman tree has been created for “STUPID SUPERSTITION”
a) Identify the binary code to be used for each character.
b) Write out the Huffman binary encoding for this tree.
c) Calculate how many bits are needed to store this data using Huffman.
d) Calculate how many bits are needed to store this data using ASCII.
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 7 of 16 by Paul Long
5) The following Huffman tree has been created for “GOOD BLOOD, BAD BLOOD”
a) Identify the binary code to be used for each character.
b) Write out the Huffman binary encoding for this tree.
c) Calculate how many bits are needed to store this data using Huffman.
d) Calculate how many bits are needed to store this data using ASCII.
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 8 of 16 by Paul Long
Creating a Huffman Tree
Note: you do not need to be able to create a Huffman Tree in an exam, so this is a
bit of extension work. You may find it helpful to understand how the Trees
are created.
Creating a Huffman Tree is best understood with a video explanation.
Video
Watch Text Compression with Huffman Coding by Barry Brown on YouTube.
Work through this animation.
Example – creating a Huffman Tree 1 Peter Piper picked a peck of pickled peppers
To start with, note that the total characters including spaces is 44 which in normal
ASCII encoding would require 8 bits per character making a total of 352 bits.
For the Huffman code, count the frequency of each letter:
Space (Δ) x 7 P x 2 p x 7 e x 8 t x 1 r x 3
i x 3 c x 3 k x 3 d x 2 a x 1 o x 1
f x 1 l x 1 s x 1
Add the total frequencies to check they add up to 44 because mistakes are very
easy to make. Now put these into ascending order along the bottom of a page:
a1 f1 l1 o1 s1 t1 P2 d2 c3 i3 k3 r3 p7 Δ7 e8
These are all now known as nodes in the Huffman Tree that is to be created. Start
by looking for the two nodes with the lowest frequencies. There are 6 to choose
from (a, f, l, o, s, t), so start with the left hand pair. Combine these to make a new
node with a number that represents the total frequency of characters within those
nodes (1 + 1 = 2):
af2
a1 f1 l1 o1 s1 t1 P2 d2 c3 i3 k3 r3 p7 Δ7 e8
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 9 of 16 by Paul Long
Repeat this for the other four nodes with a frequency of 1:
af2 lo2 st2
a1 f1 l1 o1 s1 t1 P2 d2 c3 i3 k3 r3 p7 Δ7 e8
Repeat this again for the two nodes with the lowest frequency. There are 5 to
choose from (P, d, af, lo, st). Notice how we have to take account of the new
nodes that have been created. Therefore, join together from the left hand side
fitting in P with af, then d with lo (2 + 2 = 4):
Paf4 std4
P2 af2 d2 lo2 st2
a1 f1 l1 o1 s1 t1 c3 i3 k3 r3 p7 Δ7 e8
Next move on to do the next pair of nodes with the lowest frequency. This will be st
and one of c, i, k and r. Choose the left most of these to pair with st, which will be
c (2+3 = 5), and similarly pair i with c (3 + 3 = 6).
Paf4 dlo4 stc5
P2 af2 d2 lo2 st2 c3 ik6
a1 f1 l1 o1 s1 t1 i3 k3 r3 p7 Δ7 e8
The next lowest frequency pair will be r (3) and Paf (4) making a total of 7. The
other two lowest frequencies are dlo (4) and stc (5) making a total of 9.
Remember the lowest frequency moves to the left hand side of each pair of
branches.
rPaf7 dlostc9
r3 Paf4 dlo4 stc5
P2 af2 d2 lo2 st2 c3 ik6
a1 f1 l1 o1 s1 t1 i3 k3 p7 Δ7 e8
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 10 of 16 by Paul Long
The next 2 lowest frequency nodes are ik (6) and rPaf (7) making a total of 13. The
other two lowest frequencies are rpaf (7) and space/ Δ (7) making a total of 14.
The top of each set of branches must be kept in ascending order of frequency so
rPaf Δ moves to the right hand side because it is the largest frequency.
rPaf Δ14
dlostc9 rPaf7 Δ7
dlo4 stc5 ikp13 r3 Paf4
d2 lo2 st2 c3 ik6 p7 P2 af2
l1 o1 s1 t1 i3 k3 a1 f1 e8
The next 2 lowest frequency nodes are e (8) and dlostc (9) making a total of 17.
e8 is moved to the left hand side of kdPta9 because it has a smaller frequency
and edlostc is moved to the right hand side because 17 is now the largest
frequency.
rPafΔ14 edlostc17
rPaf7 Δ7 e8 dlostc9
ikp13 r3 Paf4 dlo4 stc5
ik6 p7 P2 af2 d2 lo2 st2 c3
i3 k3 a1 f1 l1 o1 s1 t1
The next 2 lowest frequency nodes are ikp (13) and rPafΔ (14) making a total of 27.
ikprPafΔ goes on the right hand side because it is the higher frequency
ikprPafΔ27
edlostc17 Ikp13 rPafΔ14
e8 dlostc9 ik6 p7 rPaf7 Δ7
dlo4 stc5 i3 k3 r3 Paf4
d2 lo2 st2 c3 P2 af2
l1 o1 s1 t1 a1 f1
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 11 of 16 by Paul Long
Finally, the last 2 nodes are edlostc (17) and ikprPafΔ27 (27) with a total of 44
which matches the number of characters meaning there is some success so far.
To complete the Huffman Tree, add a 0 on all the left hand branches and a 1 on
all the right hand branches.
44
0 1
edlostc17 ikprPafΔ27
0 1 0 1
e8 dlostc9 Ikp13 rPafΔ14
0 1 0 1 0 1
dlo4 stc5 ik6 p7 rPaf7 Δ7
0 1 0 1 0 1 0 1
d2 lo2 st2 c3 i3 k3 r3 Paf4
0 1 0 1 0 1
l1 o1 s1 t1 P2 af2
01010 01101 0 1
a1 f1
110111
Now you can encode the characters into binary. Read from the top the 0s and 1s
down to each character. For example, follow from 44 to edlostc which is 0 and
then to e is another 0 so 00. Some are shown on the diagram above.
e (8) = 00 k (3) = 1001 d (2) = 0100 P (2) = 11010 t (1) = 01101
a (1) = 110110 o (1) = 01011 f (1) = 01101 l (1) = 01010 s (1) = 01100
r (3) = 1100 i (3) = 1000 c (3) = 0111 Δ (7) = 111 p (7) = 101
Notice how the most frequent characters have the smallest binary number.
“Peter Piper picked a peck of pickled peppers” can now be represented in binary:
11010 00 01101 00 1100 111 11010 1000 0111 1001 00 0100 101 110110 111 101 00
0111 1001 111 01011 01101 111 101 1000 0111 1001 01010 00 0100 111 101 00 101
101 00 1100 01100
Count up the number of 1s and 0s to calculate the number of bits required to
store. Total = 136.
Compare this with the space needed to store in ASCII:
44 characters x 8 bits = 352 bits.
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 12 of 16 by Paul Long
That was hard work! Huffman coding is not easy. The example above was as hard
as it should get and in an exam you will not be expected to create a Huffman
tree. Here is a simpler example below that might be easier to understand. It uses
a different method which does not keep putting in the combination of letters, but
instead just puts in the numbers for each node:
Example – creating a Huffman Tree 2 Create a Huffman code tree for “the big bugbit the little beetle”
No capital letters are used and this method moves the original sequence about in
order to build the tree. The tool used for creating this tree is
http://www.algorasim.com/ and it puts characters into ascending order rather
than descending order.
Start by identifying the frequency of each letter:
Now combine the lowest frequency pair (1 and 2 = total 3):
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 13 of 16 by Paul Long
Now combine the new lowest frequency pair of 2 and 3, moving the branches
along to the right to keep an ascending order of frequency:
Now combine the pair of 3s (l and ug) to make a total of 6:
Now combine 4 and 5 (total 9) and move to the right hand side:
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 14 of 16 by Paul Long
Do the same for a pair of 6s (e and lug) to make a total of12:
The last characters left are space (6) and t (6) with a total of 12:
Now combine 9 and 12 to make a total of 21:
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 15 of 16 by Paul Long
Finally, 12 and 21 make a total of 33 – check this is the same as the total number of
characters. Remember to put 0s on left hand side of each branch and 1s on right
hand side of each branch:
The Huffman encoding is:
Space = 00 t = 01 b =100 h = 1010 g = 11111
u = 11110 e = 110 i = 1011 l = 1110
The phrase “the big bug bit the little beetle” can now be represented in binary as:
01 1010 110 00 100 1011 11111 00 100 11110 11111 00 100 1011 01 00 01 1010 110 00
1110 1011 01 01 1110 110 00 100 110 110 01 1110 110
Count up the number of 1s and 0s to calculate the number of bits required to
store. Total = 101
Compare this with the space needed to store in ASCII:
33 characters x 8 bits = 264 bits.
1
1
1
1
1
1 1
1
0
0 0
0
0
0
0
0
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net
© paullong.net 2016 Page 16 of 16 by Paul Long
Extension activity
1) Create a Huffman tree for each of:
a) A CLEAN CREAM CAN
b) WOULD A WOODCHUCK CHUCK WOOD?
c) FOUR FINE FRESH FISH FOR YOU
c) For each of the above phrases, write out the Huffman binary encoding.
d) For each of the above phrases, calculate the number of bits required for
storage using:
i) ASCII
iii) The Huffman Tree
Questions 1) Contrast lossy and lossless compression. [2]
2) Give 2 reasons for compressing data. [2]
3) Identify two methods of compression encoding. [2]