16
GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net © paullong.net 2016 Page 1 of 16 by Paul Long Huffman coding and trees Huffman coding is another method for lossless compression. It reduces the number of bits needed to store data. It is based on the number of times that each data item (character for text or pixel for image) is repeated. The data items that occur most frequently will be stored using a fewer number of bits. Before looking at how Huffman coding works, let’s look at how to calculate the number of bits needed to store a simple text file. Each character in an ASCII text file uses 7 bits, but most text files use extended ASCII which is 8 bits per character. Therefore, to calculate the file size of an ASCII text file, count the number of characters, including spaces, and that is the number of bytes. Then multiply by 8 to give the number of bits. ASCII file size (bytes) = number of characters ASCII file size (bits) = number of characters x 8 Example – ASCII file size How many bits and bytes are in this sentence? 8 x spaces, 1 x ? and 36 letters = 45 bytes. Multiply by 8 for 280 bits. Huffman coding uses a method of identifying which characters occur most frequently. The most frequent character is assigned the least number of bits for storage. Using a Huffman Tree A Huffman Tree is used to identify the bit pattern that should be used for each character in a file. This is best explained using an example.

Huffman coding and trees - Paul Long · GCSE CS 4 AQA –Huffman – Published by paullong.net © paullong.net 2016 Page 1 of 16 by Paul Long Huffman coding and trees

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 1 of 16 by Paul Long

Huffman coding and trees

Huffman coding is another method for lossless compression. It reduces the number

of bits needed to store data. It is based on the number of times that each data

item (character for text or pixel for image) is repeated. The data items that occur

most frequently will be stored using a fewer number of bits.

Before looking at how Huffman coding works, let’s look at how to calculate the

number of bits needed to store a simple text file.

Each character in an ASCII text file uses 7 bits, but most text files use extended

ASCII which is 8 bits per character.

Therefore, to calculate the file size of an ASCII text file, count the number of

characters, including spaces, and that is the number of bytes. Then multiply by 8

to give the number of bits.

ASCII file size (bytes) = number of characters

ASCII file size (bits) = number of characters x 8

Example – ASCII file size How many bits and bytes are in this sentence?

8 x spaces, 1 x ? and 36 letters = 45 bytes. Multiply by 8 for 280 bits.

Huffman coding uses a method of identifying which characters occur most

frequently. The most frequent character is assigned the least number of bits for

storage.

Using a Huffman Tree

A Huffman Tree is used to identify the bit pattern that should be used for each

character in a file. This is best explained using an example.

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 2 of 16 by Paul Long

Example – Huffman Tree

The Huffman Tree can now be used to identify the bit pattern to be used for each

character. This is done by inserting a 0 (zero) on every left hand branch and a 1

(one) on every right hand branch. The bit pattern for each letter is calculated by

following the branches and writing down the 0s and 1s passed to get to each

character.

This is a Huffman Tree for the word

“abracadabra”. There are 5 letters in this

word – a, b, c, d and r.

Each circle is called a node.

The frequency of each letter (number of

times each letter appears in the word)

can be seen on the nodes above each

number:

a:5 r:2 b:2 c:1 d:1

The nodes above are the totals of adding

up each pair of nodes beneath them. For

example, for c and d, 1 + 1 = 2 and so 2

appears in the node above. The node at

the top indicates the total number of

characters in the phrase.

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 3 of 16 by Paul Long

Example – calculating the bit pattern

The Huffman Coding can now be calculated by replacing each character in the

file with its bit pattern.

Example – Huffman Coding Each character of abracadabra is represented as follows:

a:0 r: 10 b: 110 c: 1110 d: 1111

Therefore, the word Huffman Coding will be:

This is written out as:

01101001110011110110100

0

0

0

0

1

1

1

1

To get to the letter a, only a single 0 (zero)

is passed from the top and so the bit

pattern for a is 0 (zero).

To get to the letter r, you have to pass a 1

and then a 0 so the bit pattern for r is 10.

To get to the letter c, you have to pass a

1, then another 1, then another 1 and

then a 0 so the bit pattern for c is 11110.

Here are all the bit patterns for each

character:

a:0 r: 10 b: 110 c: 1110 d: 1111

Notice how the shortest bit pattern is used

for the highest frequency character (a).

a b r a c a d a b r a

0 110 10 0 1110 0 1111 0 110 10 0

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 4 of 16 by Paul Long

The Huffman Coding is now used to store the data. This will require less bits to store

the data. The number of bits needed can be calculated by adding up the

number of 1s and 0s in the bit pattern. This can then be compared with the

number of bits needed to store the data using ASCII to show how much storage

could be saved using the Huffman method of compression.

Example – Huffman compression savings Using Huffman Coding, abracadabra is stored as:

01101001110011110110100

Count up the number of 1s and 0s to calculate the number of bits required to

store. Total = 23 bits.

Compare this with the space needed to store in ASCII:

11 characters = 11 bytes

11 x 8 bits = 88 bits.

The saving is calculated as 88 – 23 = 65 bits saved.

Example – using a Huffman Tree This is a Huffman Tree for “poppy pop”.

This uses a total of 15 bits.

Using ASCII, 9 characters of 8 bits each would be needed making a total of 72 bits.

There is a compression saving of 72 – 15 = 57 bits.

The coding for p will be 1 (move right once).

The coding for o will be 00 (left, then left).

The coding for space will be 010 (left, right,

left).

The coding for y will be 011 (left, right, right).

Therefore, the word Huffman Coding will be:

P O P P Y space P O P

1 00 1 1 011 010 1 00 1

This is written out as:

100110110101001

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 5 of 16 by Paul Long

Activity

1) The following Huffman tree has been created for “SHE SELLS SEA SHELLS”

a) Identify the binary code to be used for each character.

b) Write out the Huffman binary encoding for “SHE SELLS SEA SHELLS”.

c) Calculate how many bits would be required to store the sentence using

ASCII.

2) The following Huffman tree has been created for “EDDIE EDITED IT”

Created using http://huffman.ooz.ie

a) Write out the Huffman binary encoding for this sentence.

b) Calculate how many bits are needed to store this data using Huffman.

c) Calculate how many bits are needed to store this data using ASCII.

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 6 of 16 by Paul Long

3) The following Huffman tree has been created for “WHICH WITCH IS WHICH”

a) Write out the Huffman binary encoding for this sentence.

b) Calculate how many bits are needed to store this data using Huffman.

c) Calculate how many bits are needed to store this data using ASCII.

4) The following Huffman tree has been created for “STUPID SUPERSTITION”

a) Identify the binary code to be used for each character.

b) Write out the Huffman binary encoding for this tree.

c) Calculate how many bits are needed to store this data using Huffman.

d) Calculate how many bits are needed to store this data using ASCII.

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 7 of 16 by Paul Long

5) The following Huffman tree has been created for “GOOD BLOOD, BAD BLOOD”

a) Identify the binary code to be used for each character.

b) Write out the Huffman binary encoding for this tree.

c) Calculate how many bits are needed to store this data using Huffman.

d) Calculate how many bits are needed to store this data using ASCII.

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 8 of 16 by Paul Long

Creating a Huffman Tree

Note: you do not need to be able to create a Huffman Tree in an exam, so this is a

bit of extension work. You may find it helpful to understand how the Trees

are created.

Creating a Huffman Tree is best understood with a video explanation.

Video

Watch Text Compression with Huffman Coding by Barry Brown on YouTube.

Work through this animation.

Example – creating a Huffman Tree 1 Peter Piper picked a peck of pickled peppers

To start with, note that the total characters including spaces is 44 which in normal

ASCII encoding would require 8 bits per character making a total of 352 bits.

For the Huffman code, count the frequency of each letter:

Space (Δ) x 7 P x 2 p x 7 e x 8 t x 1 r x 3

i x 3 c x 3 k x 3 d x 2 a x 1 o x 1

f x 1 l x 1 s x 1

Add the total frequencies to check they add up to 44 because mistakes are very

easy to make. Now put these into ascending order along the bottom of a page:

a1 f1 l1 o1 s1 t1 P2 d2 c3 i3 k3 r3 p7 Δ7 e8

These are all now known as nodes in the Huffman Tree that is to be created. Start

by looking for the two nodes with the lowest frequencies. There are 6 to choose

from (a, f, l, o, s, t), so start with the left hand pair. Combine these to make a new

node with a number that represents the total frequency of characters within those

nodes (1 + 1 = 2):

af2

a1 f1 l1 o1 s1 t1 P2 d2 c3 i3 k3 r3 p7 Δ7 e8

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 9 of 16 by Paul Long

Repeat this for the other four nodes with a frequency of 1:

af2 lo2 st2

a1 f1 l1 o1 s1 t1 P2 d2 c3 i3 k3 r3 p7 Δ7 e8

Repeat this again for the two nodes with the lowest frequency. There are 5 to

choose from (P, d, af, lo, st). Notice how we have to take account of the new

nodes that have been created. Therefore, join together from the left hand side

fitting in P with af, then d with lo (2 + 2 = 4):

Paf4 std4

P2 af2 d2 lo2 st2

a1 f1 l1 o1 s1 t1 c3 i3 k3 r3 p7 Δ7 e8

Next move on to do the next pair of nodes with the lowest frequency. This will be st

and one of c, i, k and r. Choose the left most of these to pair with st, which will be

c (2+3 = 5), and similarly pair i with c (3 + 3 = 6).

Paf4 dlo4 stc5

P2 af2 d2 lo2 st2 c3 ik6

a1 f1 l1 o1 s1 t1 i3 k3 r3 p7 Δ7 e8

The next lowest frequency pair will be r (3) and Paf (4) making a total of 7. The

other two lowest frequencies are dlo (4) and stc (5) making a total of 9.

Remember the lowest frequency moves to the left hand side of each pair of

branches.

rPaf7 dlostc9

r3 Paf4 dlo4 stc5

P2 af2 d2 lo2 st2 c3 ik6

a1 f1 l1 o1 s1 t1 i3 k3 p7 Δ7 e8

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 10 of 16 by Paul Long

The next 2 lowest frequency nodes are ik (6) and rPaf (7) making a total of 13. The

other two lowest frequencies are rpaf (7) and space/ Δ (7) making a total of 14.

The top of each set of branches must be kept in ascending order of frequency so

rPaf Δ moves to the right hand side because it is the largest frequency.

rPaf Δ14

dlostc9 rPaf7 Δ7

dlo4 stc5 ikp13 r3 Paf4

d2 lo2 st2 c3 ik6 p7 P2 af2

l1 o1 s1 t1 i3 k3 a1 f1 e8

The next 2 lowest frequency nodes are e (8) and dlostc (9) making a total of 17.

e8 is moved to the left hand side of kdPta9 because it has a smaller frequency

and edlostc is moved to the right hand side because 17 is now the largest

frequency.

rPafΔ14 edlostc17

rPaf7 Δ7 e8 dlostc9

ikp13 r3 Paf4 dlo4 stc5

ik6 p7 P2 af2 d2 lo2 st2 c3

i3 k3 a1 f1 l1 o1 s1 t1

The next 2 lowest frequency nodes are ikp (13) and rPafΔ (14) making a total of 27.

ikprPafΔ goes on the right hand side because it is the higher frequency

ikprPafΔ27

edlostc17 Ikp13 rPafΔ14

e8 dlostc9 ik6 p7 rPaf7 Δ7

dlo4 stc5 i3 k3 r3 Paf4

d2 lo2 st2 c3 P2 af2

l1 o1 s1 t1 a1 f1

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 11 of 16 by Paul Long

Finally, the last 2 nodes are edlostc (17) and ikprPafΔ27 (27) with a total of 44

which matches the number of characters meaning there is some success so far.

To complete the Huffman Tree, add a 0 on all the left hand branches and a 1 on

all the right hand branches.

44

0 1

edlostc17 ikprPafΔ27

0 1 0 1

e8 dlostc9 Ikp13 rPafΔ14

0 1 0 1 0 1

dlo4 stc5 ik6 p7 rPaf7 Δ7

0 1 0 1 0 1 0 1

d2 lo2 st2 c3 i3 k3 r3 Paf4

0 1 0 1 0 1

l1 o1 s1 t1 P2 af2

01010 01101 0 1

a1 f1

110111

Now you can encode the characters into binary. Read from the top the 0s and 1s

down to each character. For example, follow from 44 to edlostc which is 0 and

then to e is another 0 so 00. Some are shown on the diagram above.

e (8) = 00 k (3) = 1001 d (2) = 0100 P (2) = 11010 t (1) = 01101

a (1) = 110110 o (1) = 01011 f (1) = 01101 l (1) = 01010 s (1) = 01100

r (3) = 1100 i (3) = 1000 c (3) = 0111 Δ (7) = 111 p (7) = 101

Notice how the most frequent characters have the smallest binary number.

“Peter Piper picked a peck of pickled peppers” can now be represented in binary:

11010 00 01101 00 1100 111 11010 1000 0111 1001 00 0100 101 110110 111 101 00

0111 1001 111 01011 01101 111 101 1000 0111 1001 01010 00 0100 111 101 00 101

101 00 1100 01100

Count up the number of 1s and 0s to calculate the number of bits required to

store. Total = 136.

Compare this with the space needed to store in ASCII:

44 characters x 8 bits = 352 bits.

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 12 of 16 by Paul Long

That was hard work! Huffman coding is not easy. The example above was as hard

as it should get and in an exam you will not be expected to create a Huffman

tree. Here is a simpler example below that might be easier to understand. It uses

a different method which does not keep putting in the combination of letters, but

instead just puts in the numbers for each node:

Example – creating a Huffman Tree 2 Create a Huffman code tree for “the big bugbit the little beetle”

No capital letters are used and this method moves the original sequence about in

order to build the tree. The tool used for creating this tree is

http://www.algorasim.com/ and it puts characters into ascending order rather

than descending order.

Start by identifying the frequency of each letter:

Now combine the lowest frequency pair (1 and 2 = total 3):

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 13 of 16 by Paul Long

Now combine the new lowest frequency pair of 2 and 3, moving the branches

along to the right to keep an ascending order of frequency:

Now combine the pair of 3s (l and ug) to make a total of 6:

Now combine 4 and 5 (total 9) and move to the right hand side:

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 14 of 16 by Paul Long

Do the same for a pair of 6s (e and lug) to make a total of12:

The last characters left are space (6) and t (6) with a total of 12:

Now combine 9 and 12 to make a total of 21:

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 15 of 16 by Paul Long

Finally, 12 and 21 make a total of 33 – check this is the same as the total number of

characters. Remember to put 0s on left hand side of each branch and 1s on right

hand side of each branch:

The Huffman encoding is:

Space = 00 t = 01 b =100 h = 1010 g = 11111

u = 11110 e = 110 i = 1011 l = 1110

The phrase “the big bug bit the little beetle” can now be represented in binary as:

01 1010 110 00 100 1011 11111 00 100 11110 11111 00 100 1011 01 00 01 1010 110 00

1110 1011 01 01 1110 110 00 100 110 110 01 1110 110

Count up the number of 1s and 0s to calculate the number of bits required to

store. Total = 101

Compare this with the space needed to store in ASCII:

33 characters x 8 bits = 264 bits.

1

1

1

1

1

1 1

1

0

0 0

0

0

0

0

0

GCSE CS 4 AQA –Huffman – www.gcsecs.org Published by paullong.net

© paullong.net 2016 Page 16 of 16 by Paul Long

Extension activity

1) Create a Huffman tree for each of:

a) A CLEAN CREAM CAN

b) WOULD A WOODCHUCK CHUCK WOOD?

c) FOUR FINE FRESH FISH FOR YOU

c) For each of the above phrases, write out the Huffman binary encoding.

d) For each of the above phrases, calculate the number of bits required for

storage using:

i) ASCII

iii) The Huffman Tree

Questions 1) Contrast lossy and lossless compression. [2]

2) Give 2 reasons for compressing data. [2]

3) Identify two methods of compression encoding. [2]