59
CompSci 100E 31.1 Text Compression Input: String S Output: String S Shorter S can be reconstructed from S

Text Compression

  • Upload
    tavita

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Text Compression. Input: String S Output: String S  Shorter S can be reconstructed from S . Text Compression. … - PowerPoint PPT Presentation

Citation preview

Page 1: Text Compression

CompSci 100E 31.1

Text Compression

Input: String S Output: String S

Shorter S can be reconstructed from S

Page 2: Text Compression

CompSci 100E 31.2

Text Compression … He said to his friend, "If the British march

By land or sea from the town to-night,Hang a lantern aloft in the belfry archOf the North Church tower as a signal light,--One if by land, and two if by sea;And I on the opposite shore will be,Ready to ride and spread the alarmThrough every Middlesex village and farm,For the country folk to be up and to arm." …

Henry Wadsworth Longfellow (1807-1882)

By land 1

By sea 2

Page 3: Text Compression

CompSci 100E 31.3

Text Compression: Examples

Symbol

ASCII Fixedlength

Var.length

a 01100001

000 000

b 01100010

001 11

c 01100011

010 01

d 01100100

011 001

e 01100101

100 10

“abcde” in the different formatsASCII: 01100001011000100110001101100100…Fixed: 000001010011100

Var: 000110100110

0

00

0

0

00

1

1 1

1

a b c d e

a d

bc e0

0

0

0 1

1

1

1

EncodingsASCII: 8 bits/characterUnicode: 16 bits/character

Unicode adds 8 more 0’s to left of ASCII

Page 4: Text Compression

CompSci 100E 31.4

Huffman coding: go go gophers

Encoding uses tree: 0 left/1 right How many bits? 37!! Savings? Worth it?

ASCII(7bits) 3 bits Huffmang 103 1100111 3 000 00o 111 1101111 3 001 01p 112 1110000 1 010 1100h 104 1101000 1 011 1101e 101 1100101 1 100 1110r 114 1110010 1 101 1111s 115 1110011 1 110 100sp. 32 1000000 2 111 101

3

s1

*2

2

p1

h1

2

e1

r1

4

g

3

o

3

6

3

2

p1

h1

2

e1

r1

4

s1

*2

7

g3

o3

6

13

2

p1

h1

2

e1

r1

Page 5: Text Compression

CompSci 100E 31.5

Huffman Coding

D.A Huffman in early 1950’s Before compressing data, analyze the input

stream Represent data using variable length codes Variable length codes though Prefix codes

Each letter is assigned a codeword Codeword is for a given letter is produced by

traversing the Huffman tree Property: No codeword produced is the prefix of

another Letters appearing frequently have short codewords,

while those that appear rarely have longer ones Huffman coding is optimal per-character

coding method

Page 6: Text Compression

CompSci 100E 31.6

Building a Huffman tree

Begin with a forest of single-node trees (leaves) Each node/tree/leaf is weighted with character count Node stores two values: character and count There are n nodes in forest, n is size of alphabet?

Repeat until there is only one node left: root of tree Remove two minimally weighted trees from forest Create new tree with minimal trees as children,

o New tree root's weight: sum of children (character ignored)

Does this process terminate? How do we get minimal trees? Remove minimal trees, hmmm……

Page 7: Text Compression

CompSci 100E 31.7

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

C

1

F

1

P

2

U

2

R

2

L

2

D

2

G

3

T

3

O

3

B

3

A

4

M

4

S

Page 8: Text Compression

CompSci 100E 31.8

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

C

1

F

1

P

2

U

2

R

2

L

2

D

2

G

3

T

3

O

3

B

3

A

4

M

4

S

2

2

Page 9: Text Compression

CompSci 100E 31.9

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

T

3

O

3

B

3

A

4

M

4

S

2

2

3

3

Page 10: Text Compression

CompSci 100E 31.10

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

T

3

O

3

B

3

A

4

M

4

S

2

2

3

3

4

4

Page 11: Text Compression

CompSci 100E 31.11

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

T

3

O

3

B

3

A

4

M

4

S

2

2

3

3

4

4

4

4

Page 12: Text Compression

CompSci 100E 31.12

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2

3

3

4

4

4

4

5

5

Page 13: Text Compression

CompSci 100E 31.13

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4

4

4

4

5

5

6

6

Page 14: Text Compression

CompSci 100E 31.14

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4

4

4

4

5

5

6

6

6

6

Page 15: Text Compression

CompSci 100E 31.15

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

4

4

4

4

5

5

6

6

6

8

8

6

Page 16: Text Compression

CompSci 100E 31.16

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

44

5

5

6

6

6

8

8

6

8

8

Page 17: Text Compression

CompSci 100E 31.17

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

44

5

5

6

6

6

8

8

6

8

8

10

10

Page 18: Text Compression

CompSci 100E 31.18

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

6

6

8

8

6

8

8

10

10

11

11

Page 19: Text Compression

CompSci 100E 31.19

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

8

6

8

8

10

10

11

11

12

12

Page 20: Text Compression

CompSci 100E 31.20

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

10

11

11

12

1216

Page 21: Text Compression

CompSci 100E 31.21

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

121621

Page 22: Text Compression

CompSci 100E 31.22

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

23

162123

Page 23: Text Compression

CompSci 100E 31.23

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

2337

Page 24: Text Compression

CompSci 100E 31.24

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

60

Page 25: Text Compression

CompSci 100E 31.25

Encoding

1. Count occurrence of all occurring characters O( )

2. Build priority queue O( )

3. Build Huffman tree O( )

4. Create Table of codes from tree O( )

5. Write Huffman tree and coded data to file O( )

N

A

A log A

A log A

N

Page 26: Text Compression

CompSci 100E 31.26

Properties of Huffman coding Want to minimize weighted path length L(T)of

tree T

wi is the weight or count of each codeword i

di is the leaf corresponding to codeword i

How do we calculate character (codeword) frequencies?

Huffman coding creates pretty full bushy trees? When would it produce a “bad” tree?

How do we produce coded compressed data from input efficiently?

)(

)(TLeafi

iiwdTL

Page 27: Text Compression

CompSci 100E 31.27

Writing code out to file

How do we go from characters to encodings? Build Huffman tree Root-to-leaf path generates encoding

Need way of writing bits out to file Platform dependent? Complicated to write bits and read in same ordering

See BitInputStream and BitOutputStream classes Depend on each other, bit ordering preserved

How do we know bits come from compressed file? Store a magic number

Page 28: Text Compression

CompSci 100E 31.28

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

01100000100001001101

Page 29: Text Compression

CompSci 100E 31.29

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

1100000100001001101

Page 30: Text Compression

CompSci 100E 31.30

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

100000100001001101

Page 31: Text Compression

CompSci 100E 31.31

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

00000100001001101

Page 32: Text Compression

CompSci 100E 31.32

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

0000100001001101

G

Page 33: Text Compression

CompSci 100E 31.33

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

000100001001101

G

Page 34: Text Compression

CompSci 100E 31.34

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

00100001001101

G

Page 35: Text Compression

CompSci 100E 31.35

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

0100001001101

G

Page 36: Text Compression

CompSci 100E 31.36

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

100001001101

G

Page 37: Text Compression

CompSci 100E 31.37

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

00001001101

GO

Page 38: Text Compression

CompSci 100E 31.38

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

0001001101

GO

Page 39: Text Compression

CompSci 100E 31.39

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

001001101

GO

Page 40: Text Compression

CompSci 100E 31.40

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

01001101

GO

Page 41: Text Compression

CompSci 100E 31.41

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

1001101

GO

Page 42: Text Compression

CompSci 100E 31.42

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

001101

GOO

Page 43: Text Compression

CompSci 100E 31.43

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

01101

GOO

Page 44: Text Compression

CompSci 100E 31.44

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

1101

GOO

Page 45: Text Compression

CompSci 100E 31.45

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

101

GOO

Page 46: Text Compression

CompSci 100E 31.46

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

01

GOO

Page 47: Text Compression

CompSci 100E 31.47

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

1

GOOD

Page 48: Text Compression

CompSci 100E 31.48

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

01100000100001001101

GOOD

Page 49: Text Compression

CompSci 100E 31.49

Decoding

1. Read in tree data O( )

2. Decode bit string with tree O( )

Page 50: Text Compression

CompSci 100E 31.50

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

Page 51: Text Compression

CompSci 100E 31.51

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

Page 52: Text Compression

CompSci 100E 31.52

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

Page 53: Text Compression

CompSci 100E 31.53

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

Page 54: Text Compression

CompSci 100E 31.54

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

Page 55: Text Compression

CompSci 100E 31.55

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

Page 56: Text Compression

CompSci 100E 31.56

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

Page 57: Text Compression

CompSci 100E 31.57

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

Page 58: Text Compression

CompSci 100E 31.58

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

Page 59: Text Compression

CompSci 100E 31.59

Other methods

Adaptive Huffman coding Lempel-Ziv algorithms

Build the coding table on the fly while reading document

Coding table changes dynamically Protocol between encoder and decoder so that

everyone is always using the right coding scheme

Works well in practice (compress, gzip, etc.) More complicated methods

Burrows-Wheeler (bunzip2) PPM statistical methods