Download pdf - A JPEG Chip for Image Compression and Decompression

8/7/2019 A JPEG Chip for Image Compression and Decompression

http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 1/18



44 Sun and Lee

of a single chip which requires no additional wirings

in the circuit board. We have also removed data re-dundancies occurring in [5] and reduced the size of

the Huffman tables with a modified RAM architec-

ture. The chip was implemented with VLSI CAD tools.

Firstly, VHDL (VHSIC Hardware Description Lan-

guage) codes were written to describe the architecture

and behavior of the chip. Each block of the chip was

defined and simulated. Then the functionality of the

design was verified with field programmable gate ar-

rays (FPGAs) on circuit boards. Finally, a single chip

was implemented using the standard cell design ap-

proach with the 0.6 µ triple-metal process. The chip

can compress and decompress CCIR601 images, with

resolution of 720 × 480 pixels, at a rate of 30 imagesper second without any restriction on the compression

ratio. The chip contains 411,745 transistors, with a chip

size of 6.6 × 6.9 mm2.

A digital image can be regarded as a two-

dimensional array of pixels. For gray-level images,

each pixel is a value between 0 and 255 (i.e., repre-

sented by 8 bits). For colored images, each pixel con-

tains three values, e.g., RGB or YUV, with each value

lying between 0 and 255. In JPEG, values of the same

category are processed separately from values of other

categories. For example, Y values of a colored image

are processed separately from U values and V values.For convenience, we treat each pixel as a value repre-

sented by 8 bits in the rest of the paper.

The rest of the paper is organized as follows. In

Section 2, we provide a brief introduction of the JPEG

baseline system. Section 3 describes the JPEG hard-

ware implementation. Major modules, DCT/IDCT,

quantizer/de-quantizer, zig-zag, and Huffman codec,

are presented in order. In Section 4, we describe the

RAM architecture of the Huffman tables. Section 5 de-

scribes the implementation in FPGAs and shows some

experimental results. The implementation in ASIC is

described in Section 6 and our conclusion is given in

Section 7.

Huffman

coder

Compressed data

8 x 8+ Zig-zag

DCT

offset(-128)

Image data

Quantization

Figure 1. JPEG compression system.

2. Overview of JPEG Baseline System

Figure 1 shows the major components of the JPEG

compression system. The decompression system is es-

sentially the same with data flowing in the opposite di-

rection and with each function replaced by its inverse.

Four operations are involved in the compression pro-

cess: DCT (Discrete Cosine Transform), quantization,

zig-zag, and Huffman coding. The signed pixel data of

a picture are grouped into 8 × 8 blocks and each block

is transformed by DCT into 8 × 8 = 64 values called

DCT coef ficients. The upper-left corner in a 8×8 block

of the DCT coef ficients is the DC coef ficient and the

other 63 values are AC coef ficients. The 64 coef ficients

are then quantized using corresponding values froma quantization table. After quantization, the 64 quan-

tized coef ficients are converted into a one-dimensional

sequence by the zig-zag operation. Finally, the coef fi-

cients are encoded by Huffman coding and codewords

are obtained by looking up DC and AC tables. For de-

compression, the Huffman decoder decodes the given

compressed data. Then the data are organized into 8×8

two-dimensional blocks. After dequantization the data

in each block are transformed to a set of 8 × 8 pixel

vales by the Inverse DCT (IDCT).

DCT transforms a picture in spatial domain into an-

other in frequency domain. As mentioned, each pixelin an original image is assumed to represent a value

between 0 and 255. Before DCT, level shift is done by

subtracting 128 from each pixel, making a pixel value

range from −128 to 127. Then the image is partitioned

into 8 × 8 blocks and these blocks are processed one

by one left to right and top to bottom. The DCT for a

block is defined as follows:

S(u, v) = C (u)C (v)

4

7x=0

7y=0

s(x, y)cos(2x + 1)uπ

16

× cos

(2y

+1)vπ

16 (1)



A JPEG Chip 45

where s(x, y) is the pixel value, S(u, v) is the DCT

coef ficient, and

C (k ) =

1√ 2

when k = 0;

1 when k = 0.

(2)

When a block is processed by DCT, high-frequency

coef ficientsappear at thelower-right cornerof theblock

while low-frequency coef ficients appear at the up-left

corner.

Forquantization, the DCT coef ficients obtained from

the DCT module are divided by the values defined in

the quantization table which contains 8

×8 entries, i.e.,

Sq (u, v) = S(u, v)

q(u, v)= S(u, v) × 1

q(u, v)(3)

where Sq (u, v) is the quantized coef ficient of S(u, v)

and 1q(u,v)

is the corresponding quantizing value stored

at position (u, v) of the quantization table. If the di-

vider is large, the bit-rate will be low but the quality

of the reconstructed image will be bad; and vice versa.

Therefore, the user may select from different quan-

tization tables to achieve a desired trade-off between

bit-rate and quality of reconstruction.

For a block with a small variation of pixel values, the

high-frequency coef ficients obtained by DCT tend to

be small. Furthermore, the quantization operation per-

formed previously makes intentionally the DCT coef fi-

cients of high frequencies smaller, namely, coef ficients

in lower frequencies are divided by smaller integers

while higher ones are divided by larger integers. There-

fore, the lower triangular part of a coef ficient block

tends to contain many zero entries. The zig-zag opera-

tion, shown in Fig. 2, places the DCT coef ficients of a

block in a sequence from low to high frequencies. As a

result, long sections of successive zeros are more likely

to occur in the tail of the sequence, which is good for

Huffman coding.Huffman coding is a variable-length coding method.

Its idea is to use fewer bits to represent a symbol which

appears more frequently and more bits to represent a

symbol which appears less frequently. Huffman coder

receives the sequences obtained from the zig-zag mod-

ule and generates one block of codewords for each such

sequence. A block of codewords consists of one DC

codeword and one or more AC codewords, as shown

in Fig. 3. Note that an end-of-block (EOB) mark is

inserted at the end of each block. In this way, a block

of 8×8 pixel values is turned into a block of codewords

0 1 5 6

2 4 7 13

3 8 12 17

9 11 18 24

10 19 23 32

20 22 33 38

21 34 37 47

35 36 48 49

Figure 2. Sequence obtained by zig-zag.

and the effect of compression is thus achieved. Huff-

man decoder does the reverse of Huffman coder, i.e., it

receives blocks of codewords and generates blocks of

corresponding DCT coef ficients.

3. JPEG Implementation

The JPEG system is partitioned into four mod-

ules: DCT/IDCT, quantizer/dequantizer, zig-zag, and

Huffman codec. The DCT coef ficients of a block

of pixel values is obtained by the cascade of two

one-dimensional (1-D) DCT processors. The quan-

tizer/dequantizer is implemented by the radix-4 mod-

ified Booth’s algorithm. The zig-zag operation is per-

formed by a dual-buffering mechanism. Finally, the

Huffman codec employs an ef ficient architecture for

Huffman decoding.

3.1. DCT/IDCT

The DCT of Eq. (1) can be rewritten as

S(u, v) = C (u)

2

7x=0

C (v)

2

7y=0

s(x, y)

× cos(2y + 1)vπ

16

cos

(2x + 1)uπ

16(4)



46 Sun and Lee

Huffman code Huffman code Huffman code

Huffman code Huffman code

Amplitude Amplitude

Amplitude Amplitude

Amplitude

BLOCK 1

BLOCK 2

EOB

DC codeword AC codeword AC codeword

AC codeword DC codeword

Figure 3. Codewords obtained from Huffman coder.

which can be performed by applying the cascade of

two 1-D DCT processors in vertical and horizontal di-

rections [9, 10]. The 1-D DCT is defined as follows:

S(w) = C (w)

2

7t =0

s(t )cos(2t + 1)wπ

16(5)

where C (u) has the same value as given in Eq. (2).

Equation (5) can be written in the following matrix

form:

S(0)

S(2)

S(4)

S(8)

=

A A A A

B C −C −B

A −A −A A

C −B B −C

s(0) + s(7)

s(1) + s(6)

s(2) + s(5)

s(3) + s(4)

(6)

10 10

Data in

1-D DCT/IDCT

8/12

1-D DCT/IDCTTransposition Memory

Data out

12/8

2-D DCT/IDCT

Figure 4. Architecture of the DCT/IDCT module.

S(1)

S(3)

S(5)

S(7)

=

D E F G

E −G −D −F

F −D G E

G −F E D

s(0) − s(7)

s(1) − s(6)

s(2) − s(5)

s(3) − s(4)

(7)

where A = cos π

4, B = cos π

8, C = sin π

8, D = cos π

16,

E = cos 3π16

, F = sin 3π16

, and G = sin π

16.

Figure 4 shows the architecture of the DCT module.

An 8

×8 block is processed column by column by the

first 1-D DCT processor. The intermediate results areplaced in the transposition memory. Then the DCT val-

ues are obtained row by row by the second 1-D DCT

processor. The architecture of the 1-D DCT processors,

implemented in distributed arithmetic [11], is shown in

Fig. 5. The first stage, FREG, in a 1-D DCT consists

of 8 parallel-in-serial-out registers. The eight pixels in



A JPEG Chip 47

2

2

2

2

2

2

2

2

ADDSUB

Butterfly

and FREG

2

2

2

2

2

2

2

2

BREG

2

2

2

2

2

2

2

2

ADDSUB

Butterfly

and

2

2

2

2

2

2

2

2

RAC

12 12

Data outData in

1-D DCT/IDCT

Figure 5. Architecture of 1-D DCT/IDCT.

a column of an 8 × 8 block are fed into the registers,

and shifted out serially by 2 bits at a time to the next

stage, called ADDSUB, which is responsible for the

additions and subtractions appearing in the right-hand

sides of Eqs. (6) and (7). The next stage, called RAC

(ROMs and Accumulators) and shown in Fig. 6, con-sists of eight ROMs and eight corresponding accumu-

lators. Note that all the combinations of the constants

and pixel values are stored in ROMs. RAC receives

eight pixel values andobtains eight DCTcoef ficients by

looking up ROM tables. The second ADDSUB acts as

a butterfly connection between RAC and BREG which

consists of 8 serial-in-parallel-out registers.

CoefficientROM

CoefficientROM

CSA

Register

+

8

4

4

14

12

14

Shift 2 bitsto right

Input

Figure 6 . Architecture of RAC.

For IDCT, the first ADDSUB acts as a butterfly

connection between FREG and RAC, and the second

ADDSUB is responsible for the required additions and

subtractions.

3.2. Quantizer/Dequantizer

We adopted the radix-4 modified Booth’s algo-

rithm [12] to speed up the multiplication operation

involved in quantization and dequantization. The al-

gorithm, as shown in Table 1, takes care of two bits

at a time. Note that in this table, DCT coef ficients,

denoted by x’s, are multipliers and quantizing values,

Q, are multiplicands. The architecture of the quantizer

is shown in Fig. 7 which includes five adders and five

Table 1. The radix-4 modified Booth’s algorithm.

xi xi−1 xi−2 Operation Comments

0 0 0 +0 String of zeros

0 1 0 +Q A single 1

1 0 0 −2Q Beginning of 1’s

1 1 0 −Q Beginning of 1’s

0 0 1 +Q End of 1’s

0 1 1 +2Q End of 1’s

1 0 1 −Q A single 0

1 1 1 +0 A string of 1’s



48 Sun and Lee

>>2

+

>>2

+

>>2

+

Reg. >>4

+ >>8

+

DCT coefficients Quantization

table

Reg. for DCT coef. Register for value Q pipeline stage1

pipeline stage2pipeline stage3

pipeline stage 4

12

3

0,+Q,-Q,+2Q,-2Q

0,+Q,-Q,+2Q,-2Q

0,+Q,-Q,+2Q,-2Q

0,+Q,-Q,+2Q,-2Q

0,+Q,-Q,+2Q,-2Q

0,+Q,-Q,+2Q,-2Q

Reg.

Reg.

Reg.

Reg.

Reg.

x1 0 -1x x

3

x3x x2 1

3

x5x x4 3

3

x7x x6 5

3

x9x x8 7

3

x11x x10 9

12

Quantizedcoef.

12

Figure 7 . Architecture of the quantizer.

pipeline registers. In thefirstclock,aDCTcoef ficientof

12 bits andthe corresponding quantizing valuefrom the

quantization table are latched in the pipeline registers

of stage 1. In the second clock, six values are selected

from the set {+0, +Q, −Q, +2Q, −2Q} by x1x0x−1,

x3x2x1, x5 x4 x3, x7x6x5, x9 x8 x7, x11x10 x9, respectively,

andare passed through a set of shift-and-add structures.

The results are then stored in the pipeline registers of

stage 2. In the third clock, values from the pipeline reg-

isters of stage 2 are processed in a similar manner and

the results are stored in the pipeline registers of stage 3.

After one more shift-and-add operation, the quantized

coef ficient is stored in the pipeline register of stage 4.For dequantization, the input to Fig. 7 is a quantized

coef ficient and the table is replaced with the dequanti-

zation table. Note that each element of the dequantiza-

tion table is the inverse of the corresponding element

of the quantization table.

3.3. Zig-Zag

The zig-zag module consists of two RAMs and an ad-

dress control, as shown in Fig. 8. The two RAMs,

RAM1 and RAM2, work in double-buffering mode.

When the content of RAM1 is read out in the zig-

zag order, RAM2 is being loaded with another block

of 64 DCT coef ficients. Then RAM1 and RAM2 are

switched, namely, RAM2 is read out and RAM1 is

loaded. The switching of the two RAMs is controlled

by a multiplexer.

3.4. Huffman Codec

DC coef ficients and AC coef ficients are encoded sep-

arately. DC coef ficients are not encoded directly. In-stead, thedifference of theDC coef ficient of thepresent

block from that of the previous block is used for en-

coding. To obtain AC codewords for a sequence, the 63

AC coef ficients are interpreted into runs of zeros each

of which ends with a non-zero coef ficient. Then each

run of zeros and its following nonzero coef ficient are

used for encoding. A DC codeword is derived from two

parts, size and amplitude, and each AC codeword is de-

rived from run-length/size and amplitude. Amplitude

indicates the difference of the underlying DC coef fi-

cient from the previous DC coef ficient, or the nonzero



A JPEG Chip 49

RAM1(64x12)

RAM2(64x12)

Read/Writeaddress control

6

12

12

Addr2

Addr1

Data inDout2

Dout1

Mux Data out

6

Figure 8. Architecture of the zig-zag module.

AC coef ficient following a run of zeros. Size indicates

the number of bits required for representing the ampli-

tude in one’s complement form. The relationship be-

tween size and amplitude is shown in Table 2. Note

that for DC coding, size can be of up to 11 bits, while

for AC coding, up to 10 bits. Run-length indicates the

number of zeros in a run of zeros. Table 3 shows all

the possible combinations allowed for run-length (in

horizontal direction) and size (in vertical direction) for

AC coding. Note that when the run-length of a run is

greater than 16, two or more codes required for this run.

The ZRL mark in Table 3 indicates a run of 15 zerosfollowed by a zero AC coef ficient (i.e., 16 consecutive

zeros), and the EOB mark is used to end a block, as

mentioned earlier.

When size, amplitude, and run-length are available,

we are ready for coding. To obtain the codeword for

Table2 . Correspondencebetweensizeand amplitude.

Size Amplitude

0 0

1−

1, 1

2 −3, −2, 2, 3

3 −7 ∼ −4, 4 ∼ 7

4 −15 ∼ −8, 8 ∼ 15

5 −31 ∼ −16, 16 ∼ 31

6 −63 ∼ −32, 32 ∼ 63

7 −127 ∼ −64, 64 ∼ 127

8 −255 ∼ −128, 128 ∼ 255

9 −511 ∼ −256, 256 ∼ 511

10 −1023 ∼ −512, 512 ∼ 1023

11 −2047 ∼ −1024, 1024 ∼ 2047

a DC difference, D, we use the size of D to obtain a

Huffman code, CODE1, by looking-up the DC Huff-

man table, as shown in Table 4(a). Let CODE2 be

the amplitude of D represented in one’s complement

form. Then the codeword for D is the concatenation of

CODE1 and CODE2. For example, let D be6. The size

for 6 is3. ThenCODE1is 100 obtainedfromTable4(a),

and CODE2 is 110. Therefore, the codeword for D is

100110. For theencodingof a runof zeros followedby a

nonzero AC coef ficient, A, we use the run-length/size

combination to obtain a Huffman code, CODE1, by

looking-up theAC Huffman table,a small part of whichis shown in Table 4(b). Let CODE2 be the amplitude

of A represented in one’s complement form. Then the

codeword for this run of zeros and the following AC co-

ef ficient is the concatenation of CODE1 and CODE2.

For example, suppose we want to find the codeword

Table 3. Possible combinations of run-length and size for AC

coding.

0 1 2 · · · 9 10 11 · · · 14 15

0 EOB N/A N/A· · ·

N/A N/A N/A· · ·

N/A ZRL

1 01 11 21 · · · 91 A1 B1 · · · E1 F1

2 02 12 22 · · · 92 A2 B2 · · · E2 F2

3 03 13 23 · · · 93 A3 B3 · · · E3 F3

4 04 14 24 · · · 94 A4 B4 · · · E4 F4

5 05 15 25 · · · 95 A5 B5 · · · E5 F5

6 06 16 26 · · · 96 A6 B6 · · · E6 F6

7 07 17 27 · · · 97 A7 B7 · · · E7 F7

8 08 18 28 · · · 98 A8 B8 · · · E8 F8

9 09 19 29 · · · 99 A9 B9 · · · E9 F9

10 0A 1A 2A · · · 9A AA BA · · · EA FA



50 Sun and Lee

Table 4. Huffman tables: (a) DC Huffman table; (b) Part of AC

Huffman table.

(a) (b)

Size Huffman code Run-length/Size Huffman code

0 00 · · · · · ·1 010 3/1 111010

2 011 3/2 111110111

3 100 3/3 111111110101

4 101 3/4 111110001111

5 110 3/5 1111111110010000

6 1110 3/6 1111111110010001

7 11110 3/7 1111111110010010

8 111110 3/8 11111111100100119 1111110 3/9 1111111110010100

10 11111110 3/10 1111111110010101

1 1 11 11111 10 · · · · · ·

for the sequence 0003 with A = 3. The run-length is 3

and the size of A is 2. Then CODE1 is 111110111 ob-

tained from Table 4(b), and CODE2 is 11. Therefore,

the codeword for this sequence is 11111011111.

For decoding, each block of codewords is converted

back to a block of DCT coef ficients by looking up the

Huffman tables. The decoding process is totally thereverse of the coding process and the description of it

is omitted here.

DCT

ones

code

combiner

run-len

Huffman

zero-run

address

+

reconstructed

data generator

Huffman Coder/Decoder

MUX

leading ones

detector run-len

size

coefficients

codewords

reconstructed

DCT coefficients

shift-out

no. of leading

shift -in

Huffman tables

size

size detector

codewords

detector

Figure 9. The Huffman codec module.

The overall architecture of the Huffman codec mod-

ule is shown in Fig. 9. In this figure, zero-run detec-tor, size detector, combiner, and barrel shift-out are in-

volved in thecoding process, while barrel shift-in, lead-

ing ones detector, and reconstructed data generator are

involved in the decoding process. The twocomponents,

address MUX and Huffman tables, are involved in both

encoding and decoding.

3.4.1. Encoding Path. The zero-run detector is used

to count the number of successive zeros in a section of

an input block. It is equipped with a zero-run counter.

The entries of a block are fed into the zero-run detec-

tor one by one. If the input entry is zero, the zero-run

counter increments by one. If the input entry is non-

zero, it is sent out to the size detector and the zero-run

counter is reset to zero. The size detector determines

the size of the input value.

For coding DC coef ficients, address MUX outputs

size to form the address for the Huffman tables. How-

ever, for coding AC coef ficients, address MUX outputs

both run-length and size to form the address for the

Huffman tables. A detailed description about address-

ing for encoding will be given in the next section. The

output from the Huffman tables is a unique Huffman

code whichis concatenated with amplitude in combiner

to form a codeword for each input.Obviously, codewords obtained are variable in

length. However, the width of the data bus isfixed. The





52 Sun and Lee

32

32

32

32

Register 1Register 2

Barrel shifter

Barrel Shift-in

control signals

MSB LSB

Data in

Data out

Length

32

5

5

Figure 11. Architecture of the barrel shift-in module.

ones, and group 9 having at least 9 leading ones. There-

fore, 23 = 8 RAM locations are needed to decode the

codewords in each of the 10 groups, requiring a to-

tal of 80 locations for a DC decoding table. Similarly,

the codewords for the AC coef ficients are divided into

10 groups. The maximum possible tail length of the

codewords is 8. Since the first bit of the tail is always

0, up to 7 additional trailing bits are required to de-

code the codewords within each group. Hence 27 = 128

RAM locations are needed to decode the codewords in

each of the 10 groups, requiring a total of 1280 lo-

cations for a AC decoding table. Since each entry is

12 bits wide and there are two sets of Huffman tables,

2720 × 12 = 32640 RAM bits are required for the

decoding tables.

The encoder was designed to utilize the hardware re-

quired for the decoder. By grouping the codewords by

code length, the encoder tables can be reduced to 12-

bits tofit within the decoding tables. Instead of looking

up the codeword directly, the length is used to deter-mine the first code value for that group. Adding the

first code value and the offset of the codeword within

the group produces the desired codeword. With this ap-

proach, only an additional RAM of size 56×16 = 896

is required. Therefore, a total of 32640+ 896 = 33536

RAM bits are required for the Huffman codec.

4.2. Our Improvement

As in [5], we treat a Huffman code as the concate-

nation of two parts, ONES and CBITS, which denote

the leading ones and the remaining bits (excluding the

leading zero), respectively, of the code. We also di-

vide AC codes and DC codes, respectively, into ten

groups. Therefore, like [5], we need 2720 entries for

decoding tables. However, in Ruetz’s method, each

entry is 12 bits wide. We propose an improvement

to reduce the width of each entry as follows. Each

of the 2720 entries contains two fields, RUN-LEN

and SIZE, only. Therefore, each entry is 8 bits wide.

The code length C-LEN required for decoding is de-

rived by checking-up with another small table with

each entry being 4 bits wide. Furthermore, encoding

is done by taking advantage of the decoding tables

without additional storage. As a result, we can save

about 10 K bits more than Ruetz’s method for Huffman

tables.

We use two RAMs, RAM3 and RAM4 to imple-

ment the Huffman tables, as shown in Fig. 12. RAM3

has 2720 entries with 8 bits in each entry, and RAM4

has 376 entries with 4 bits in each entry. For de-coding, an address for RAM3 contains three variable

fields: Group, SEVEN/THREE, and N, with 4 bits,

7/3 bits, and one bit, respectively. The Group field

indicates the group of the leading ones of the un-

derlying Huffman code. The SEVEN/THREE field

contains the 7/2 bits behind the zero that follows

the leading ones. This works because Huffman cod-

ing ensures that none of the codes is a prefix of an-

other code. Of course, multiple entries in RAM3 may

store the same content due to this way of address-

ing. The N field indicates which set of the tables is



A JPEG Chip 53

2720 x 8

376x4

(0)

(375)

000(0)

177

(375)

(2719)

11 10 9 8 7 6 5 4 3 2 1 0

AC

DC 1010 N

NGroup

Group

11 10 9 8 7 6 5 4 3 2 1 0

NSIZE

NSIZE

AC

DC

000

0001011

Address format

12 8

4

9

9

Table number

RUN-LEN

SEVEN

Encoding

Decoding

THREE

000

3 2 1 07 6 5 4 3 2 1 0

RUN-LEN SIZE C-LEN

3 2 1 07 6 5 4 3 2 1 0

Data format

C-LEN

RAM3

RAM4

EIGHT

A9F

177

Figure 12. Addressing of RAM tables.



54 Sun and Lee

used. Each entry in RAM3 contains the information

about run-length and size, indicated as RUN-LEN andSIZE, respectively. However, we need to provide the

barrel shifter of the barrel shift-in module with the

length of CBITS so that the exact number of bits

can be shifted. The length of CBITS can be obtained

from the code length field, C-LEN, of RAM4 (In

fact, C-LEN is one less than the code length, as ex-

plained later). When we have RUN-LEN and SIZE

from RAM3, we can get the length of CBITS from

RAM4 (The length of CBITS is equal to the difference

of C-LEN and the number of leading ones of the

codeword).

+

+

+

Next code

Leading onesdetector

Address MUX

RAM4376x4 words

RAM32720x8 words

RUN-LEN SIZE C-LEN

Barrel shifter 1(Shift leading ones)

Barrel shifter 2(Shift CBITS)

Barrel shifter 3

Codewords

Sum1_reg

Tmp_pos

Sum0

Sum1

To the next stage

16 32

4

4

3

Barrel shifter

Figure 13. Architecture of the Huffman decoder.

We also use RAM3 and RAM4 for encoding. From

Table 4, it is clear that 12 entries and 11 × 16 = 176entries are required for DC and AC encoding, re-

spectively. Therefore, 188 × 2 = 376 entries in to-

tal are needed for two sets of tables. The way of ad-

dressing for encoding is shown in the upper-part of

Fig. 12. An address for RAM3 and RAM4 contains

three variable fields: RUN-LEN, SIZE, and N. An en-

try in RAM3 contains EIGHT, the lower-order eight

bits of a Huffman code, and an entry in RAM4 con-

tains C-LEN, one less than the length of a Huffman

code. For example, consider the Huffman code 111010

for 3/1 in Table 4(b). RAM3 will include an entry of



A JPEG Chip 55

00111010 and RAM4 will include an entry of 0101

(i.e., 5) at address 00000010011N for this Huffmancode. Consider another example of the Huffman code

1111111110010000 for 3/5 in Table 4(b). RAM3 will

include an entry of 10010000 and RAM4 will include

Figure 14. The JPEG system implemented with FPGAs: (a) Components involved in the system; (b) Connecting the system to a PC.

an entry of 1111 (i.e., 15) at address 00000110101N

for this Huffman code. Apparently, the total numberof RAM bits required is 2720 × 8 + 376 × 4 = 23264

bits, which is 10272 bits fewer than that required in

Ruetz’s method.



56 Sun and Lee

The Huffman decoder based on these RAM archi-

tectures is shown in Fig. 13. In this figure, a sequenceof 16 input bits is loaded into the Next Code register

through several barrel shifters. We have the barrel

shifters operate in parallel with the other parts of the

decoder. When the output of the Next Code register

passes through the leading ones detector, the barrel

shifter in the shift-in module shifts in the input bit

stream. When the data pass through the address

multiplexer and RAM3, the barrel shifter shifts in a

number of input bits with the size equal to the number

of leading ones. When the data pass through RAM4,

the barrel shifter shifts in a number of input bits

with the size equal to the length of the CBITS of the

code.

5. Emulation with FPGAs

VHDL was adopted as the high-level language for

implementing the JPEG baseline system. VHDL

codes were written to describe the architecture and

behavior of each component. After the function of the

design had been tested successfully with the VHDL

functional simulator, the design was synthesized

by Synopsys with the Altera FLEX 10K FPGA

technology. The whole system was partitioned andfit into two FLEX 10K FPGAs, an EPF10K100

and an EPF10K70. Placement, routing, and pro-

gramming of the FPGAs were done by ALTERA

Maxplus II.

Figure 15. LENA: (a) original image; (b) reconstructed image.

5.1. Construction of Circuit Boards

The DCT/IDCT module and interface circuits are

placed in the EPF10K100 FPGA. The transposition

RAM of the DCT module is fit into the embedded

RAM architecture of the FPGA. The quantization, zig-

zag, and codec modules are placed in the EPF10K70

FPGA. The RAMs of quantization and zig-zag mod-

ules are fit into the embedded RAM architecture of the

FPGA. Huffman tables are implemented in an external

static RAM.

The FPGAs are mounted on two circuit boards, as

shown in Fig. 14(a), and are connected together by two

50-bitsflat cables and a 8-bits download cable. The two

flat cablesform thepath for thesignalsflowing betweenthetwo FPGAs.The downloadcableis connectedto the

parallel port of a PC andprogrammingof the2 cascaded

FPGAs is done by the PC, as shown in Fig. 14(b). The

whole system communicates with the outside world via

the interface circuits.

The estimated propagation delays of the two circuit

boards are 189.7 ns and 218 ns, respectively. The clock

rateof the ISA buson a PCis about 8.2 MHz.We divide

the clock on the ISA bus by 2 and apply it to our circuit

boards. The system works correctly at the clock rate of

4.1 MHz.

5.2. Experimental Results

A program written in the C language is used to moni-

tor the status of the FPGAs and read results from and



A JPEG Chip 57

Figure 16 . PEPPERS: (a) original image; (b) reconstructed image.

write data into the circuit boards. The subroutine “im-

port( )” reads data from the specified I/O address and

“outport( )” writes data to the specified I/O address.

When encoding, an image file of raw data is opened

for feeding pixel data into the circuit boards and an-

other file is opened for storing the resulting codewords

coming from the circuit boards. When decoding, a

file of codewords is fed into the circuit boards andanother file is opened for storing the resulting pixel

values.

Some standard testing images are used as bench-

marks to test the functionality of our design. Figure 15

Figure 17 . BABOON: (a) original image; (b) reconstructed image.

is a 512 × 512 image with three color components in

RGB. The image is first compressed and then decom-

pressed. The bit rate is 0.055 bits/pixel and the SNR is

about 36.7 dB. SNR (Signal to Noise Ratio) is defined

as follows:

SNR=

M i=1

N j=1 s̄(i, j )2

M i=1

N j=1 [s(i, j) − s̄(i, j)]2

(8)

where M and N denote the dimensions of x and y,

respectively, of the image, s(i, j) denotes pixel values

of the original image, and s̄(i, j ) denotes pixel values



58 Sun and Lee

Figure 18. Floor plan of the JPEG chip.

of the reconstructed image. Obviously, a large SNR

indicates a small distortion of the reconstructed im-

age from the original image. A low bit-rate means a

high compression ratio and thus a small transmission

bandwidth is required. Figure 16 is a color image in

RGB format. The bit rate is 0.058 bits/pixel and the

SNR is 32.9 dB. Figure 17 is also a color image. The

bit rate and the SNR are 0.127 bits/pixel and 27.8 dB,

respectively.

6. Chip Implementation

A single chip for the JPEG baseline system is con-

structed by standard cells with the 0.6 µ triple-metal

process. The design is synthesized using Synopsys with

Table 5. Comparison of JPEG chips.

Author Size (mm2) Complexity RAM Memory Clk Rate Power Consumed

Sun and Lee 45.54 (0.6 µ) 411,745 transistors 23.264 K 27 MHz 1000 mW

Ruetz et al. [5] Two chips: N/A 33.536 K 30 MHz N/A

62.4 and 96.04 (1 µ)

Kovac and Ranganathan [13] 168 (1 µ) N/A N/A 100 MHz N/A

Okada et al. [14] 54 (0.6 µ) 70 K (Gate Count) 17 K 18 MHz 400 mW

Asada et al. [15] 125 (0.5 µ) 200 K (Gate Count) 38 K 17.5 500 mW

Hunter et al. [16] 14 (0.35 µ) 50 K (Gate Count) 11.5 K 30 MHz 840 mW

the Compass 0.6µ standard cell library. Placement and

routing are done by Cadence.

The core is divided into 4 parts roughly according to

the size of the four major modules: DCT/IDCT, quan-

tization, zig-zag, and Huffman codec. Each RAM used

is then placed within the respective part and around the

core properly. The floor plan of the chip is shown in

Fig. 18.

Clock trees and the constraints on placing stan-

dard cells are added before routing. Finally, we got

the layout as shown in Fig. 19. DRC (design rule

check), ERC (electrical rule check), and LVS (lay-

out versus schematic) are performed to check the

correctness of the layout. Post-layout timing analy-

sis is done using TimeMill. Power analysis is done by

PowerMill.



A JPEG Chip 59

Figure 19. Layout of the JPEG chip.

Simulation results have shown that the chip can op-

erate in real time at any compression ratio. A pixel or a

codeword can be processed within a single clock cycle

and the maximal working frequency achieved is about

27 MHz. The chip contains 411,745 transistors, with adie size of 6.6 × 6.9 mm2. Power dissipation is about

1 watt at the maximal working frequency. A com-

parison between our chip and other JPEG designs [5,

13–16] is given in Table 5. Note that the work of [5]

contains two chips. DCT and zig-zag are contained in

a chip with 62.4 mm2, and coding and quantization are

contained in another chip with 96.04 mm2. Someof the

chips, e.g., [16], are only for JPEG encoders. Appar-

ently, our chip is highly competitive with other chips.

Note that our focus has been on the development of

a single-chip JPEG codec at a campus-lab with limited

resources rather than an application-specific standard

product (ASSP) chip design adopted for commercially

available chips [16].

7. Conclusion

We have presented the design and implementation

of a single chip for the JPEG baseline system. The

chip is mainly composed of four modules: DCT/IDCT,

quantizer/dequantizer, zig-zag, and Huffman codec.

The chip was designed with modern VLSI CAD tools.

VHDL was used to describe thearchitecture andbehav-

ior of each module. Then the functionality of the de-

sign was verified with field programmable gate arrays

(FPGAs) on circuit boards. Finally, a single chip was



60 Sun and Lee

implemented using the standard cell design approach

with the 0.6 µ triple-metal process. The resulting chipcontains 411,745 transistors, with a die size of is

6.6 × 6.9 mm2. The chip can operate in real time at

any compression ratio.

Acknowledgments

Theauthors would like to thank theanonymous referees

for their constructive comments and suggestions.

This work was supported by the National Science

Council underthe grants NSC-87-2213-E-110-012 and

NSC-88-2218-E-110-008. A preliminary version of

this paper appeared in Proceedings of the 2000 AsiaPacific Conference on MultimediaTechnologyand Ap-

plications, Kaohsiung, Taiwan, December 2000.

References

1. “JPEG digital compression and coding of continuous-tone still

image,” Technical Report Draft ISO 10918, 1991.

2. “Coding of moving pictures and associated audio,” Committee

Draft of Standard ISO11172: ISO/MPEG/90/176, 1990.

3. “Video codec for audio visual services at p × 64 k bits per

second,” CCITT Recommendation H.261, 1990.

4. K. Sayood, Introduction to Data Compression. San Francisco,

CA: Morgan Kaufmann, 2000.

5. P.A. Ruetz, P. Tong, D. Luthi, and P.H. Ang, “A Video-Rate

JPEG Chip Set,” Journal of VLSISignal Processing, vol.5, 1993,

pp. 29–38.

6. H.Park andV.K. Prasanna,“Areaef ficientVLSI architectures for

Huffman coding,” IEEE Transactions on Circuits and Systems,

1993, pp. 568–575.

7. A. Mukherjee, N. Ranganathan, J.W. Flieder, and T. Acharya,

“MARVLE: A VLSI Chip for Data Compression Using Tree-

Based Codes,” IEEE Transactions on Very Large Scale Integra-

tion Systems, 1993, pp. 203–213.

8. Y.S. Lee, J.J. Jong , T.S. Perng, L.C. Hsu, M.Y. Jaw, and C.Y.

Lee, “A Memory-BasedArchitecturefor Very-High-Throughput

Variable LengthCodec Design,” IEEE InternationalSymposium

on Circuits and Systems, June 1997, pp. 2096–2099.

9. M. Sun, T. Chen, and A.M. Gottlieb, “VLSI Implementation of

a 16 × 16 Discrete Cosine Transform,” IEEE Transactions on

Circuits and Systems, vol. 36, 1989, pp. 610–617.

10. S.-F. Hsiao and J.-M. Tseng, “Parallel, Pipelined and Folded

Architectures for Computation of 1-D and 2-D DCT in Image

and Video Coding,” Journal of VLSI Signal Processing, vol. 28,

no. 3, 2001, pp. 205–220.

11. S.A. White, “Applications of Distributed Arithmetic to Digital

Signal Processing: A Tutorial Review,” IEEE ASSP Magazine,

June 1989, pp. 4–17.

12. I. Koren, Computer Arithmetic Algorithm, Prentice-Hall Inter-

national Editions, 1993.

13. M. Kovac and N. Ranganathan, “JAGUAR: A Fully Pipelined

VLSI Architecture for JPEG Image Compression Standard,”

Proceedings of the IEEE , vol. 83, no. 2, 1995, pp. 247–258.

14. S. Okada, Y. Matsuda, T. Watanabe, and K. Kondo, “A Single

Chip Motion JPEG Codes LSI,” IEEE Transactions on Con-

sumer Electronics, vol. 43, no. 3, 1997, pp. 418–422.

15. S.K. Asada, H. Ohtsubo, T. Fujihira, and T. Imaide, “Develop-

ment of a Low-PowerMPEG1/JPEG Encode/DecodeIC,” IEEE

Transactions on Consumer Electronics, vol. 43, no. 3, 1997,

pp. 639–644.

16. J.K.Hunter,J.V. McCanny, A. Simpson, Y. Hu, and J.G.Doherty,

“JPEG Encoder System-on-a-Chip Demonstrator,” in Proceed-

ings of the Thirty-Third Asilomar Conference, vol. 1, 1999,

pp. 762–766.

Sun-Hsien Sun was born on October 11, 1971 in Taipei, Taiwan.

He received bachelor and master degrees of electrical engineer-

ing from National Sun Yet-Sen University (NSYSU), Kaohsiung,

Taiwan, ROC, in 1994 and 1998, respectively. His main interests

include VLSI design of signal processing and hardware description

languages.

Shie-Jue Lee was born at Kin-Men, ROC on August 15, 1955. He

received the B.S.E.E. and M.S.E.E. degrees in 1977 and 1979, re-

spectively, from National Taiwan University, and the Ph.D. degree

fromthe Department of Computer Science at theUniversityof North

Carolina, Chapel Hill, USA, in 1990. Dr. Leejoined thefaculty of the

Department of Electrical Engineering at National Sun Yat-Sen Uni-

versity, Taiwan, in 1983, and has become a professor of the depart-

ment since1994. His research interests include machine intelligence,

multimedia communications, and chip design.

Dr. Lee served as the acting director and the director of the

Telecommunication Development and Research Center of National

Sun Yat-Sen University during 1997–2000, and the director of the

Southern Telecommunications Research Center, National Science

Council, Taiwan, in 1998–1999. He is now professor and chairman

of the ElectricalEngineering Department,NationalSun Yat-Sen Uni-

versity.

Dr. Lee is a member of IEEE, IEICE, Association for Automated

Reasoning, Chinese Fuzzy Systems Association, Institute of Infor-

mation and Computing Machinery, and Taiwanese Association of

Artificial Intelligence.