8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 1/18
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 2/18
44 Sun and Lee
of a single chip which requires no additional wirings
in the circuit board. We have also removed data re-dundancies occurring in [5] and reduced the size of
the Huffman tables with a modified RAM architec-
ture. The chip was implemented with VLSI CAD tools.
Firstly, VHDL (VHSIC Hardware Description Lan-
guage) codes were written to describe the architecture
and behavior of the chip. Each block of the chip was
defined and simulated. Then the functionality of the
design was verified with field programmable gate ar-
rays (FPGAs) on circuit boards. Finally, a single chip
was implemented using the standard cell design ap-
proach with the 0.6 µ triple-metal process. The chip
can compress and decompress CCIR601 images, with
resolution of 720 × 480 pixels, at a rate of 30 imagesper second without any restriction on the compression
ratio. The chip contains 411,745 transistors, with a chip
size of 6.6 × 6.9 mm2.
A digital image can be regarded as a two-
dimensional array of pixels. For gray-level images,
each pixel is a value between 0 and 255 (i.e., repre-
sented by 8 bits). For colored images, each pixel con-
tains three values, e.g., RGB or YUV, with each value
lying between 0 and 255. In JPEG, values of the same
category are processed separately from values of other
categories. For example, Y values of a colored image
are processed separately from U values and V values.For convenience, we treat each pixel as a value repre-
sented by 8 bits in the rest of the paper.
The rest of the paper is organized as follows. In
Section 2, we provide a brief introduction of the JPEG
baseline system. Section 3 describes the JPEG hard-
ware implementation. Major modules, DCT/IDCT,
quantizer/de-quantizer, zig-zag, and Huffman codec,
are presented in order. In Section 4, we describe the
RAM architecture of the Huffman tables. Section 5 de-
scribes the implementation in FPGAs and shows some
experimental results. The implementation in ASIC is
described in Section 6 and our conclusion is given in
Section 7.
Huffman
coder
Compressed data
8 x 8+ Zig-zag
DCT
offset(-128)
Image data
Quantization
Figure 1. JPEG compression system.
2. Overview of JPEG Baseline System
Figure 1 shows the major components of the JPEG
compression system. The decompression system is es-
sentially the same with data flowing in the opposite di-
rection and with each function replaced by its inverse.
Four operations are involved in the compression pro-
cess: DCT (Discrete Cosine Transform), quantization,
zig-zag, and Huffman coding. The signed pixel data of
a picture are grouped into 8 × 8 blocks and each block
is transformed by DCT into 8 × 8 = 64 values called
DCT coef ficients. The upper-left corner in a 8×8 block
of the DCT coef ficients is the DC coef ficient and the
other 63 values are AC coef ficients. The 64 coef ficients
are then quantized using corresponding values froma quantization table. After quantization, the 64 quan-
tized coef ficients are converted into a one-dimensional
sequence by the zig-zag operation. Finally, the coef fi-
cients are encoded by Huffman coding and codewords
are obtained by looking up DC and AC tables. For de-
compression, the Huffman decoder decodes the given
compressed data. Then the data are organized into 8×8
two-dimensional blocks. After dequantization the data
in each block are transformed to a set of 8 × 8 pixel
vales by the Inverse DCT (IDCT).
DCT transforms a picture in spatial domain into an-
other in frequency domain. As mentioned, each pixelin an original image is assumed to represent a value
between 0 and 255. Before DCT, level shift is done by
subtracting 128 from each pixel, making a pixel value
range from −128 to 127. Then the image is partitioned
into 8 × 8 blocks and these blocks are processed one
by one left to right and top to bottom. The DCT for a
block is defined as follows:
S(u, v) = C (u)C (v)
4
7x=0
7y=0
s(x, y)cos(2x + 1)uπ
16
× cos
(2y
+1)vπ
16 (1)
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 3/18
A JPEG Chip 45
where s(x, y) is the pixel value, S(u, v) is the DCT
coef ficient, and
C (k ) =
1√ 2
when k = 0;
1 when k = 0.
(2)
When a block is processed by DCT, high-frequency
coef ficientsappear at thelower-right cornerof theblock
while low-frequency coef ficients appear at the up-left
corner.
Forquantization, the DCT coef ficients obtained from
the DCT module are divided by the values defined in
the quantization table which contains 8
×8 entries, i.e.,
Sq (u, v) = S(u, v)
q(u, v)= S(u, v) × 1
q(u, v)(3)
where Sq (u, v) is the quantized coef ficient of S(u, v)
and 1q(u,v)
is the corresponding quantizing value stored
at position (u, v) of the quantization table. If the di-
vider is large, the bit-rate will be low but the quality
of the reconstructed image will be bad; and vice versa.
Therefore, the user may select from different quan-
tization tables to achieve a desired trade-off between
bit-rate and quality of reconstruction.
For a block with a small variation of pixel values, the
high-frequency coef ficients obtained by DCT tend to
be small. Furthermore, the quantization operation per-
formed previously makes intentionally the DCT coef fi-
cients of high frequencies smaller, namely, coef ficients
in lower frequencies are divided by smaller integers
while higher ones are divided by larger integers. There-
fore, the lower triangular part of a coef ficient block
tends to contain many zero entries. The zig-zag opera-
tion, shown in Fig. 2, places the DCT coef ficients of a
block in a sequence from low to high frequencies. As a
result, long sections of successive zeros are more likely
to occur in the tail of the sequence, which is good for
Huffman coding.Huffman coding is a variable-length coding method.
Its idea is to use fewer bits to represent a symbol which
appears more frequently and more bits to represent a
symbol which appears less frequently. Huffman coder
receives the sequences obtained from the zig-zag mod-
ule and generates one block of codewords for each such
sequence. A block of codewords consists of one DC
codeword and one or more AC codewords, as shown
in Fig. 3. Note that an end-of-block (EOB) mark is
inserted at the end of each block. In this way, a block
of 8×8 pixel values is turned into a block of codewords
0 1 5 6
2 4 7 13
3 8 12 17
9 11 18 24
10 19 23 32
20 22 33 38
21 34 37 47
35 36 48 49
Figure 2. Sequence obtained by zig-zag.
and the effect of compression is thus achieved. Huff-
man decoder does the reverse of Huffman coder, i.e., it
receives blocks of codewords and generates blocks of
corresponding DCT coef ficients.
3. JPEG Implementation
The JPEG system is partitioned into four mod-
ules: DCT/IDCT, quantizer/dequantizer, zig-zag, and
Huffman codec. The DCT coef ficients of a block
of pixel values is obtained by the cascade of two
one-dimensional (1-D) DCT processors. The quan-
tizer/dequantizer is implemented by the radix-4 mod-
ified Booth’s algorithm. The zig-zag operation is per-
formed by a dual-buffering mechanism. Finally, the
Huffman codec employs an ef ficient architecture for
Huffman decoding.
3.1. DCT/IDCT
The DCT of Eq. (1) can be rewritten as
S(u, v) = C (u)
2
7x=0
C (v)
2
7y=0
s(x, y)
× cos(2y + 1)vπ
16
cos
(2x + 1)uπ
16(4)
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 4/18
46 Sun and Lee
Huffman code Huffman code Huffman code
Huffman code Huffman code
Amplitude Amplitude
Amplitude Amplitude
Amplitude
BLOCK 1
BLOCK 2
EOB
DC codeword AC codeword AC codeword
AC codeword DC codeword
Figure 3. Codewords obtained from Huffman coder.
which can be performed by applying the cascade of
two 1-D DCT processors in vertical and horizontal di-
rections [9, 10]. The 1-D DCT is defined as follows:
S(w) = C (w)
2
7t =0
s(t )cos(2t + 1)wπ
16(5)
where C (u) has the same value as given in Eq. (2).
Equation (5) can be written in the following matrix
form:
S(0)
S(2)
S(4)
S(8)
=
A A A A
B C −C −B
A −A −A A
C −B B −C
s(0) + s(7)
s(1) + s(6)
s(2) + s(5)
s(3) + s(4)
(6)
10 10
Data in
1-D DCT/IDCT
8/12
1-D DCT/IDCTTransposition Memory
Data out
12/8
2-D DCT/IDCT
Figure 4. Architecture of the DCT/IDCT module.
S(1)
S(3)
S(5)
S(7)
=
D E F G
E −G −D −F
F −D G E
G −F E D
s(0) − s(7)
s(1) − s(6)
s(2) − s(5)
s(3) − s(4)
(7)
where A = cos π
4, B = cos π
8, C = sin π
8, D = cos π
16,
E = cos 3π16
, F = sin 3π16
, and G = sin π
16.
Figure 4 shows the architecture of the DCT module.
An 8
×8 block is processed column by column by the
first 1-D DCT processor. The intermediate results areplaced in the transposition memory. Then the DCT val-
ues are obtained row by row by the second 1-D DCT
processor. The architecture of the 1-D DCT processors,
implemented in distributed arithmetic [11], is shown in
Fig. 5. The first stage, FREG, in a 1-D DCT consists
of 8 parallel-in-serial-out registers. The eight pixels in
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 5/18
A JPEG Chip 47
2
2
2
2
2
2
2
2
ADDSUB
Butterfly
and FREG
2
2
2
2
2
2
2
2
BREG
2
2
2
2
2
2
2
2
ADDSUB
Butterfly
and
2
2
2
2
2
2
2
2
RAC
12 12
Data outData in
1-D DCT/IDCT
Figure 5. Architecture of 1-D DCT/IDCT.
a column of an 8 × 8 block are fed into the registers,
and shifted out serially by 2 bits at a time to the next
stage, called ADDSUB, which is responsible for the
additions and subtractions appearing in the right-hand
sides of Eqs. (6) and (7). The next stage, called RAC
(ROMs and Accumulators) and shown in Fig. 6, con-sists of eight ROMs and eight corresponding accumu-
lators. Note that all the combinations of the constants
and pixel values are stored in ROMs. RAC receives
eight pixel values andobtains eight DCTcoef ficients by
looking up ROM tables. The second ADDSUB acts as
a butterfly connection between RAC and BREG which
consists of 8 serial-in-parallel-out registers.
CoefficientROM
CoefficientROM
CSA
Register
+
8
4
4
14
12
14
Shift 2 bitsto right
Input
Figure 6 . Architecture of RAC.
For IDCT, the first ADDSUB acts as a butterfly
connection between FREG and RAC, and the second
ADDSUB is responsible for the required additions and
subtractions.
3.2. Quantizer/Dequantizer
We adopted the radix-4 modified Booth’s algo-
rithm [12] to speed up the multiplication operation
involved in quantization and dequantization. The al-
gorithm, as shown in Table 1, takes care of two bits
at a time. Note that in this table, DCT coef ficients,
denoted by x’s, are multipliers and quantizing values,
Q, are multiplicands. The architecture of the quantizer
is shown in Fig. 7 which includes five adders and five
Table 1. The radix-4 modified Booth’s algorithm.
xi xi−1 xi−2 Operation Comments
0 0 0 +0 String of zeros
0 1 0 +Q A single 1
1 0 0 −2Q Beginning of 1’s
1 1 0 −Q Beginning of 1’s
0 0 1 +Q End of 1’s
0 1 1 +2Q End of 1’s
1 0 1 −Q A single 0
1 1 1 +0 A string of 1’s
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 6/18
48 Sun and Lee
>>2
+
>>2
+
>>2
+
Reg. >>4
+ >>8
+
DCT coefficients Quantization
table
Reg. for DCT coef. Register for value Q pipeline stage1
pipeline stage2pipeline stage3
pipeline stage 4
12
3
0,+Q,-Q,+2Q,-2Q
0,+Q,-Q,+2Q,-2Q
0,+Q,-Q,+2Q,-2Q
0,+Q,-Q,+2Q,-2Q
0,+Q,-Q,+2Q,-2Q
0,+Q,-Q,+2Q,-2Q
Reg.
Reg.
Reg.
Reg.
Reg.
x1 0 -1x x
3
x3x x2 1
3
x5x x4 3
3
x7x x6 5
3
x9x x8 7
3
x11x x10 9
12
Quantizedcoef.
12
Figure 7 . Architecture of the quantizer.
pipeline registers. In thefirstclock,aDCTcoef ficientof
12 bits andthe corresponding quantizing valuefrom the
quantization table are latched in the pipeline registers
of stage 1. In the second clock, six values are selected
from the set {+0, +Q, −Q, +2Q, −2Q} by x1x0x−1,
x3x2x1, x5 x4 x3, x7x6x5, x9 x8 x7, x11x10 x9, respectively,
andare passed through a set of shift-and-add structures.
The results are then stored in the pipeline registers of
stage 2. In the third clock, values from the pipeline reg-
isters of stage 2 are processed in a similar manner and
the results are stored in the pipeline registers of stage 3.
After one more shift-and-add operation, the quantized
coef ficient is stored in the pipeline register of stage 4.For dequantization, the input to Fig. 7 is a quantized
coef ficient and the table is replaced with the dequanti-
zation table. Note that each element of the dequantiza-
tion table is the inverse of the corresponding element
of the quantization table.
3.3. Zig-Zag
The zig-zag module consists of two RAMs and an ad-
dress control, as shown in Fig. 8. The two RAMs,
RAM1 and RAM2, work in double-buffering mode.
When the content of RAM1 is read out in the zig-
zag order, RAM2 is being loaded with another block
of 64 DCT coef ficients. Then RAM1 and RAM2 are
switched, namely, RAM2 is read out and RAM1 is
loaded. The switching of the two RAMs is controlled
by a multiplexer.
3.4. Huffman Codec
DC coef ficients and AC coef ficients are encoded sep-
arately. DC coef ficients are not encoded directly. In-stead, thedifference of theDC coef ficient of thepresent
block from that of the previous block is used for en-
coding. To obtain AC codewords for a sequence, the 63
AC coef ficients are interpreted into runs of zeros each
of which ends with a non-zero coef ficient. Then each
run of zeros and its following nonzero coef ficient are
used for encoding. A DC codeword is derived from two
parts, size and amplitude, and each AC codeword is de-
rived from run-length/size and amplitude. Amplitude
indicates the difference of the underlying DC coef fi-
cient from the previous DC coef ficient, or the nonzero
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 7/18
A JPEG Chip 49
RAM1(64x12)
RAM2(64x12)
Read/Writeaddress control
6
12
12
Addr2
Addr1
Data inDout2
Dout1
Mux Data out
6
Figure 8. Architecture of the zig-zag module.
AC coef ficient following a run of zeros. Size indicates
the number of bits required for representing the ampli-
tude in one’s complement form. The relationship be-
tween size and amplitude is shown in Table 2. Note
that for DC coding, size can be of up to 11 bits, while
for AC coding, up to 10 bits. Run-length indicates the
number of zeros in a run of zeros. Table 3 shows all
the possible combinations allowed for run-length (in
horizontal direction) and size (in vertical direction) for
AC coding. Note that when the run-length of a run is
greater than 16, two or more codes required for this run.
The ZRL mark in Table 3 indicates a run of 15 zerosfollowed by a zero AC coef ficient (i.e., 16 consecutive
zeros), and the EOB mark is used to end a block, as
mentioned earlier.
When size, amplitude, and run-length are available,
we are ready for coding. To obtain the codeword for
Table2 . Correspondencebetweensizeand amplitude.
Size Amplitude
0 0
1−
1, 1
2 −3, −2, 2, 3
3 −7 ∼ −4, 4 ∼ 7
4 −15 ∼ −8, 8 ∼ 15
5 −31 ∼ −16, 16 ∼ 31
6 −63 ∼ −32, 32 ∼ 63
7 −127 ∼ −64, 64 ∼ 127
8 −255 ∼ −128, 128 ∼ 255
9 −511 ∼ −256, 256 ∼ 511
10 −1023 ∼ −512, 512 ∼ 1023
11 −2047 ∼ −1024, 1024 ∼ 2047
a DC difference, D, we use the size of D to obtain a
Huffman code, CODE1, by looking-up the DC Huff-
man table, as shown in Table 4(a). Let CODE2 be
the amplitude of D represented in one’s complement
form. Then the codeword for D is the concatenation of
CODE1 and CODE2. For example, let D be6. The size
for 6 is3. ThenCODE1is 100 obtainedfromTable4(a),
and CODE2 is 110. Therefore, the codeword for D is
100110. For theencodingof a runof zeros followedby a
nonzero AC coef ficient, A, we use the run-length/size
combination to obtain a Huffman code, CODE1, by
looking-up theAC Huffman table,a small part of whichis shown in Table 4(b). Let CODE2 be the amplitude
of A represented in one’s complement form. Then the
codeword for this run of zeros and the following AC co-
ef ficient is the concatenation of CODE1 and CODE2.
For example, suppose we want to find the codeword
Table 3. Possible combinations of run-length and size for AC
coding.
0 1 2 · · · 9 10 11 · · · 14 15
0 EOB N/A N/A· · ·
N/A N/A N/A· · ·
N/A ZRL
1 01 11 21 · · · 91 A1 B1 · · · E1 F1
2 02 12 22 · · · 92 A2 B2 · · · E2 F2
3 03 13 23 · · · 93 A3 B3 · · · E3 F3
4 04 14 24 · · · 94 A4 B4 · · · E4 F4
5 05 15 25 · · · 95 A5 B5 · · · E5 F5
6 06 16 26 · · · 96 A6 B6 · · · E6 F6
7 07 17 27 · · · 97 A7 B7 · · · E7 F7
8 08 18 28 · · · 98 A8 B8 · · · E8 F8
9 09 19 29 · · · 99 A9 B9 · · · E9 F9
10 0A 1A 2A · · · 9A AA BA · · · EA FA
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 8/18
50 Sun and Lee
Table 4. Huffman tables: (a) DC Huffman table; (b) Part of AC
Huffman table.
(a) (b)
Size Huffman code Run-length/Size Huffman code
0 00 · · · · · ·1 010 3/1 111010
2 011 3/2 111110111
3 100 3/3 111111110101
4 101 3/4 111110001111
5 110 3/5 1111111110010000
6 1110 3/6 1111111110010001
7 11110 3/7 1111111110010010
8 111110 3/8 11111111100100119 1111110 3/9 1111111110010100
10 11111110 3/10 1111111110010101
1 1 11 11111 10 · · · · · ·
for the sequence 0003 with A = 3. The run-length is 3
and the size of A is 2. Then CODE1 is 111110111 ob-
tained from Table 4(b), and CODE2 is 11. Therefore,
the codeword for this sequence is 11111011111.
For decoding, each block of codewords is converted
back to a block of DCT coef ficients by looking up the
Huffman tables. The decoding process is totally thereverse of the coding process and the description of it
is omitted here.
DCT
ones
code
combiner
run-len
Huffman
zero-run
address
+
reconstructed
data generator
Huffman Coder/Decoder
MUX
leading ones
detector run-len
size
coefficients
codewords
reconstructed
DCT coefficients
shift-out
no. of leading
shift -in
Huffman tables
size
size detector
codewords
detector
Figure 9. The Huffman codec module.
The overall architecture of the Huffman codec mod-
ule is shown in Fig. 9. In this figure, zero-run detec-tor, size detector, combiner, and barrel shift-out are in-
volved in thecoding process, while barrel shift-in, lead-
ing ones detector, and reconstructed data generator are
involved in the decoding process. The twocomponents,
address MUX and Huffman tables, are involved in both
encoding and decoding.
3.4.1. Encoding Path. The zero-run detector is used
to count the number of successive zeros in a section of
an input block. It is equipped with a zero-run counter.
The entries of a block are fed into the zero-run detec-
tor one by one. If the input entry is zero, the zero-run
counter increments by one. If the input entry is non-
zero, it is sent out to the size detector and the zero-run
counter is reset to zero. The size detector determines
the size of the input value.
For coding DC coef ficients, address MUX outputs
size to form the address for the Huffman tables. How-
ever, for coding AC coef ficients, address MUX outputs
both run-length and size to form the address for the
Huffman tables. A detailed description about address-
ing for encoding will be given in the next section. The
output from the Huffman tables is a unique Huffman
code whichis concatenated with amplitude in combiner
to form a codeword for each input.Obviously, codewords obtained are variable in
length. However, the width of the data bus isfixed. The
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 9/18
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 10/18
52 Sun and Lee
32
32
32
32
Register 1Register 2
Barrel shifter
Barrel Shift-in
control signals
MSB LSB
Data in
Data out
Length
32
5
5
Figure 11. Architecture of the barrel shift-in module.
ones, and group 9 having at least 9 leading ones. There-
fore, 23 = 8 RAM locations are needed to decode the
codewords in each of the 10 groups, requiring a to-
tal of 80 locations for a DC decoding table. Similarly,
the codewords for the AC coef ficients are divided into
10 groups. The maximum possible tail length of the
codewords is 8. Since the first bit of the tail is always
0, up to 7 additional trailing bits are required to de-
code the codewords within each group. Hence 27 = 128
RAM locations are needed to decode the codewords in
each of the 10 groups, requiring a total of 1280 lo-
cations for a AC decoding table. Since each entry is
12 bits wide and there are two sets of Huffman tables,
2720 × 12 = 32640 RAM bits are required for the
decoding tables.
The encoder was designed to utilize the hardware re-
quired for the decoder. By grouping the codewords by
code length, the encoder tables can be reduced to 12-
bits tofit within the decoding tables. Instead of looking
up the codeword directly, the length is used to deter-mine the first code value for that group. Adding the
first code value and the offset of the codeword within
the group produces the desired codeword. With this ap-
proach, only an additional RAM of size 56×16 = 896
is required. Therefore, a total of 32640+ 896 = 33536
RAM bits are required for the Huffman codec.
4.2. Our Improvement
As in [5], we treat a Huffman code as the concate-
nation of two parts, ONES and CBITS, which denote
the leading ones and the remaining bits (excluding the
leading zero), respectively, of the code. We also di-
vide AC codes and DC codes, respectively, into ten
groups. Therefore, like [5], we need 2720 entries for
decoding tables. However, in Ruetz’s method, each
entry is 12 bits wide. We propose an improvement
to reduce the width of each entry as follows. Each
of the 2720 entries contains two fields, RUN-LEN
and SIZE, only. Therefore, each entry is 8 bits wide.
The code length C-LEN required for decoding is de-
rived by checking-up with another small table with
each entry being 4 bits wide. Furthermore, encoding
is done by taking advantage of the decoding tables
without additional storage. As a result, we can save
about 10 K bits more than Ruetz’s method for Huffman
tables.
We use two RAMs, RAM3 and RAM4 to imple-
ment the Huffman tables, as shown in Fig. 12. RAM3
has 2720 entries with 8 bits in each entry, and RAM4
has 376 entries with 4 bits in each entry. For de-coding, an address for RAM3 contains three variable
fields: Group, SEVEN/THREE, and N, with 4 bits,
7/3 bits, and one bit, respectively. The Group field
indicates the group of the leading ones of the un-
derlying Huffman code. The SEVEN/THREE field
contains the 7/2 bits behind the zero that follows
the leading ones. This works because Huffman cod-
ing ensures that none of the codes is a prefix of an-
other code. Of course, multiple entries in RAM3 may
store the same content due to this way of address-
ing. The N field indicates which set of the tables is
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 11/18
A JPEG Chip 53
2720 x 8
376x4
(0)
(375)
000(0)
177
(375)
(2719)
11 10 9 8 7 6 5 4 3 2 1 0
AC
DC 1010 N
NGroup
Group
11 10 9 8 7 6 5 4 3 2 1 0
NSIZE
NSIZE
AC
DC
000
0001011
Address format
12 8
4
9
9
Table number
RUN-LEN
SEVEN
Encoding
Decoding
THREE
000
3 2 1 07 6 5 4 3 2 1 0
RUN-LEN SIZE C-LEN
3 2 1 07 6 5 4 3 2 1 0
Data format
C-LEN
RAM3
RAM4
EIGHT
A9F
177
Figure 12. Addressing of RAM tables.
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 12/18
54 Sun and Lee
used. Each entry in RAM3 contains the information
about run-length and size, indicated as RUN-LEN andSIZE, respectively. However, we need to provide the
barrel shifter of the barrel shift-in module with the
length of CBITS so that the exact number of bits
can be shifted. The length of CBITS can be obtained
from the code length field, C-LEN, of RAM4 (In
fact, C-LEN is one less than the code length, as ex-
plained later). When we have RUN-LEN and SIZE
from RAM3, we can get the length of CBITS from
RAM4 (The length of CBITS is equal to the difference
of C-LEN and the number of leading ones of the
codeword).
+
+
+
Next code
Leading onesdetector
Address MUX
RAM4376x4 words
RAM32720x8 words
RUN-LEN SIZE C-LEN
Barrel shifter 1(Shift leading ones)
Barrel shifter 2(Shift CBITS)
Barrel shifter 3
Codewords
Sum1_reg
Tmp_pos
Sum0
Sum1
To the next stage
16 32
4
4
3
Barrel shifter
Figure 13. Architecture of the Huffman decoder.
We also use RAM3 and RAM4 for encoding. From
Table 4, it is clear that 12 entries and 11 × 16 = 176entries are required for DC and AC encoding, re-
spectively. Therefore, 188 × 2 = 376 entries in to-
tal are needed for two sets of tables. The way of ad-
dressing for encoding is shown in the upper-part of
Fig. 12. An address for RAM3 and RAM4 contains
three variable fields: RUN-LEN, SIZE, and N. An en-
try in RAM3 contains EIGHT, the lower-order eight
bits of a Huffman code, and an entry in RAM4 con-
tains C-LEN, one less than the length of a Huffman
code. For example, consider the Huffman code 111010
for 3/1 in Table 4(b). RAM3 will include an entry of
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 13/18
A JPEG Chip 55
00111010 and RAM4 will include an entry of 0101
(i.e., 5) at address 00000010011N for this Huffmancode. Consider another example of the Huffman code
1111111110010000 for 3/5 in Table 4(b). RAM3 will
include an entry of 10010000 and RAM4 will include
Figure 14. The JPEG system implemented with FPGAs: (a) Components involved in the system; (b) Connecting the system to a PC.
an entry of 1111 (i.e., 15) at address 00000110101N
for this Huffman code. Apparently, the total numberof RAM bits required is 2720 × 8 + 376 × 4 = 23264
bits, which is 10272 bits fewer than that required in
Ruetz’s method.
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 14/18
56 Sun and Lee
The Huffman decoder based on these RAM archi-
tectures is shown in Fig. 13. In this figure, a sequenceof 16 input bits is loaded into the Next Code register
through several barrel shifters. We have the barrel
shifters operate in parallel with the other parts of the
decoder. When the output of the Next Code register
passes through the leading ones detector, the barrel
shifter in the shift-in module shifts in the input bit
stream. When the data pass through the address
multiplexer and RAM3, the barrel shifter shifts in a
number of input bits with the size equal to the number
of leading ones. When the data pass through RAM4,
the barrel shifter shifts in a number of input bits
with the size equal to the length of the CBITS of the
code.
5. Emulation with FPGAs
VHDL was adopted as the high-level language for
implementing the JPEG baseline system. VHDL
codes were written to describe the architecture and
behavior of each component. After the function of the
design had been tested successfully with the VHDL
functional simulator, the design was synthesized
by Synopsys with the Altera FLEX 10K FPGA
technology. The whole system was partitioned andfit into two FLEX 10K FPGAs, an EPF10K100
and an EPF10K70. Placement, routing, and pro-
gramming of the FPGAs were done by ALTERA
Maxplus II.
Figure 15. LENA: (a) original image; (b) reconstructed image.
5.1. Construction of Circuit Boards
The DCT/IDCT module and interface circuits are
placed in the EPF10K100 FPGA. The transposition
RAM of the DCT module is fit into the embedded
RAM architecture of the FPGA. The quantization, zig-
zag, and codec modules are placed in the EPF10K70
FPGA. The RAMs of quantization and zig-zag mod-
ules are fit into the embedded RAM architecture of the
FPGA. Huffman tables are implemented in an external
static RAM.
The FPGAs are mounted on two circuit boards, as
shown in Fig. 14(a), and are connected together by two
50-bitsflat cables and a 8-bits download cable. The two
flat cablesform thepath for thesignalsflowing betweenthetwo FPGAs.The downloadcableis connectedto the
parallel port of a PC andprogrammingof the2 cascaded
FPGAs is done by the PC, as shown in Fig. 14(b). The
whole system communicates with the outside world via
the interface circuits.
The estimated propagation delays of the two circuit
boards are 189.7 ns and 218 ns, respectively. The clock
rateof the ISA buson a PCis about 8.2 MHz.We divide
the clock on the ISA bus by 2 and apply it to our circuit
boards. The system works correctly at the clock rate of
4.1 MHz.
5.2. Experimental Results
A program written in the C language is used to moni-
tor the status of the FPGAs and read results from and
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 15/18
A JPEG Chip 57
Figure 16 . PEPPERS: (a) original image; (b) reconstructed image.
write data into the circuit boards. The subroutine “im-
port( )” reads data from the specified I/O address and
“outport( )” writes data to the specified I/O address.
When encoding, an image file of raw data is opened
for feeding pixel data into the circuit boards and an-
other file is opened for storing the resulting codewords
coming from the circuit boards. When decoding, a
file of codewords is fed into the circuit boards andanother file is opened for storing the resulting pixel
values.
Some standard testing images are used as bench-
marks to test the functionality of our design. Figure 15
Figure 17 . BABOON: (a) original image; (b) reconstructed image.
is a 512 × 512 image with three color components in
RGB. The image is first compressed and then decom-
pressed. The bit rate is 0.055 bits/pixel and the SNR is
about 36.7 dB. SNR (Signal to Noise Ratio) is defined
as follows:
SNR=
M i=1
N j=1 s̄(i, j )2
M i=1
N j=1 [s(i, j) − s̄(i, j)]2
(8)
where M and N denote the dimensions of x and y,
respectively, of the image, s(i, j) denotes pixel values
of the original image, and s̄(i, j ) denotes pixel values
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 16/18
58 Sun and Lee
Figure 18. Floor plan of the JPEG chip.
of the reconstructed image. Obviously, a large SNR
indicates a small distortion of the reconstructed im-
age from the original image. A low bit-rate means a
high compression ratio and thus a small transmission
bandwidth is required. Figure 16 is a color image in
RGB format. The bit rate is 0.058 bits/pixel and the
SNR is 32.9 dB. Figure 17 is also a color image. The
bit rate and the SNR are 0.127 bits/pixel and 27.8 dB,
respectively.
6. Chip Implementation
A single chip for the JPEG baseline system is con-
structed by standard cells with the 0.6 µ triple-metal
process. The design is synthesized using Synopsys with
Table 5. Comparison of JPEG chips.
Author Size (mm2) Complexity RAM Memory Clk Rate Power Consumed
Sun and Lee 45.54 (0.6 µ) 411,745 transistors 23.264 K 27 MHz 1000 mW
Ruetz et al. [5] Two chips: N/A 33.536 K 30 MHz N/A
62.4 and 96.04 (1 µ)
Kovac and Ranganathan [13] 168 (1 µ) N/A N/A 100 MHz N/A
Okada et al. [14] 54 (0.6 µ) 70 K (Gate Count) 17 K 18 MHz 400 mW
Asada et al. [15] 125 (0.5 µ) 200 K (Gate Count) 38 K 17.5 500 mW
Hunter et al. [16] 14 (0.35 µ) 50 K (Gate Count) 11.5 K 30 MHz 840 mW
the Compass 0.6µ standard cell library. Placement and
routing are done by Cadence.
The core is divided into 4 parts roughly according to
the size of the four major modules: DCT/IDCT, quan-
tization, zig-zag, and Huffman codec. Each RAM used
is then placed within the respective part and around the
core properly. The floor plan of the chip is shown in
Fig. 18.
Clock trees and the constraints on placing stan-
dard cells are added before routing. Finally, we got
the layout as shown in Fig. 19. DRC (design rule
check), ERC (electrical rule check), and LVS (lay-
out versus schematic) are performed to check the
correctness of the layout. Post-layout timing analy-
sis is done using TimeMill. Power analysis is done by
PowerMill.
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 17/18
A JPEG Chip 59
Figure 19. Layout of the JPEG chip.
Simulation results have shown that the chip can op-
erate in real time at any compression ratio. A pixel or a
codeword can be processed within a single clock cycle
and the maximal working frequency achieved is about
27 MHz. The chip contains 411,745 transistors, with adie size of 6.6 × 6.9 mm2. Power dissipation is about
1 watt at the maximal working frequency. A com-
parison between our chip and other JPEG designs [5,
13–16] is given in Table 5. Note that the work of [5]
contains two chips. DCT and zig-zag are contained in
a chip with 62.4 mm2, and coding and quantization are
contained in another chip with 96.04 mm2. Someof the
chips, e.g., [16], are only for JPEG encoders. Appar-
ently, our chip is highly competitive with other chips.
Note that our focus has been on the development of
a single-chip JPEG codec at a campus-lab with limited
resources rather than an application-specific standard
product (ASSP) chip design adopted for commercially
available chips [16].
7. Conclusion
We have presented the design and implementation
of a single chip for the JPEG baseline system. The
chip is mainly composed of four modules: DCT/IDCT,
quantizer/dequantizer, zig-zag, and Huffman codec.
The chip was designed with modern VLSI CAD tools.
VHDL was used to describe thearchitecture andbehav-
ior of each module. Then the functionality of the de-
sign was verified with field programmable gate arrays
(FPGAs) on circuit boards. Finally, a single chip was
8/7/2019 A JPEG Chip for Image Compression and Decompression
http://slidepdf.com/reader/full/a-jpeg-chip-for-image-compression-and-decompression 18/18
60 Sun and Lee
implemented using the standard cell design approach
with the 0.6 µ triple-metal process. The resulting chipcontains 411,745 transistors, with a die size of is
6.6 × 6.9 mm2. The chip can operate in real time at
any compression ratio.
Acknowledgments
Theauthors would like to thank theanonymous referees
for their constructive comments and suggestions.
This work was supported by the National Science
Council underthe grants NSC-87-2213-E-110-012 and
NSC-88-2218-E-110-008. A preliminary version of
this paper appeared in Proceedings of the 2000 AsiaPacific Conference on MultimediaTechnologyand Ap-
plications, Kaohsiung, Taiwan, December 2000.
References
1. “JPEG digital compression and coding of continuous-tone still
image,” Technical Report Draft ISO 10918, 1991.
2. “Coding of moving pictures and associated audio,” Committee
Draft of Standard ISO11172: ISO/MPEG/90/176, 1990.
3. “Video codec for audio visual services at p × 64 k bits per
second,” CCITT Recommendation H.261, 1990.
4. K. Sayood, Introduction to Data Compression. San Francisco,
CA: Morgan Kaufmann, 2000.
5. P.A. Ruetz, P. Tong, D. Luthi, and P.H. Ang, “A Video-Rate
JPEG Chip Set,” Journal of VLSISignal Processing, vol.5, 1993,
pp. 29–38.
6. H.Park andV.K. Prasanna,“Areaef ficientVLSI architectures for
Huffman coding,” IEEE Transactions on Circuits and Systems,
1993, pp. 568–575.
7. A. Mukherjee, N. Ranganathan, J.W. Flieder, and T. Acharya,
“MARVLE: A VLSI Chip for Data Compression Using Tree-
Based Codes,” IEEE Transactions on Very Large Scale Integra-
tion Systems, 1993, pp. 203–213.
8. Y.S. Lee, J.J. Jong , T.S. Perng, L.C. Hsu, M.Y. Jaw, and C.Y.
Lee, “A Memory-BasedArchitecturefor Very-High-Throughput
Variable LengthCodec Design,” IEEE InternationalSymposium
on Circuits and Systems, June 1997, pp. 2096–2099.
9. M. Sun, T. Chen, and A.M. Gottlieb, “VLSI Implementation of
a 16 × 16 Discrete Cosine Transform,” IEEE Transactions on
Circuits and Systems, vol. 36, 1989, pp. 610–617.
10. S.-F. Hsiao and J.-M. Tseng, “Parallel, Pipelined and Folded
Architectures for Computation of 1-D and 2-D DCT in Image
and Video Coding,” Journal of VLSI Signal Processing, vol. 28,
no. 3, 2001, pp. 205–220.
11. S.A. White, “Applications of Distributed Arithmetic to Digital
Signal Processing: A Tutorial Review,” IEEE ASSP Magazine,
June 1989, pp. 4–17.
12. I. Koren, Computer Arithmetic Algorithm, Prentice-Hall Inter-
national Editions, 1993.
13. M. Kovac and N. Ranganathan, “JAGUAR: A Fully Pipelined
VLSI Architecture for JPEG Image Compression Standard,”
Proceedings of the IEEE , vol. 83, no. 2, 1995, pp. 247–258.
14. S. Okada, Y. Matsuda, T. Watanabe, and K. Kondo, “A Single
Chip Motion JPEG Codes LSI,” IEEE Transactions on Con-
sumer Electronics, vol. 43, no. 3, 1997, pp. 418–422.
15. S.K. Asada, H. Ohtsubo, T. Fujihira, and T. Imaide, “Develop-
ment of a Low-PowerMPEG1/JPEG Encode/DecodeIC,” IEEE
Transactions on Consumer Electronics, vol. 43, no. 3, 1997,
pp. 639–644.
16. J.K.Hunter,J.V. McCanny, A. Simpson, Y. Hu, and J.G.Doherty,
“JPEG Encoder System-on-a-Chip Demonstrator,” in Proceed-
ings of the Thirty-Third Asilomar Conference, vol. 1, 1999,
pp. 762–766.
Sun-Hsien Sun was born on October 11, 1971 in Taipei, Taiwan.
He received bachelor and master degrees of electrical engineer-
ing from National Sun Yet-Sen University (NSYSU), Kaohsiung,
Taiwan, ROC, in 1994 and 1998, respectively. His main interests
include VLSI design of signal processing and hardware description
languages.
Shie-Jue Lee was born at Kin-Men, ROC on August 15, 1955. He
received the B.S.E.E. and M.S.E.E. degrees in 1977 and 1979, re-
spectively, from National Taiwan University, and the Ph.D. degree
fromthe Department of Computer Science at theUniversityof North
Carolina, Chapel Hill, USA, in 1990. Dr. Leejoined thefaculty of the
Department of Electrical Engineering at National Sun Yat-Sen Uni-
versity, Taiwan, in 1983, and has become a professor of the depart-
ment since1994. His research interests include machine intelligence,
multimedia communications, and chip design.
Dr. Lee served as the acting director and the director of the
Telecommunication Development and Research Center of National
Sun Yat-Sen University during 1997–2000, and the director of the
Southern Telecommunications Research Center, National Science
Council, Taiwan, in 1998–1999. He is now professor and chairman
of the ElectricalEngineering Department,NationalSun Yat-Sen Uni-
versity.
Dr. Lee is a member of IEEE, IEICE, Association for Automated
Reasoning, Chinese Fuzzy Systems Association, Institute of Infor-
mation and Computing Machinery, and Taiwanese Association of
Artificial Intelligence.