04 32 bit loss less comp.doc

7/27/2019 04 32 bit loss less comp.doc

1/111

www.1000projects.com

www.fullinterview.comwww.chetanasprojects.com

CHAPTER 1


http://www.1000projects.com/http://www.fullinterview.com/http://www.1000projects.com/http://www.fullinterview.com/http://www.1000projects.com/http://www.fullinterview.com/http://www.1000projects.com/http://www.fullinterview.com/


2/111



Introduction to VLSI

1.1 Historical perspective:

The electronics industry has achieved a phenomenal growth over the last two decades, mainly

due to the rapid advances in integration technologies, large-scale systems design - in short, due to the

advent of VLSI. The number of applications of integrated circuits in high-performance computing,

telecommunications, and consumer electronics has been rising steadily, and at a very fast pace.

Typically, the required computational power (or, in other words, the intelligence) of these applications

is the driving force for the fast development of this field. The current leading-edge technologies (such

as low bit-rate video and cellular communications) already provide the end-users a certain amount of

processing power and portability. This trend is expected to continue, with very important implications

on VLSI and systems design. One of the most important characteristics of information services is their

increasing need for very high processing power and bandwidth (in order to handle real-time video, for

example). The other important characteristic is that the information services tend to become more and

more personalized (as opposed to collective services such as broadcasting), which means that the

devices must be more intelligent to answer individual demands, and at the same time they must be

portable to allow more flexibility/mobility. As more and more complex functions are required in

various data processing and telecommunications devices, the need to integrate these functions in a

small system/package is also increasing. The level of integration as measured by the number of logic

gates in a monolithic chip has been steadily rising for almost three decades, mainly due to the rapid

progress in processing technology and interconnects technology




3/111



1.2 Advantages of IC:

The most important message here is that the logic complexity per chip has been (and still is) increasing

exponentially. The monolithic integration of a large number of functions on a single chip usually

provides

Less area/volume and therefore, compactness

Less power consumption

Less testing requirements at system level

Higher reliability, mainly due to improved on-chip interconnects

Higher speed, due to significantly reduced interconnection length

Significant cost savings

1.3 Levels Of ICs:

Digital circuits are constructed with integrated circuits. An integrated circuit (IC) is a small

silicon semiconductor crystal, called a chip. Containing the electronic components for the digital gates.

The various gates are interconnected inside the chip to form the circuit.




4/111



Digital ICs are categorized according to their circuit complexity. As measured by the number of logic

gates in a single package, they are:

Small Scale Integration (SSI).

Medium Scale Integration (MSI)

Large Scale Integration (LSI)

Very Large Scale Integration (VLSI).

1.4 Classification of ICS by device count:




5/111



NomenclatureActive Device

CountFunctions Technology

SSI 1-100Gates, op-amps,Many linear

ApplicationsBipolar

MSI 100-1,000 Registers, filters etc

Bipolar like

TTL, ECL

LSI 1,000-10,000 Microprocessors MOS: NMOS,PMOS

VLSI 1,00,000-10,00,000Memories, computers,Signal processors CMOS

Very Large Scale Integration:

Micro electron chip which contains billions of physical components or millions of logical

components integrated (embedded) on an IC.

The feature size (physical dimension) of the component which is placing on a VLSI chip

is measured in terms of microns.

ACMOS IC fabricated with Very Deep Sub-micron (VDSM) technology (0.09micron

1.5 VLSI Design Flow:

1. Design Specification:

The first step in high level design flow is the design specification process. This process involves

specifying the behavior expected of the final design. The specifications specify the expected function

and behavior of the design using textual description and graphic element.

2. Behavioral Description:




6/111



Behavioral description is created to analyze the functionality and algorithm and then framed and

its performance and compliance to standard is verified.

VLSI Design Flow:



Design Specification

Behavioral Description

RTL description (VHDL)

Functional Verifying &testing

Logic Synthesis

Gate level

Logic Verification& testing

Flour Planning Automatic

Planning & Routing

Physical Layout


7/111



3. RTL Description (VHDL):

Once the algorithm is scrutinized, the code is written keeping in mind the functionality & its ability

to be synchronized, the RTL description can be written in Gate level, Data flow or behavioral levels. A

standard VHDL simulator can be used to read the RTL description and verify the correctness of the

design.

4. Functional Verifying & Testing:

The VHDL simulator reads VHDL description compiles it in to an internal format, and then

executes the compiled format using test vectors, after compilation if any syntax errors are there they has

to be removed and recompiled. After analyzing the results of the simulation stimulus for the design has

to be added. This may be file of input stimulus design (or) the file output stimulus design using

waveform editor the respective output waveform are to be observed to test the functionality of the

design.

5. Logic Synthesis:

Once the code is validated to implement the design process VHDL synthesis tool are used. The

goal of the VHDL synthesis step is to create a design that implements the required functionality and

constraints provided. Logic synthesis tool convert the given RTL code in to Optimized Gate level net

list.




8/111



6. Gate level:

A gate level net list is the description of the design (circuit) interms of the Gates and connections

between them. Gate level is an input to automatic place and route tool.

7. Logic Verification &Testing:

The VHDL synthesis tool report syntax & synthesis errors. It gives errors & warnings. If it founds

mismatches between RTL Simulation results & output netlist simulation results. If it is error free the

next step is to map the design.

8. Flour Planning, Automatic Placing and Routing:

Place and route tools are used to take the design netlist and implement the design to the target

technology device.

9. Physical Layout:

In this each component or primitive from the netlist are placed on the target device according to

the design or architecture. The signals from one module to the other are also connected to form a

Physical layout.

1.6 INTRODUCTION TO VHDL1.6.1 What is VHDL?

VHDL stands for VHSIC Hardware Description Language, where VHSIC stands for Very High

Speed Integrated Circuit. Like the name implies, VHDL is a language for describing the behavior of

digital hardware. VHDL is just another way of describing what outputs of a digital circuit are desired

when it is given certain inputs. The critical difference between VHDL and these other languages are

that it can be readily interpreted by software, enabling the computer to accomplish your design work for

you.

As the size and complexity of digital systems increase, more computers aided design tools are

introduced into the hardware design process. The early paper-and-pencil design methods have given

way to sophisticated design entry, verification, and automatic hardware generation tools. The newest




9/111



addition to this design methodology is the hardware description languages (HDL). Although the

concept of HLDs is not new, their widespread use in digital system design is no more than a decade old.

Based on HDLs, new digital system CAD tools have been developed and are now being utilized by

hardware designers.

1.6.2 VHDL History:

In 1980 the US government developed the Very High Speed Integrated Circuit (VHSIC) project

to enhance the electronic design process, technology, and procurement, spawning development of many

advanced integrated circuit (IC) process technologies. This was followed by the arrival of VHSIC

Hardware Description Language (VHDL).

1.6.3 Why We Use VHDL?

There are many reasons why it makes good design sense to use VHDL:

1. Portability:

Technology changes so quickly in the digital industry that discrete digital devices require

constant rework in order to remain current. VHDL is designed to be device-independent, meaning that

if you describe your circuit in VHDL, as opposed to designing it with discrete devices, changing

hardware becomes a (relatively) trivial process.

2. Flexibility:

Most working engineers can recall a situation where they felt frustrated with their customer,

supervisor, or team members because the design specification that they were working with was

constantly changing. Sometimes these changes can't be helped. Design work is usually focused on

creating small, easily maintainable components and then integrating these components into a larger

device. On larger projects different teams of engineers will each design separate parts of the project at

the same time. This can mean that if one component in the project changes, all of the components must

change, even those being worked on by other engineering teams. Suppose you were told to design a

simple counter that set an output bit after it had counted to 100. However, the software engineer

working on this project discovered that the entire design could be radically simplified if your counter




10/111



could count down from 300 instead of up to 100. If you had implemented your design in discrete

circuits, you'd have to start over from scratch. But, if you'd designed using VHDL, all you'd have to do

is change your code.

1.6.4 VHDL Features:

General features:

VHDL can be used for design documentation, high level design, simulation, synthesis, and

testing of hardware and as a driver for a physical design tool.

1. Concurrency:

In VHDL the transfer statements, descriptions of components, and instantiations of gates or

logical units can all be executed such that in the end they appear to have been executed

simultaneously.

2. Support for design hierarchy:

In VHDL the operation of a system can be specified based on its functionality, or it can be

specified structurally in terms of its smaller subcomponents.

3. Library support:

User and system defined primitives and descriptions reside in the library system. VHDL

provides a mechanism for accessing various libraries. Moreover different designers can access theselibraries.

4. Sequential statement:

VHDL provides mechanism for executing sequential statements. These statements provide an

easy method for modeling hardware components based on their functionality. Sequential or

procedural capability is only for convenience, and the overall structure of the VHDL language

remains highly concurrent.

5. Type declaration and usage:




11/111



VHDL is not limited to just bit or boolean types, but it also supports integer, floating-point,

enumerated types and user-defined types. In addition, VHDL also allows array-type declarations

and composite-type definitions.

6. Use of subprograms:

VHDL allows the use of functions and procedures which can be used in type conversions, logic

unit definitions, operator redefinitions, new operation definitions, and other applications.

7. Timing control:

VHDL allows the designer to schedule values to signals and delay the actual assignment of

values until a later time. It also allows the use of any number of explicitly defined clock signals. It

provides features for edge detection, delay specification, setup and hold time specification, pulse

width checking, and setting various time constraints.

8. Structural specification:

VHDL allows the designer to describe a generic 1-bit design and use it when describing

multibit regular structures in one or more dimensions.

1.7 Advantages of VHDL:

VHDL offers the following advantages for digital design:

1. Standard:

VHDL is an EKE standard. Just like any standard (such as graphics X- window standard,

bus communication interface standard, high-level programming languages, and so on), it reduces

confusion and makes interfaces between tools, companies, and products easier. Any development

to the standard would have better chances of lasting longer and have less chance of becoming

obsolete due to incompatibility with others.

2. Government support:

VHDL is a result of the VHSIC program; hence, it is clear that the US government supports

the VHDL standard for electronic procurement. The Department of Defense (DOD) requires

contractors to supply VHDL for all Application Specific Integrated Circuit (ASIC) designs.




12/111



3. Industry support:

With the advent of more powerful and efficient VHDL tools has come the growing support

of the electronic industry. Companies use VHDL tools not only with regard to defense contracts,

but also for their commercial designs.

4. Portability:

The same VHDL code can be simulated and used in many design tools and at different

stages of the design process. This reduces dependency on a set of design tools whose limited

capability may not be competitive in later markets. The VHDL standard also transforms design

data much easier than a design database of a proprietary design tool.

5. Modeling capability:

VHDL was developed to model all levels of designs, from electronic boxes to transistors.

VHDL can accommodate behavioral constructs and mathematical routines that describe complex

models, such as queuing networks and analog circuits. It allows use of multiple architectures and

associates with the same design during various stages of the design process. VHDL can describe

low-level transistors up to very large systems.

6. Reusability:

Certain common designs can be described, verified, and modified slightly in VHDL for

future use. This eliminates reading and marking changes to schematic pages, which is time

consuming and subject to error. For example, a parameterized multiplier VHDL code can be

reused easily by changing the width parameter so that the same VHDL code can do either 16 by

16 or 12 by 8 multiplication.

7. Technology and foundry independence:

The functionality and behavior of the design can be described with VHDL and verified,

making it foundry and technology independent. This frees the designer to proceed without having

to wait for the foundry and technology to be selected.

8. Documentation:




13/111



Single place by embedding it in the code. The combining of comments and the code that

actually dictates what the design should do reduces the ambiguity between specification and

implementation.

9. New design methodology:

Using VHDL and synthesis creates a new methodology that increases the design

productivity, shortens the design cycle, and lowers costs. It amounts to a revolution comparable to

that introduced by the automatic semi-custom layout synthesis tools of the last few years.

Synthesis, in the domain of digital design, is a process of translation and optimization. For

example, layout synthesis is a process of taking a design netlist and translating it into a form of

data that facilitates placement and routing, resulting in optimizing timing and/or chip size. Logic

synthesis, on the other hand, is the process of taking a form of input (VHDL), translating it into a

form (Boolean equations and synthesis tool specific), and then optimizing in terms of propagation

delay and/or area. After the VHDL code is translated into an internal form, the optimization

process can be performed based on constraints such as speed, area, power.




14/111



CHAPTER 2www.1000projects.com



15/111



INTRODUCTION TO LOSSLESS COMPRESSION

2.1. Objective

With the increase in silicon densities, it is becoming feasible for multiple compression

systems to be implemented in parallel onto a single chip. A 32-BITsystem with distributed memory

architecture is based on having multiple data compression and decompression engines working

independently on different data at the same time. This data is stored in memory distributed to each

processor. The objective of the project is to design a lossless parallel data compression system which

operates in high-speed to achieve high compression rate. By using Parallel architecture of compressors,

the data compression rates are significantly improved. Also inherent scalability of parallel architecture

is possible. The main parts of the system are the two Xmatchpro based data compressors in

parallel and the control blocks providing control signals for the Data compressors, allowing

appropriate control of the routing of data into and from the system. Each Data compressor can process

four bytes of data into and from a block of data every clock cycle. The data entering the system

needs to be clocked in at a rate of 4n bytes every clock cycle, where n is the number of

compressors in the system. This is to ensure that adequate data is present for all compressors to process




16/111



rather than being in an idle state.

2.2.Goal of the Thesis

To achieve higher compression rates using 32-bit compression/decompression architecture with

least increase in latency.

2.3.LITERATURE SURVEY

2.3.1. Compression Techniques

At present there is an insatiable demand for ever-greater bandwidth in communication networks

and forever-greater storage capacity in computer system. This led to the need for an efficient

compression technique. The compression is the process that is required either to reduce the volume of

information to be transmitted text, fax and images or reduce the bandwidth that is required for its

transmission speech, audio and video. The compression technique is first applied to the source

information prior to its transmission. Compression algorithms can be classified in to two types, namely

O Lossless Compression

O Lossy Compression

2.3.1.1. Lossless Compression

In this type of lossless compression algorithm, the aim is to reduce the amount of source

information to be transmitted in such a way that, when the compressed information is

decompressed, there is no loss of information. Lossless compression is said therefore, to be

reversible. i.e., Data is not altered or lost in the process of compression or decompression.

Decompression generates an exact replica of the original object. The Various lossless Compression

Techniques are,

Packbits encoding

CCITT Group 3 1D

CCITT Group 3 2D

Lempel-Ziv and Welch algorithm LZW

Huffman




17/111



Arithmetic

Example applications of lossless compression are transferring data over a network as a

text file since, in such applications, it is normally imperative that no part of the source

information is lost during either the compression or decompression operations and file storage

systems (tapes, hard disk drives, solid state storage, file servers) and communication networks

(LAN, WAN, wireless).

2.3.1.2. Lossy Compression

The aim of the Lossy compression algorithms is normally not to reproduce an exact copy

of the source information after decompression but rather a version of it that is perceived by the recipient

as a true copy.

The Lossy compression algorithms are:

JPEG (Joint Photographic Expert Group)

MPEG (Moving Picture Experts Group)

CCITT H.261 (Px64)

Example applications of lossy compression are the transfer of digitized images and audio

and video streams. In such cases, the sensitivity of the human eye or ear is such that any fine details

that may be missing from the original source signal after decompression are not detectable .

2.3.1.3. Text Compression

There are three different types of text unformatted, formatted and hypertext and all are

represented as strings of characters selected from a defined set. The compression algorithm associated

with text must be lossless since the loss of just a single character could modify the meaning of a

complete string. The text compression is restricted to the use of entropy encoding and in practice,

statistical encoding methods. There are two types of statistical encoding methods which are used with

text: one which uses single character as the basis of deriving an optimum set of code words and the




18/111



other which uses variable length strings of characters. Two examples of the former are the Huffman and

Arithmetic coding algorithms and an example of the latter is Lempel-Ziv (LZ) algorithm.

The majority of work on hardware approaches to lossless parallel data compression has used an

adapted form of the dictionary-based Lempel-Ziv algorithm, in which a large number of simple

processing elements are arranged in a systolic array [1], [2], [3], [4].

2.3.2. Previous work on Lossless Compression Methods

A second Lempel-Ziv method used a content addressable memory (CAM) capable of

performing a complete dictionary search in one clock cycle [5], [6], [7]. The search for the most

common string in the dictionary (normally, the most computationally expensive operation in the

Lempel-Ziv algorithm) can be performed by the CAM in a single clockcycle, while the systolic array

method uses a much slower deep pipelining technique to implement its dictionary search.

However, compared to the CAM solution, the systolic array method has advantages in terms of

reduced hardware costs and lower power consumption, which may be more important criteria in

some situations than having faster dictionary searching. In [8], the authors show that hardware

main memory data compression is both feasible and worthwhile. The authors also describe the

design and implementation of a novel compression method, the XMatchPro algorithm. The authors

exhibit the substantial impact such memory compression has on overall system performance. The

adaptation of compression code for parallel implementation is investigated by Jiang and Jones [9].

They recommended the use of a processing array arranged in a tree-like structure. Although

compression can be implemented in this manner, the implementation of the decompressors

search and decode stages in parallel hardware would greatly increase the complexity of the

design and it is likely that these aspects would need to be implemented sequentially. An FPGA

implementation of a parallel binary arithmetic coding architecture that is able to process 8 bits

per clock cycle compared to the standard 1 bit per cycle is described by Stefo et al [10].

Although little research has been performed on architectures involving several independent

compression units working in a concurrent cooperative manner, IBM has introduced the MXT




19/111



chip [11], which has four independent compression engines operating on a shared memory area.

The four Lempel-Ziv compression engines are used to provide data throughput sufficient for

memory compression in computer servers. Adaptation of software compression algorithms to make use

of multiple CPU systems was demonstrated by research of Penhorn [12] and Simpson and Sabharwal

[13]. Penhorn used two CPUs to compress data using a technique based on the Lempel-Ziv

algorithm and showed that useful compression rate improvements can be achieved, but only at

the cost of increasing the learning time for the dictionary. Simpson and Sabharwa described the

software implementation of compression system for a multiprocessor system based on the

parallel architecture developed by Gonzalez and Smith and Storer [14].

2.3.2.1. Statistical Methods

Statistical Modeling of lossless data compression system is based on assigning Values to

events depending on their probability. The higher the value, the higher the probability. The accuracy

with which this frequency distribution reflects reality determines the efficiency of the model. In

Markov modeling, predictions are done based on the symbols that precede the current symbol.

Statistical Methods in hardware are restricted to simple higher order modeling using binary

alphabets that limits speed, or simple multisymbol alphabets using zeroth-order models that

limits compression. Binary alphabets limit speed because only a few bits (typically a single bit)

are processed in each cycle while zeroth order models limit compression because they can only

provide an inexact representation of the statistical properties of the data source.

2.3.2.2. Dictionary Methods

Dictionary Methods try to replace a symbol or group of symbols by a dictionary location code.

Some dictionary-based techniques use simple uniform binary codes to rocess the information

supplied. Both software and hardware based dictionary models achieve good throughput and

competitive compress

The UNIX utility compress uses Lempel-Ziv-2 (LZ2) algorithm and the data

compression Lempel-Ziv (DCLZ) family of compressors initially invented by Hewlett-

Packard[16] and currently being developed by AHA[17],[18] also use LZ2 derivatives. Bunton




20/111



and Borriello present another LZ2 implementation in [19] that improves on the Data

Compression Lempel-Ziv method. It uses a tag attached to each dictionary location to identify which

node should be eliminated once the dictionary becomes full.

2.4. XMatchPro Based System

The Lossless data compression system is a derivative of the XMatchPro Algorithm which

originates from previous research of the authors [15] and advances in FPGA technology. The

flexibility provided by using this technology is of great interest since the chip can be adapted to the

requirements of a particular application easily. The drawbacks of some of the previous methods are

overcome by using the XmatchPro algorithm in design. The objective is then to obtain better

compression ratios and still maintain a high throughput so that the compression/decompression

processes do not slow the original system down.




21/111



CHAPTER 3www.1000projects.com



22/111






23/111



FUNCTIONS OF LOSSLESS COMPRESSION

3.1. BASICS OF COMMUNICATION

A sender can compress data before transmitting it and a receiver can decompress the data after

receiving it, thus effectively increasing the data rate of the communication channel. Lossless data

compression is the process of encoding a body of data into a smaller body of data that can at a

later time be uniquely decoded back to the original data.

Lossless compression removes redundant information from the data while they are being

transmitted or before they are stored in memory, and lossless decompression reintroduces the

redundant information to recover fully the original data. In the same way, the data is

compressed before it is stored and decompressed when it is retrieved, thus increasing theeffective capacity of the storage device.

3.2. Proposed Method




24/111



In [1], the author discusses about the Parallel Algorithm that can be implemented form

High Speed Data Compression. The authors gives the basic idea about how the Data Compression is

carried out using the Lempel-Ziv Algorithm and how it could be altered for Parallelism of the

algorithm. The author describes the Lempel-Ziv algorithm as a very efficient universal data

compression technique, based upon an incremental parsing technique, which maintains codebooks

of parsed phrases at the transmitter and at the receiver. An important feature of the algorithm is

that it is not necessary to determine a model of the source, which generates the data. According to the

author, in an attempt to increase the speed of the algorithm on general-purpose processors, the

algorithm has been parallelised to run on two processors.

3.3. Background

The author explains a novel architecture for a high-performance lossless data compressor

that is organized around a selectively shiftable Content Addressable Memory, which permits full

matching, the processor offers very high performance with good compression of computer-based

data. The author also gives details about the operation, architecture and performance of the

Data Compression Techniques. He also introduces the XMatchPro lossless data compressor. In [3],

the authors discuss about the parallelism in Data Compression Techniques and the authors

explain the Parallel Architecture for High Speed Data Compression. In this paper, the author

expresses Data Communication as an essential component of high-speed data communication and

storage. In [4], the authors discuss about the various methods of Data Compression and their

Techniques and drawbacks and propose a new methodology for a high speed Parallel Lossless

Data Compression. The authors describes the research and hardware implementation of a high

performance parallel multi compressor chip which could able to meet the intensive data processing

demands of highly concurrent system. The authors also investigate the performance of

alternative input and output routing strategies for realistic data sets demonstrate that the design

of parallel compression devices involves important trade offs that affect compression performance,

latency and throughput. Compression ratio achieved by the proposed universal code uniformly

approaches the lower bounds on the compression ratios attainable by block-to-variable codes and

variable-to-block codes designed to match a completely specified source.




25/111



3.4. Usage of XMatchPro Algorithm

The Lossless Parallel Data Compression system designed uses the XMatchPro Algorithm.

The XMatchPro algorithm uses a fixed-width dictionary of previously seen data and attempts to

match the current data element with a match in the dictionary. It works by taking a 4-byte word and

trying to match or partially match this word with past data. This past data is stored in a dictionary,

which is constructed from a content addressable memory. As each entry is 4 bytes wide, several

types of matches are possible. If all the bytes do not match with any data present in the dictionary

they are transmitted with an additional miss bit. If all the bytes are matched then the match location

and match type is coded and transmitted, this match is then moved to the front of the dictionary.

The dictionary is maintained using a move to front strategy whereby a new tuple is placed at the front

of the dictionary while the rest move down one position. When the dictionary becomes full the

tuple placed in the last position is discarded leaving space for a new one.

The coding function for a match is required to code several fields as follows: A zero followed

by:

1). Match location: It uses the binary code associated to the matching location.

2). Match type: Indicates which bytes of the incoming tuple have matched.

3). Characters that did not match transmitted in literal form.

A description of the XMatchPro algorithm in pseudo-code is given in the figure below.

clear the dictionary;

set the next free location (NFL) to 0;

Do

{

read in a tuple T from the data stream;

search the dictionary for tuple T;

IF (full or partial hit)

{

determine the best match location ML and match type MT;

output 0;




26/111



output any required literal characters of T;

}

ELSE

{ output 1;

output tuple T;

}

IF (full hit)

{

move dictionary entries 0 to ML -1 down by one location ;

}

ELSE

{move all dictionary entries down by one location;

increment NFL (if dictionary is not full);

}

copy tuple T to dictionary location 0;

}

WHILE (more data is to be compressed);

Fig.3.2. Pseudo Code for XMatchPro Algorithm

With the increase in silicon densities, it is becoming feasible for multiple XMatchPros to

be implemented in parallel onto a single chip. A parallel system with distributed memory

architecture is based on having multiple data compression and decompression engines working

independently on different data at the same time. This data is stored in memory distributed to each

processor. There are several approaches in which data can be routed to and from the compressors that

will affect the speed, compression and complexity of the system. Lossless compression removes

redundant information from the data while they are transmitted or before they are stored in memory.

Lossless decompression reintroduces the redundant information to recover fully the original data.

There are two important contributions made by the current parallel compression &

decompression work, namely, improved compression rates and the inherent scalability. Significant




27/111



improvements in data compression rates have been achieved by sharing the computational

requirement between compressors without significantly compromising the contribution made by

individual compressors. The scalability feature permits future bandwidth or storage demands to be

met by adding additional compression engines.

3.4.1. The XMatchPro based Compression system

Previous research on the lossless XMatchPro data compressor has been on optimising

and implementing the XMatchPro algorithm for speed, complexity and compression in hardware.

The XMatchPro algorithm uses a fixed width dictionary of previously seen data and attempts to

match the current data element with a match in the dictionary. It works by taking a 4-byte word and

trying to match this word with past data. This past data is stored in a dictionary, which is constructed

from a content addressable memory.

Initially all the entries in the dictionary are empty & 4-bytes are added to the front of the

dictionary, while the rest move one position down if a full match has not occurred. The larger the

dictionary, the greater the number of address bits needed to identify each memory location, reducing

compression performance. Since the number of bits needed to code each location address is a

function of the dictionary size greater compression is obtained in comparison to the case where

a fixed size dictionary uses fixed address codes for a partially full dictionary.

In the parallel XMatchPro system, the data stream to be compressed enters the

compression system, which is then partitioned and routed to the compressors. For parallel

compression systems, it is important to ensure all compressors are supplied with sufficient data by

managing the supply so that neither stall conditions nor data overflow occurs.

3.4.2. The Main Component- Content Addressable Memory

Dictionary based schemes copy repetitive or redundant data into a lookup table (such as

CAM) and output the dictionary address as a code to replace the data. The compression architecture is

based around a block of CAM to realize the dictionary. This is necessary since the search

operation must be done in parallel in all the entries in the dictionary to allow high and data-independent




28/111



throughput.

Fig.3.3. Conceptual view of CAM

The number of bits in a CAM word is usually large, with existing implementations

ranging from 36 to 144 bits. A typical CAM employs a table size ranging between a few

hundred entries to 32K entries, corresponding to an address space ranging from 7 bits to 15 bits. The

length of the CAM varies with three possible values of 16, 32 or 64 tuples trading complexity for

compression.

The no. of tuples present in the dictionary has an important effect on compression. In principle,

the larger the dictionary the higher the probability of having a match and improving

compression. On the other hand, a bigger dictionary uses more bits to code its locations degrading

compression when processing small data blocks that only use a fraction of the dictionary length

available. The width of the CAM is fixed with 4bytes/word. Content Addressable Memory (CAM)

compares input search data against a table of stored data, and returns the address of the matching

data. CAMs have a single clock cycle throughput making them faster than other hardware and

software-based search systems.

The input to the system is thesearch wordthat is broadcast onto the searchlines to the table

of stored data. Each stored word has a matchline that indicates whether the search word and




29/111



stored word are identical (the match case) or are different (a mismatch case, or miss). The matchlines

are fed to an encoder that generates a binary match location corresponding to the matchline that is

in the match state. An encoder is used in systems where only a single match is expected. The overall

function of a CAM is to take a search word and return the matching memory location.

3.4.2.1. Managing Dictionary entries

Since the initialization of a compression CAM sets all words to zero, a possible input

word formed by zeros will generate multiple full matches in different locations. TheXmatchpro

compression system simply selects the full match closer to the top. This operational mode initializes

the dictionary to a state where all the words with location address bigger than zero are declared

invalid without the need for extra logic. The reason is that location x can never generate a match until

the data contents of location x-1 are different from zero because locations closer to the top have

higher priority generating matches. Also to increase dictionary efficiency, only one dictionary

position contains repeated information and in the best case, all the dictionary positions contain

different data.




30/111



CHAPTER 4




31/111



XMATCHPRO LOSSLESS COMPRESSION SYSTEM

4.1. DESIGN METHODOLOGYThe XMatchPro algorithm is efficient at compressing the small blocks of data necessary

with cache and page based memory hierarchies found in computer systems. It is suitable for high

performance hardware implementation. The XMatchPro hardware achieves a throughput 2-3

times greater than other high-performance hardware implementation. The core component of the

system is the XMatchPro based Compression / Decompression system. The XMatchPro is a high-

speed lossless dictionary based data compressor. The XMatchPro algorithm works by taking an

incoming four-byte tuple of data and attempting to match fully or partially match the tuple with the

past data.

4.2. FUNCTIONAL DESCRIPTION

The XMatchPro algorithm maintains a dictionary of data previously seen and attempts to

match the current data element with an entry in the dictionary, replacing it with a shorter code

referencing the match location. Data elements that do not produce a match are transmitted in full

(literally) prefixed by a single bit. Each data element is exactly 4 bytes in width and is referred to

as tuple. This feature gives a guaranteed input data rate during compression and thus also guaranteed

data rates during decompression, irrespective of the data mix. Also the 4-byte tuple size gives an

inherently higher throughput than other algorithms, which tend to operate on a byte stream.




32/111



The dictionary is maintained using move to front strategy, where by the current tuple is

placed at the front of the dictionary and the other tuples move down by one location as

necessary to make space. The move to front strategy aims to exploit locality in the input data. If the

dictionary becomes full, the tuple occupying the last location is simply discarded.

A full match occurs when all characters in the incoming tuple fully match a Dictionary

entry. A partial match occurs when at least any tow of the characters in the incoming tuple

match exactly with a dictionary entry, with the characters that do not match being transmitted

literally.

The use of partial matching improves the compression ratio when compared with

allowing only 4 byte matches, but still maintains high throughput. If neither a full nor

partial match occurs, then a miss is registered and a single miss bit of 1 is transmitted followed by the

tuple itself in literal form. The only exception to this is the first tuple in any compression operation,

which will always generate a miss as the dictionary begins in an empty state. In this case no miss bit is

required to prefix the tuple.

At the beginning of each compression operation, the dictionary size is reset to zero. The

dictionary then grows by one location for each incoming tuple being placed at the front of the

dictionary and all other entries in the dictionary moving down by one location. A full match

does not grow the dictionary, but the move-to-front rule is still applied. This growth of the

dictionary means that code words are short during the early stages of compressing a block. Because the

XMatchPro algorithm allows partial matches, a decision must be made about which of the

locations provides the best overall match, with the selection criteria being the shortest possible

number of output bits.

4.3 Parallel Xmatchpro Compression

The Input router of the system divide the data to be processed and Output router concatenate

the data to give as output of the parallel compression system respectively. The split data by Input

Router are sent to each of the compression system or XMatchPro compression engines where the

data is compressed and is sent to the Output Router to merge the compressed data and sent out as

the compressed data.




33/111



For multiple compression systems, it is important to ensure all compressors are supplied

with sufficient data by managing the supply so that neither stall conditions nor data overflow occurs.

There are several approaches in which data can be routed in and out of the compressors. The

basic method for input routing used in this project is done by getting twice the size of the input to the

XMatchPro compressor, the lower 32 bit is given to the Compressor 0 and the higher 32 bits are

given to the other Compressor 1. The method is used for output routing and additional output

pins are assigned for both the Compressor 0 and Compressor 1.

4.4. DATA FLOW FOR PARALLEL XMATCHPRO COMPRESSOR

The below figure shows graphically the general concept of this approach. Thedata

stream to be compressed enters the compression system, which is then partitioned and routed to

the compressors. Appropriate methods for routing the data are discussed below, but to achieve

good compression performance, it is important that the partitioning mechanism supplies the

compressors with sufficient data to keep them active for as great a proportion of the time that the stream

is entering the system as is possible.

As the compressors operate independently, each producing its own compressed data

stream, a mechanism is required to merge these streams in such a way that subsequent

decompression can be performed correctly. Also, subsequent decompression needs to be capable of

operating in an appropriate parallel fashion, otherwise, a disparity in compression and decompression




34/111



speeds will occur, reducing overall throughput.

The data Flow for parallel compression system is given in Figure 3 below.

4.5. INPUT ROUTINGAs per the Algorithm, XMatchPro can process four bytes of data per clock cycle, then to ensure

that all are busy, data must enter the system at a rate of 4n bytes per clock cycle, where n is the

number of compressors in the system. It can be achieved by 2 methods.

1. Interleaved input method

2. Blocked Input method

4.5.1INTERLEAVED INPUT METHOD

In the Interleaved input approach, the router divides the input data into 4-byte widedata streams that are fed into the compressors. This is illustrated in the below figure for two

compressors, but the technique can be extended to supply data to any required number of

compressors.




35/111



7 5 3 1

8 6 4 2

IR

7 5 3 1

8 6 4 2

XMatchPro

XMatchPro

Fig.4.3. Interleaved Input Routing

The interleaved method avoids the need for input buffering as data are continuously fed

to the compressors and acted upon immediately on arrival. This minimization of latency is an

important advantage of the approach.

4.5.2. BLOCKED INPUT METHOD

In the blocked input approach, a fixed length block of data is sent from the incoming data

stream to each of the compressors in turn, as shown in the following figure. In this scheme, the

data has to arrive at the dedicated memory of the compressor at a rate slower than it can be processed,

thereby allowing the memory to be filled with data.




36/111

To minimize the latency introduced in blocked mode, compressors need to start

processing data as it arrives. It is also important to ensure that sufficient data are available for the

compressor to work on while data are being routed to the other compressors, as no new data can be

added to the dedicated memory until this process has been completed.

4.5.3. PROPOSED INPUT ROUTING

In this project, Blocked Input Routing method is used for inputting data to compression

system as it is more advantageous than interleaved input approach. The advantage of going for

this method is that the complexity in designing and coding is reduced and helps in achieving

superior compression ratio. But at the same time number of input pins increase as it assigns another

set of pins for the second XMatchPro compressor. Actually, the input data size for one

XMatchPro compressor is 32 bit, so another 32 bit is required for the second XMatchPro

compressor. In order to achieve this, while designing the parallel compressor an input data is assigned

as 64 bits and the lower order 32 bits is sent to one XMatchPro compressor and the higher order 32 bit

is sent to the second XMatchPro compressor. Thus, by doing so, both the XMatchPro compressor is

supplied with the data simultaneously and this increases the speed of compression.

4.6. OUTPUT ROUTING

The lengths of the compressed data output blocks from an array of parallel compressors

will generally not be constant due to the variability of redundancy in the data. As in

decompression, the system would not know the data boundaries of each block, these data cannot be sent

directly to the output bus and additional manipulation is needed in order to guarantee that the original

data can be recovered.

It is achieved by 3 methods, namely,

1. Single Compressed Block

2. Multiple Compressed Block

3. Interleaved Compressed Block

4.6.1. SINGLE COMPRESSED BLOCK

In this method, it is assumed that the data enters the system using the blocked mode

technique and that the compressed data are collected in the compressors output buffers. The

buffer outputs are routed in strict order of the compressor number and a boundary tag that


37/111

contains information on the block length is added so as to precede the data. As the tag will enter the

decompression system, first, it will know the length of the compressed data input belonging to any

given decompression engine. The introduction of tags is detrimental to the compression ratio, but this

effect diminishes as the block length is increased, as the overhead of one tag per block of compressed

data is largely constant.

One of the drawbacks of this approach is that the data output may contain idle time.

This arises since a whole block of data needs to be compressed before the appropriate tag

values can be determined and, so, a compressor may still be compressing its data when router becomes

available.

4.6.2. MULTIPLE COMPRESSED BLOCK

The Figure 2.7 illustrates the format of an output data stream containing multiple blocks. This

is similar to the single block scheme, but, instead of waiting for each compressor to finish

processing its block of data, all compressors need to finish compressing blocks before the data

are sent. In this technique, the tag provides information on the length or the compressed data to

enable correct decompression. As all compressors need to have completed their operations before an

output can be produced, this approach has a greater latency compared with the single compressed block

case, but, as fewer tags are needed, the effect on the compression ratio is reduced. The combined tag isshorter than the sum of the individual tags as the output bus granularity is of fixed width. Output tags

are sized in accordance with the output but width in order to simplify the routing architecture and

decoding operations, even though fewer bits are required to determine block size boundaries.


38/111

4.6.3. INTERLEAVED COMPRESSED BLOCK

The figure illustrates the interleaved approach for routing multiple compressed blocks of

data to the output stream. Instead of waiting for a whole block to be compressed, a predefined fixed

length of compressed data is always sent to the output. If a compressor has not completed its

operations, the system must wait until the data block has been produced.

There are two benefits of this approach compared with the previously discussed methods.

First, there is a reduction in latency since data can be sent to the output before the whole block is

compressed. Second, since no boundary tags are required, the compression ratio is improved.

At the end of compression sequence, the interleaved approach needs to add dummy tags

to the output stream in receipt of the stop signal, output routing continues until all compressors

have completed operations on their input blocks. It is likely that the final interleaved block from each

compressor will contain insufficient data to fill the required fixed output length and, so, the dummy

data tags are added as required in order to maintain the interleave length.


39/111

4.6.4. PROPOSED OUTPUT ROUTING

In this project, the Interleaved technique was selected as the Output Routing method as it

imparts no overhead to maintain compressed data boundaries, and so has no detrimental effect on the

compression ratio. The advantage of going for this method is that the complexity in designing and

coding is reduced. But at the same time number of input pins increase as it assigns another set of pinsfor the second XMatchPro compressor. Actually, the output data size for one 32 bit compressor is either

7 bit (match is found) or 33 bit (match not found), so another set of 33 bit in case of no match and 7 bit

in case of match is required for the second compressor. In order to achieve this, while designing the

parallel compressor an output data is assigned with two sets of 7 bits as well as two 33 bit output pins.

Thus, by doing so, both the compressors are supplied with data simultaneously and the output

data is transmitted via the external bus

4.7. IMPLEMENTATION OF XMATCHPRO BASED COMPRESSOR

The block diagram gives the details about the components of a single 32 bit Compressor.

There are three components namely, COMPARATOR, ARRAY, CAMCOMPARATOR. The

comparator is used to compare two 32-bit data and to set or reset the output bit as 1 for equal and 0 for

unequal. The CAM COMPARATOR searches the CAM dictionary entries for a full match of the input

data given.

The reason for choosing a full match is to get a prototype of the high throughout Xmatchpro


40/111

compressor with reduced complexity and high performance.

If a full match occurs, the match-hit signal is generated and the corresponding match

location is given as output by the CAM Comparator.. If no full match occurs, the corresponding data

that is given as input at the given time is got as output.

Array is of length of 64X32 bit locations. This is used to store the unmatched incoming data and

when a new data comes, the incoming data is compared with all the data stored in this array. If a match

occurs, the corresponding match location is sent as output else the incoming data is stored in

next free location of the array & is sent as output. The last component is the cam comparator and

is used to send the match location of the CAM dictionary as output if a match has occurred. This is

done by getting match information as input from the comparator.

Suppose the output of the comparator goes high for any input, the match is found and the

corresponding address is retrieved and sent as output along with one bit to indicate that match is

found. At the same time, suppose no match occurs, or no matched data is found, the incoming data is

stored in the array and it is sent as the output. These are the functions of the three components of the

Compressor. The hardware descriptions of these modules are done using VHDL Language. VHDL is

an acronym for Very high-speed integrated circuits Hardware Description Language. It can be

used to model a digital system at many levels of the abstraction, ranging from the algorithmic

level to gate level.

The VHDL language can be regarded as an integrated amalgamation of the following

languages:

o Sequential language

o Concurrent language

o Net-list language

o Timing specifications

o Waveform generation language.

So the language has constructs that enable you to express the concurrent or sequential

behavior of a digital system with or without timing. It also allows modeling the system as an

inter-connection of components. Test waveforms can also be generated using the same constructs. The

language not only defines the syntax but also defines very clear simulation semantics for each language

construct. Therefore, models written in this language can be verified using a VHDL simulator.

VHDL is event driven, to allow for efficient simulation of hardware. Computations are

only performed when some data has changed (event occurred).


41/111

CHAPTER 5


42/111

DESIGN OF PARALLEL LOSSLESS

COMPRESSION/DECOMPRESSION SYSTEM

5.1. DESIGN OF COMPRESSOR / DECOMPRESSOR

The block diagram [Fig.12] gives the details about the components of a single 32-bit

compressor / decompressor. The Same design approach is used for designing a 64-bit

Compression/Decompression system which is essentially used for comparison of increased

compression rates given by the 64-bit Lossless Parallel High-Speed Data Compression System.

There are three components namely COMPRESSOR, DECOMPRESSOR and CONTROL.The

compressor has the following components - COMPARATOR, ARRAY, and CAMCOMPARATOR.

The comparator is used to compare two 32-bit data and to set or reset the output bit as 1 for equal and 0

for unequal.

Array is of length of 64X32bit locations. This is used to store the unmatched in coming

data and when the next new data comes, that data is compared with all the data stored in this array. If

the incoming data matches with any of the data stored in array, the Comparator generates a match

signal and sends it to Cam Comparator. The last component is the Cam comparator and is

used to send the incoming data and all the stored data in array one by one to the comparator.

Suppose output of comparator goes high for any input, then the match is found and the

corresponding address (match location) is retrieved and sent as output along with one bit to indicate

the match is found. At the same time, suppose no match is found, then the incoming data stored in the

array is sent as output. These are the functions of the three components of the XMatchPro

based compressor.

The decompressor has the following components Array and Processing Unit. Array has

the same function as that of the array unit which is used in the Compressor. It is also of the same length.

Processing unit checks the incoming match hit data and if it is 0, it indicates that the data is not present

in the Array, so it stores the data in the Array and if the match hit data is 1, it indicates the data is

present in the Array, then it instructs to find the data from the Array with the help of the address input

and sends as output to the data out.


43/111

Fig.5.1. Block Diagram of 32 bit Compressor/Decompressor


44/111

The Control has the input bit called C / D i.e., Compression / Decompression indicates

whether compression or decompression has to be done. If it has the value 0 then compressor is stared

when the value is 1 decompression is done.

5.2. DESIGN OF 64 BIT SINGLE COMPRESSION/DECOMPRESSION SYSTEM

The 64 bit single Compression /Decompression system is done to compare the compression

rate & area with the parallel compression / decompression system which gives higher throughput.

The design & functionality of the 64- bit Single compression system is same as that of

the 32-bit compressor / decompressor discussed above except the input is changed from 32-bit to 64-

bit & hence to accommodate more data in CAM dictionary, the array size is increased from 64X32 to

128 X 64. The match location is now given by 7 bits for the fixed 128 locations of the memory.

In the Compression system, the comparator compares the incoming 64 bit data with data

entries that are previously stored in the memory. If any of the dictionary entries matches with the

incoming data, then a match signal is generated to provide the corresponding match location as

output along with match signal. If no match occurs, then the incoming data is stored in the dictionary

entry and the data is given as output of the compressor.

The Decompression system hence gets 64 bit data if a match has not occurred or 1 bit match

signal & 7 bit match location to be processed by the 128 X 64 array in decompressor to give

the data in the match location as output. The block diagram of the 64 bit Compression / Decompression

System is given below.


45/111

Fig.5.2. Block Diagram of 64 bit Compression / Decompression system

5.3. PARALLEL COMPRESSION / DECOMPRESSION SYSTEM


46/111

5.3.1 DESIGN OF PARALLEL COMPRESSION SYSTEM

The block diagram gives the details about the components of a parallel Compression

system. Here the compressor is instantiated twice for the two processors. The number of input as

well as the number of output pins are twice as that of the single compressor.

The components of the single instantiated compressor are same as that of the 32-bit compressor.

The components involved in the 32-bit compressor are namely, COMPARATOR, ARRAY, and

CAMCOMPARATOR.

The comparator is used to compare two 32-bit data and to set or reset the output bit as 1 for

equal and 0 for unequal. Array is of length of 64X32bit locations.

This is used to store the unmatched incoming data and when a new data comes, that

data is compared with the all the data stored in this array for a match. If no match occurs, the

incoming data is stored in next free location of the array. The last component is the cam comparator and

is used to send the incoming data and all the stored data in array one by one to the comparator.


47/111

Comparator goes high for any input the match is found and the corresponding address is

retrieved and sent as output along with one bit to indicate that a match is found. At the same time,

suppose that no match is found, then the incoming data is stored in the array and is sent as output.

These are the functions of the three components of the 32-bit Compressor.

5.3.2 DESIGN OF PARALLEL DECOMPRESSION SYSTEM

The parallel Decompression system is also implemented by concatenating the outputs of

two compressors in parallel architecture and giving those data as input to the parallel decompression

system comprising two 32-bit decompression system discussed above for single compression

system. The 32-bit decompressor has the following components Array and Processing Unit.

Array has the same function as that of the array unit which is used in the Compressor. It is

also of the same length. Processing unit checks the incoming match hit data and if it is 0, it indicates

that the data is not present in the Array, so it stores the data in the Array. If the match hit data is 1, it

indicates the data is present in the Array, then it instructs to find the data from the Array with the

help of the address input (match location) and sends as output to the data out.

5.4. SIMULATION RESULTS

The design coded in VHDL is simulated using Modelsim of Mentor Graphics. The obtained

waveforms are as follows

Fig.5.4.Comparator


48/111

Fig.5.5. Cam Comparator


49/111

Fig.5.6.Content Addressable Memory


50/111

Fig.5.7. 32-bit Single Compression Top Module

Fig.5.8. 32-bit Single Compression Top Module Decimal inputs


51/111

Fig.5.9. 64-bit Single Compression System -Top module

Fig.5.10. 64-bit Single Compression System -Test bench Waveform


52/111

Fig.5.11. 32-bit Single Decompression Top Module

Fig.5.12. 32-bit Single Decompression- Test bench Waveform


53/111

Fig.5.13. Parallel Compression System - 64-bit input Top module

Fig.5.14. Parallel Compression System - 64-bit input Test bench


54/111

5.5. RTL SCHEMATIC

The RTL Schematic for vhdl codes are generated using Xilinx Project Navigator 8.1i

Fig.5.15. 32 bit Single Compression System



55/111


Fig.5.18. RTL Schematic for 64 bit Single Compression System


56/111

Fig5.19. 64 bit Parallel Compression System

Fig.5.20. RTL Schematic for 64 bit Parallel Compression System

5.6. Xilinx Synthesis Results for Target Device xc2v1500bg575-6


57/111

5.6.1. 32-bit Single Compression System

===============================================================

* Synthesis Options Summary *

===============================================================---- Source Parameters

Input File Name : "xmatchpro.prj"

Input Format : mixedIgnore Synthesis Constraint File : NO

---- Target Parameters

Output File Name : "xmatchpro"Output Format : NGC

Target Device : xc2v1500-6-bg575

===============================================================* HDL Compilation *

===============================================================

Compiling vhdl file "E:/proj/xilinx/s_comp32/s_comp32/comparator.vhd" in Library

work.

Architecture arch _comp of Entity comparator is up to date.

Compiling vhdl file "E:/proj/xilinx/s_comp32/s_comp32/camcomp.vhd" in Library work.

Architecture arch_cam64 of Entity camcomp is up to date.

Compiling vhdl file "E:/proj/xilinx/s_comp32/s_comp32/cam.vhd" in Library work.

Architecture arch_cam of Entity cam is up to date.

Compiling vhdl file "E:/proj/xilinx/s_comp32/s_comp32/xmatchpro.vhd" in Library

work.

Architecture arch_xmatch of Entity xmatchpro is up to date.

Table 4.1. 32-bit Single Compression System - HDL Synthesis Report

Macro Statistics

# ROMS

4x1 bit ROM

No.

64

64

# Adders/Subtractors 1

32-bit adder

# Registers

1-bit register32-bit register

6-bit register

# Latches

1-bit latch

6-bit latch

# Comparators

32-bit comparator equal

1

68

166

1

2

11

64

64

43


58/111

5.6.2. 64-bit Single Compression System===============================================================

* Synthesis Options Summary *===============================================================

---- Source ParametersInput File Name : "xmatchpro.prj"Input Format : mixed

Ignore Synthesis Constraint File : NO


Output File Name : "xmatchpro"

Output Format : NGC

Target Device : xc2v1500-6-bg575===============================================================

* HDL Compilation *

===============================================================Compiling vhdl file "E:/proj/xilinx/s_comp64/s_comp64/comparator.vhd" in Library

work.

Architecture arch_comp of Entity comparator is up to date.Compiling vhdl file "E:/proj/xilinx/s_comp64/s_comp64/camcomp.vhd" in Library work.

Architecture arch_cam64 of Entity camcomp is up to date.

Compiling vhdl file "E:/proj/xilinx/s_comp64/s_comp64/cam.vhd" in Library work.Architecture arch_cam of Entity cam is up to date.

Compiling vhdl file "E:/proj/xilinx/s_comp64/s_comp64/xmatchpro.vhd" in Library

work.Architecture arch_xmatchpro of Entity xmatchpro is up to date.

Table 5.2. 64-bit Single Compression System - HDL Synthesis Report

Macro Statistics

# ROMS

4x1 bit ROM

Nos.

128

128

# Adders/Subtractors 1

32-bit adder

# Registers

1-bit register

32-bit register

64-bit register

7-bit register# Latches

1-bit latch

7-bit latch

# Comparators

64-bit comparator equal

1

132

1

1

129

12

1

1

128

128


59/111


60/111

5.6.4. 64-bit Parallel Decompression System

===============================================================

* Synthesis Options Summary *===============================================================

---- Source Parameters

Input File Name : "LL_decomp.prj"Input Format : mixed

Ignore Synthesis Constraint File : NO


Output File Name : "LL_decomp"

Output Format : NGC

Target Device : xc2v1500-6-bg575===============================================================

* HDL Compilation *

===============================================================Compiling vhdl file "E:/proj/xilinx/dual_decomp/dual_decomp/de_xmatchpro.vhd" in

Library work.

Architecture arch_de_camcomparator of Entity de_xmatchpro is up to date.Compiling vhdl file "E:/proj/xilinx/dual_decomp/dual_decomp/LL_decomp.vhd" in

Library work.

Architecture arch_dualdecomp of Entity ll_decomp is up to date.

Table 5.4. 64-bit Parallel Decompression System - HDL Synthesis Report

Macro Statistics Nos.

# Adders / Subtractors 2

32-bit adder 2

# Latches 130

32-bit latch 130

# Multiplexers 2

32-bit 64-to-1 multiplexer 2


61/111

CHAPTER 6


62/111

CHAPTER 6 -ANALYSIS OF RESULTS

6.1. Device Utilization of Various Modules

Table 6.1.

Compression Device Utilization Summary for Selected Device: xc2v1500bg575-6

Modules:32-bit Single

Compression

64-bit Single

Compression

64-bit Parallel

Compression

Number of Slices: 1756 out of 768022%

6819 out of 768088%

3560 out of 768046%

Number of Slice 2064 out of 15360 8168 out of 15360 4206 out of 15360

Flip Flops: 13% 53% 27%

Number of 4 input 1368 out of 15360 4776 out of 15360 2930 out of 15360LUTs:

Number of bonded

IOBs:

IOB Flip Flops:

8%

74 out of 392

18%

39

31%

139 out of 392

35%

72

19%

145 out of 392

36%

78

Number of

GCLKs:

2 out of 16 12% 2 out of 16 12% 2 out of 16 12%


63/111

6.2. CADENCE RTL Compiler Reports

The Hardware designs done are compiled in Cadence RTL compiler and the

results are as follows:

6.2.1. 32-bit Single Compression System

6.2.1.1. Area Report

============================================================Generated by: Encounter(r) RTL Compiler v06.10-p003_1Generated on: Apr 17 2007 08:42:56 PMModule: scomp_32Technology libraries: typical 1.3

tpz973gtc 230ram_128x16A 0.0ram_256x16A 0.0

rom_512x16A 0.0pllclk 4.3Operating conditions: typical (balanced_tree)Wireload mode: segmented

============================================================

Instance Cells Cell Area Net Area Wireload----------------------------------------------------------------------scomp_32 5393 116863 0 TSMC32K_Conservative (S)

6.2.1.2. Power Report============================================================

Generated by: Encounter(r) RTL Compiler v06.10-p003_1Generated on: Apr 17 2007 08:43:13 PMModule: scomp_32Technology libraries: typical 1.3

tpz973gtc 230ram_128x16A 0.0ram_256x16A 0.0rom_512x16A 0.0pllclk 4.3

Operating conditions: typical (balanced_tree)Wireload mode: segmented

============================================================

Leakage Internal Net SwitchingInstance Cells Power(nW) Power(nW) Power(nW) Power(nW)

-----------------------------------------------------------------------scomp_32 5393 4.255 5832894.166 2001783.940 7834678.105

6.2.1.3. Timing Report============================================================

Generated by: Encounter(r) RTL Compiler v06.10-p003_1Generated on: Apr 17 2007 08:42:21 PMModule: scomp_32Technology libraries: typical 1.3


64/111

tpz973gtc 230ram_128x16A 0.0ram_256x16A 0.0rom_512x16A 0.0pllclk 4.3

Operating conditions: typical (balanced_tree)Wireload mode: segmented============================================================

Pin Type Fanout Load Slew Delay Arrival(fF) (ps) (ps) (ps)

----------------------------------------------------------------------(clock clk) launch 0 Ru3

Wr_addr_reg_reg[31]/CK setup 0 +365 7451 R- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -(clock clk)

Documents

04 32 bit loss less comp.doc