Steps In Video Compression - CAE Usershomepages.cae.wisc.edu/~ece734/project/s04/valia.doc · Web viewIt is known to be the most crucial and computationally intensive process in the

Performance Enhancement of Video Compression Algorithms with SIMD

Report Submitted By:Shamik Valia

Saket Jamkar

1 Introduction

Video Compression algorithms have a large number of applications ranging from Video

Conferencing to Video on Demand to Video phones. Video Compression standards (such

as the MPEG -1, 2, 4, 7) and Teleconferencing standards (such as the H.2XX) are vital

algorithms used in these and other multimedia applications, whose performance is very

critical given the high data rates that are common for video applications. The timing

constraints with such high data rates can be challenging enough even for custom Video-

codecs and overwhelming for some of the state-of-the-art superscalar processors.

Performing these operations real-time isn’t easy on most platforms if image resolutions of

acceptable quality are desired.

The algorithms, however, consist of repetitive and regular operations by nature, which

could benefit greatly from the use of some architectures that are better able to perform

such repetitive tasks efficiently. In recent years, general purpose microprocessors have

also been endowed with functional units capable of Single Instruction Stream Multiple

Data Stream (SIMD) operation. This project attempts to study the speedup achievable for

the most critical parts of these algorithms by utilizing the Streaming SIMD Extensions 2

from the Intel Pentium 4 processors. We also deal with the improvement schemes for the

DCT algorithm on the SSE2 architecture.

2 Basic steps used in Video Compression

The Video Compression algorithm utilized in numerous standards (such as MPEG 1, 2

H.263) usually consists of the following steps:

1. Motion Estimation

2. Motion Compensation and Image Subtraction

3. Discrete Cosine Transform

4. Quantization

5. Run Length Encoding

6. Entropy Coding – Huffman Coding

We examine each of these steps in greater detail in this section.

2.1 Motion Estimation

Motion estimation is the process of calculating motion vectors by finding matching

blocks in the future frame corresponding to blocks in the current frame. Motion

estimation helps in detecting the temporal redundancy. Various search algorithms have

been devised for estimating motion. The basic assumption underlying these algorithms is

that only translational motion can be compensated for Rotational motion and zooming

cannot be estimated by using block based search algorithms. It is known to be the most

crucial and computationally intensive process in the video compression algorithm.

Figure 2.1.1 Search region (-p, p)

Since most of the video streams have a frame rate ranging from 15 to 30 frames per

second, there is never a very large motion of any object between two successive frames.

Therefore most search algorithms search for matching block in the neighborhood of the

position of the current block in the next frame. The region where matching block is

searched for is called the search region. Search region around a block is shown in the

figure 2.1.1.

The choice for the value of p will depend upon the type of broadcast that has to be sent.

For fast-moving videos such as sports events a higher value of p such as 16 or 32 may be

P

Search Region (-p,p)

Current block

used. On the other hand for broadcasts with less motion such as a news-telecast a smaller

value of p such as 4 or 8 may be used.

Figure 2.1.2 Motion vector and best match

The task to be performed by the search algorithm is to find the best match for a block in

the current frame in the next frame. A typical block size is 8x8 or 16x16 pixels. The

quality of match found will depend on the value of Mean Absolute Error, more

commonly known as MAE between the blocks. This is the average absolute pixel-wise

difference between two blocks, reference block in the current frame and probable match

found in the next frame. The matching block is figured out on the basis of the magnitude

of the value of its mean error. Smaller the magnitude better is the match. The

displacement of the block with the minimum MAE is taken as the motion vector.

Formula for MAE is given by:

Next we explain the two search algorithms that were used in this project.

u,v

(x,y)

(x+u,y+v)

MAE = (1/MN) | C(x+k,y+l) – R(x+I+k,y+j+l)|

2.1.1 Full Search

Full search is an exhaustive search algorithm. Full search is the simplest method to find

the motion vector for each block; in it the MAE(i,j) is found at each point in the search

region. Thus a search for the match block is made in the complete (-p, +p) range in the

future frames for every block of the current frame.

For each motion vector, there are (2p) 2 search locations. At each search location (i,j) we

compare N x M pixels. Each pixel comparison requires three operations, namely: a

subtraction, an absolute value calculation and an addition. We ignore the cost of

accessing the pixels C(x + k, y + l) and R(x + i + k, y + j + l). Thus the total complexity

per block is (2p) 2 x MN x 3 operations. For a picture resolution of I x J, and a picture rate

of F pictures/second, the overall complexity is IJF/MN x (2p) 2 x MN x 3 operations per

second.

But this makes it a very intensive method computationally. CPU time for full search is

the highest of all the algorithms. At the same time the accuracy of Full search is also

highest and the best match for every block in the current frame is always found. Full

search, therefore is a benchmark for comparison of the quality of a search algorithm,

which as was previously mentioned depends on CPU time and accuracy. There is a trade-

off between the efficiency of the algorithm and the quality of the prediction image.

Keeping this trade-off in mind a lot of algorithms have been developed.

2.1.2 Three Step Search

Three step search became very popular because of its simplicity and also robust and near

optimal performance. It searches for the best motion vectors in a coarse to fine search

pattern. The algorithm may be described as: (Refer to figure 2.1.3)

Step 1: An initial step size is picked. Eight points at a distance of step size from the

centre (around the centre point) are picked for comparison.

Step 2: The step size is halved. The centre is moved to the point with the minimum

distortion. Steps 1 and 2 are repeated till the step size becomes smaller

than 1. A particular path for the convergence of this algorithm is shown below:

Points chosen for first stage Points chosen for second stage Points chosen for third stage

Figure 2.1.3 Example path for convergence of Three Step Search

2.2 Motion Compensation and Image SubtractionThe process of Motion Estimation and Motion Compensation is similar to DPCM. The

idea is to reduce the bandwidth required for the video by sending only the difference

frames instead of the actual frames.

The motion vectors produced during Motion Estimation are utilized in the Motion

Compensation process in order to produce the predicted image in the encoder just like it

would be produced in the Decoder. The two images (current frame and the motion

compensated frame) are now subtracted and the difference is sent to the receiver along

with the motion vectors. Thus the decoder can produce the exact copy of the future frame

by first motion compensating the current frame using the motion vectors and then adding

the difference image.

The block diagram of the Encoder is given below in Figure 2.2.1 in order to illustrate the

idea.

Fig. 2.2.1 Block Diagram of Video Encoder

2.3 Discrete Cosine Transform (DCT)

DCT based image coding is the basis for almost all the image and video compression

standards. Discrete Cosine Transform is a derivative of the Discrete Fourier Transform

(DFT), which is encountered very commonly in Digital Signal Processing.

The fundamental operation performed by DCT is to transform the space domain

representation of an image to a spatial frequency domain (known as DCT domain). The

formula for DCT is given below:

Y(k,l)= C(k) C(l)/4 Xij cos((2i+1)k) cos((2j+1)l

C(k) = (½)½ if k = 0C(k) = 1 otherwise

MotionEstimation

Motion Compensation

E(x,y,t) = I(x,y,t) – I(x-u,y-v,t+1)

Frame (n+1) I(x,y,t+1)

u,v

DCT coding

Frame (n) I(x,y,t)

The DCT transformation can be viewed as the process of finding for each waveform, the

corresponding weight Y(k,l) so that the sum of 64 waveforms scaled by the

corresponding weights Y(k,l) yields the reconstructed version of the original 8 x 8 block.

Energy compaction of DCT is among the highest next only to the Karhunen- Loeve

Transform. This means that the information can be compressed to a very high degree

with DCT, which is why DCT is commonly used. At the same time DCT also minimizes

the block artifact that is present in many other transforms due to the favorable periodic

nature of DCT.

DCT, in principle, is a lossless process. However, due to the finite word-lengths in a

microprocessor, there is some loss of information due to rounding and truncating of

calculated DCT values. This loss of information is irreversible.

2.4 Quantization

The human eye is not sensitive to the high frequency content in an image. Therefore

removal of these spatial frequencies does not lead to any perceptible loss in image

quality. This is the basic principle behind quantization. The spatial frequency content of

the image is obtained by using the DCT operation, which is followed by a removal of the

high frequency content that is the quantization process.

The JPEG standard recommends standard values of quantization tables which are used to

deemphasize higher frequencies in the DCT image. Quantization is a lossy process and

some data is lost during quantization. This loss of information is irreversible.

2.5 Run Length Encoding (RLE)

Run-length encoding is the next stage of the compression process. It encodes the runs of

zeroes. If pixel values are correlated to their neighbors, then there will be sequences of

the same value. Instead of coding all the repeat values, just encode the first value and

then give the run length of the sequence. Intuitively, one can understand how RLE can

help in achieving compression. Suppose the data is 00000…0(ten times). Now instead of

writing ten zeroes one can send only 0-10, which could be taken to mean that a zero

occurs 10 times. This is how compression is achieved in Run-Length Encoding. Runs of

zeroes are encoded in a 16 bit or 8 bit format.

A higher compression can be achieved in Run-Length encoding if we somehow obtain

longer strings of zeroes. This is achieved by performing RLE in a zigzag manner on a

block. In the DCT image the higher frequency content is always found towards the lower

right hand corner of the DCT image while the lowest frequencies are in the upper left

hand corner of the image. During quantization the higher frequencies are reduced to zero

and therefore the values in the lower right hand side are mostly zero. Therefore by

performing RLE in a zig-zag manner, we try to obtain runs of zeroes out of the lower

right hand side of the DCT domain representation.

2.6 Huffman Encoding

Huffman encoding is a form of entropy encoding and it is based on Shannon’s

Information theory. The fundamental idea behind Huffman encoding is that symbols,

which occur more frequently, should be represented by fewer bits, while those occurring

less frequently should be represented by more number of bits. This scheme is similar to

the one utilized in Morse code.

Shannon has proved that the entropy of the total message gives the most efficient code,

with minimum average code length, for sending a message.

Given n symbols S1 to Sn-1 with probabilities of occurrence P1 to Pn-1 in a certain

message, the entropy of the message will be given by

Huffman encoding attempts to minimize the average number of bits per symbol and try to

get a value close to entropy.

Example:

We describe the algorithm for Huffman encoding with the help of an example.

Consider 4 symbols with probabilities as shown in the first column of the table.

Entropy = Pi log2 (1/ Pi)

Symbol Probability Iteration 1 Iteration 2 Length in bits

S1 0.4 (1) 0.4 (1) 0.6 (0) 1

S2 0.3 (00) 0.3 (00) 0.4 (1) 2

S3 0.2 (010) 0.3 (01) 3

S4 0.1 (011) 3

Table 2.6.1 Huffman encoding method

Step 1: Sort the probabilities and arranged in descending order as shown in the column

marked “Probability”.

Step 2: Add the probabilities of the last two symbols and add them to the next column

after sorting the values.

Step 3: Continue steps 1 and 2 until only two symbols remain.

Step 4: Assign bit 0 to upper symbol and 1 to the lower symbol. (or vice-versa…but then

this format should be followed throughout the process.)

Step 5: Trace back the probabilities according to where they have come from in the

previous column and append a 0 or 1 depending on the format chosen above.

Step 6: Follow this procedure up to first column and you have the variable length code

ready.

Huffman encoding isn’t implemented in this manner in the JPEG image compression

standard. The standard code tables for Huffman coding are defined in the standard for the

DC values and the non-DC values as well. These values are looked up both for encoding

and decoding. Huffman code is a prefix code and hence it can be uniquely decoded.

The other alternative methods for entropy coding or source coding are Shannon-Fano

encoding, and arithmetic coding. Arithmetic coding has even been adopted in the JPEG

2000 standard for entropy coding after IBM agreed to release its patent on this technique

for the JPEG 2000 standard.

3. Key Bottlenecks

After performing an analysis of the Video compression algorithms and a survey in the

literature to improve its performance we were able to identify two algorithms (i.e. Motion

estimation and DCT), which are the most resource intensive and in which a very high

proportion of the time in a video compression algorithm is spent. Motion Estimation

requires making use of highly repetitive methods applied to the whole image. DCT is

essentially a matrix multiplication loop which is to be performed on every 8 or 16 pixels

of the image. Also with the increase in image resolution the problem becomes even worse

as the loop iterations will increase and will require more computational resources.

However, it can be seen that there isn’t any data dependence between the various data

elements that are used in the algorithm. Therefore it is possible to try and improve

performance of these programs by exploiting parallelism inherent in these media

algorithms and running different data points in parallel to obtain higher throughput.

SIMD architecture exploits this parallelism by use of increase datapath size and

performing the same operations on the different data point (in our case pixels).

4. SIMD Architecture:

Usually, processors process one data element in one instruction, a processing style called

Single Instruction Single Data, or SISD. In contrast, processors having the SIMD

capability process more than one data element in one instruction. The Single Instruction

Stream Multiple Data Stream (SIMD) Architectures perform operations on many

elements in a lockstep fashion. The same instruction is performed on different data

elements computed by different functional units.

The Intel’s MMX/SSE/SSE2, AMD’s 3DNow, Power PC’s Altivec ISA extensions are

testimonial to the benefits of SIMD support to traditional superscalar.

4.1 Intel’s Streaming SIMD Extensions

SIMD Extensions for the IA-32 ISA began with the Multimedia Extensions (MMX) in

1997 for the Pentium processor. MMX datapath of 64 bits subword parallel ALU’s for

bytes, words and doublewords enhanced its performance on multimedia benchmarks.

However, these instructions had a very limited function, in that only integer data-types

could be handled. Also since the MMX instructions utilized the floating point registers, it

was very hard to inter-mingle floating point and MMX instructions.

Streaming SIMD Extensions (SSE) from the Pentium III marked the advent of 68 new

instructions to the IA-32 ISA, in particular the MMX. The biggest winners from the new

instructions were applications that handled 3D or streaming media, as applying identical

instructions to multiple pieces of code was now handled in parallel. AMD wasn't idle

over this time though, and introduced 3DNow! to the world. This much catchier-sounding

set offered capabilities similar to those made possible by SSE, but was incompatible with

it.

The SSE2 technology from the Pentium-4, introduced new Single Instruction Multiple

Data (SIMD) double-precision floating-point instructions and new SIMD integer

instruction into the IA-32 Intel architecture. The 128-bit SIMD integer extensions are a

full superset of the 64-bit integer SIMD instructions, with additional instructions to

support more integer data types, conversion between integer and floating-point data

types, and efficient operations between the caches and system memory. These

instructions provide a means to accelerate operations typical of 3D graphics, real-time

physics, spatial (3D) audio, video encoding/decoding, encryption, and scientific

application.

4.1.1 SSE Vs MMX

MMX and SSE, both of which are extensions to existing architectures, share the concept

of SIMD, but they differ in the data types they handle, and in the way they are supported

in the processor.

MMX instructions are SIMD for integers, while SSE instructions are SIMD for single-

precision floating-point numbers also. MMX instructions operate on two 32-bit integers

simultaneously, while SSE instructions operate on four 32-bit floats simultaneously.

A major difference between MMX and SSE is that no new registers were defined for

MMX, while eight new registers have been defined for SSE. Each of the registers for

SSE is 128 bits long and can hold four single-precision floating-point numbers (each

being 32 bits long). The arrangement of the floating-point numbers in the new data type

handled by SSE is illustrated in Figure 4.1.

Figure 4.1: Arrangement of numbers in the new data type.

The immediate question is: Where did the registers for MMX come from? The MMX

registers were allocated out of the floating-point registers of the floating-point unit. A

floating-point register is 80 bits long, of which 64 bits were used for an MMX register. A

limitation of this architecture is that an application cannot execute MMX instructions and

perform floating-point operations simultaneously. Additionally, a large number of

processor clock cycles are needed to change the state of executing MMX instructions to

the state of executing floating-point operations and vice versa. SSE does not have such a

restriction. Separate registers have been defined for SSE. Hence, applications can execute

SIMD integer (MMX) and SIMD floating-point (SSE) instructions simultaneously.

Applications can also execute non-SIMD floating-point and SIMD floating-point

instructions simultaneously.

The arrangement of the registers in MMX and SSE is illustrated in Figure 4.2. Figure

4.2(a) illustrates the mutually exclusive floating-point and MMX registers, while Figure

4.2(b) illustrates the SSE registers.

Figure 4.2: Registers in MMX and SSE.

MMX and SSE have one more similarity: Both have eight registers. MMX registers are

named mm0 through mm7, while SSE registers are named xmm0 through xmm7. For the

purpose of our experiment we make use of the SSE2 extensions.

4.2 SSE2 Coding Techniques

There is limited compiler support available for the SIMD ISA extension. As a result to

make use of the rich features provided by this extension we need to go through different

programming techniques. One can use one of the following techniques to code programs

with SSE2.

a) Assembly level programming

b) Intrinsics

c) Vector Class Library

Advantages of using Intrinsics and the Vector Class Library is that the Intrinsics and

Vector Classes free the programmer from managing registers while ensuring easier

maintenance and modularization of code. The compiler optimizes instruction scheduling

and register allocation and hence the executable runs faster.

Each computation and data manipulation assembly instruction has a corresponding

intrinsic that implements it directly. The intrinsic in SSE2 contain suffixes to indicate the

datatype operated on by instructions.

- p, pd, ps suffix indicates a packed, packed double, packed single precision

floating point operation

- s, sd, ss indicates a scalar, scalar double or scalar single precision floating point

operation

- i , si, su, pi, pu, epi, epu indicates an integer, 64-bit signed or unsigned integer,

128 bit (ep) signed or unsigned extended precision operation for 8, 16, 32 or 64

bits.

To use the intrinsics library, the file xmmintrin.h must be included. Thus we chose to

utilize the Intrinsics style of coding for Motion Estimation Algorithms. We chose the

Intel’s C++ compiler over the Microsoft’s Visual Studio pack to compile our motion

estimation algorithms. For most of the parts we made use of normal C code constructs.

However in cases where we could exploit parallelism with SIMD we made use of SSE2

intrinsics to indicate to the compiler its use.

4.3. Motion Estimation

We perform motion estimation for full search and three step search for both the 16x16

and 8x8 block size and compare performance. The complete sample of C code for all the

programs are provided in the appendix. Here we present the instrinic optimization done to

incorporate the SSE2 features.

4.3.1Code Snippets

Blockdiff is the main computationally intensive function call in the program. It also can

make use of the SSE2 features to improve its performance. We change the code using

intrinsics to employ the SSE2 datapath. Snippet below provides the blockdiff function

call. int blockdiff(int x1,int x2,int y11,int y22) { unsigned char block1[16], block2[16]; int i1,j1,k1,ch,offset1,offset2; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; __m128i *b1,*b2,m1; union mmx

{ __m128i m; short int x[8]; }m;….….…. // type casting pointers. b1 = (__m128i*)block1; b2 = (__m128i*)block2; //SAD for 16 bytes.

m1 = _mm_sad_epu8(*b1,*b2);m.m = m1;

totaldiff = totaldiff + m.x[3] + m.x[7];}

}

Figure 4.3 Blockdiff function to process 16 x 16 blocks.

Figure 4.3 above shows the SSE code for the blockdiff() function which finds the

difference between two blocks located at (x1, y11) and (x2, y22) .

The top part shows the declarations inside the function, while the bottom part shows the

calculation of the difference using the SSE intrinsic. We define a union called mmx,

which can be used to address the m register of the mmx datatype “__m128i” and as an

array of 8 intgers as well. This __m128i register consists of 16 8-bit integer values.

The block1 and block2 arrays will contain the 16 8-bit pixel values from the image.

These are typecast into the __m128i format and put into the locations pointed by b1 and

b2. Next the __mm_sad_epu8() instruction finds the sum of differences of these 16

values directly and places it in the m1 register. Totaldiff adds up the total difference from

previous iterations and this one.

Figure 4.4 Snippet from blockdiff function for 8 x 8 blocks.

int blockdiff(int x1,int x2,int y11,int y22) { unsigned char block1[16], block2[16]; int i1,j1,k1,ch,offset1,offset2; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; __m64 *b1,*b2,m1; union mmx

{__m64 m;int x[2];

}m;

….….….

// type casting pointers.b1 = (__m64*)block1;b2 = (__m64*)block2;

//SAD for 16 bytes.m1 = _m_psadbw(*b1,*b2);m.m = m1;totaldiff = totaldiff + m.x[0];

}

Figure 4.4 above shows the blockdiff function code for taking differences between 8 x 8

blocks in a similar manner to the one above.

The operations performed are similar but the datatype and intrinsic used are different.

The datatype used is the __m64 type, which consists of 8 8-bit values. The intrinsic used

to calculate the sum of differences is the _m_psadbw() operation.

Totaldiff will again contain the accumulated difference from all the iterations.

4.3.2. Results

Without SSE With SSE

Full Search

16 x 16

4 secs 1 secs

Full Search

8 x 8

23 secs 6 secs

Three Step

16 x 16

3 secs 1 secs

Three Step

8 x 8

12 secs 3 secs

Table 1 Timing information for the various programs.

Figure 4.4.1 Frame 4 and 5 of the news.qcif video-stream

Figure 4.4.2 Motion Compensated Frame produced from Frame 4 of the news.qcif

video-stream with block size of 8 x 8 and 16 x 16 respectively

Figure 4.4.4 Part of frame (5)Figure 4.4.3 Part of frame (4)

Figure 4.4.5 Part of Predicted frame (5) with block size of 8 x 8

Figure 4.4.6 Part of Predicted frame (5) with block size of 16x16

We draw the following conclusions from the results given above. We notice that the

speedup is by a factor of 3-4 for most programs with SSE. Also the 8 x 8 block programs

for both algorithms take longer to execute compared to the 16 x 16 block programs. The

reason is that the loop overhead for the programs goes up, even though the number of

addition or subtractions to be performed are the same. From the images we see that the 8

x 8 blocks perform a better job at matching than the 16 x 16 blocks. The predicted images

after motion compensation show that the 8 x 8 blocks are better suited for tracking

movement of the smaller image regions with these algorithms.

4.4 DCT

Our second candidate algorithm which is highly computationally intensive is the DCT. It

is essentially matrix multiplication of an image block by a DCT constant multiplication

matrix. This structure can also make use of the SIMD architecture to improve

performance. For the following section we present a few suggestions that would improve

the performance of the DCT algorithm. However, we don’t go into the performance

comparisons of the algorithms due to lack of the capability to add additional functionality

to the compiler tool. Below is a code snippet for the DCT algorithm.

Figure 4.4.1

This DCT code could be further enhanced using the SIMD support provided by Intel’s

SSE2

To illustrate this fact let us consider a 4 x 4 multiplication using traditional methods and

then by using the SIMD architecture.

void DCT (int InBlock[][8], int OutBlock[][8]){

int TempBlock[8][8], CosTrans[8][8];

/*TempBlock = InBlock * CosBlock^T*/MatrixMult(InBlock, CosTrans, TempBlock);

Transpose(CosBlock, CosTrans);

/*OutBlock = CosBlock * TempBlock*/MatrixMult(CosBlock, TempBlock, OutBlock);

}

(a) (b)

(c) (d)Figure 4.4.2:Matrix Multiplication for computation of a single element.Parts a,b,c,d

show the various steps for obtaining a single result for matrix multiplication

The traditional method requires that the row and column elements of the two matrices

that are multiplied be accessed one at a time and a MAC operation performed. This will

require 64 sequential operations of accessing the elements from memory and multiply

accumulate. However, when we employ the use of the SIMD architecture it will require

16 operations on the SIMD architecture. The illustration is given below.

(a) (b)

(c) (d)Figure 4.4.3:Matrix Multiplication for four elements.Parts a,b,c,d show the

computation on the SIMD platform Essentially, because of the large datapath of the SSE architecture, it is possible to

concurrently perform operations that are independent from each other. Therefore, partial

products for 4 different elements of the matrix are carried on in parallel. Hence

improving performance at the cost of extra hardware.

We believe this will improve performance of the DCT code till upto 4 times the original

sequential code for DCT

4.4.2 Specialized hardware support for DCT on the SIMD architecture

Using the SIMD architecture of the SSE2 for the implementation of 8 point 1-D DCT does improve performance over the use of simple C code implementation of 1-D DCT.A specialized accelerator for DCT incorporated on the SIMD would improve performance further.The motivation of this study is therefore to study the trade off between the cost in terms of hardware v/s performance improvement obtained by using a dedicated accelerator for 1-D DCT implementation. Choice of the DCT accelerator was highly driven by its capability to scale with the SIMD architecture. Hence implementing distributed arithmetic for DCT which is easily scalable with the SIMD architecture.

4.2.2.1 DCT Implementation on hardware:

The 2-D DCT has been recognized as the most cost effective techniques among various transform coding schemes for image compression. The DCT is one of the orthogonal transforms and the N x N 2-D DCT is defined as follows

where x(i,j) (i,j =0,1,2,…N-1) is the pixel data, X(u,v)(u,v=0,1,2,……..N-1) is the transformed coefficient, and C(0)=1/ ,C(u)=C(v)=1 if u,v 0.

The 2-D DCT unit is comprised of two 1-D DCT units and a transpose operation. This 2-D DCT is separated into two 1-D DCT’s by the row-column decomposition technique. The input data are fed into the first DCT unit where 1-D DCT is calculated in row order. Then the intermediate data is transposed. Finally, the transposed data are inputted to the second 1-D DCT unit and processed in column order.

The recursive fast DCT algorithm is used to calculate the eight point DCT as shown

A = cos ,

where xi (i=0,1,2,….7) is the pixel data and Xu (u=0,1,2….7) is the transformed coefficient.Owing to this algorithm, the number of multiplications becomes half for the DCT.

Figure 4.2.2.1:DAP structure for the SSE2 extension

The figure 4.2.2.1 shows the block diagram of the 1-D DCT processing unit on the SSE2 platform.The preprocessed values of addition and substraction can be obtains by the parallel addition and substraction instructions on the SSE2.Multiplier accumulator in the DCT core processor has been designed with the distributed arithmetic. According to the distributed arithmetic, the parallel multipliers can be eliminated from the core processor and the hardware amount is greatly reduced. Furthermore, a very high speed operation can be achieved because the critical path is formed in adder instead of multiplier.Here, we illustrate the principle of distributed arithmetic (DA).Assume the input vector is presented in N-bit two’s complement code as follows:

The multiply accumulate in the normal way can be presented as the following equation:

x0+ x7 x1+x6 x2+x5 x3+x4 x1-x6 x2-x5 x3-x4

ROM ROM

+

0.5

R 0.25

1616

16

16

16

DAP DAP DAP DAP DAP DAP DAP

X2 X4 X6 X1 X3 X5 X7

4

x0-x7

where ak (k=1,2,3,….K) is the multiply coefficient. Based on the distributed arithmetic y can be calculated as follows

The multiply operation is implemented with a ROM that stores the precalculated partial products. Therefore, the hardware of the multiply accumulation based on the DA includes ROM and an adder that accumulates the partial products read from ROM.

In the multiply–accumulate operations based on distributed arithmetic, precalculated partial products are read out from ROM’s and accumulated in a bit-wise manner from LSB’s to MSB’s. To double the processing speed, two partial products for adjacent bits can be read from the individual ROM’s at the same time. This method of calculation can be written in the following equation

In this case the two adjacent bits are processed simultaneously. Thus, two ROM’s required to offer a pair of partial products for higher and lower bits in every cycle, and both have two banks for the two modes of DCT. The ROM size itself was reduced by 2 4

times by using the fast algorithm in conjunction with the DA scheme. With the present configuration of the DAP(Distributed Arithmetic processor) structure we would require a 16 ROM’s each 16 x 16 bits. Each of the DAPs will complete the operation in 8 cycles. Hence the entire DCT is calculated in 8 cycles because all the DAPs work in parallel on the SIMD architecture of SSE2.

4.4.2.2 Implementation and Results

The DAP structure was implemented using Verilog and synthesis was done using the gflx-p library. The control flow for the DAP is given below:The DAP will take a total of 8 cycles to complete. For all the fractional binary bits we use the shift add method. Essentially we shift the operand right by 2 and add it every cycle as shown in the DAP figure.The Shiftadd complement method is used in case when DAP is processing the integer part of the binary fixed point number. This can be further understood by looking into the code placed in the appendix.

Synthesis Results

Clock period achieved : 1.8ns

Area results:Combinational Area : 6660.396Non-Combinational Area : 1128.06Interconnect Net Area :1211.4841

Total Cell Area :7788.476Total Area : 8999.942

5.Conclusion:

SIMD extensions to the superscalar architectures have helped improve the performance

of the general purpose processors on media applications. We used the motion estimation

algorithm and optimized it to make use of the SIMD architecture offered by today’s

modern processors. There was considerable performance improvement with the use of the

wide datapath. Also, we explored into improving performance of the DCT by use of the

existing ISA and by employing a dedicated hardware for the DCT implementation. The

performance analysis of this extension on the ISA and its hardware trade-off remains to

be seen and an agenda for future work.

APPENDIX

A.1 Program for the Full Search Motion Estimation Algorithm with simple C for 16

x 16 image blocks//------------------------------------------------------------------------// PROGRAM TO IMPLEMENT FULL SEARCH//------------------------------------------------------------------------#include<stdio.h>#include<conio.h>#include<math.h>#include<stdlib.h>#include<time.h>

int blockdiff(int, int, int, int);int x1, x2, y11, y2, i,j,k, mindiff[12][10], p = 8;int far diff[12][10][257];int sort[257], temp,point, l, motionx, motiony, col ,row, x, y;long peldiff = 0;float amad = 0.0;time_t first, second;void main(){

int i1,j1,k1;FILE *fpold,*fpnew;

first = time(NULL);

for(j=0; j<9 ;j++){

for(i=0; i<11 ;i++){

x1 = 16*i; //START OF BLOCKS IN REFERENCE IMAGEy11 = 16*j;

for(row = 0; row < 2*p ; row++){

for(col = 0; col < 2*p; col++){

x2 = x1 - p + col;y2 = y11 - p + row;diff[i][j][(16*row) + col] = blockdiff(x1, x2,

y11, y2);

}}

//------------------------------------------------------------------------//SUBPROGARM FOR SORTING THE DIFFERENCES

//-----------------------------------------------------------------------

for(k=0; k<256 ; k++){

sort[k] = diff[i][j][k];// printf("\tk%d diff%d", k, diff[i][j][k]);

}

for(k=0; k<256 ;k++)

{for(l=0; l<256 ; l++){

if(sort[l] < sort[l+1]){

temp = sort[l];sort[l] = sort[l+1];sort[l+1] = temp;

}}

}mindiff[i][j] = sort[255];

// printf("\nmindiff=%d", mindiff[i][j]);// printf("\nsort=%d", sort[255]);// getch();

for(k=0; k<255; k++){

if(diff[i][j][k] == sort[255]){

l = k;}

}x = l % 16;y = (l - motionx)/16;

motionx = x - 8;motiony = y - 8;

// printf("\t %d %d %d %d", x1, y11, motionx , motiony);// printf("\n\t%d %d",(j*11)+i, mindiff[i][j]);

//----------------------------------------------------------------------// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE//----------------------------------------------------------------------/* fpold = fopen("C:\\ECE734\\f4.raw","rb");

fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b");

if(motionx != 0 || motiony != 0){

//SKIP PIXELS UPTO INITIAL POINTfseek(fpold,(176 * y11) + x1, 0);fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx,

0);//SKIP PIXELS USING MOTION VECTOR FOR THE NEW IMAGE

//COPY REQUIRED PIXELS FROM BLOCK 1

for(i1 = 0; i1<16; i1++){

for(j1 =0; j1<16; j1++)//FOR LOOP FOR WRITING REQUIRED PIXELS IN NEW IMAGE

{point = fgetc(fpold);fputc(point, fpnew);

}

fseek(fpold,160,1);fseek(fpnew,160,1);

}}

fclose(fpnew);fclose(fpold); */amad = amad + sqrt(motionx*motionx + motiony*motiony);peldiff = peldiff + mindiff[i][j];}

}

amad = amad/99;printf("\nAMAD = %f", amad);printf("\nPixel Difference %ld", peldiff/99);second=time(NULL);printf("\nDifference in time %ld", second - first);getch();

}

//--------------------------------------------------------------------------// FUNCTION BLOCKDIFF//--------------------------------------------------------------------------

int blockdiff(int x1,int x2,int y11,int y2){

int block1[16][16], block2[16][16],i1,j1,k1,ch;int diff1[16][16], totaldiff = 0;FILE *fp1, *fp2;

//DISCARDING LOCATIONS NEAR THE BOTTOM AND RIGHT END OF FRAME

if(x2 < 0 || y2 < 0 || x2 >160 || y2 >128){totaldiff = 10000;}

else{

fp1 = fopen("C:\\ECE734\\f4.raw", "rb");fp2 = fopen("C:\\ECE734\\f5.raw", "rb");

if(fp1 == NULL || fp2 == NULL){

printf("File cannot be opened");getch();exit(0);

}

for(i1=0; i1<16 ; i1++){

for(j1=0; j1<16 ;j1++){

diff1[i1][j1] = 0;block1[i1][j1]= 0;block2[i1][j1]= 0;

}}

for(i1=0; i1<(176 * y11) + x1; i1++)//SKIP PIXELS UPTO INTIAL POINTch = fgetc(fp1);


for(i1 = 0; i1<16; i1++){

for(j1 =0; j1<16; j1++) {

block1[i1][j1] = fgetc(fp1); }

for(k1=0; k1< 160; k1++) //SKIP PIXELS FROM SAME LINE NOT NEEDED

{ch = fgetc(fp1);}

}

//BLOCK COPIED FROM SECOND FRAMEfor(i1=0; i1<(176*y2) + x2; i1++)//SKIP PIXELS UPTO INTIAL POINT

ch = fgetc(fp2);

//COPY REQUIRED PIXELS FROM BLOCK 2for(i1 = 0; i1<16; i1++){

for(j1 =0; j1<16; j1++){ block2[i1][j1] = fgetc(fp2);}for(k1=0; k1< 160; k1++) //SKIP PIXELS FROM SAME LINE

NOT NEEDED{ch = fgetc(fp2);}

}

for(i1=0; i1<16 ; i1++){

for(j1=0; j1<16 ;j1++){

diff1[i1][j1] = block2[i1][j1] - block1[i1][j1];diff1[i1][j1] = abs(diff1[i1][j1]);totaldiff = totaldiff + diff1[i1][j1];

}}

fclose(fp1);fclose(fp2);

}// else loop end return(totaldiff);

}

A.2 Full Search Program with SSE 2 intrinsics for 16 x 16 blocks//-------------------------------------------------------------------------------------// PROGRAM TO IMPLEMENT FULL SEARCH WITH SSE2//-------------------------------------------------------------------------------------#include<stdio.h>#include<conio.h>#include<math.h>#include<stdlib.h>#include<time.h>#include<xmmintrin.h>#include<sse2mmx.h>//#include <iostream.h>

#include <mmsystem.h>#include <windows.h>

int blockdiff(int, int, int, int);int x1, x2, y11, y22, i,j,k, mindiff[12][10], p = 8;int diff[12][10][257];int sort[257], temp,point, l, motionx, motiony, col ,row, x, y;long peldiff = 0;float amad;clock_t first, second;FILE *in1,*in2;

void main(){

int i1,j1,k1; FILE *fpold,*fpnew;

int diff1,diff,diffx,diffy;amad =0.0;DWORD start, finish, duration;

//first = clock();start = timeGetTime();printf("START TIME!! %ld\n", start);

diff = 1000000;diffy =0;diffx =0;//j and i move the reference blocksfor(j=0; j<9 ;j++)//j<9 9*16 = 144{

for(i=0; i<11 ;i++)//i<11 16*11 = 176{

x1 = 16*i; //START OF BLOCKS IN REFERENCE IMAGE y11 = 16*j;

for(row = 0; row<2*p; row++) {

for(col = 0; col<2*p; col++)

{x2 = x1 - p + col;

y22 = y11 - p + row; diff1 = blockdiff(x1, x2, y11, y22); if (diff > diff1)

{diff = diff1;

diffx = x2; diffy = y22; }//else discard diff1

}}

motionx = diffx - x1;motiony = diffy - y11;

//----------------------------------------------------------------------// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE//----------------------------------------------------------------------

fpold = fopen("C:\\ECE734\\f4.raw","rb");fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b");

if(motionx != 0 || motiony != 0) {

//SKIP PIXELS UPTO INITIAL POINT fseek(fpold,(176 * y11) + x1, 0);

fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS USING MOTION VECTOR FOR THE NEW IMAGE


for(i1 = 0; i1<16; i1++) {

for(j1 =0; j1<16; j1++)//FOR LOOP FOR WRITING REQUIRED PIXELS IN NEW IMAGE {

point = fgetc(fpold);fputc(point, fpnew);

}


}}fclose(fpnew);

fclose(fpold);

// amad = 99.0f ;// 1.0;((float)motionx* motionx) + ((float)motiony*motiony); peldiff = peldiff + diff;

}}

//amad = amad/99; // printf("\nAMAD = %f",amad); printf("\nPixel Difference %ld", peldiff/99); second = clock(); printf("\nDifference in time %ld", second); getch();}


int blockdiff(int x1,int x2,int y11,int y22) { unsigned char block1[16], block2[16]; int i1,j1,k1,ch,offset1,offset2; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; __m128i *b1,*b2,m1; union mmx

{ __m128i m; short int x[8]; }m;


if(x2 < 0 || y22 < 0 || x2 >160 || y22 >128)

{ totaldiff = 10000; }

else { fp1 = fopen("C:\\ECE734\\f4.raw", "rb"); fp2 = fopen("C:\\ECE734\\f5.raw", "rb");

if(fp1 == NULL || fp2 == NULL) { printf("File cannot be opened"); getch(); exit(0); }

//skip to the intial point and get the result. for(i = 0 ;i<16;i++)

{offset1 = (176*(y11+i) + x1);offset2 = (176*(y22+i) + x2);

fseek(fp1,offset1,SEEK_SET); fread(block1,1,16,fp1); //for(i=0;i<16;i++) //printf("%c \n",block1[i]);

fseek(fp2,offset2,SEEK_SET); fread(block2,1,16,fp2); // type casting pointers. b1 = (__m128i*)block1; b2 = (__m128i*)block2; //SAD for 16 bytes.

m1 = _mm_sad_epu8(*b1,*b2);m.m = m1;


fclose(fp1); fclose(fp2); }// else loop end return(totaldiff);}

A.3 Three Step Search Program with simple C for 16 x 16 image blocks

//------------------------------------------------------------------------// PROGRAM TO IMPLEMENT 3 STEP SEARCH//------------------------------------------------------------------------#include<stdio.h>#include<conio.h>#include<math.h>#include<stdlib.h>#include<time.h>

int blockdiff(int, int, int, int);int x1, x2, y11, y22, diff[12][10][10],i,j,p = 16,k, mindiff[12][10];int sort[10], temp,l, motionx, motiony;long float amad = 0.0, dist =0.0;long peldiff = 0;int locx, locy, point;time_t first, second;

void main(){

FILE *fpold, *fpnew;int i1,j1,k1,ch,c;

first = time(NULL);

for(j=0; j<9 ;j++){

for(i=0; i<11 ;i++){


locx = x1;locy = y11;

p = 16;

while(p >= 1){//ALGORITHM FOR 3 STEP SEARCH//FOR POINT NO. 0x2 = locx - p;

y22 = locy - p;diff[i][j][0] = blockdiff(x1, x2, y11, y22);

//FOR POINT NO. 1x2 = locx;y22 = locy - p;diff[i][j][1] = blockdiff(x1, x2, y11, y22);

//FOR POINT NO. 2x2 = locx + p;y22 = locy - p;diff[i][j][2] = blockdiff(x1, x2, y11, y22);

//FOR POINT NO. 3x2 = locx - p;y22 = locy;diff[i][j][3] = blockdiff(x1, x2, y11, y22);

//FOR POINT NO. 4x2 = locx;y22 = locy;diff[i][j][4] = blockdiff(x1, x2, y11, y22);

//FOR POINT NO. 5x2 = locx + p;y22 = locy;diff[i][j][5] = blockdiff(x1, x2, y11, y22);

//FOR POINT NO. 6x2 = locx - p;y22 = locy + p;diff[i][j][6] = blockdiff(x1, x2, y11, y22);

//FOR POINT NO. 7x2 = locx;y22 = locy + p;diff[i][j][7] = blockdiff(x1, x2, y11, y22);

//FOR POINT NO. 8 x2 = locx + p; y22 = locy + p; diff[i][j][8] = blockdiff(x1, x2, y11, y22);


//-----------------------------------------------------------------------

for(k=0; k<9 ; k++){

sort[k] = diff[i][j][k];// printf("\nk=%d diff=%d", k, diff[i][j][k]);

}

for(k=0; k<9 ;k++){

for(l=0; l<9 ; l++){


temp = sort[l];sort[l] = sort[l+1];

sort[l+1] = temp;}

}}mindiff[i][j] = sort[8];

// printf("\nmindiff=%d", mindiff[i][j]);

for(k=0; k<9; k++){


l = k;}

}

if(l==0){

locx = locx - p;locy = locy - p;

}

if(l==1){

locx = locx;locy = locy - p;

}

if(l==2){

locx = locx + p;locy = locy - p;

}

if(l==3){

locx = locx - p;locy = locy;

}

if(l==4){

locx = locx;locy = locy;

}

if(l==5){

locx = locx + p;locy = locy;

}

if(l==6){

locx = locx - p;locy = locy + p;

}

if(l==7){

locx = locx;locy = locy + p;

}

if(l==8){

locx = locx + p;locy = locy + p;

}

p = p/2;} //while loop end.

motionx = locx - x1;motiony = locy - y11;

// printf("\n\tx1 %d y11 %d \n \tlocx %d locy %d \n\tmotion vector x=%d y=%d",x1,y11, locx, locy, motionx , motiony);// printf("\n\t%d %d",(j*11)+i, mindiff[i][j]);// getch();

dist = sqrt( motionx*motionx + motiony*motiony);amad = amad + dist;peldiff = peldiff + mindiff[i][j];

//*//----------------------------------------------------------------------// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE//----------------------------------------------------------------------



//SKIP PIXELS UPTO INITIAL POINTfseek(fpold,(176 * y11) + x1, 0);fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP

PIXELS USING MOTION VECTOR FOR THE NEW IMAGE


for(i1 = 0; i1<16; i1++){



}


}}

fclose(fpnew);fclose(fpold);//*/}

}second = time(NULL);printf("Time taken %ld", second - first);//TO FIND CPU TIMEprintf("\nAMAD=%lf", amad/99);printf("\nPixel Difference %ld", peldiff/99);getch();

}






else{


if(fp2 == NULL){

printf("File 2 cannot be opened");getch();exit(0);

}

if(fp1 == NULL){


}

for(i1=0; i1<16 ; i1++){

for(j1=0; j1<16 ;j1++){


}}



for(i1 = 0; i1<16; i1++){

for(j1 =0; j1<16; j1++) {block1[i1][j1] = fgetc(fp1); }


{ch = fgetc(fp1);}

}


ch = fgetc(fp2);


for(j1 =0; j1<16; j1++){ block2[i1][j1] = fgetc(fp2);}for(k1=0; k1< 160; k1++) //SKIP PIXELS FROM SAME LINE NOT

NEEDED{ch = fgetc(fp2);}

}

for(i1=0; i1<16 ; i1++){

for(j1=0; j1<16 ;j1++){


}}

fclose(fp1);fclose(fp2);}// else loop endreturn(totaldiff);

}

A.4 Three Step Search Program with SSE 2 intrinsics for 16 x 16 blocks//------------------------------------------------------------------------// PROGRAM TO IMPLEMENT 3 STEP SEARCH//------------------------------------------------------------------------#include<stdio.h>#include<conio.h>

#include<math.h>#include<stdlib.h>#include<time.h>#include<xmmintrin.h>#include<sse2mmx.h>


void main(){


first = time(NULL);

for(j=0; j<9 ;j++){

for(i=0; i<11 ;i++){



p = 16;

while(p >= 1){//ALGORITHM FOR 3 STEP SEARCH//FOR POINT NO. 0x2 = locx - p;y22 = locy - p;diff[i][j][0] = blockdiff(x1, x2, y11, y22);










//-----------------------------------------------------------------------

for(k=0; k<9 ; k++){


}

for(k=0; k<9 ;k++){

for(l=0; l<9 ; l++){



}}



for(k=0; k<9; k++){


l = k;}

}

if(l==0){


}

if(l==1)

{locx = locx;locy = locy - p;

}

if(l==2){


}

if(l==3){


}

if(l==4){


}

if(l==5){


}

if(l==6){


}

if(l==7){


}

if(l==8){


}





//*







for(i1 = 0; i1<16; i1++){



}


}}


}second = time(NULL);printf("Time taken %ld", second - first);//TO FIND CPU TIMEprintf("\nAMAD=%lf", amad/99);printf("\nPixel Difference %ld", peldiff/99);getch();

}


int blockdiff(int x1,int x2,int y11,int y22) { unsigned char block1[16], block2[16]; int i1,j1,k1,ch,offset1,offset2; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; __m128i *b1,*b2,m1; union mmx

{ __m128i m; short int x[8]; }m;


if(x2 < 0 || y22 < 0 || x2 >160 || y22 >128) { totaldiff = 10000; }

else { fp1 = fopen("C:\\ECE734\\f4.raw", "rb"); fp2 = fopen("C:\\ECE734\\f5.raw", "rb");

if(fp1 == NULL || fp2 == NULL) { printf("File cannot be opened"); getch(); exit(0); }

//skip to the intial point and get the result. for(i = 0 ;i<16;i++)

{offset1 = (176*(y11+i) + x1);offset2 = (176*(y22+i) + x2);

fseek(fp1,offset1,SEEK_SET); fread(block1,1,16,fp1); //for(i=0;i<16;i++) //printf("%c \n",block1[i]);

fseek(fp2,offset2,SEEK_SET); fread(block2,1,16,fp2); // type casting pointers. b1 = (__m128i*)block1; b2 = (__m128i*)block2; //SAD for 16 bytes.

m1 = _mm_sad_epu8(*b1,*b2);m.m = m1;


fclose(fp1); fclose(fp2); }// else loop end return(totaldiff);}

A.5 Full Search with simple C for 8 x 8 blocks//------------------------------------------------------------------------// PROGRAM TO IMPLEMENT FULL SEARCH//------------------------------------------------------------------------#include<stdio.h>#include<conio.h>#include<math.h>#include<stdlib.h>#include<time.h>

int blockdiff(int, int, int, int);int x1, x2, y11, y22, i,j,k, mindiff[22][18], p = 4;//int diff[22][18][256];int sort[257], temp,point, l, motionx, motiony, col ,row, x, y;long peldiff = 0;float amad = 0.0;time_t first, second;int diff1,diff,diffx,diffy;

void main(){

int i1,j1,k1;FILE *fpold,*fpnew;

first = time(NULL);

for(j=0; j<18 ;j++){

for(i=0; i<22 ;i++){


diff = 1000000;

//defines search spacefor(row = 0; row < 2*p ; row++){

//printf("\n");for(col = 0; col < 2*p; col++)

{x2 = x1 - p + col;y22 = y11 - p + row;diff1 = blockdiff(x1, x2, y11, y22);if(diff > diff1){

diff = diff1;diffx = x2;diffy = y22;

} //printf("diff %d ",diff[i][j][(16*row) + col]); //if(diff[i][j][(16*row) + col] == 0)

//{printf(" i %d j %d row %d col %d",i,j,row,col); //getchar(); }}

//}//------------------------------------------------------------------------

//SUBPROGARM FOR SORTING THE DIFFERENCES//----------------------------------------------------------------------- /*

for(k=0; k<256 ; k++){

sort[k] = diff[i][j][k];// printf("\tk%d diff%d", k, diff[i][j][k]);

}

for(k=0; k<256 ;k++){

for(l=0; l<256 ; l++){



}}



// printf("\nsort=%d", sort[255]);// getch();

for(k=0; k<256; k++){


l = k;}

}x = l % 16;y = (l - x)/16;

*/motionx = diffx -x1;motiony = diffy -y11;

// printf("\t %d %d %d %d", x1, y11, motionx , motiony);// printf("\n\t%d %d",(j*11)+i, mindiff[i][j]);







for(i1 = 0; i1<8; i1++){


{point = fgetc(fpold);

fputc(point, fpnew);}


}}

fclose(fpnew);fclose(fpold); //amad = amad + sqrt(motionx*motionx + motiony*motiony);peldiff = peldiff + mindiff[i][j];}

}

amad = amad/99;printf("\nAMAD = %f", amad);printf("\nPixel Difference %ld", peldiff/99);second=time(NULL);printf("\nDifference in time %ld", second - first);getch();

}



int block1[8][8], block2[8][8],i1,j1,k1,ch;int diff1[8][8];register int totaldiff = 0;FILE *fp1,*fp2;



else{




}

for(i1=0; i1<8 ; i1++){

for(j1=0; j1<8 ;j1++){


}}



for(i1 = 0; i1<8; i1++){

for(j1 =0; j1<8; j1++) {

block1[i1][j1] = fgetc(fp1); }


{ch = fgetc(fp1);}

}


ch = fgetc(fp2);




}

for(i1=0; i1<8 ; i1++){

for(j1=0; j1<8 ;j1++){


}}


}// else loop end return(totaldiff);

}

A.7 Three Step Search with simple C for 8 x 8 blocks//------------------------------------------------------------------------// PROGRAM TO IMPLEMENT 3 STEP SEARCH//------------------------------------------------------------------------#include<stdio.h>#include<conio.h>#include<math.h>#include<stdlib.h>#include<time.h>


void main(){


//clrscr();first = time(NULL);

for(j=0; j<18 ;j++){

for(i=0; i<22 ;i++){


locx = x1;

locy = y11;

p = 8;











//-----------------------------------------------------------------------

for(k=0; k<9 ; k++){


}

for(k=0; k<9 ;k++){

for(l=0; l<9 ; l++){



}}



for(k=0; k<9; k++){


l = k;

}}

if(l==0){


}

if(l==1){


}

if(l==2){


}

if(l==3){


}

if(l==4){


}

if(l==5){


}

if(l==6){


}

if(l==7){


}

if(l==8){


}





///*//----------------------------------------------------------------------// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE//----------------------------------------------------------------------



//SKIP PIXELS UPTO INITIAL POINTfseek(fpold,(176 * y11) + x1, 0);

fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS USING MOTION VECTOR FOR THE NEW IMAGE


for(i1 = 0; i1<8 ;i1++){



}fseek(fpold,168,1);fseek(fpnew,168,1);

}}


}second = time(NULL);printf("Time taken %ld", second - first);//TO FIND CPU TIMEprintf("\nAMAD=%lf", amad/(22*18));printf("\nPixel Difference %ld", peldiff/(22*18));getch();

}






else{


if(fp2 == NULL){


}

if(fp1 == NULL){


}

for(i1=0; i1<8 ; i1++){

for(j1=0; j1<8 ;j1++){


}}

// for(i1=0; i1<(176 * y11) + x1; i1++)//SKIP PIXELS UPTO INTIAL POINT// ch = fgetc(fp1);

fseek(fp1,(176 * y11) + x1,0);


for(i1 = 0; i1<8; i1++)

{for(j1 =0; j1<8; j1++)

{block1[i1][j1] = fgetc(fp1); }for(k1=0; k1< 168; k1++) //SKIP PIXELS FROM SAME LINE

NOT NEEDED{ch = fgetc(fp1);}

}


ch = fgetc(fp2);




}

for(i1=0; i1<8 ; i1++){

for(j1=0; j1<8 ;j1++){


}}

fclose(fp1);fclose(fp2);}// else loop endreturn(totaldiff);

}

A.8 Three Step Search Program with SSE 2 for 8 x 8 blocks//------------------------------------------------------------------------// PROGRAM TO IMPLEMENT 3 STEP SEARCH//------------------------------------------------------------------------#include<stdio.h>#include<conio.h>#include<math.h>#include<stdlib.h>#include<time.h>#include<sse2mmx.h>#include<xmmintrin.h>


void main(){


first = time(NULL);

for(j=0; j<18 ;j++){

for(i=0; i<22 ;i++){



p = 8;











//-----------------------------------------------------------------------

for(k=0; k<9 ; k++){

sort[k] = diff[i][j][k];

// printf("\nk=%d diff=%d", k, diff[i][j][k]);}

for(k=0; k<9 ;k++){

for(l=0; l<9 ; l++){



}}



for(k=0; k<9; k++){


l = k;}

}

if(l==0){


}

if(l==1){


}

if(l==2){


}

if(l==3){


}

if(l==4){


}

if(l==5){


}

if(l==6){


}

if(l==7){


}

if(l==8){


}











for(i1 = 0; i1<8 ;i1++){



}fseek(fpold,168,1);fseek(fpnew,168,1);

}}

fclose(fpnew);

fclose(fpold);//*/}

}second = time(NULL);printf("Time taken %ld", second - first);//TO FIND CPU TIMEprintf("\nAMAD=%lf", amad/(22*18));printf("\nPixel Difference %ld", peldiff/(22*18));getch();

}


int blockdiff(int x1,int x2,int y11,int y22) { unsigned char block1[16], block2[16]; int i1,j1,k1,ch,offset1,offset2; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; __m64 *b1,*b2,m1; union mmx

{__m64 m;int x[2];

}m;


if(x2 < 0 || y22 < 0 || x2 >168 || y22 >136) {

totaldiff = 100000; }

else {




}

//skip to the intial point and get the result.for(i1 = 0 ;i1<8;i1++){

offset1 = (176*(y11 + i1) + x1);offset2 = (176*(y22 + i1) + x2);

fseek(fp1,offset1,SEEK_SET);fread(block1,1,8,fp1);

fseek(fp2,offset2,SEEK_SET);

fread(block2,1,8,fp2);

// type casting pointers.b1 = (__m64*)block1;b2 = (__m64*)block2;

//SAD for 16 bytes.m1 = _m_psadbw(*b1,*b2);m.m = m1;totaldiff = totaldiff + m.x[0];

}


}// else loop end return(totaldiff);}

/***************************************************************************** * File Name : DCTmatrix.c * * Comment : This file will help produce ROM values stored to compute DCT * using distributed arithmetic. * * Project :ECE734 * Author :Shamik Valia * ****************************************************************************/

/* Arguments for the execution * foo precision * command line * gcc -lm -g DCTmatrix.c -o DCTmatrix */

#include<stdio.h>#include<math.h>#include<stdlib.h>#define FOR_HARDWARE//#define FOR_SIMULATOR

main(int argc,char *argv[]){ FILE *R1,*R2,*R3,*R4,*R5,*R6,*R7,*R8 ; double A,B,C,D,E,F,G,v1; int i,j,k,l; //char *C; char c[atoi(argv[1])+1] ; void dectobinary(double ,int, char[]); void complement(char[]);

// printf("am here please print");

A = cos(M_PI/4); B = cos(M_PI/8); C = sin(M_PI/8); D = cos(M_PI/16); E = cos(3*M_PI/16); F = sin(3*M_PI/16); G = sin(M_PI/16);

printf("PI = %lf \n",M_PI); printf("A = %lf \n",A); printf("B = %lf \n",B); printf("C = %lf \n",C); printf("D = %lf \n",D); printf("E = %lf \n",E); printf("F = %lf \n",F); printf("G = %lf \n",G); R1 = fopen("dct_ROM1.dat","w"); R2 = fopen("dct_ROM2.dat","w"); R3 = fopen("dct_ROM3.dat","w"); R4 = fopen("dct_ROM4.dat","w"); R5 = fopen("dct_ROM5.dat","w"); R6 = fopen("dct_ROM6.dat","w"); R7 = fopen("dct_ROM7.dat","w");

R8 = fopen("dct_ROM8.dat","w");

for(i=0;i<=1;i++){ for(j=0;j<=1;j++){ for(k=0;k<=1;k++) {

for(l=0;l<=1;l++){ #ifdef FOR_SIMULATOR

v1=0.5*((A*i)+(A*j)+(A*k)+(A*l)); fprintf(R1,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((B*i)+(C*j)-(C*k)-(B*l)); fprintf(R2,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((A*i)-(A*j)-(A*k)+(A*l)); fprintf(R3,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((C*i)-(B*j)+(B*k)-(C*l)); fprintf(R4,"sequence %d%d%d%d value = %lf \n ",i,j,k,l,v1); v1=0.5*((D*i)+(E*j)+(F*k)+(G*l)); fprintf(R5,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((E*i)-(G*j)-(D*k)-(F*l)); fprintf(R6,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((F*i)-(D*j)+(G*k)+(E*l)); fprintf(R7,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((G*i)-(F*j)+(E*k)-(D*l)); fprintf(R8,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1);

#endif #ifdef FOR_HARDWARE

v1=0.5*((A*i)+(A*j)+(A*k)+(A*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R1,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((B*i)+(C*j)-(C*k)-(B*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R2,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((A*i)-(A*j)-(A*k)+(A*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R3,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((C*i)-(B*j)+(B*k)-(C*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R4,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((D*i)+(E*j)+(F*k)+(G*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R5,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((E*i)-(G*j)-(D*k)-(F*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R6,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c);

v1=0.5*((F*i)-(D*j)+(G*k)+(E*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R7,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((G*i)-(F*j)+(E*k)-(D*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R8,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c);

#endif

}

} } } fclose (R1); fclose (R2); fclose (R3); fclose (R4); fclose (R5); fclose (R6); fclose (R7); fclose (R8); }

void dectobinary(double no ,int precision,char c[]) { //double no; ///int precision ; //given the nos before the decimal point it will give binary representation float fixed ; int decimal ; int ans = 0, x ,i=0,two_complement = 0; void complement(char[]); //char C[precision]; //printf("i am inside"); if (no<0){ two_complement = 1; no = - no ; }

decimal = (int)no ; fixed = no-decimal; while(decimal/2 !=0) { x = decimal % 2 ; ans = ans + x*pow(10,i); i++; decimal = decimal / 2 ; } x= decimal%2; ans = ans + x*pow(10,i); sprintf(c,"%d",ans); if(strlen(c)!=2) { if(strlen(c)>2) fprintf(stderr,"error at decimal..more than 2 places");

c[1] = c[0]; c[0] = '0'; c[2]='\0'; } // printf("stringlenth is %d",strlen(c)); //given the nos after decimal point , gives the binary. for(i=strlen(c);i<precision;i++) { fixed = 2*fixed ; if(fixed>=1.0) {

fixed = fixed-1.0 ;c[i] = '1';

} else c[i] ='0'; } c[i]='\0';

if(two_complement==1){ complement(c); }

// return C ;}

void complement(char c[]){

int i = strlen(c); int flag = 0; while(i!=-1){ if (flag ==0){ if(c[i]=='1')

flag = 1; i--; }else //flag ==0 ends { //flag ==1

if(c[i]=='1') c[i]='0';else c[i]='1';i--;

} //flag==1 ends } //while ends}

/* Real dct implementation.

x0 = x0 + x7 ;x1 = x1 + x6 ;x2 = x2 + x5 ;x3 = x3 + x4 ;

x4 = x0 - x7 ;x5 = x0 - x6 ;x6 = x0 - x5 ;x7 = x0 - x4 ;

X0 = A(x0) + A(x1) + A(x2) + A(x3);X2 = B(x0) + C(x1) - C(x2) - B(x3);X4 = A(x0) - A(x1) - A(x2) + A(x3);X6 = C(x0) - B(x1) - B(x2) - C(x3);

X1 = D(x4) + E(x5) + F(x6) + G(x7);X3 = E(x4) - G(x5) - D(x6) - F(x7);X5 = F(x4) - D(x5) + G(x6) + E(x7);X7 = G(x4) - F(x5) + E(x6) - D(x7);

*************************************************/

#include<stdio.h>#include<math.h>#include<stdlib.h>

main (){ short A,B,C,D,E,F,G; short int X0,X1,X2,X3,X4,X5,X6,X7 ; short int x0,x1,x2,x3,x4,x5,x6,x7; double X0_,X1_,X2_,X3_,X4_,X5_,X6_,X7_; int Add_SS2(int ,int); int n1,n2,n3; /* A = cos(M_PI/4); B = cos(M_PI/8); C = sin(M_PI/8); D = cos(M_PI/16); E = cos(3*M_PI/16); F = sin(3*M_PI/16); G = sin(M_PI/16);

*/ A =23170; B=30273; C=12539; D=32138; E=27245; F=18204; G=6392;

printf("input x0 : "); scanf("%hd",&x0); printf("\ninput x1 : "); scanf("%hd",&x1);

printf("\ninput x2 : "); scanf("%hd",&x2); printf("\ninput x3 : "); scanf("%hd",&x3); printf("\ninput x4 : "); scanf("%hd",&x4); printf("\ninput x5 : "); scanf("%hd",&x5); printf("\ninput x6 : "); scanf("%hd",&x6); printf("\ninput x7 : "); scanf("%hd",&x7);

/* X0 = (short int)(((A*x0) + (A*x1) + (A*x2) + (A*x3))>>1); X2 = (short int)(((B*x0) + (C*x1) - (C*x2) - (B*x3))>>1); X4 = (short int)(((A*x0) - (A*x1) - (A*x2) + (A*x3))>>1); X6 = (short int)(((C*x0) - (B*x1) - (B*x2) - (C*x3))>>1); X1 = (short int)(((D*x4) + (E*x5) + (F*x6) + (G*x7))>>1); X3 = (short int)(((E*x4) - (G*x5) - (D*x6) - (F*x7))>>1); X5 = (short int)(((F*x4) - (D*x5) + (G*x6) + (E*x7))>>1); X7 = (short int)(((G*x4) - (F*x5) + (E*x6) - (D*x7))>>1); */ n1 = Add_SS2((A*x0) ,(A*x1)); n2 = Add_SS2((A*x2) ,(A*x3)); n3 = Add_SS2(n1,n2); X0 = (short int)(n3>>17); n1 = Add_SS2((B*x0) ,- (C*x2)); n2 = Add_SS2((C*x1) ,- (B*x3)); n3 = Add_SS2(n1,n2); X2 = (short int)(n3>>17); // X2=X2+ (((B*x_[0]) + (C*x_[1]) - (C*x_[2]) - (B*x_[3]))/(power(2,15-i))); n1 = Add_SS2((A*x0) , - (A*x1)); n2 = Add_SS2((A*x3),- (A*x2) ); n3 = Add_SS2(n1,n2); X4 = (short int)(n3>>17); //X4=X4+ (((A*x_[0]) - (A*x_[1]) - (A*x_[2]) + (A*x_[3]))/(power(2,15-i))); n1 = Add_SS2((C*x0), - (B*x1)); n2 = Add_SS2((B*x2),- (C*x3) ); n3 = Add_SS2(n1,n2); X6 = (short int)(n3>>17); //X6=X6+ (((C*x_[0]) - (B*x_[1]) + (B*x_[2]) - (C*x_[3]))/(power(2,15-i))); n1 = Add_SS2((D*x4),(E*x5)); n2 = Add_SS2( (F*x6), (G*x7) ); n3 = Add_SS2(n1,n2); X1 = (short int)(n3>>17); //X1=X1+ (((D*x_[4]) + (E*x_[5]) + (F*x_[6]) + (G*x_[7]))/(power(2,15-i))); n1 = Add_SS2((E*x4),- (G*x5)); n2 = Add_SS2(- (D*x6), - (F*x7) ); n3 = Add_SS2(n1,n2); X3 = (short int)(n3>>17); // X3=X3+ (((E*x_[4]) - (G*x_[5]) - (D*x_[6]) - (F*x_[7]))/(power(2,15-i)));

n1 = Add_SS2((F*x4), - (D*x5)); n2 = Add_SS2((G*x6), (E*x7) ); n3 = Add_SS2(n1,n2); X5 = (short int)(n3>>17); // X5=X5+ ((F*x_[4]) - (D*x_[5]) + (G*x_[6]) + (E*x_[7]))/(power(2,15-i))); n1 = Add_SS2((G*x4), - (F*x5)); n2 = Add_SS2((E*x6),- (D*x7)); n3 = Add_SS2(n1,n2); X7 = (short int)(n3>>17); printf("X0 = %hd \n",X0); printf("X1 = %hd \n",X1); printf("X2 = %hd \n",X2); printf("X3 = %hd \n",X3); printf("X4 = %hd \n",X4); printf("X5 = %hd \n",X5); printf("X6 = %hd \n",X6); printf("X7 = %hd \n",X7); X0_ = 0.5*((A*x0) + (A*x1) + (A*x2) + (A*x3)); X2_ = 0.5*((B*x0) + (C*x1) - (C*x2) - (B*x3)); X4_ = 0.5*((A*x0) - (A*x1) - (A*x2) + (A*x3)); X6_ = 0.5*((C*x0) - (B*x1) - (B*x2) - (C*x3)); X1_ = 0.5*((D*x4) + (E*x5) + (F*x6) + (G*x7)); X3_ = 0.5*((E*x4) - (G*x5) - (D*x6) - (F*x7)); X5_ = 0.5*((F*x4) - (D*x5) + (G*x6) + (E*x7)); X7_ = 0.5*((G*x4) - (F*x5) + (E*x6) - (D*x7));

printf("X0 = %lf \n",X0_); printf("X1 = %lf \n",X1_); printf("X2 = %lf \n",X2_); printf("X3 = %lf \n",X3_); printf("X4 = %lf \n",X4_); printf("X5 = %lf \n",X5_); printf("X6 = %lf \n",X6_); printf("X7 = %lf \n",X7_);

}

int Add_SS2(int a, int b) { int out; out = a + b;

if (a > 0 && b > 0 && out < 0) return 0x7fffffff; else if (a < 0 && b < 0 && out > 0) return 0x10000000; else return out;}

/***************************************************************************** * File Name : ROM.c * * Comment : This file will help produce ROM code in verilog * * Project :ECE699 * Author :Shamik Valia * Creation Date : 27 July 2003. * Advisor : Prof Mike Schulte. ****************************************************************************/

/* Execution format will be * ROM inputfile inputbits# outputfile outputbits# */

#include<stdio.h>

main(int argc,char *argv[]){ FILE *fo;

//checking for the right syntax given. if(argc!=5) { fprintf(stderr,"Insufficient arguments"); fprintf(stderr,"Requires 5 arguments"); }

//open file to read // fi = fopen(argv[1],"r"); //open a new file to write output fo = fopen(argv[3],"w");

//Copyright information... fprintf(fo,"//#########################################################\n"); fprintf(fo,"//\n"); fprintf(fo,"// File Name:%s \n",argv[3]); fprintf(fo,"//\n"); fprintf(fo,"// Comment : The file is ROM code for inbit#%s and output#%s \n",argv[2],argv[4]); fprintf(fo,"// Project: \n"); fprintf(fo,"// Author : Shamik Valia \n"); fprintf(fo,"// Creation Date : \n"); fprintf(fo,"// Advisor:Prof.Mike Schutle\n"); fprintf(fo,"//\n"); fprintf(fo,"//#########################################################\n"); fprintf(fo,"\n\n");

//Generation of verilog code. fprintf(fo,"module rom(\n "); fprintf(fo," select, //input select line to ROM \n"); fprintf(fo," output1, //output data of the ROM \n"); fprintf(fo," output2, //output data of the ROM \n"); fprintf(fo,"); \n\n"); fprintf(fo,"parameter in_bits = %s, \n",argv[2]); fprintf(fo," out_bits = %s; \n\n",argv[4]); fprintf(fo,"input [in_bits-1:0] select ; \n");

fprintf(fo,"output1[out_bits-1:0] output ; \n\n"); fprintf(fo,"output2[out_bits-1:0] output ; \n\n"); fprintf(fo,"reg [out_bits-1:0] output ;\n\n"); fprintf(fo,"always@(select) \n"); fprintf(fo," case(select) \n"); fprintf(fo," `include \"%s\" \n ",argv[1]); fprintf(fo," endcase \n"); fprintf(fo,"endmodule \n");}

Distributed Arithmetic structure

//

module struc_16dap(in1,in2,in3,in4,reset,clk,start,out,done);

input [15:0] in1,in2,in3,in4 ;input reset,clk,start;output [20:0] out ;output done ;

wire [3:0] in1_,in2_,in3_ ,in4_,in5_,in6_,in7_,in8_,in9_,in10_,in11_,in12_,in13_,in14_,in15_,in16_;wire [3:0] in_ROM1,in_ROM2;wire [15:0] out_ROM1,out_ROM2;wire [18:0] ext_out1,ext_out2;wire [20:0] adder_out ;wire reset_reg,shiftAdd,shiftaddcomplement,rd,reset_reg_use;wire [2:0] S1,S2;wire strobe;//instantiation of a DAP controller

dap_controller_4 D1(reset_reg_use,done,S1,S2,reset_reg,shiftAdd,shiftaddcomplement,rd,reset,start,clk);

//wire the bits to be put in mux.

assign in16_ = {in1[15],in2[15],in3[15],in4[15]};assign in15_ = {in1[14],in2[14],in3[14],in4[14]};assign in14_ = {in1[13],in2[13],in3[13],in4[13]};assign in13_ = {in1[12],in2[12],in3[12],in4[12]};assign in12_ = {in1[11],in2[11],in3[11],in4[11]};assign in11_ = {in1[10],in2[10],in3[10],in4[10]};assign in10_ = {in1[9],in2[9],in3[9],in4[9]};assign in9_ = {in1[8],in2[8],in3[8],in4[8]};assign in8_ = {in1[7],in2[7],in3[7],in4[7]};assign in7_ = {in1[6],in2[6],in3[6],in4[6]};assign in6_ = {in1[5],in2[5],in3[5],in4[5]};assign in5_ = {in1[4],in2[4],in3[4],in4[4]};assign in4_ = {in1[3],in2[3],in3[3],in4[3]};assign in3_ = {in1[2],in2[2],in3[2],in4[2]};assign in2_ = {in1[1],in2[1],in3[1],in4[1]};assign in1_ = {in1[0],in2[0],in3[0],in4[0]};

//select the input to be selected for ROM input mux_8 #(4) M1(in_ROM1,in2_,in4_,in6_,in8_,in10_,in12_,in14_,in16_,S1);mux_8 #(4) M2(in_ROM2,in1_,in3_,in5_,in7_,in9_,in11_,in13_,in15_,S2);

//ROM data strcuture.ROM_16 RO1(in_ROM1,out_ROM1,rd);ROM_16 RO2(in_ROM2,out_ROM2,rd);

//Sign extension for ROM valuesassign ext_out1 = {out_ROM1[15],out_ROM1[15],out_ROM1[15],out_ROM1};//higher valueassign ext_out2 = {out_ROM2[15],out_ROM2[15],out_ROM2[15],out_ROM2};//lesser value

//adder block

adder Add1(adder_out,ext_out1,ext_out2,out,shiftAdd,shiftaddcomplement,reset_reg,reset_reg_use);assign strobe = 1'b1;reg_ #(21) RE1(out,adder_out,clk,strobe,reset_reg);

endmodule

module adder(adder_out,in_adder_,in_adder,r,shiftAdd,shiftaddcomplement,reset_reg,reset_reg_use);

input [18:0] in_adder_,in_adder ;input [20:0] r;input shiftAdd,shiftaddcomplement,reset_reg,reset_reg_use;output [20:0]adder_out;reg [20:0] adder_out;reg [20:0] c,d,e;

always@(in_adder,in_adder_,r,shiftAdd,shiftaddcomplement,reset_reg_use)begin

if (shiftAdd ==1'b1) begin

if (reset_reg_use == 1'b1)begin c = {in_adder_,1'b0,1'b0};d = {in_adder[18],in_adder,1'b0};adder_out = {in_adder_,1'b0,1'b0}+{in_adder[18],in_adder,1'b0};

endelsebeginc = {in_adder_,1'b0,1'b0};d = {in_adder[18],in_adder,1'b0};e = {r[20],r[20],r[20:2]} ; adder_out = {in_adder_,1'b0,1'b0}+({in_adder[18],in_adder,1'b0})+

({r[20],r[20],r[20:2]}); end

endelse beginif (shiftaddcomplement ==1'b1)

beginadder_out = {(~in_adder_ + 1'b1),1'b0,1'b0} +({in_adder[18],in_adder,1'b0}) +

({r[20],r[20],r[20:2]});c = {(~in_adder_ + 1'b1),1'b0,1'b0};d = {in_adder[18],in_adder,1'b0};e = {r[20],r[20],r[20:2]} ;end

else beginc= 20'b0;d = 20'b0;e = 20'b0;adder_out = 20'b0;endend

end endmodule

module dap_controller_4(reset_reg_use,done,S1,S2,reset_reg,shiftAdd,shiftaddcomplement,rd,reset,start,clk);

input reset ,clk ,start;output reset_reg,shiftAdd,shiftaddcomplement,rd,done,reset_reg_use;output [2:0] S1,S2;reg reset_reg,shiftAdd,shiftaddcomplement,rd,reset_reg_use;reg [2:0] S1,S2;reg [2:0]state,nextstate ;reg done;parameter ST0 = 3'b000;parameter ST1 = 3'b001;parameter ST2 = 3'b010;parameter ST3 = 3'b011;parameter ST4 = 3'b100;parameter ST5 = 3'b101;parameter ST6 = 3'b110;parameter ST7 = 3'b111;

always@(posedge clk or posedge reset)begin // reset_reg = 1'b0; // shiftAdd = 1'b0; // shiftaddcomplement =1'b0; // rd = 1'b1; if (reset == 1'b1) begin state = ST0; //reset_reg = 1'b1; rd = 1'b0; reset_reg_use = 1'b1; end else begin case(state) ST0:if(start ==1'b1) begin S1 = 3'b000; S2 = 3'b000; nextstate = ST1 ; reset_reg_use = 1'b1; shiftAdd = 1'b1; rd = 1'b1; done = 1'b0; end else nextstate = ST0;

ST1:begin nextstate = ST2; S1 = 3'b001;

S2 = 3'b001; shiftAdd = 1'b1; reset_reg_use =1'b0;

rd = 1'b1; end

ST2 : begin nextstate = ST3;S1 = 3'b010;S2 = 3'b010;shiftAdd = 1'b1;reset_reg_use =1'b0;rd = 1'b1;

end


end


end


end


end

ST7 : begin nextstate = ST0;S1 = 3'b111;S2 = 3'b111;//shiftAdd = 1'b1;shiftaddcomplement = 1'b1;rd = 1'b1;done =1'b1;

end

default: nextstate = ST0;endcase

state = nextstate ;

endend

endmodule

/*module struc_16;

wire [15:0] in1,in2,in3,in4;wire [20:0] out ;reg clk , reset ,start ;reg [63:0] t;wire [31:0] out1;wire [20:0] value;wire [15:0] error;wire done;reg val;assign in1=t[15:0];assign in2=t[31:16];assign in3=t[47:32];assign in4=t[63:48];

check16 c1(in1,in2,in3,in4,out1);struc_16dap D1(in1,in2,in3,in4,reset,clk,start,out,done);assign value = out<<1;//assign error = (val==1'b1) ? (out1[30:15] - out[16:1]) : 16'b0;assign error = out1[30:15] - out[16:1];always#5 clk = ~clk ;

alwaysbegin# 5 start = 1'b1;val = 1'b0;t = t+1;# 10 start =1'b0;# 95 val=1'b1;end

initial beginclk = 1'b1;reset =1'b1;t = 64'h0000_0000_0000_0000;#7 reset =1'b0;end

endmodule */

Documents

Steps In Video Compression - CAE Usershomepages.cae.wisc.edu/~ece734/project/s04/valia.doc · Web viewIt is known to be the most crucial and computationally intensive process in the