Multiplier / Divider - People @ EECS at UC Berkeley · Web viewCS152 Computer Architecture and Engineering Final Project Multiprocessing, Cache Coherency, Synchronization Multiplier/Divider,

CS152Computer Architecture and Engineering

Final Project

Multiprocessing, Cache Coherency, Synchronization

Multiplier/Divider, TLBa.k.a.

Haste Makes Waste

December 9, 1999Tony Tseng, David Chang, Mark

Chuang, Greg Lawrence, Chris Shumway

TA: Victor Wen

INTRODUCTION AND SUMMARY

For Lab 7, our group implemented a multi-processor, level two cache, a divider/multiplier, and a translation lookaside buffer. The processor part for the multi-processor part is essentially unchanged from Lab 6. However, the memory system has gone through some major changes in order to accommodate the addition of a processor and level two cache. There are four level one caches, three arbiters, and one level two cache in the memory system. A synchronization module was designed to allow programs running on both processors to properly share memory through locks. In addition, there is an eight entry tlb placed before the memory controller. When needed, this tlb may reference a page table which is located at a fixed point in high memory.

Top Level Block Diagram

Performance Summary

Maximum clock rate = 18.87MHz

For final_1.s:Execution time = 865363nsNumber of cycles = 16328CPI ~ 2.72 (assuming there are roughly 6000 instructions)

For merge_sort.s:Execution time = 859851nsNumber of cycles = 16224CPI ~ 2.70 (assuming there’ are roughly 6000 instructions)

For dot_product.s:Execution time = 124370nsNumber of cycles = 2347

ALL FEATURES DESCRIPTION

Memory System

The memory system is consisted of two level one data caches, two level one instruction caches, three arbiters, and two DRAM banks. It is organized as the illustration below.

Whenever there is a request from an instruction cache, it would have to go through both arbiters in order to reach the memory. As for data requests, they would have to go through the first arbiter, the level two cache, and the second arbiter to reach the memory. Arbiters are essentially unchanged from Lab 6, except the level one arbiters take in 32 bits from level one caches instead of 64 bits.

Synchronization Module

The synchronization module handles the lock variable between the address 0xffffffff and 0xfffffff0. If the memory request is normal (i.e. the address is outside this range), then the synchronization simply relays the access and store signals to the memory. Otherwise, it would block the signals to the memory and handles the request itself.

Processor

SyncL1 Data Cache

To mimic normal memory b

To mimic normal memory behavior, the processor asks for a lock by loading words and releases it by storing words. When two processors attempt to grab the same lock at the same time, the atomic behavior is properly observed such that only one of them can acquire the lock.

Cache Coherency

The coherency system is another imporant aspect of multiprocessing. It is important that all data in the caches corresponding to a certain address be updated every time something is stored to that address. To implement this in the easiet way, we made assumtions that potential simultaneous accesses to the same address are handled by synchronization and software. There are 2 scenarios that you must consider when you implement cache coherency. The first is when a processor stores to an address held by the other processor’s data cache. Since you can assume that the two processors are not storing to the same address at the same time, you can safely say that the other processor is not accessing it, and therefore should be able to be written to within the 1 cycle hit and store time our caches use. The second scenario is when a processor stores to an address which is not in the other processor’s data cache. In this case, the write through policy of our L1 caches automatically solves the problem. Writing through guarentees that when the other processor tries to access that address in the future, the value of the data is valid and recent. Implementing this system wasn’t very hard. Even though it does not improve performace, it is a simplistic system which allows cache coherency to exist without hurting performance. The advantage of this system is that it won’t add extra cycles. Our system allows a single cache system to make two writes during one cycle if it is needed.

2 Comparators!

Multiplier / Divider

We implemented both a multiplier and divider unit similar to the ones presented in class. Both of these units consist of a special register connected to a controller. Binary multiplication is done by performing a series of additions with each one involving the multiplicand shifted to the left the appropriate number of times. Instead of shifting the multiplicand to the left, one can shift the current summation result to the right. This allows one to use a single 64-bit product register. We designed this register to have the ability to shift its contents to the right and it can load a 32-bit value into both the upper and lower half. In addition, to handle the carry bit that can occur when two large 32 bit values are added together, it accepts a high bit set signal which simply sets the 64th bit in the register to a one after a shift operation is done. When a multiplication needs to be done, the controller starts in a loop asserting the appropriate signals. These signals are set according to the version three hardware multiplier as described in lecture.

Division is done in a similar fashion. Instead of performing a series of additions, division involves a series of subtractions coupled with a check to see if the result is non-negative. Again, we can use a single 64-bit register but this time we will be shifting to the left. This divider register has the ability to shift its contents to both the left and right and it can load a 32-bit value into both the upper and lower half. When a division is

requested, the divide controller starts in a loop asserting the correct signals. Again, these signals are set according the hardware divider as described in lecture.

Multiplication takes 33 cycles and division takes 34 cycles. Division takes one more cycle than multiplying because there is a shift right that needs to be done at the end. The reason the time is not 32 and 33 cycles respectively is because there is some overhead involved in propagating the results through a register. The multiplier and divider are capable of handling signed numbers as well. This is done by checking beforehand if either operand is negative and if so, taking the two's complement and proceeding as normal. After the result is computed, then if exactly one of the initial operands was negative then we take the two's complement of the result to get the appropriate sign. Following the MIPS convention, overflow and dividing by zero must be checked by software.

In terms of modifications to the datapath, these features required minimal changes. Because we had already implemented unsigned multiply (multu), we had the datapath and the staller already in place. The only real change was widening the control signal—previously one bit was all that was needed to tell the multiplier unit to start. Now, however, because there are four multiply and divide instructions that share the same hardware, the control signal is a 3-bit bus. This allows us to have a separate code for each instruction, as well as a bit for telling the hardware to begin processing. The simple pattern we used is as follows: the uppermost bit reflects signed or unsigned, the middle bit signifies when to begin a divide, and the lowest bit signifies when to begin a multiply. Thus, the control signal for div would be “110”, and the code for multu would be “001.”

TLB

We implemented a translation lookaside buffer, otherwise known as a TLB. This buffer acts as a mapping from virtual to physical addresses and thus gives our processor the ability to use virtual memory. The TLB can only hold eight entries and so the remaining virtual to physical address mappings must be stored in a page table that is, for now, in a fixed memory location (0x3e0). This is acceptable because we do not support true virtual memory and relocatable address spaces.

The TLB consists of a number of registers for each entry. There are two 32-bit registers to hold the virtual and physical address for each translation. In addition there are two flip flops that store a valid and reference bit. The valid bit is true when the corresponding translation is valid and the reference bit is used overtime an instruction accesses the current virtual address -> physical address translation. These reference bits are used to approximate a least recently used replacement algorithm. When a value is fetched from the page table, entries that have their reference bits set to zero are favored over those that do not. Every hundred cycles, these reference bits are cleared to zero.

The TLB is placed right before the actual memory and so values that are placed in the cache are virtually addressed and so do not require translation. Since our processor does not have the means to handle multi-programming, a virtually addressed cache will not be a problem. There is only one address space being used at any one time. Placing the TLB before the cache would have introduced other complexities because we have a cache for both instructions and data. This means that we would need two TLBs and exactly how the two would interact was not obvious to us. We therefore decided to place it before the actual memory. Normally, when a load or store instruction must access the memory because of a cache miss, the translation is done by checking to see if the current tag is in the TLB. This lookup is done in parallel by using eight comparator devices. Assuming that the translation is currently available, the comparator outputs a high signal which selects the corresponding physical address and the processor continues as normal. This delay is the sum of the delays of the comparator and a tri-state buffer (1.5ns + 1ns = 2.5ns) and so our processor's critical path could potentially increase by 2.5ns.

If the translation is not present in the TLB, then the address must be fetched from the page table. Since we are not in a multi-programming environment, our page table is in a fixed location in memory. The lookup involves stalling the processor and performing a memory read at the correct index into the page table.

Page Table

Since our processor has a limited amount of RAM we decided to make our page size very small just so that we can see TLB entries being updated while running our test programs. Real computers have page sizes that are larger then our entire memory and so virtual memory would be pointless because there would only be one page. Besides, without a form of secondary storage such as a hard disk, and an operating system to take care of page fault exceptions, this is purely an exercise in design, and does not assist performance (in fact slightly decreases it due to TLB miss lookup time). Therefore, our page size is just 32words, which allows for an even 32 pages per DRAM. This normally means that our virtual tag would be 27 bits in length (32 bits minus five for the page size) and would require a page table that was larger than all of our small physical memory. We made the decision to limit the size of the page table to a single memory page, and devote the uppermost page in memory (0x3e0) to the page table.

Our situation is even more interesting because we have two separate DRAMs with interleaved address space. Our approach to this was to treat both the “even” and “odd’ pages as part of the same big page. This works because of spatial locality, especially when processing instructions—if the data at 0xf00 is accessed, it is very likely that the data at 0xf01 would be next. Thus, we only need one page table to take care of both DRAMs. By shifting the virtual page address one bit to the left, we are basically ignoring which DRAM the data is in, and are able to keep track of twice as many virtual pages in the same sized page table.

The method we used was both compact and made it extremely simple to look up virtual page addresses: Basically, each page table entry is six bits long: five bits for the page

offset, and one valid bit. By byte-aligning each page table entry, we end up with only two unused bits per byte, but can readily access each physical page number by simply looking at the appropriate byte address, and the valid bit by shifting five to the left. Using this method, we were able to keep four page table entries per word of the page table, for a total of 128 page table entries. This means our virtual address space expands to (128 x 32 x 2) = 8192 words, a fourfold improvement over using only the physical memory in the two DRAMs. (Of course, because there is currently nowhere to store unused pages and the data that gets paged out from main memory, it is impossible to use more than 64 different physical pages in a session, because all data in the pages getting paged out is lost.)

TLB and Page Table Initialization

By adding a TLB we present a problem with bootstrapping the system. In other words, how do we set up the TLB when the processor is first turned on? If we mark everything as invalid, then no program can run. We initialize one of the eight TLB entries to contain virtual address zero = physical address zero. This will allow one page size of code to run at the beginning that can then set up the page table for the rest of the program. Since the page table itself is just a portion of the memory, it can be updated by a user program by storing the appropriate values there.

PERFORMANCE SUMMARY

Critical Path

In Lab 5, memory accesses literally take no time, so we put memory access at the end of execution stage to ensure that it would write in the very beginning of the memory stage. However, we never changed such design even after memory has become painfully slow. Essentially, we inadvertently merged the memory stage and execution stage into one. Therefore, our critical path is the forwarding path from the memory stage to the execution stage in the pipeline.

For example, supposed we have a sequence of two loads:Lw $2, 100($0)Lw $3, 0($0)

To calculate the address of the second load word, the result of the first load word must be forwarded from memory to execution first. Thus the total time = time to access memory + 2 * mux delay + forwarding unit delay + ALU delay, which explains our pitiful 53ns cycle time. Because of our careless design error, we are quite certain this is THE critical path.

Performance analysis

Unfortunately, the final mystery program does not work on our Lab 6 datapath. Therefore, we really have nothing to compare the performance with. The final mystery is the only program that works on both single and multi-processors and takes advantage of multi-processor is possible. It would be quite pointless to run a single-processor program on our multi-processor in order to make performance comparison.

While running final_1.s and merge_sort.s, we obtained quite similar results:Total time -- 865363ns vs. 85981nsNumber of cycles – 16328 vs 16224CPI – 2.72 and 2.70

This is not surprsing because they are essentially the same program, except that merge_sort uses locks and final_1 uses registers as locks for communication and sychronization.

The CPI of 2.7 is quite disappointing though, considering that we are using two processors at the same time. We think the main cause is memory stalls. Right now the memory is simply too inefficient and slow as compared to other components. An additional processor brings more competition for memory access, and our performance takes a hit as a direct result of the increased memory contention.

On the other hand, while running dot_product.1, our execution time and number of cycles are way down. The main reason can be contributed to the dual-processor. Since we have two processors running at the same time, that means we are also using two multipliers at the same time. Dot_product.1 mainly deals with multiplication, and our two multipliers can certainly help in that aspect.

TESTING PHILOSOPHY

Our final project testing strategy involved a complex system of individual and comprehensive testing algorithms. The 4 major parts of our project were the new memory system involving cache coherency, the synchronization module, the multiplier/divider, and the TLB. Like all the testing strategies we implemented this semester in this class, each module was tested independently before combining them into our multiprocessor. Independent testing included the insertion of these modules into the lab6 datapath. By doing these insertions and running old test programs, we were able to test its true ability to output the correct data.

Memory System

L1 Cache First thing we had to test was the new write through policy controller. This was done through a basic state testing command file. Basically we made command files to push the controller into the right states. Next we had to test the new cache coherency system within each block of the cache. We went though command file which tested storing different things to different addresses at the same time. We made sure the logic for enabling the cache data registers were correct and in order. After everything was tested, we created the symbol and put it into our lab6. We did this because we wanted to see if our write through would work.

L2 Cache The most important aspect of the L2 cache that we needed to test was its communication with the Dram. With its extended size, we needed to ensure that the 4 words were loaded and stored correctly (lower and upper 2-word blocks). We had to test the L2 cache independently looking at its outputs in the schematics. It was difficult to test this module because writing to memory really depends on the arbiter. Since our L2 cache had to do 2 consecutive stores to ram, we had to use our arbiter to solve timing issues. Later on we tested the L2 Cache with just the arbiter and dram. When we started testing the whole multiprocessor package, we made a new CACHE MONITOR for the L2 cache. We figured that since the L1 cache is write through, that the L2 cache is easier to use to debug. It contains most of the data running the test programs and the data will always be the most updated version.

Modular Arbiter

The arbiter was tested in a similar way as the memory controller. For the initial debugging, we made a separate schematic file with only the arbiter module in it and tested the module in digital fusion by asserting/deasserting various control/input signals. After we verifed the initial results, we inserted it into the three places of our memory module and tested the memory module as a whole. We similiated various requests from the processor and

looked at the outputs on the waveform to make sure that it was functioning correctly. The biggest test we did to test these arbiters was ask for 4 requests at the same time (DATA from processor 1, DATA from processor 2, INST from processor 1, and INST from processor 2)When we started testing the whole multiprocesser package, we made our arbiters assert USAGE SIGNALS to the screen.This way, it was easier to debug what was going wrong in the arbitration process.

Dram Controller

No Changes were made to the Dram Controller.

Synchronization Module:We first tested the synchronization module in isolation. We put it in a separate schematic, manually controlled its inputs, and observed its outputs step by step. After we were relatvely confident about its correctness, we incorporated it into the memory system and ran several small test files involving the usage of synchronization variable. (sync.s, cachetest.s, cachetest2.s)

Multiplier/Divider:Because we had added unsigned multiply in Lab 5 as extra credit, a large portion of our testing was already done, specifically the interaction between the multiply/divide component and the rest of the datapath. Because all four instructions show the same behavior, i.e. stall if a mfhi or mflo is about to be processed, we did not need to change this at all. However, we added two major components, the divider and the twos-complement converter, and these required extensive testing.

Overall: Besides from the normal modular testing strategies of testing components independently then together, multiprocessing brought on many new complications. The first thing we did when every thing was in place was make sure that the single processor worked independently (meaning we tied the pc of processor 2 to 0, while we ran an old test program on processor 1) This was a magnificent testing strategy because it revealed a lot of problems that existed in the memory system. If we had started off by running test programs that involved both processors, we would have twice the ground to cover. So by making sure that the processor could work singularly with the rest of the new components, we solved a lot more problems with greater speed.

Testing our final project was a very difficult thing to do. We spent countless nights compiling and debugging and compiling and debugging. We took a very brute force approach in testing our multi processor. It was difficult to map out a concise plan of action when there is so much work to do in such a little time especially when compiling and running test programs takes more than half of the time. However this testing philosophy led to complications. Even though the complications, we had passed a point of no return. There was no way we could have stripped our processor down and test it slowly.

Fortunately, the hours of brute force testing led to an accomplished product. The main to testing everything was making comprehensive test programs. One of the most comprehensive test programs was when we put lab 5 mystery and lab 6 mystery together to form a multiprocessing test program. This was especially useful because we already knew the behavior of the programs. In fact, we encountered the same exact bug that we found and fixed during lab 6 a couple weeks ago.

One of the most important testing philosophy that we implemented was writing modules that outputed what was going on in each module. The following is a list of components and its corresponding testing module.

L1 Cache: Error Checking Module which asserted signals when multiple blocks were being accessed or overwritten

L2 Cache: Cache Monitor which outputed the contents of the L2 cache. This, as explained above, is important because all the recent data is stored in L2 since L1 writes through to it.

Processor: Instruction Monitor which outputed into either a file format or on-screen format. This module allowed us to see what cycle and which instructions were causing our processor to exit unplanned.

Another testing philosophy we began to use in the beginning of this final lab was setting breakpoints. This was especially inportant when we were testing our memory system. Because the arbitration process is so complicated, it was necessary to break when the data address lines reached a certain address. We knew which addresses to break on because those were the addresses where the data was wrong in the end. For some of our test programs we would run lab6 on one computer and run lab7 on another, this allowed us to see face to face where the problems were caused. Our motto was “STEP BY STEP” – it seems like this philosophy is very wise – it’s the only way to go.

Documents

Multiplier / Divider - People @ EECS at UC Berkeley · Web viewCS152 Computer Architecture and Engineering Final Project Multiprocessing, Cache Coherency, Synchronization Multiplier/Divider,