Horizontal demand prefetching: A novel approach to eliminating the jump problem

118

Horizontal demand prefetching: a novel approach to eliminating the jump problem

La prerecherche selon la demande horizontale : un approche nouvelle visant I'elimination du

probleme des sauts Daniel C. McCrackin w B a r n a Szabados, Department of Electrical and Computer Engineering, McMaster University,

Hamilton, Ontario L8S 4L7.

The principle of a novel prefetching strategy, horizontal demand prefetching, is presented. This mechanism allows deep prefetching without jump-related misses by prefetching shallowly across several independent streams in a horizontal fashion. Multistream processors using this technique can achieve very high memory utilization. The mechanism supports very efficient multitasking and hardware process synchronization. The structure and performance of a prototype minicomputer using this mechanism are presented.

Le principe d'une strategic nouvelle de prerecherche, la prerecherche selon la demande horizontale, est presente. Cette technique permet une prerecherche poussee tout en evitant les erreurs dues aux sauts lorsque la prerecherche est effectuee horizontalement de fac,on superficielle parmi plusieurs voies independantes. Les processeurs multivoies utilisant cette technique peuvent atteindre un tres haut degre d'utilisation de la memoire. Cette strategic supporte un environnement multi-taches et la synchronisation des unites peripheriques. La structure et la performance d'un miniordinateur conc,u sur ce principe sont presentees dans l'article.

Introduction

Virtually all modern computer systems use some form of instruction prefetching to improve processor performance. Prefetching improves the performance of otherwise sequential von Neumann processors by allowing instruction fetching and execution to proceed concurrently.

Unfortunately, the presence of j u m p instructions in the instruction stream makes it difficult to correctly determine the next instruction to prefetch. An incorrectly prefetched instruction carries with it a double performance penalty: first, the processor is forced to wait while the correct instruction is fetched, and second, the memory accesses used to fetch the out-of-sequence instruction are wasted.

With deep prefetching, this j u m p problem can represent a serious degradation of system performance, since typically 15-25% of all executed instructions are taken jumps . 1 Indeed, in the VAX-11/780, about 20% of all memory accesses are for incorrectly prefetched instructions due to taken j u m p s . 3

Several approaches for reducing the effect of the j u m p problem appear in the literature; good reviews of these techniques are given in References 4 and 5. The mechanisms vary in complexity from the simple delayed j u m p strategy characteristic of RISC designs 6 " 8 to sophisticated branch prediction techniques 1 and branch folding techniques. 9 None of these techniques are able to completely eliminate the j u m p problem.

This paper presents a novel prefetching strategy which we will call Horizontal Demand Prefetching ( H D P ) . 1 0 This strategy prefetches instructions across several independent streams (processes), allowing deep instruction prefetching without prefetch misses. The architecture and performance of a prototype multi-stream minicomputer demonstrating this mechanism are presented.

Principle

One implementation of instruction prefetching separates a processor into two autonomous sections: an Execution Unit (EU)

that executes instructions, and a Bus interface and Instruction Unit (BIU) that fetches instructions and performs all memory accesses. 1 1

The decoupling of instruction fetching (by the BIU) and execution (by the EU) allows instructions to be prefetched whenever the processor-memory interface is free. Figure 1 shows an implementation of the horizontal demand prefetching mechanism based on this E U / B I U arrangement.

In H D P , the state information in the EU and in the BIU is replicated TV-fold, permitting their contexts to be individually selected. In the EU, the registers, flags and microprogram counter

R e g i s t e r s

EU a n d

F l a g s

E x e c u t i o n

U n i t u P C U n i t

B I U P r e f e t c h R

B u s i n t e r f a c e

I n s t r u c t i o n

U n i t

P C

s c u S t r e a m

C o n t r o 1

U n i t

M e m o r y

Figure I: An implementation of horizontal demand prefetching.

Can. J. Elect. & Comp. Eng., Vol. 16 No. 3, 1991

select

: s t a t u s '• control ^

select

select

MCCRACKIN/SZABADOS: HORIZONTAL DEMAND PREFETCHING 119

are replicated; in the BIU, the program counter (PC) and prefetch registers are replicated. This replication corresponds to increasing the size of the EU register file, replacing the PC with a small scratch pad, and so on. There is no other replication of hardware.

In itself, this N-fold replication of state information may be used to eliminate context switching overhead. Multitasking for at most N processes requires only manipulation of the EU and BIU context selection signals. Context switching may occur between micro-operations, without any time overhead. Thus the machine becomes a multistream uniprocessor with up to N streams concurrently resident in the processor.

We now eliminate the j u m p problem entirely by imposing the following restrictions on the processor's operation: First, instruction prefetching is limited to (at most) one instruction for each stream. Second, we enforce an execute-to-fetch delay on each stream. If the E U begins to execute an instruction for stream X, then D cycles must elapse before stream Xmay be selected for instruction prefetching via the BIU. D is chosen to give sufficient time for the EU to determine whether X's instruction is a taken jump , and if so, what the destination is. Thus the BIU is prevented from fetching the wrong instruction.

With only one stream executing, this execute-to-fetch delay will actually decrease the processor's overall throughput, as several clock cycles may elapse from the time that a stream's next instruction can be prefetched to the time at which it is allowed to be prefetched. N o instruction fetching occurs during this delay, and the processor-memory interface may go idle. With several running streams, however, this negative effect is eliminated, as will be demonstrated.

A Stream Control Uni t (SCU) is required to co-ordinate EU and BIU stream context selection and to handle execute-to-fetch delay enforcement. The SCU selects the next stream for which the BIU will prefetch an instruction and selects the next stream for which the EU will execute an instruction. The SCU also ensures that only streams with empty prefetch registers will be prefetched and that only streams with full prefetch registers will be executed, and enforces enough execute-to-fetch delay to circumvent the j u m p problem. Implicitly, the SCU is also responsible for the correct distribution of processing time to each of the streams.

In order for the SCU to carry out this selection function, it is necessary that it contain some status information for each stream. The simplest SCU might have one bit per stream to indicate the status of each stream's prefetch register. Additional status bits could be used to handle the presence of halted streams and streams waiting for access to shared resources.

The operation of the H D P processor is straightforward. Streams compete for the BIU resource (for prefetching) and the EU resource (for execution). If only one stream is running, the instruction prefetch depth is one, although the enforced execute-to-fetch delay reduces the beneficial effect of this prefetching. If two streams are running, the prefetch depth is two, and the effect of the execute-to-fetch delay is reduced; during one stream's delay, another stream may be prefetched. As more tasks are added, the prefetch depth grows to a maximum depth of N instructions ahead for N streams. Since the prefetch depth is great, overall processor throughput is improved, but since prefetching is done horizontally across independent streams, no prefetch misses can occur.

This increase in overall throughput as more streams are added is one of the unusual properties of the H D P prefetching mechanism. In conventional processors executing multiple processes, performance generally decreases (or at best remains constant) as more processes are added. With the H D P mechanism, total throughput can actually increase until the available memory bandwidth is saturated. Furthermore, context switching is "free" for up to N streams, and contexts may be switched at every micro-operation boundary with no performance penalty. Naturally, if more than the

B I U A c t i o n E U A c t i o n

A . O n e R u n n i n g S t r e a m :

f e t c h i n s t r . 1

Exec-fetch^delay (idle)

e x e c u t e i n s t r . 1


( i d l e )

e x e c u t e i n s t r . 2

B . T u o R u n n i n g S t r e a m s :


f o r s t r e a m A

f e t c h i n s t r . 1 . _ e x e c u t e i n s t r . 1

f o r s t r e a m B r ^ . f o r s t r e a m A

e x e c u t e i n s t r . 1 f o r s t r e a m B

Figure 2: HDP operation for one and two running streams.

N hardware-supported tasks are required, a secondary context switching mechanism must be provided. With careful design, this need not greatly degrade performance, although it might constrain the maximum rate at which contexts could be switched.

Clearly, the performance of an H D P processor executing N streams should be similar to that of a processor with error-free prefetching into an A-level prefetch queue. Let Tj- denote the number of clock cycles needed to prefetch an instruction into a prefetch register; this is usually a constant. Let Te denote the amount of time required to execute an instruction; since this depends on the particular instruction and upon the particular instruction mix, we may consider it to be a random variable. Te must not include time spent accessing operands in memory, since the memory is not idle at these times and is not available for instruction prefetching. Clearly, if Tf < Te, the prefetching system is faster than the execution system, and the prefetch registers will tend to stay full. Conversely, if Tf > Te, the execution system is faster than the prefetching system, the prefetch registers will tend to empty, and the processor's memory utilization will be higher. In fact, if Tf > Te, then overall throughput will increase as streams are added until the available memory bandwidth is saturated.

Presuming that memory is more expensive than the central processor of a computer system, we should have Tj- > Te for best memory utilization. Since Te depends on the actual instruction mix, this requires at least that Tf > mean(T e ) . Of course the actual memory utilization and throughput of an H D P processor for a given number of running streams depends on the instruction mix characteristics. If two instruction mixes have the same m e a n ( r j , then the one with the wider Te distribution will require deeper prefetching to match the overall throughput and memory utilization of the other.

Figure 2 illustrates the operation of H D P for one and two running streams. Note in Figure 2a how the presence of the execute-

120 CAN. J. ELECT. & COMP. ENG., VOL. 16, NO. 3, 1991

M e m o r y a d d r e s s tf :

t u MAR ; |' I :

I R e g i s t e r

^[xl* ' * *Pad ( R S P ) Jf P r o g r a m I] : /

T 1 J A u x i 11 i a r y 111 v * 7 Ln i — • S c r a t c h \\\ I \ + 1 / Y \ I P a d ( A S P ) [[r !

' r r«1 , , M e m o r y ^ i ALU j D a t a + — > I 1 | =?n & \ \ / / :

n | < 1 N D R |»r I Flags P ^ c t r ! 4 - ^ ^ / " \

P r e f e t c h : . T ^ , I | R e g i s t e r s ^ ! I j l P U I n t e r n a l B u s j

I > T o I n s t r u c t i o n D e c o d e r

B I U C o n t r o l l e d P a t h s EU C o n t r o l l e d P a t h s

Figure 3: Prototype processor data paths.

to-fetch delay may degrade processor throughput for one stream. However, for two streams (Figure 2b), the effect of the execute-to-fetch delay is reduced, since the delay of one stream may be filled with a fetch for another.

Related multistream techniques

The multistream architecture most closely related to horizontal demand prefetching is the Cyclic Pipehne Computer (CPC). Among the earliest and best known examples of CPCs are the peripheral processing unit of the C D C 6600 and the peripheral processor of the TI -ASC. 1 3 A recent paper by Shimizu, Goto and Ichikawa discusses applications of the CPC idea to pipelined memory machines . 1 4

S t a t u s ELemi

U p d a t e from —• EU & B I U

e n t S ~ \ ^ - S t a t u s B i t s

S t a t u s U p d a t e S e l e c t

S E L 0

S E L 1

S E L 2

F e t c h P r i o r i t y R e s o l v e r

F e t c h • R e q .

t o B I U

S E L 1 5

E x e c u t e P r i o r i t y R e s o 1 v e r

E x e c . - R e q .

t o EU

Figure 4: Simplified stream control unit structure.

pipeline complexity of these machines. Pipeline lengths are relatively short (eight stages on the HEP), but a large number of stream contexts may be stored in the register files. Thus their overall throughput varies linearly with the number of ready streams until there are enough streams available to fully occupy the pipeline.

The performance of an H D P machine is also dependent on the number of executing processes, but the relationship between load and performance is not a simple one. In CPCs, halting a process effectively disables 1 /Mh of the processor; in H D P , halting a process removes one level of prefetching. This loss of prefetching depth only slightly affects the processor hardware and memory utilization.

In cyclic pipeline computers, an TV-stage circular pipeline contains N offset processes. In each clock cycle, the state information for each process (flags, program counter, registers) moves one stage around the circular pipeline. The pipeline is designed so that memory accesses for each process occur at only one section of the ring and will always occur each time a process rotates through this section. This ensures complete memory interface utihzation, and since instructions are not fetched until they are actually needed, no j u m p problem can occur.

The CPC idea is conceptually simple and can achieve very high memory utilization. Implementation is difficult, however, as all the state information of each stream must travel around the pipeline, with the result that a very wide internal data path is needed. This makes it impractical to implement more than a very few registers in the pipeline. The lack of registers degrades processor performance by increasing the rate of data memory access and complicating compiler code optimization.

In addition to this problem of hardware complexity, one may observe that the performance of cyclic pipeline computers is very load-dependent; all processes must be executing in order to achieve complete memory and C P U utilization. If only two of four processes are in use, then the memory utilization and overall processing speed drops to 50% of the peak value. This makes it difficult to efficiently implement process synchronization mechanisms; a blocked process causes 1 /Mh of the processor hardware and memory bandwidth to be wasted.

More recent examples of cychc pipeline computers are the H E P 1 5

and the Hor i zon 1 6 " ' 7 shared memory multiprocessors. In these machines, the principal reason for using a CPC architecture is to hide network access latencies of the machine's shared memory. H E P and Horizon depart from the classical CPC organization in that the registers for each stream are stored in fast register files; they are not passed around the pipeline. This considerably reduces the

H D P ' s use of hardware selectable contexts is similar to that of the Xerox A l t o 1 8 and the Xerox D o r a d o 1 9 processors. In these machines, selectable register files are combined with an interrupt-like mechanism to allow very fast response to external events. In the input /ou tpu t system of the Lincoln TX-2 processor 2 0 , a similar mechanism is implemented at the instruction level. While these unusual interrupt mechanisms reduce the complexity of external high-speed I / O devices, they fail to address the j ump problem. Furthermore, in the TX-2, Alto and Dorado processors, only one context is dedicated to program execution or emulation; task context switching must still be done at the machine language level. These processors implement a form of fast hardware interrupt system, while H D P implements an efficient hardware multitasking system that also circumvents the j u m p problem.

Prototype design

We have constructed a 16-bit prototype minicomputer that implements horizontal demand prefetching for up to 16 streams. The system clock speed is 5 MHz, and as memory takes three cycles to access, the available memory bandwidth is 1.7M 16-bit words per second. The value of D, the execute-to-fetch delay, is one clock cycle. The prototype implements simple process blocking within its stream control unit; operations like semaphore-set-with-wait are implemented in microcode as single machine instructions.

Figure 3 shows the major data paths of the processor's bus interface and instruction unit and execution unit. Note that all process state information is replicated 16-fold. Thus, the program counter is actually a 16 X 16-bit scratch pad in the BIU data paths. The main Register Scratch Pad (RSP) is a 16 X 32 X 16-bit memory, allowing 32 16-bit registers for each of the 16 possible hardware processes. There are 16 16-bit prefetching registers, allowing a maximum instruction look-ahead of one word for each of the 16 streams.


R o t a t e C o m m a n d

Figure 5: High-speed rotatable priority encoder mechanism.

Figure 4 presents a simplified diagram of the Stream Control Unit (SCU). Status information for the streams is contained in a set of 16 Status ELement (SEL) finite state machines. Fetching and execution requests from the SELs are arbitrated by a priority resolution mechanism. One of the more difficult aspects of the prototype design was the development of a priority rotation mechanism that operates fast enough (50 nS) to allow request resolution well within one clock cycle. The direct way of achieving rotatable priorities would have been to combine a barrel shifter with a fixed priority encoder, but this solution was rejected because of its high complexity.

A novel solution to the encoder problem is to use fixed priority encoders and rotate the location of the status information instead. This is done in the prototype by arranging the SELs in a ring as in Figure 5. At each clock cycle, the status elements may either operate normally or shift their status bits one position around the ring. A 4-bit counter tracks the number of rotations, and two 4-bit adders are used to adjust the encoder outputs accordingly. This mechanism achieves the effect of a high-speed rotatable priority encoder with simple hardware.

The effect of this SCU priority structure is that the highest priority SEL requesting service is given first preference. This selection of streams for execution and fetching is done "on demand," rather than in a fixed circular pat tern as in cyclic pipeline computers. Since stream priorities may be rotated, any desired distribution of processing time may be given to the running processes, although the prototype rotation controller at tempts to give equal time to all processes.

Note that simply rotating stream priorities periodically is not sufficient to ensure that equal time is given to all streams. If only two streams are running and their status bits lie in adjacent SELs, then one stream will have the higher priority for 15 /16 of the time. This asymmetry in the time distribution is caused by the presence of

P C A L L n e u s p , n e u p c

NO NEW P R O C E S S C R E A T E D

6 R I G I N A L" ' P R O C E S S j

i : S P = n e u s p

: | P C = n e u p c

P r i g . P C P r i g . S P

S u b p r o c . C o d e

Figure 6: PC A LL instruction operation.

halted SELs in the SCU ring. A mechanism is needed to compensate for this effect.

The prototype's rotate compensation algorithm dictates simply, "If the highest priority SEL (SEL 15) holds the status bits of a halted stream, then rotate one position." In the prototype, priority rotation may occur at every other clock cycle (i.e., 400 nS). Since at most 14 shifts will be needed to skip over halted SELs, skipping takes at most 14 * 2 = 28 clock cycles (5.6 uS at 5 MHz). If the priority rotation time quan tum is chosen to be much greater than this minimum time, the algorithm will yield a very even priority time distribution. For example, with a time quantum of 1 mS, at most 0.6% of the time quantum is needed for this compensation.

Instruction set and compiler

The 16-bit word length of the processor limits the instruction set to 60 double-register address instructions and 128 single-register address instructions. This small number of instruction encodings precludes a fully composable instruction set; instead, a load/store instruction set was implemented. Da ta may be moved to and from memory using a variety of addressing modes, but arithmetic operations may be carried out only on the register file.

The most interesting feature of the instruction set is the inclusion of task control operations at the machine language level. The C R E A T E instruction allows the creation of a new process by assigning a halted stream to a new task. A process may terminate itself with the H A L T instruction. Alternatively, a process may use the K I L L instruction to force another process to halt. It is interesting to note that the C R E A T E instruction is very fast, requiring only eight clock cycles (1.6 uS) to execute. The H A L T and K I L L instructions are similarly quick, requiring three clocks (0.6 uS) and seven clocks (1.4 uS), respectively.

Of particular interest are the PCALL (Parallel CALL) and the P R E T (Parallel RETurn) instructions. These instructions support constructs like the P A R structure of concurrent Pascal, in which statements may be executed in any order, including concurrently. Since the overall throughput of an H D P machine is greatest when several streams are running, it is desirable that program flow adapt itself to process loading. PCALL and P R E T provide a simple mechanism for achieving this, allowing a task to split itself into light weight concurrent threads.

Figure 6 illustrates the operation of these instructions. If a new task cannot be created, (i.e., if there are no halted streams), then

^ — — S t a t u s I n f o r m a t i o n P a t h

/ 3

I > S E L 0 r e q u e s t ^ 0

/ 3

S E L i r e q u e s t > c

4" I F i x e d

/ 3 P r i o r i t y ± E n c o d e r

" ' S E L 2 r e q u e s t >c 2 c u S t r o b e

- I 1 S t r o b e ? u t

I A d d r ^ - * P \ . ' k 1 . A d d r

•! I r C i n = l S E L 1 5 r e q u e s t ^ ^

— >

/ 3

<» • E N

R o t a t i o n c o u n t j ^ . c o u n t R o t a t e C o u n t e r 4 ^ C o m m a n d

I" NEW P R P C E S S C R E A T E D j

! i P R I G I N A L P * R 0 C E S S ! ! C R E A T E D " P R P C E S S I !

Hi S P ^ O r i g i n a l j j , s p = n e u s p L _ ^ 0 | | j

jjl P C [—j . O r i g i n a l j i | ^ S u b p r o c . \ \

x \ 1 1 P r o c e s s ; • 1 1 C o d e : i


PCALL is handled like a subroutine call. A new stack is created for the subprocess, and on it are pushed the old stack pointer (SP) and program counter (PC) values. Control is then transferred to the subprogram. Upon completion, the P R E T instruction restores the old SP and PC values, effecting a return to the caller.

However, if there are halted streams available, PCALL creates a new process to execute the subprocess in parallel with the main task. The new subprocess is created with the specified new SP and PC values, and a marker of zero (which is an invalid return address) is pushed on its stack. When the subprocess executes a P R E T instruction upon completion, the zero marker is recognized as a signal for the subprocess to halt.

In the prototype, the PCALL instruction takes 13 clock cycles (2.6 uS) to create a subprocess, or eight clocks (1.6 uS) for a normal call. P R E T takes six clocks (1.2 uS) to halt a subprocess, or eight clocks (1.6 uS) for a normal return. Total time for a P C A L L / P R E T pair is thus at most 19 clocks (3.8 uS). This compares favourably with the normal C A L L / R E T instructions, which require a total of nine clocks (1.8 uS). Thus the overhead for this adaptive parallel call operation is quite low, provided that P C A L L / P R E T frequency is not excessive.

In addition to instructions for creating and halting tasks, semaphore operations are implemented in microcode to assist with process synchronization. The SSETW (Semaphore SET with Wait) instruction performs an atomic test-and-set on the specified target memory location. If the location contains zero, then execution proceeds with the next instruction. If the location is non-zero, the process executing the SSETW is suspended by setting its W A I T bit in the stream control unit. The process ceases to execute and may resume only after its W A I T bit is cleared, at which point it repeats the SSETW operation.

The other half of the semaphore mechanism is provided by the SCLR (Semaphore CLeaR) instruction. SCLR has two functions: it clears a specified memory location (a semaphore) and then causes all W A I T bits of all processes to be cleared. The effect is that whenever a process clears a semaphore, all WAITing processes awake and check their own semaphores. This simple mechanism prevents processes which are waiting for resources from wasting the processor's time by continually checking semaphores in a "busy-wait" fashion.

The prototype's SSETW instruction takes nine clock cycles (1.8 uS) plus eight clocks (1.6 uS) per semaphore re-test to execute, while the SCLR instruction takes three clocks (0.6 uS) to execute. Overhead for this simple semaphore mechanism is fairly low; even if 15 of the 16 streams are waiting for semaphores, and an SCLR occurs 1,000 times per second, only 2% of the prototype's time is wasted in semaphore re-testing.

A compiler for a Pascal-like high-level language called MPL was developed to allow the prototype to be tested with automatically compiled benchmarks. At present, the compiler performs only temporary-to-register binding, parameter passing in the register file, and simple peep-hole optimizations. Variables are bound to absolute memory locations; recursion is not supported.

Task control operations are supported in the library as simple procedure calls. It is interesting to note that many task control functions, like SSETW, translate to only two machine language instructions; i.e., the SSETW instruction and a return instruction.

Prototype testing and performance

The load/util ization performance of this processor was measured with two benchmarks: the Sieve of Eratosthenes and the Pascal version of the Dhrystone 2.0. The former is a simple looping and array access benchmark proposed by Gi lbrea th 2 1 , while the latter is

1 . 0 - - p o o o o o o

0 . 9 - - / /

MEMORY / /

U T I L . / /

/ / - c - S I E U E

0 - 8 ^ / — D H R Y S T O N E 2 . 0

0 . 7 - -

-\ I 1 I 1 1 1 h-1 2 3 4 5 G 7 8

N U M B E R OF R U N N I N G S T R E A M S

Figure 7: Memory utilization vs. number of running streams.

a synthetic systems programming benchmark originally proposed by Weicker.

These compiled benchmarks were linked with a special test driver that permitted concurrent copies of the benchmarks to be created, synchronized and measured. Machine performance measurements were carried out during the benchmark runs with the aid of marker bits in the machine's microcode word, allowing characteristics like instruction rate and j u m p frequency to be measured directly.

Among the most important of these performance characteristics is the machine's memory utilization. Since the H D P mechanism prevents prefetching misses, all memory transfers must contribute to the execution of the programs. Thus overall throughput and memory utilization must be linearly related. Furthermore, one would expect the ratio of total MIPS to memory utilization to be constant for a given benchmark.

Figure 7 shows a graph of memory utilization versus the number of concurrently running processes for the Sieve and the Dhrystone. The graph shows that as the number of running processes is increased, the memory utilization (and hence overall throughput) also increases. The performance change is most pronounced initially, as prefetching depth changes from one to two levels. This is because two-level prefetching is required to overcome the effect of the execute-prefetch delay enforced by the stream control unit. Adding more levels of prefetching (processes) yields diminishing returns; after three streams, processor performance on the Sieve and the Dhrystone is essentially constant. On the prototype, the N — 4 performance of the Sieve is 2.14 loops per second, while the Dhrystone has an N — 4 performance of 952 Dhrystones per second.

For the Sieve, the ratio of MIPS to memory utilization was found to be 0.704 (S.D. = 0.008), and for the Dhrystone it was found to be 0.851 (S.D. = 0.0006). F rom this and the maximum memory bandwidth of 1.6667 M words/second, one may infer that the Sieve requires an average of 2.4 memory word accesses per instruction, while the Dhrystone requires an average of 2.0 word accesses per instruction. This is consistent with the Sieve's characteristic of being a more memory-intensive benchmark than the Dhrystone.

In Figure 7, one will observe that the Dhrystone memory utilization increases rapidly to 0.962 at N = 3 streams, and then continues to increase at a much slower rate as processes are added. The reason for this behaviour is interesting. In the prototype, there is no special hardware support for multiplication, with the result that the microcoded multiplication routine takes a great deal of time (an average of 46 clock cycles) to execute. This long multiply delay allows all prefetch registers to fill, causing the memory to go idle. As


1 0 -- p o C— o— -o

8 - - I EU TIME / LOST FOR G / —o- SIEUE

S Y N c ' V . ) 4 " / — DHRYSTONE 2 . 0

2 J

0 - ) 1 1 1 1 1 1 h-1 2 3 4 5 6 7 8

NUMBER OF RUNNING STREAMS

Figure 8: EU-BIU synchronization time vs. number of running streams.

0 . 5 - 1 -

0 . 4 - - jy—"^^^

E X E C U T I O N <f \ SKEW / \

0 . 3 - - / \

T m a x - T m i n / Q T m a x / \

0 . 2 - - / \

0 . 1 - - ° \ —O— C o m p e n s a t i o n D i s a b l e d \

—•— C o m p e n s a t i o n E n a b l e d \

0 - | t - • • • • • V-1 2 3 4 5 6 7 8

NUMBER OF A D J A C E N T R U N N I N G S T R E A M S

Figure 9: Effect of HA LT compensation on execution skew.

more streams are added, theeffect of the long multiply is masked by the deeper effective prefetching queue, yielding a slight performance increase with each extra stream. We expect that if the multiply time were greatly reduced, the Dhrystone would achieve complete memory utilization (and hence maximum overall throughput) at N — 3 streams.

In the description of the H D P principle, it was observed that execution unit utilization is traded for higher memory utihzation by setting Tf < mean(7 e ) . In the prototype, E U utilization for the benchmarks was defined as the proport ion of time during which the EU was performing useful work; i.e., not waiting for an instruction fetch or for other memory accesses. Naturally, the ratio of EU utilization to memory utihzation must be constant for a given instruction mix. This ratio was measured to be 0.603 (S.D. = 0.003) for the Sieve and 0.662 (S.D. = 0.0007) for the Dhrystone. Hence at 100% memory utilization, the execution unit is 60% utilized by the Sieve, and would be 66% utilized by the Dhrystone.

When the EU is not performing useful work, it is either waiting for an instruction fetch or is being synchronized to the BIU for data memory access. Figure 8 shows the latter situation, the EU-BIU synchronization time, as a function of the number of running streams. For the case where only one stream is running, there is almost no time spent for EU-BIU synchronization because the BIU is often idle. As loading (and hence BIU utihzation) increases, the EU-BIU synchronization time rapidly reaches a nominal value of about 10% for the Sieve and 8.6% for the Dhrystone. This implies that at higher loading, the E U performance could at best be increased by 30% for the Sieve and 25% for the Dhrystone. In other

words, 25-30% of the maximum realizable E U utilization is traded for complete memory utihzation in the prototype.

The operation of the stream control unit 's rotation compensation algorithm is quite satisfactory. For the purposes of this discussion, we define the process execution skew to be (Tmax — 7 ^ ) / 7 ^ , where Tmax and 7 ^ are the minimum and maximum times for test processes to complete execution, respectively. Figure 9 compares the Sieve benchmark 's execution skew with the compensation algorithm enabled, to its skew with compensation algorithm disabled. The SCU was specially configured to run a maximum of eight streams, as there was insufficient memory to run the full 16. Care was taken to ensure that all processes were created in adjacent SELs.

With compensation enabled, the Sieve execution skew is small (<0 .004) . However, if compensation is disabled, the skew peaks at N = 5 streams to a value of 0.42, over 100 times greater than that found with compensation enabled. To place this in perspective, a skew of 0.004 would correspond to a 14-second time distribution inequality over one hour; a skew of 0.42 would correspond to a 25-minute inequality. The difference in measured skew indicates that the rotation compensation system does indeed work, and that its presence is essential to ensuring a fair time distribution. Furthermore, it appears that distributing fetch and execute priority does have the same effect as distributing execution time.

To place the H D P prototype's performance in perspective, it is necessary to examine the frequency of j u m p instructions (Fj) and of taken j u m p instructions (FJt) in the benchmark instruction streams. For the Sieve, F was measured to be 22.3%, while Fjt was measured to be 12.1%. The Dhrystone showed similar characteristics of F = 23.4% and Fjt = 16.6%. We will consider average values of Fj'• — 23% and Fjt = 1 4 % , which are comparable to j u m p frequencies found in other studies. 1 Consider a processor identical to the prototype, but with conventional prefetching instead of H D P . We have observed that the probability of a taken j u m p is about FJt — 14% for the prototype instruction set. Assuming that the conventional processor manages, on average, to fetch one instruction word ahead, one might expect a performance degradation on the order of 10%, since each taken j u m p causes one memory access to be discarded. With deeper prefetching, this performance penalty would increase.

Summary and applications

Horizontal demand prefetching uses prefetching across streams to achieve deep prefetching without prefetch misses, allowing very high memory utihzation at the expense of processor utilization. The architecture permits context switching with no overhead for a hardware-limited number of tasks, and supports the implementation of process control operations at the microcode level. The performance of the 16-bit prototype rapidly peaks as tasks are added, suggesting that only three or four streams may be needed to achieve significant performance improvement from H D P .

H D P is well-suited to real-time control applications; with an appropriately designed SCU, H D P can easily achieve event response times on the order of one instruction time. At a clock speed of 10 M H z with an average of six clocks per instruction, one could expect latencies on the order of 2 uS. Furthermore, the overall system cost /performance ratio for an H D P machine should be lower than that of conventional processors, as high memory utilization allows slower memory devices to be used for constant processor performance. In low-end systems, the need for separate direct memory access (DMA) devices may be eliminated by having one or two streams handle D M A directly. Since the processor memory utilization approaches 100%, it may be argued that the performance of a cacheless system with stream D M A will be the same as that of a cacheless system with external DMA.


Acknowledgement The authors wish to acknowledge the financial contribution

of the Natural Sciences and Engineering Research Council of Canada.

References

1. Lee, J., and Smith, A., "Branch prediction strategies and branch target buffer design," IEEE Computer, Vol. 17, Jan. 1984, pp. 6-22.

2. Toy, W., Zee, B., Computer Hardware/Software Architecture, Ch. 4, New Jersey: Prentice-Hall, 1986.

3. Hennessy, J.L., "VLSI processor architecture," IEEE Trans. Computers, Vol. C-33, Dec. 1984, pp. 1221-1246.

4. Lilja, D.J., "Reducing the branch penalty in pipelined processors," IEEE Computer, Vol. 21, July 1988, pp. 47-55.

5. McFarling, S., and Hennessy, J., "Reducing the cost of branches," Proc. 13th International Symposium on Computer Architecture, Tokyo, Japan, 1986, pp. 396-403.

6. Chow, P., and Horowitz, M., "Architectural tradeoffs in the design of MIPS-X," Proc. 14th International Symposium on Computer Architecture, Pittsburgh, Pa., June 1987, pp. 300-308.

7. Patterson, D.A., and Sequin, C.H., "A VLSI RISC," IEEE Computer, Vol. 15, Sept. 1982, pp. 8-21.

8. Radin, G., "The 801 minicomputer," IBMJ. Research and Development, Vol. 27, May 1983, pp. 237-246.

9. Ditzel, D.R., and McLellan, H.R., "Branch folding in the CRISP microprocessor: reducing branch delay to zero," Proc. 14th International Symposium on Computer Architecture, Pittsburgh, Pa., 1987, pp. 2-9.

10. McCrackin, D.C., "The microcode level timeslicing processor architecture," Ph.D. dissertation (B. Szabados, supervisor), McMaster U., Hamilton, 1988.

11. Flores, I., "Lookahead control in the IBM System 370 Model 165," IEEE Computer, Vol. 11, Nov. 1974, pp. 24-38.

12. Thornton, J.E., Design of a Computer: the Control Data 6600, Glenview, 111.: Scott, Foresman and Co., 1970.

13. Watson, W.J., "The TI ASC: a highly modular and flexible super computer architecture," in Computer Structures: Principles and Examples, eds. D.P. Siewiorek, C.G. Bell and A. Newell, New York: McGraw-Hill, 1982, pp. 753-762.

14. Shimizu, K., Goto, E., and Ichikawa, S., "CPC (Cyclic Pipeline Computer) - an architecture suited for Josephson and pipelined-memory machines," IEEE Trans. Computers, Vol. C-38, June 1989, pp. 825-832.

15. Smith, B.J., "Architecture and applications of the HEP multiprocessor computer system," Real-Time Signal Processing IV, Proc. SPIE, 1981, pp. 241-248.

16. Kuehn, J.T., and Smith, B.J., "The Horizon supercomputing system: architecture and software," Proc. Supercomputing '88, Orlando, Fla., 1988, pp. 28-34.

17. Thistle, M.R., and Smith, B.J., "A processor architecture for Horizon," Proc. Supercomputing '88, Orlando, Fla., 1988, pp. 35-41.

18. Thacker, CP., McCreight, E.M., Lampson, B.W., Sproull, R.F., and Boggs, D.R., "Alto: a personal computer," in Computer Structures: Principles and Examples, eds. D.P. Siewiorek, C.G. Bell and A. Newell, New York: McGraw-Hill, 1982, pp. 549-572.

19. Lampson, B.W., and Pier, K.A., "A processor for a high-performance personal computer," Proc. 7th International Symposium on Computer Architecture, La Baule, France, 1980, pp. 146-160.

20. Forgie, J.W., "The Lincoln TX-2 input-output system," Proc. Western Joint Computer Conference, Los Angeles, 1957, pp. 156-160.

21. Gilbreath, J., "A high-level language benchmark," Byte, Vol. 6, Sept. 1981, pp. 180-198.

22. Weicker, R.P., "Dhrystone: a synthetic systems programming benchmark," Comm. ACM, Vol. 27, No. 9, Oct. 1984, pp. 1013-1030.

Documents

Horizontal demand prefetching: A novel approach to eliminating the jump problem