10
Exploring a Stack Architecture Russell P. Blake Hewlett-Packard Introduction Computer. architects design a high-performance system out of the same components that the rest of us would use to build a common one. They do it through an understanding of how the system will actually be programmed. Such was the insight which led to the development of stack-based computers. Today our intuition can be augmented by experimenting with existing stack systems. The HP3000 Series II computer system' was introduced in 1976. Since then HP has undertaken a large number of measurements relating to various aspects of system performance. Some of these shed considerable light on the behavior of its stack in a multiprogramming environment. Process address space The stack features of the HP3000 are imbedded in the process address space. (A process is the execution of a program.) Figure 1 shows the address structure of a process. The abbreviations stand for CPU registers dedi- cated to addressing. Two separate domains are accessible to the process. The data stack is assigned to the process for its life. The code segment, on the other hand, may change several times a millisecond. Unlike the stack, the code segment may be shared with any other process (subject to security restrictions, naturally). A segment may be any size up to 64K bytes for data segments, or 32K bytes for code. The registers bounding the code segment are called program base (PB) and program limit (PL) registers. Firmware ensures that these registers always point to the currently executing code segment. The P register points to the next instruction, and is bounds-checked by the firmware against PB and PL. Branches within the segment use P relative displacements. There are no instructions which can store into memory relative to the PB, P, PL registers. Thus the unmodifiable code segments are a convenient place to keep constants. There are five registers in the CPU which are applied to the stack for addressing when its process is actually executing. The process dispatcher establishes this environ- ment prior to giving a process use of the CPU. This is called process launch. The most common path through the dispatcher is firmware; the process selection algorithm is implemented in software.2 When a process is not executing, its current register values are kept at the base of the stack (in the process control block extension, accessible only to the operating system). Architecturally, process dispatching is handled as the lowest priority external interrupt. Interrupts execute outside the process address space on a special 512-word interrupt control stack (ICS). To launch a process, the dispatcher extracts the register values from the PCBX area and places them in fixed memory locations. S: 2 , cc 0 LU aD LU 0-Q CURRENT CODE SEGMENT S - PCBX - z PROCESS STACK Figure 1. Process address space. 30 DB-i COMPUTER

Exploring Stack Architecture - Semantic Scholar · program base (PB) and program limit (PL) ... Ackermann's function has long been usedin the theory of algorithms and automata. It

  • Upload
    doannhu

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Exploring aStack ArchitectureRussell P. BlakeHewlett-Packard

Introduction

Computer. architects design a high-performance systemout of the same components that the rest of us woulduse to build a common one. They do it through anunderstanding of how the system will actually beprogrammed. Such was the insight which led to thedevelopment of stack-based computers.Today our intuition can be augmented by experimenting

with existing stack systems. The HP3000 Series IIcomputer system' was introduced in 1976. Since thenHP has undertaken a large number of measurementsrelating to various aspects of system performance. Someof these shed considerable light on the behavior of itsstack in a multiprogramming environment.

Process address space

The stack features of the HP3000 are imbedded in theprocess address space. (A process is the execution of aprogram.) Figure 1 shows the address structure of aprocess. The abbreviations stand for CPU registers dedi-cated to addressing.Two separate domains are accessible to the process.

The data stack is assigned to the process for its life.The code segment, on the other hand, may changeseveral times a millisecond. Unlike the stack, the codesegment may be shared with any other process (subjectto security restrictions, naturally). A segment may beany size up to 64K bytes for data segments, or 32Kbytes for code.The registers bounding the code segment are called

program base (PB) and program limit (PL) registers.Firmware ensures that these registers always point tothe currently executing code segment. The P registerpoints to the next instruction, and is bounds-checked bythe firmware against PB and PL. Branches within thesegment use P relative displacements. There are noinstructions which can store into memory relative to the

PB, P, PL registers. Thus the unmodifiable code segmentsare a convenient place to keep constants.There are five registers in the CPU which are applied

to the stack for addressing when its process is actuallyexecuting. The process dispatcher establishes this environ-ment prior to giving a process use of the CPU. This iscalled process launch. The most common path throughthe dispatcher is firmware; the process selection algorithmis implemented in software.2When a process is not executing, its current register

values are kept at the base of the stack (in the processcontrol block extension, accessible only to the operatingsystem). Architecturally, process dispatching is handledas the lowest priority external interrupt. Interruptsexecute outside the process address space on a special512-word interrupt control stack (ICS). To launch a process,the dispatcher extracts the register values from thePCBX area and places them in fixed memory locations.

S:

2,

cc

0

LU

aDLU

0-Q

CURRENTCODE

SEGMENT

S-

PCBX

- zPROCESS STACK

Figure 1. Process address space.

30

DB-i

COMPUTER

The interrupt exit (IXIT) instruction sets the registerenvironment for the stack from these pre-arranged cells,and transfers from the ICS to the process in a fashionsimilar to a procedure return.

Stack address structure. The data limit (DL) and the Zregisters ensure that data references by the process areconfined to the assigned stack area while in user mode.Addresses in the stack are always formed from displace-ments relative to the DB, Q, or S registers. The DBregister provides access to global variables. The Q registeris used to access parameters and local variables of thecurrent procedure. The S register is the logical pointerto the top of stack for this process. As data are loadedonto the top of stack (called TOS), S is incrementedtowards Z. When data are stored off the stack, into aDB- or Q- or S-relative area, the S register is automaticallydecremented appropriately. Should S encounter Z, a trapto the operating system invokes a routine which expandsthe stack and then resumes execution.From the viewpoint of the executing procedure, details

of addressing within the stack data segment are displayedin Figure 2. Global scalars (integers, reals, pointers, etc.)are located at fixed positions from the DB register.Although an advanced compiler might enforce somerestrictions,3 the global scalars are nominally accessibleto the outer block and to all the procedures in theprogram. The invariant instructions in the code segmentrefer to the global scalars as a fixed distance from theDB register for each scalar. This distance is determinedat compile time.

Indirect cells in the global scalar area !can containpointers into the global data structures. These aretypically arrays, which are accessed using one level ofindirection and post-indexing with the two's complementindex (X) register. All indirect pointers in the stack are16-bit word or byte displacements from DB. Since theseare also in two's complement form, negative pointers canbe used to access the area from DL to DB, with theoptional indexing.

STACKDL

MEMORY REFERENCE INSTRUCTIONADDRESS BIT ENCODING

Instruction BitsAddress 12131415Mode 7 18 19 10 I1121314[5

DB+ 0 0: 255 -

Q+ 1 0 - 0 127

0- 1 1 0- 63s- 1 1 1 0:63-

WORD DISPLACEMENTFROM REGISTER

This basic structure can be extended easily. A languagewith multiple lexical levels could employ the first fewDB+ locations as the static display.4 These can bemaintained by additions to the procedure call and pro-cedure exit instructions. Lexical levelp would be accessedby post-indexing an indirection through the pointer atDB +p. A language which uses a run-time heap for allocat-ing local data structures can easily manipulate the DL toDB area as a stack which grows in the opposite directionfrom S.5 In most languages, the distance from DL to DBis controlled by the program calling an operating systemprocedure. In an experimental version of our systemsprogramming language, called SPL II, the DL area ismanaged by the compiler to provide for dynamic arrays.(A dynamic array is one which expands after its initialdeclaration. As an example, a compiler's symbol tablemight be implemented as a dynamic array.) In APL 3000the DL area is used to control a process-local, virtualarray capability. In effect, the APL "workspace" ispaged from the disk to the DL area. Special firmwaresupplied for this purpose enables the APL program to haveas large a workspace as can fit on the disks.By far the most common variation from the simple

structure presented in Figure 2 is split-stack mode,illustrated in Figure 3. In split-stack mode, the DB-register has been placed at another data segment, whichmay contain file, data base, or data communicationsbuffers, along *ith other information of use to themultiprogramming executive (MPE). This facility is usedheavily by MPE to help implement a homemade versionof operating system monitors.6 In Figure 3, DB-relativeaddressing will access the data in the extra data segment.Bounds checking is limited in split-stack mode, sinceDB is no longer between DL and S. Therefore, splitstack addressing is permitted only while in privilegedmode. Better error confinement would result from boundschecking all memory references, not only by user andprivileged instructions but also by I/O activity.

STACK

EXTRA DATA SEGMENT

z

Figure 2. Procedure's view of the stack. Figure 3. Split stack mode.

May 1977

IIIIIIIIIII

DB I I

31

b.DB

LOADPARAMETERS

PARMS

PCALADDS

_ _ _ ___0

LOCALVARIABLES

COMPUTEIN CALLED

PROCEDURES

/ 0

PAR MS

LOCALVARIABLES

TEMPVARIABLES

EXIT

Q- -

S-FELTUT

Procedure Call

The necessity for dividing large programs into modules7in order to produce more manageable and reliable soft-ware is now well accepted, and the need for computerswhich can support highly modular programming is no

longer in question. Our measurements have shown thatin a multiprogramming mix, a procedure call is executedevery 90 instructions on average. So the efficiency of theprocedure call mechanism in a modular system is extremelyrelevant to its overall performance.The stack movements for procedure entry and exit

on the HP3000 are sketched in Figure 4. First, spaceis reserved for the result of the procedure call, if theprocedure is a function. This is shown by a zero on thestack in Figure 4b. Parameters, or DB-relative pointersto parameters, are loaded to the top of the stack. Thiscauses the S-register to advance towards Z. When theprocedure call instruction (PCAL) is executed, a "stackmarker" of eight bytes is written at the top of the stack.This remembers the environment of the caller: the codesegment and P register, the X register, status bits, andthe previous stack marker. The Q register is advancedby the PCAL firmware to point to the new stack marker.Execution ensues in the called procedure, where an add toS (ADDS) instruction can be used to allocate space forlocal variables.When the procedure has completed its computation, it

executes the EXIT instruction, which contains the numberof words to cut off the stack when returning to thecalling procedure. Thus the parameters, placed on the stackbefore PCAL, are eliminated from the stack by EXIT.If the called procedure is a function, it returns a result,as shown in Figure 4e.

Details of the PCAL instruction's operation8 are flow-charted in Figure 5. The parameter contained in thePCAL instruction is a displacement negative from PL, into

32

Figure 5. Procedure call instruction logic.

the segment transfer table (STT; see Figure 1). The STTentry indicates whether the procedure is in the currentcode segment or a different one. If the procedure is"local" to the current segment, the new P is extractedfrom the STT entry, and the called procedure beginsexecution. Otherwise the transfer to another segment iseffected. The traps indicated by the circles in Figure 5will cause a firmware transfer to be appropriate MPE

COMPUTER

S

Figure 4. Stack movement on procedure entry/exit.

d.a. C. e.

IIIIIIIIiI

IIIIII

I IZ I I

internal interrupt routines in code segment 1.9 Theseperform a variety of services, depending on the natureof the trap. An absence trap, for example, will cause thevirtual memory software to make the segment presentin main memory.10 Traversing the left-hand side of Figure5 requires 5.95 microseconds, whereas traversing theright-hand side requires an additional 8.75 microseconds.The EXIT instruction logically reverses the PCAL; alocal EXIT takes 7.175 microseconds, whereas an EXITto another segment takes 5.425 microseconds more.

Ackermann's function. Ackermann's function has beenwidely used to measure the calling performance of alanguage/system pair. It has two parameters and returnsa value. It is highly recursive-i.e., it calls itself persistentlyto compute its return value. This characteristic is dis-played by the algorithm for the function, which we expresshere in HP3000 Systems Programming Language (SPL);

integer procedure ACKERMANN(M,N); value M,N;integer M,N;ACKERMANN := if M=0 then N+1

else if N=0 then ACKERMANN(M-1,1)else ACKERMANN(M-1,ACKERMANN(M,N- 1));

Because of its strongly recursive nature, Ackermann'sfunction has long been used in the theory of algorithmsand automata. It is of importance in computer designbecause it can serve as a standard method for evaluatingan implementation of procedure calling. Sundblad1' usedAckermann's function to explore various implementationsof Algol, PL/I, and Simula on the 360/75 and the CDC6600. Wichmann12,3 has now reported on the performanceof Ackermann's function on over 50 language/systempairs. These studies provide a healthy basis for comparison.The stack architecture of the HP3000 and the separation

of code from data provide recursion in its languages.The HP3000 architecture thus permitted us to code andthen run the function in several languages. The time percall to Ackermann's function for each compiled languagewe tested is shown in Table 1. In every case weattempted to duplicate Wichmann's program for executingthe algorithm.12 CPU time was measured by calling anMPE procedure with access to a clock which measuresCPU time in the process. These tests were first executedin a timesharing environment. A subsequent run on astand-alone system yielded identical results.

Table 1. Time to call Ackermann's functionin several languages.

version must go indirectly through these pointers. Thesame is true for Basic: parameters to functions are passedby reference. In addition, the value returned by thefunction in Basic is returned by reference. This isnecessary because with Basic, a function may return astring as easily as a number. So a DB-relative pointerto the cell(s) reserved for the return value will also becreated on the top of stack before each function callin Basic.We conclude that differences in performance from one

type of language to the next lie in differences in theway parameters are passed to and from procedures. Thesedifferences originate in the external definitions of thelanguages. They have to do with whether the parametersand return values are passed by value or by reference.A word of caution about these figures: one should not

attempt to deduce the general performance of a program-ming language on the basis of procedure-calling efficiencyalone. Overall language efficiency is more a product ofthe suitability of the language for the application beingimplemented.Four SPL versions are listed in Table 1. The high-

level version of SPL was illustrated with the definitionof Ackermann's function. The experimental languageSPL II is a more modern version of SPL. Its definitionof Ackermann's is like the high-level SPL declaration.SPL permits the mixing of lower-level constructs withinthe high-level syntax. The system-level SPL containsspecial syntax for explicit reference to the TOS, the Xregister, and status register conditions. At the verylowest level, direct machine code can be written. As anexample, a phrase of the assembly-level SPL code, whichevaluates the path through ACKERMANN(M-1, ACKER-MANN (M,N-1)), appears as shown in Table 2.

Table 2. Assembly-level SPL version of Ackermann's function.

Code

LOAD

BELDXBEZERO,

DUP,ZERO,LDXA,PCALPCALSTOREXIT

0-5

MOQ-4NOXCH

DECBXCHDECAACKERMANNACKERMANNQ-62

Stack changes

<< load parameter m to the top of thestack >>

<< go: code to do m = 0 case (not shown) >><< index register gets n parameter >><< go: code to do n = 0 case (not shown) >><< push 0 onto TOS; exchange m with

the 0 >><< duplicate TOS, decrement original word >><< push 0 onto TOS, exchange >><< load index reg to TOS, subtract 1 >><< parameters are on the stack >>'<< last param is result of previous PCAL >><< store function result >><< delete 2 parameters and return >>

LANGUAGEASSEMIBLY-LEVEL SPLSYSTEM-LEVEL SPLHIGH-LEVEL SPLSPLIIFORTRAN.BASIC (COMPILED)

TIME PER CALL24, SEC24pSEC3OSEC31 ,SEC46,.SEC71 ,uSEC

It is clear from Table 1 that choice of language hasdramatic effect on procedure-calling performance. Thespread from compiled Basic to SPL II is due todifferences in parameter-passing conventions. By definition,Fortran will pass both parameters m and n by reference.Instructions are included in the code generated by theFortran compiler which push DB-relative pointers to thetwo parameters. References to parameters in the Fortran

May 1977

The differences in the various SPL versions are allrelated to the use of instructions called "paired stackops."The combination "ZERO,XCH" is a paired stack operator.It is in fact two instructions packed into a single 16-bitword. The assembly-level version contains four pairedstackops. The high-level SPL version contains one pairedstackop. The SPL II version did not generate any pairedstack operators; this caused it to execute one moreinstruction than its predecessor. Because SPL II is stillin development, this difference may soon disappear.

Finally, on the basis of a comparison between ourresults and previous measurements of other systems, 11,12,13

the stack architecture of the HP3000 yields decisiveperformance benefits for modular programming in high-level languages.

33

As expected, the interpreted languages on our systemwere several orders of magnitude slower than thecompiled languages. This is a reasonable price for theinteractive facilities of interpretation. Basic required 3.5milliseconds per call, whereas APL was an order ofmagnitude slower than Basic. This is simply becauseAPL/3000 permits a program to back up to a previouspoint and continue from there along a different path.14'15An important application of this feature is the "safe"APL development environment, wherein all changes topreviously reliable software contain an acceptance test.If the test fails, the new routine logs the failure, backsup the environment to some appropriate previous point,and then executes the old, reliable routine.16 The abilityto return execution to a previous environment exacts acertain toll on the efficiency of function calling. But webelieve that the price we pay for stability in softwareis always less than the cost of not having it.

Instruction frequency measurements.

The popularity of the various instructions is a matterof some interest to the CPU designer. It is a truismof computer architecture that the CPU design shouldmake the frequently executed instructions as quick aspossible. A stack architecture includes instructions notpresent in other systems, so our frequencies are especiallyrelevant to the design of future stack systems.These measurements include the multiprogramming

executive's activity in the mix as well as the problemprograms' activities. Since computers like the HP3000 aredesigned to run as a multiprogramming system, they canbest be evaluated under a multiprogramming workload.A laboratory examination of the instruction occurrences

during a commerical benchmark taught us that this typeof measure will vary somewhat from. one load on thesystem to another. In light of the variations expectedfrom one mix to the next, instruction frequencies arerounded to the closest percent. This particular benchmarkused an HP2100 computer system to drive the HP3000from prepared scripts.The benchmark we measured had a decided data

processing flavor, including 14 interactive sessions makinginquiries to a data base which is used to control ourmanufacturing operations. The inquiry programs werewritten in Cobol and used the HP3000's Image data-base facility. Three other sessions engaged the Basicinterpreter in interactive program development. Then therewere five more sessions using the Editor to manipulateCobol source statements. At the same time, three jobswere running in batch. One was a Cobol compilation,another an RPG compile and go, and the third an SPLcompile and go, including a Sort.One of our CPU designers modified the hardware and

firmware of an-HP3000 Series II to take the measure-ments. A measurement microcode routine was writtenand added to the regular system firmware. The hardwarewas altered to provide a trap to the measurementmicrocode after each instruction. The operating systemwas told to use only the first half of the 512K bytes ofmain memory, and the last half was used to collect thedata: a 32-bit counter for each of the 65,536 possiblecontents of the 16-bit current instruction register. Eachtime an instruction occurred, its counter was incremented.Control would then be returned to the standard microcodefor processing the next instruction.The results of the test are grouped by instruction

type in Figure 6. The memory reference instructions have

34

Linkage & Control

Figure 6. Instruction frequency distributions.

two operands. One is a DB-, Q-, or S-relative displacement-possibly indirect, possibly indexed. The other operand isimplicitly the top word of the stack. The immediategroup operates directly on the top of the stack with aconstant value. The branches are either unconditional,conditional on a previous operation, or on the contentsof the top of the stack. The stackops operate on thetop two registers, but sometimes the top four or onlythe top one. The privileged memory reference instructionsare provided to ease operating system table access. Thefield and bit group are used to isolate and test bits andbit strings. Linkage and control consist primarily of thePCAL and EXIT instructions. Shifts perform a shift onthe top of stack.

Memory reference instructions. The memory referenceinstructions are dominated by the LOAD instruction.By far the most common single instruction (it accountsfor 18 percent of all instructions executed), LOAD pushesa DB-, Q-, S-, or P-relative word onto the top of thestack-possibly indirect, possibly indexed. Its converse,the STOR instruction, accounts for 7 percent of theinstructions executed. The LDX instruction, which similarlyloads the index register from DB, Q, S, or P, occursanother 3 percent of the time. So only three instructionsaccount for 28 of the 34 percent occurrence of memoryreference instructions.The memory reference group is of interest because it

gives hints as to how programs view their addressspace. The distributions and types of references for theLOAD and STOR instructions are given in Table 3.Each column totals less than 100 percent because we didnot include direct array addressing, which involves theindex register without indirection. Direct arrays are usedheavily by MPE in split stack mode. They account for13 percent of the LOADs and 6 percent of the STORs.

COMPUTER

AddTy,

DB+DB+, I,0-

O-,Q-I, >Q+Q+, I,>S-P+ -

Table 3. Distribution of memory references.

ress Nominal Percent Ppe Use of LOADs of

global scalar 7X global array 3

LOAD: value parameter 20STOR: return value -

reference parameter scalar 4array parameter 5local scalar 27local array 7temporary 2constant 12 no

Again, percentages are expressed as a fraction of allinstructions executed. Much of the use of DUP could

'ercent probably be eliminated by including a nondestructiveISTORs STOR instruction, which does not pop the stack, but

merely copies it to the specified DB-, Q-, or S-relative- 7 location.

10

1756

4441

Dt allowed

Branch instructions. On the HP3000, branches can beeither direct or indirect. The indirect branch instructionpoints to a cell which contains a two's complement,16-bit displacement from the indirect cell itself. Thedirect branch simply points to the jump destination;unconditional branches may be up to 255 words away.In conditional branches, the destination (or indirect cell)must be within 31 words of the branch instruction. Thecondition code in the status register determines theoutcome of the conditional branches. It is set by manyinstructions, indicating a positive, negative, or zero result.Of all branches, 68 percent depended on the setting of thecondition code, 19 percent were unconditional, and 13 per-cent on the value of the low-order bit on the top of thestack. A one in that position is taken as a true conditionby SPL, a fact exploited by operating system tableentries. Eighty-one percent of the conditional and 86 per-cent of the unconditional branches were direct P-relativebranches. For only direct branches, the distribution ofbranch distances (expressed as a percentage of directbranches) were as shown in Table 4.

Table 4. Distribution of branch distances.

% ofBranch DirectDistance BR

128-22564-1 2732-6316-318-154-72-3

533

42101215

P+- 1 9

% ofDirectBCC

20302623

Stackops. The stack operators are those whose operandsare implicitly at the top of the stack. Their operationwas demonstrated by Ackermann's function. One resultof the measurement was that 5 percent of all instructionsexecuted were paired stackops. Paired stackops reducememory traffic to the CPU and improve the code com-pression otherwise inherent in the stack architecture.Of the most common stackops, only one is an arithmetic

operator as shown in Table 5.

Table 5. Dominant stackops.

DUPSTAXZEROCMPXCHDECA

3% Duplicate top of stack3% Store top of stack in index reg and delete2% Push a zero onto the top of stack1% Compare top two words, set conditon code1% Exchange top two words1% Subtract one from the top of stack

Immediates. One quarter of the immediate group wereexecutions of LDXI (load X immediate). The value loadedis between 0 and 255 and is imbedded in the instructionin the code segment. With value once again expressedas a percentage of all executed instructions, the otherdominant immediates were as shown in Table 6.

Table 6. Dominant immediates.

CMPIADDILDISEDANDI

3% Compare immediate value with TOS2% Add immediate value to the TOS2% Load immediate value to the TOS2% Enable, disable external interrupts1% And immediate with the TOS

The effect of the operating system is evident in twoplaces. One is the frequency of interrupt management;the other is the frequency of ANDI, which is generatedby SPL to isolate the low-order bits in the top ofstack. This is commonly used to unpack system tableinformation.

Other Instructions. Over half of the field and bitgroup are executions of EXF, extract field. This takesa bit field in the top of the stack, right-justifies it,and clears the high-order bits. It is also used heavily inunpacking operating system tables.The privileged memory reference instructions include

the absolute loads and stores, used only by the operatingsystem. The LST (load system table) instruction accountsfor 2.5 percent of all the instructions executed. It permitsarray-type access to main memory resident system tables,which are used primarily by the input/output, virtualmemory, and process handling software. Special instruc-tions included to speed the work of the operating systemcontribute to its multiprogramming performance. Theirfrequency is evidence of their contribution.< The linkage and control instructions were dominatedby PCAL and EXIT at 1.1 percent each; there was aPCAL or an EXIT every 45 instructions. In this groupwe included the ADDS instruction, used primarily toreserve space for local variables in the Q+ area at thevery start of a procedure. There was one ADDS forevery other PCAL.The HP3000 Series II includes extensions to the basic

instruction set for packed decimal and floating pointarithmetic. Packed decimal instructions accounted foronly 0.003 percent of the instructions executed, a numbersmaller than might be expected. No extended floatingpoint instructions were executed.The 10 most frequent instructions are displayed in

Table 7. These truly unique results provide an instructionprofile17 of a multiprogramming system in actual operation.They differ significantly from those reported after ex-amining individual application programs."8 The maindifference lies in the emphasis on testing and branchingwhich is evident in our commercial, multiprogrammingbenchmark, and the low profile of the arithmetic operators.A weighted average instruction execution time for the

10 most frequent instructions is 1.5 microseconds; thisincludes fetching the instruction from memory. As can beseen in Table 7, the 10 most frequent instructions

35May 1977

account for about 60 percent of all instructions executed.The system is fulfilling its design objective since themost frequently executed instructions are especiallyquick.

Table 7. Ten most frequent instructions in amultiprogramming benchmark.

18%10%7%4%3%3%3%3%3%3%

Load word onto the top of stackBranch on status conditionStore word off the top of stackLoad immediate value into index registerDuplicate the top of stackStore top of stack into index registerUnconditional branchCompare immediate value with top of stackLoad index register from memoryExtract bit field from the top of stack

word of the stack. The adder output is ignored if SRis zero, meaning the registers are empty.The namer is a two-bit register which keeps track of

which TR register is currently the top of the stack.It is decremented each time an element is pushed ontothe stack and incremented each time an element ispopped off the stack. It is combined with the SRregister count in the adder, whose output enables themapper to deliver the correct register contents to theALU.An example of the renamer in action can be seen in

Figure 9. The computation proceeds normally until allfour TR registers are filled up. Then the intermediateresult deepest in the TR registers is pushed into memoryand the SM register is incremented to mark the new

Stack hardware

On the HP3000 there are four registers in the CPUwhich may contain up to four words at the top of the;stack of the current process. The SR register in theCPU is a three-bit register. It contains the numberof top-of-stack registers which in turn contain data validfor the current process. This is shown by example inFigure 7, where the stack in memory (SM) registermarks the last valid memory location, and the SR valueof three indicates how many of the four registers havedata belonging to the current computation.Data items are pulled into the top-of-stack registers

when required by an instruction, or pushed to makeroom for the execution of another. These pushes andpulls are automatically managed by the firmware.The logic implementing the top of the stack is diagramed

in Figure 8. There are mappers, a namer, an adder, thefour top-of-stack TR registers, and the SR register.They constitute the top-of-stack register renamer. Therenamer logic enables the top of the stack to rotateamong the TR registers. The output of the adder is thenumber of the TR register which currently holds the top

4 TOP-OF-STACKCPU REGISTERS

, Ili SR REGISTER

No. CPU REGS VALID

SM + (SR)(TOP OF STACK) \ MEMORY LOCA TION OF LAST

VALID STACK WORD IN MAINMEMORY

Figure 7. Stack registers extend the stack in memory.

Figure 8. Simplified block diagram of TOS hardware.

COMPUTER

LOADBCCSTORLDXIDUPSTAXBRCMPILDXEXF

36

top of stack in memory. The top of the stack in theCPU is renamed to TR3 by subtracting one from thenamer. Within a microprogram, the top two stack registersare always referenced as RA and RB. They might be anylogically adjacent register pair. That RA maps into TR3and RB maps into TRO is a condition of only thisparticular computation. The renamer thus prevents havingto move the operands to specific registers, prior toexecution of microprogram.In the last row of Figure 9, the operand is auto-

matically pulled into the CPU in preparation for execution.The number of TOS registers which must be validbefore the instruction begins is a characteristic of theinstruction group and is recorded in the lookup table(LUT). This table is used to determine for each instructionthe microroutine to be executed. From the LUT datathe CPU registers are filled with pulls and emptied withpushes into memory, as required, prior to the executionof each instruction.

Optimal TOS hardware. A point of considerable interestto the CPU architect is the number of registers toinclude at the top of the stack. In the design we havedescribed, the number of registers at the top of thestack can be increased by enlarging the SR and namerregisters. Is there a way to determine the optimalnumber of registers to include in future systems?

To answer this question, microcode was introduced intothe system to emulate up to 16 top-of-stack registers.The emulator maintained pseudo SR registers in the lasthalf of memory. When fewer than four TOS registerswere being measured, the emulator pushed any extraregisters into memory after each instruction. If more thanfour were being measured, the emulator kept track, usinga pseudo SR, of how many words were on the top ofthe stack in addition to those counted by the real SR.By using multiple pseudo SR registers, we could measureseveral configurations during one run.Under these conditions, it was possible to count the

number of pushes and pulls generated by the multi-programming benchmark described above.A pull is initiated when an operand at the top of

the stack in memory is required in the CPU. Pushes,on the other hand, result from top of stack registersfilling up. Furthermore, on a PCAL, any occupied top-of-stack registers are pushed into memory.A pull indicates that the top of the stack has too

few registers to hold all the intermediate results. A pushalso indicates an inability to hold all the intermediateresults in the top of the stack. Thus the optimal numberof top-of-stack registers will minimize the number ofpushes and pulls.In Figure 10, the number of pushes and pulls is

graphed as a function of the number of TOS registers.

COMPUTE: f.- [ a J

I (SM-SM + 1)a .

(SM-SM-1)

Figure 9. Top of stack register renamer in action.

May 1977

S.

STOL SM

STOR f |

37

PUSHES & PULLSVS

TOS REGISTERS

.05

.02 _PULLS

.01 I I I

0 1 2 4 6 8

Number of TOS Registers

Figure 10. Memory references caused by stack activity.

Tihe number of pulls per instruction is displayed withcircles. The number of pushes per instruction is shownby X's.There are 2.2 references to main memory per instruction

executed on the Series II. For the four TOS registerson the actual machine, pushes and pulls counted for a

total of 0.085 memory references per instruction. Thisrepresents only 4 percent of CPU-memory traffic.At first glance, Figure 10 indicates that the choice

of four registers in the Series II is pleasingly near-

optimal. We conclude that for the Series II instruction set,four is indeed a good number. However, the instructionset was clearly defined with four TOS registers inmind. In addition, frequently executed parts of theoperating system were hand crafted to perform optimallywith the four registers. We feel that these factors haveinfluenced the results, and that a larger number ofregisters would be optimal for an instruction set designedto use them well. But it is encouraging to discover thatso rich an instruction set as that of the HP3000 can beadequately supported by a small number of TOS registers.As the number of TOS registers was increased by

emulation, the number of pushes and pulls dropped toa constant. The pulls level off at about 0.02 per instruc-tion. Further measurements show that one third of these are

caused by SCAL, PCAL, and interrupts emptying wordsinto memory which are needed on return. The remainingtwo thirds of the pulls do not appear to have a single cause.

The pushes level off at about 0.05 per instruction.These are due entirely to emptying the top of stack

on a PCAL or ADDS instruction. Once the number ofTOS registers exceeds eight, a PCAL will flush theregisters before they fill up. If a greater number ofTOS registers were available, one might choose not toflush the top of stack on PCAL.

Acknowledglments

Several engineers at HP's General Systems Divisioncontributed directly to this report. Visionary experimentsby John Sell provided the majority of the data. Johnmodified the CPU and wrote the measurement firmwareand data reduction programs. Thanks are also due toTom Carney for his help in defining the multiprogrammingmix and preparing the scripts. Doug Jeung, DickZimmerman, Cliff Jager, Ken Mintz, Bob Olsen, and JeanDanver all provided valuable assistance with Ackermann'sFunction. Readers are in debt to Ed Bassart and TerryHamm for incisive reviews of the manuscript.

References

1. HP3000 General Information Manual, Hewlett-PackardCo., Manual Part No. 30000-90008.

2. R. P. Blake, "Tuning an Operating System for GeneralPurpose Use," Computer Performance Evaluation, Online,Potomac, Maryland, 1976, p. 303.

3. P. B. Hansen, "The Programming Language ConcurrentPascal," IEEE Transactions on Software Engineering,Vol. SE-1, No. 2, June 1975, p. 199.

4. E. A. Huack and B. A. Dent, "Burrough's B6500/B7500Stack Mechanism," AFIPS Conference Proceedings, 1968SJCC, p. 245.

5. K. V. Nori, U. Ammann, et al., "The PASCAL 'P'Compiler: Implementation Notes," Institut fur Informatik,Eidgenossissche Technische Hochschule, Zurich, ReportNo. 10.

6. C. A. R. Hoare, "Monitors: An Operating SystemStructuring Concept," CACM, Vol. 17, No. 10, Oct. 1974,p. 549.

7. D. L. Parntts, "On the Criteria to be Used in DecomposingSystems into Modules," CACM, Vol. 15, No. 12, Dec. 1972,p. 1053.

8. HP3000 Series II Computer System Machine InstructionSet, Hewlett-Packard Co., Manual Part No. 30000-90022.

9. HP3000 Series II Systems Reference Manual, Hewlett-Packard Co., Manual Part No. 30000-90020.

10. L. E. Shar, "Series II General Purpose ComputingSystems," Hewlett-Packard Journal, Vol. 27, No. 12,Aug. 1976.

11. Y. Sundblad, "The Ackermann Function-A Theoretical,Computational, and Formula Manipulative Study," BIT,Vol. 11, 1971, p. 107.

12. B. A. Wichmann, "Ackermann's Function: A Study in theEfficiency of Calling Procedures," BIT, Vol. 16, 1976,p. 103

13. B. A. Wichmann, "An Experimental Data Base for Com-puter Performance Prediction," "Computer PerformanceEvaluation," Online, Potomac, Maryland, 1976, p. 239.

COMPUTER

1.0

.5

c1==._2a-u

=6-

a. .1up

A.

38

14. D. G. Bobrow, "A Model and Stack Implementation ofMultiple Environments," CACM, Vol. 16, No. 10, Oct.1973,p. 591. Get d ed

15. APL/3000 Reference Manual, Hewlett-Packard Co., Manual mbet ddPart No. 32105-90002. e

16. B. Randell, "System Structure for Software Fault Tolerance,"IEEE Transactions on Software Engineering, Vol. SE-i,No. 2, June 1975, p. 220.

17. D. E. Knuth, "An Empirical Study of FORTRANPrograms," Software-Practice and Experience, Vol. 1.,1971, p. 105.

18. G. W. Alexander, "Static and Dynamic Characteristics DTA w coi

of XPL Programs," Computer, Nov. 1975, p. 41.

~ Russell P. Blake is project manager of per-Packard's General Systems Division. Since "$E_ Tjoining HP in 1973, he has been involved in D DISC CONTROLLERSthe design and implementation of multi-programming computer systems. His contri- ;;1 r vrituadl all tap and disc drives.

' 9 _ butions include an integrated timeshare/batch i or 9 trk, 8 drispooling facility, a segmented virtual memory ~,9tPkE' dr

manager, and an adjustable process dispatcher. t ' to, tpt'ofddl hh, up to 100 mlHe has planned the division's efforts in

operating system reliability, and advanced the theory of system 0 '- D q0ateriompatible.performance tuning. Prior to working for HP, he spent threeyears developing data base/data communications systems. 0' W *dWO6d FtETE INFOR1MATIONHe was graduated from Antioch College with an AB in _

philosophy and received his MS in computer science from the e - - imhmmIeUniversity of Wisconsin in Madison. Blake has lectured in the PU_WW F PWV UPUW lsU.S. and Europe on the topics of virtual memory and schedulingsystems, and system performance modeling and improvement. .(714) 17 TWX: 910-6511-417 * Ceble: WESPERHe is a member of ACM. 1100 Claudina Plsce, Anahelm, CA 92805

Reader Service Number 14

-mVore;:1. 8Kx8 ECONORAM II $163.84

full feature kit .14 per bit memory for the S-100 bus

2. 10 SLOT MOTHERBOARD $80kit includes 10 edge connectors plus active terminations

TERMS: Add 504 orders under $10. Allow up to 5%for shipping; excess refunded. We require streetaddress for COD. BankAmericard' /Mastercharges($15 min) call 415-562-0636, 24h. CA res add tax.BI GODBAUT ELECTRONC

_BOK 2355, OAKLAND AIRPORICT, 61

Reader Service Number 15