CRE652 Processor Architecture Course Objective: To gain

Korea UniversityG. Lee - 2009 1

CRE652 Processor Architecture

Course Objective: To gain(1). knowledge on the current issues in processor architectures, and (2). skills for performing architecture research Ref. Text • J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995.

• Papers from ISCA , MICRO, and ICCD

• Computer Architecture: A Quantitative Approach, Hennessy and Patterson, Morgan Kaufmann.


Reg. Filerename

D-Cache

Function units

Superscalar Processor ModelSuperscalar Processor Model

ROBI-Cache

I Buffer

BTB

Instr. window

Dispatch(scheduler)

Rename reservation station

IF

DP

IS

WBCT

•VLIW – EPIC•SMT

ghlee

note that 1. reservation station is now centralized for convenience (needs heavy multiporting).and 2. opernad values are avaliable at three different places, reservation station, ROB, and register file.note also, thet reservation station may not need to actually hold opernad values, which can be provided from ROB or register file.


Virtual address

I-TLB

D-TLB

Page table

Page table pointer register

Entry with Dirty = 1

Memory Access Flow

CacheMemory

From program counter orLoad/Store Instruction

Processor


Walls:

Limit in performance

ILP Wall

Memory Wall

Power Wall


ILP(Instruction Level Parallelism)

Fundamental limitation: data flow dependency

Practical limiting factors• Instruction Window Size

Branch Prediction• Data dependency

Register Renaming• Memory-address Alias

Memory Disambiguation• (Resource Conflicts)• (Memory Latency due to cache-miss and lack of

ports)


ILP(Instruction Level Parallelism)

With no limiting factorsi.e. infinite window, infinite renaming registers, perfect branch

prediction, and all memory addresses are exactly known,

the average ILP in programs are known to be quite high.

But with realistic limiting factors,

IPC becomes fairly restricted.


ILP Limit1. Foster and Riseman, “percolation of code to enhance parallel

dispatching and execution”, IEEE Trans. Computers, Vol. C-21, Dec. 1972.

No. of Branches bypassed ILP 0 (basic block) 1.72

1 2.722 3.618 7.2136 14.8128 24.4∞ 51.2


ILP Limit

2. Spec92

H&P-Text Fig. 3.1 p. 157

ILP = 17.9 for li to 150.1 for tomcatv

3. M. A. Postiff, “The Limits of ILP in SPEC95 Applications”, INTERACT-3, ACM Computer Architecture News, Vol. 27, No.1, Mar. 1999

With no memory aliasing,

19.62 for li – 3933.03 for mgrid (61.47 for tomcatv)

With stack dependency (for allocating activation record) removed, 81.45 for li – 4003.44 for mgrid

ghlee

here ILP is extracted after execution, constructing traces, which has only RAW dependencies. so cinsider this as unlimyed resource idealized one


ILP due to practical limiting factors

Limiting Factors: (H&P-text p. 152 – 170) Instruction Window Size

more instructions to consider, better ILP potential Branch Prediction Accuracy

less wasted cycles Renaming Registers

more registers, better chance to remove WAR and WAW Memory Aliasing

more accurate memory dependency Resources

matching function unit types available to ILP



Limiting Factor - Instruction Window SizeInstruction Window; set of instructions examined for simultaneous

execution - reservation station + current fetch

max. no. of comparisons:

no. of completing instructions X no. of instructions waiting to be issued X 2 (assuming at most two source operands/instr)

with typical window size of 64 to 128, time-critical

ghlee

remember that instructions are dispatched to reservation stations and then issued to function units if the operans are available.note that with Tomasulo like dynamic scheduling, instructionsa re issued without checking hazards. so, at commit, results will be compared with all pending instructions in the reservation sstations waiting for the results to resolve the depndency the the instructions are issued to the function units

Korea UniversityG. Lee - 200911


Limiting Factor - Instruction Window Sizee.g. (from H&P-Text Fig. 3.2 p. 159)

ILP vs. window size

0

10

20

30

40

50

60

70

oo 2K 512 128 32 8 4

gccespressoli

note :1. effects of window size2. inefficiency of larger window

ghlee

considering beyond 2K instructions to look at at each commit seems too much, but 512 does not give much benefit over smaller window sizesalso note that ilp < 64 even with infinite window sizerecall that the reservation size were aorund 20 for superscalr processors



Limiting Factor – Branch Predictione.g. (from H&P-Text Fig. 3.3 p. 160)

ILP vs. Branch prediction

0

5

10

15

20

25

30

35

40

45

perf comb bi stat none

gccespressoli

note :perf: perfect branch predictioncomb: tournament predictorbi: bimodal predictor(2-bit counter)stat: static prediction with profilingnone: no prediction

note:instruction window size: 2Kissue limit: 64jmp prediction with 2K entry table

ghlee

jmp prediction is for return address and absilute j,ps, separate from branch prediction



Limiting Factor – Renaming Registerse.g. (from H&P-Text Fig. 3.5 p. 163)

ILP vs. additional rename registers

0

2

4

6

8

10

12

14

16

oo 256 128 64 32 0

gccespressoli

note:instruction window size: 2Kissue limit: 64combining predictor of total 8K entryjmp prediction with 2K entry table


ILP due to practical limiting factorsLimiting Factor – Memory Aliasing

e.g ld $3, #200($4)st $5, #200($6)

how to be sure about dependency between the two memory locations: ($4)+200 and ($6)+150

•Perfect – after executing program•Global reference and Stack references

Global data regionStack access for local variables (activation records)Unknown, i.e. assume conflicts, for heap region for dynamic data structures

•Inspection – compile time region analysis


ILP due to practical limiting factorsLimiting Factor – Memory Aliasing

e.g. (from H&P-Text Fig. 3.6 p. 164)

ILP vs. aliasing detection schemes

0

2

4

6

8

10

12

14

16

P G/ S Ins none

gccespressoli

note:instruction window size: 2Kissue limit: 64 with 256 registerscombining predictor of total 8K entryjmp prediction with 2K entry table

P: perfect alias resolutionG/S: global/stackIns: inspection

ghlee

perfect is figuring out memory access conflicts after execution.G/S is marking no conflict between global and stack as local variables are allocated into stack accessed via stack pointer, but all pointer variables (i.e. heap) are assumed to be conflicting.Inspection is for one live range if two memory refrences with the same base register uses different displacement or not


ILP Limit

A Realizable Superscalar Processor:H&P-Text sec.3.3 with rather realistic assumptions

64-issue with no issue restrictions Tournament predictor with 1K entries 16-entry jump return predictor 256 instruction window No alias within window 64 additional renaming registers

note: no issue restriction is virtually impossible even for lower issue count, say 16.

ghlee

this is streched out extremely for 'realistic' processor with billion transistors, whichis not quite realistic


ILP Limit – Realistic Processor

0

20

40

60

80

100

120

140

160

gcc li doduc

realideal

around 25%


ILP Limit – Realistic Processor

ILP potential in softwareILP limited by resources

Window sizeFunction unit mismatchRegisters

ILP limited by dependencyBranch predictionFalse Dependency

Output dependency (WAW)Data dependency (RAW)


Processor Micro architecture Fetch / Issue /

Execute

FU Clock Rate (GHz

)

TrsDie size

Power

Intel Pentium 4 Extreme

Speculative dynamically

scheduled; deeply pipelined; SMT

3/3/4 7 int. 1 FP

3.8 125 M 122 mm2

115 W

AMD Athlon 64

FX-57

Speculative dynamically scheduled

3/3/4 6 int. 3 FP

2.8 114 M 115 mm2

104 W

IBM Power5 (1 CPU only)

Speculative dynamically

scheduled; SMT; 2 CPU cores/chip

8/4/8 6 int. 2 FP

1.9 200 M 300 mm2 (est.)

80W (est.)

Intel Itanium 2

Statically scheduled VLIW-style

6/5/11 9 int. 2 FP

1.6 592 M 423 mm2

130 W

Processor Architecture Comparison (H&P-Text Sec.3.6)


Performance on SPECint2000

0

500

1000

1500

2000

2500

3000

3500

gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf

SP

EC

Rat

io

Itanium 2 Pentium 4 AMD Athlon 64 Power 5


Performance on SPECfp2000

0

2000

4000

6000

8000

10000

12000

14000

w upw ise sw im mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi

SP

EC

Ra

tio

Itanium 2 Pentium 4 AMD Athlon 64 Power 5


Normalized Performance: Efficiency

0

5

10

15

20

25

30

35

SPECInt / MTransistors

SPECFP / MTransistors

SPECInt /mm^2

SPECFP /mm^2

SPECInt /Watt

SPECFP /Watt

I tanium 2 Pentium 4 AMD Athlon 64 POWER 5

Rank

Itanium2

PentIum4

Athlon

Power5

Int/Trans 4 2 1 3

FP/Trans 4 2 1 3

Int/area 4 2 1 3

FP/area 4 2 1 3

Int/Watt 4 3 1 2

FP/Watt 2 4 3 1


Superscalar processor

N-way Superscalar:Fetch and decode N instructionsN “ready” instructions “issued” to function units

fetch, decode, renaming, dispatch, issueissue, execution, writeback/commit After issue, execution begins The maximum number of instruction a processor

can send simultaneously is the “issue width”. Actual issue rate is much less

Fetch=Decode > Issue = Execute > Commit


Note: Can we keep going with Superscalar path for better performance?

Increase instruction windowIssue widthData path width

→ wire delay become more important factor→ clustered organization may help

frequent intra-cluster operationsinfrequent inter-cluster operations

Simpler may be better?But it does not utilize available on-chip resources fullyAdapting multiprocessor approach?How to control multiprocessors for multiple instructions


Note: Removing dependency limit1. Current practice/convention of programming model

imposes unnecessary dependency WAR and WAW through memory

because of the way stack frame is allocated or deallocated, a procedure may reuse memory locations a previous procedure on the stack used

specific use of registers loop counter, return address register, stack pointer,

2. Going beyond data-flow limit Data Value prediction with speculation

general value prediction; unlikely address value prediction constant/loop index value prediction


Dealing with Other Walls

Memory Wall Faster Multilevel Cache Non-blocking pipelined cache Cache in multicore processor

Transaction memory

Power Wall Lower driving voltage Allowing errors


Adding New Functionality

Network and I/O related Bypassing OS intervention

Multimedia Vector instructions

Trusted Computing Trusted Platform Module

Documents

CRE652 Processor Architecture Course Objective: To gain