27
Korea University G. Lee - 2009 1 CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures, and (2). skills for performing architecture research Ref. Text • J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995. Papers from ISCA , MICRO, and ICCD Computer Architecture: A Quantitative Approach , Hennessy and Patterson, Morgan Kaufmann.

CRE652 Processor Architecture Course Objective: To gain

Embed Size (px)

DESCRIPTION

CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures, and (2). skills for performing architecture research Ref. Text J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995. - PowerPoint PPT Presentation

Citation preview

Page 1: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 1

CRE652 Processor Architecture

Course Objective: To gain(1). knowledge on the current issues in processor architectures, and (2). skills for performing architecture research Ref. Text • J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995.

• Papers from ISCA , MICRO, and ICCD

• Computer Architecture: A Quantitative Approach, Hennessy and Patterson, Morgan Kaufmann.

Page 2: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 2

Reg. Filerename

D-Cache

Function units

Superscalar Processor ModelSuperscalar Processor Model

ROBI-Cache

I Buffer

BTB

Instr. window

Dispatch(scheduler)

Rename reservation station

IF

DP

IS

WBCT

•VLIW – EPIC•SMT

ghlee
note that 1. reservation station is now centralized for convenience (needs heavy multiporting).and 2. opernad values are avaliable at three different places, reservation station, ROB, and register file.note also, thet reservation station may not need to actually hold opernad values, which can be provided from ROB or register file.
Page 3: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 3

Virtual address

I-TLB

D-TLB

Page table

Page table pointer register

Entry with Dirty = 1

Memory Access Flow

CacheMemory

From program counter orLoad/Store Instruction

Processor

Page 4: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 4

Walls:

Limit in performance

ILP Wall

Memory Wall

Power Wall

Page 5: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 5

ILP(Instruction Level Parallelism)

Fundamental limitation: data flow dependency

Practical limiting factors• Instruction Window Size

Branch Prediction• Data dependency

Register Renaming• Memory-address Alias

Memory Disambiguation• (Resource Conflicts)• (Memory Latency due to cache-miss and lack of

ports)

Page 6: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 6

ILP(Instruction Level Parallelism)

With no limiting factorsi.e. infinite window, infinite renaming registers, perfect branch

prediction, and all memory addresses are exactly known,

the average ILP in programs are known to be quite high.

But with realistic limiting factors,

IPC becomes fairly restricted.

Page 7: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 7

ILP Limit1. Foster and Riseman, “percolation of code to enhance parallel

dispatching and execution”, IEEE Trans. Computers, Vol. C-21, Dec. 1972.

No. of Branches bypassed ILP 0 (basic block) 1.72

1 2.722 3.618 7.2136 14.8128 24.4∞ 51.2

Page 8: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 8

ILP Limit

2. Spec92

H&P-Text Fig. 3.1 p. 157

ILP = 17.9 for li to 150.1 for tomcatv

3. M. A. Postiff, “The Limits of ILP in SPEC95 Applications”, INTERACT-3, ACM Computer Architecture News, Vol. 27, No.1, Mar. 1999

With no memory aliasing,

19.62 for li – 3933.03 for mgrid (61.47 for tomcatv)

  With stack dependency (for allocating activation record) removed, 81.45 for li – 4003.44 for mgrid

ghlee
here ILP is extracted after execution, constructing traces, which has only RAW dependencies. so cinsider this as unlimyed resource idealized one
Page 9: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 9

ILP due to practical limiting factors

Limiting Factors: (H&P-text p. 152 – 170) Instruction Window Size

more instructions to consider, better ILP potential Branch Prediction Accuracy

less wasted cycles Renaming Registers

more registers, better chance to remove WAR and WAW Memory Aliasing

more accurate memory dependency Resources

matching function unit types available to ILP

Page 10: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 10

ILP due to practical limiting factors

Limiting Factor - Instruction Window SizeInstruction Window; set of instructions examined for simultaneous

execution - reservation station + current fetch

max. no. of comparisons:

no. of completing instructions X no. of instructions waiting to be issued X 2 (assuming at most two source operands/instr)

with typical window size of 64 to 128, time-critical

ghlee
remember that instructions are dispatched to reservation stations and then issued to function units if the operans are available.note that with Tomasulo like dynamic scheduling, instructionsa re issued without checking hazards. so, at commit, results will be compared with all pending instructions in the reservation sstations waiting for the results to resolve the depndency the the instructions are issued to the function units
Page 11: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 200911

ILP due to practical limiting factors

Limiting Factor - Instruction Window Sizee.g. (from H&P-Text Fig. 3.2 p. 159)

ILP vs. window size

0

10

20

30

40

50

60

70

oo 2K 512 128 32 8 4

gccespressoli

note :1. effects of window size2. inefficiency of larger window

ghlee
considering beyond 2K instructions to look at at each commit seems too much, but 512 does not give much benefit over smaller window sizesalso note that ilp < 64 even with infinite window sizerecall that the reservation size were aorund 20 for superscalr processors
Page 12: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 200912

ILP due to practical limiting factors

Limiting Factor – Branch Predictione.g. (from H&P-Text Fig. 3.3 p. 160)

ILP vs. Branch prediction

0

5

10

15

20

25

30

35

40

45

perf comb bi stat none

gccespressoli

note :perf: perfect branch predictioncomb: tournament predictorbi: bimodal predictor(2-bit counter)stat: static prediction with profilingnone: no prediction

note:instruction window size: 2Kissue limit: 64jmp prediction with 2K entry table

ghlee
jmp prediction is for return address and absilute j,ps, separate from branch prediction
Page 13: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 200913

ILP due to practical limiting factors

Limiting Factor – Renaming Registerse.g. (from H&P-Text Fig. 3.5 p. 163)

ILP vs. additional rename registers

0

2

4

6

8

10

12

14

16

oo 256 128 64 32 0

gccespressoli

note:instruction window size: 2Kissue limit: 64combining predictor of total 8K entryjmp prediction with 2K entry table

Page 14: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 200914

ILP due to practical limiting factorsLimiting Factor – Memory Aliasing

e.g ld $3, #200($4)st $5, #200($6)

how to be sure about dependency between the two memory locations: ($4)+200 and ($6)+150

•Perfect – after executing program•Global reference and Stack references

Global data regionStack access for local variables (activation records)Unknown, i.e. assume conflicts, for heap region for dynamic data structures

•Inspection – compile time region analysis

Page 15: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 200915

ILP due to practical limiting factorsLimiting Factor – Memory Aliasing

e.g. (from H&P-Text Fig. 3.6 p. 164)

ILP vs. aliasing detection schemes

0

2

4

6

8

10

12

14

16

P G/ S Ins none

gccespressoli

note:instruction window size: 2Kissue limit: 64 with 256 registerscombining predictor of total 8K entryjmp prediction with 2K entry table

P: perfect alias resolutionG/S: global/stackIns: inspection

ghlee
perfect is figuring out memory access conflicts after execution.G/S is marking no conflict between global and stack as local variables are allocated into stack accessed via stack pointer, but all pointer variables (i.e. heap) are assumed to be conflicting.Inspection is for one live range if two memory refrences with the same base register uses different displacement or not
Page 16: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 200916

ILP Limit

A Realizable Superscalar Processor:H&P-Text sec.3.3 with rather realistic assumptions

64-issue with no issue restrictions Tournament predictor with 1K entries 16-entry jump return predictor 256 instruction window No alias within window 64 additional renaming registers

note: no issue restriction is virtually impossible even for lower issue count, say 16.

ghlee
this is streched out extremely for 'realistic' processor with billion transistors, whichis not quite realistic
Page 17: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 200917

ILP Limit – Realistic Processor

0

20

40

60

80

100

120

140

160

gcc li doduc

realideal

around 25%

Page 18: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 200918

ILP Limit – Realistic Processor

ILP potential in softwareILP limited by resources

Window sizeFunction unit mismatchRegisters

ILP limited by dependencyBranch predictionFalse Dependency

Output dependency (WAW)Data dependency (RAW)

Page 19: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 19

Processor Micro architecture Fetch / Issue /

Execute

FU Clock Rate (GHz

)

TrsDie size

Power

Intel Pentium 4 Extreme

Speculative dynamically

scheduled; deeply pipelined; SMT

3/3/4 7 int. 1 FP

3.8 125 M 122 mm2

115 W

AMD Athlon 64

FX-57

Speculative dynamically scheduled

3/3/4 6 int. 3 FP

2.8 114 M 115 mm2

104 W

IBM Power5 (1 CPU only)

Speculative dynamically

scheduled; SMT; 2 CPU cores/chip

8/4/8 6 int. 2 FP

1.9 200 M 300 mm2 (est.)

80W (est.)

Intel Itanium 2

Statically scheduled VLIW-style

6/5/11 9 int. 2 FP

1.6 592 M 423 mm2

130 W

Processor Architecture Comparison (H&P-Text Sec.3.6)

Page 20: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 200920

Performance on SPECint2000

0

500

1000

1500

2000

2500

3000

3500

gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf

SP

EC

Rat

io

Itanium 2 Pentium 4 AMD Athlon 64 Power 5

Page 21: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 200921

Performance on SPECfp2000

0

2000

4000

6000

8000

10000

12000

14000

w upw ise sw im mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi

SP

EC

Ra

tio

Itanium 2 Pentium 4 AMD Athlon 64 Power 5

Page 22: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 200922

Normalized Performance: Efficiency

0

5

10

15

20

25

30

35

SPECInt / MTransistors

SPECFP / MTransistors

SPECInt /mm^2

SPECFP /mm^2

SPECInt /Watt

SPECFP /Watt

I tanium 2 Pentium 4 AMD Athlon 64 POWER 5

Rank

Itanium2

PentIum4

Athlon

Power5

Int/Trans 4 2 1 3

FP/Trans 4 2 1 3

Int/area 4 2 1 3

FP/area 4 2 1 3

Int/Watt 4 3 1 2

FP/Watt 2 4 3 1

Page 23: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 23

Superscalar processor

N-way Superscalar:Fetch and decode N instructionsN “ready” instructions “issued” to function units

fetch, decode, renaming, dispatch, issueissue, execution, writeback/commit After issue, execution begins The maximum number of instruction a processor

can send simultaneously is the “issue width”. Actual issue rate is much less

Fetch=Decode > Issue = Execute > Commit

Page 24: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 24

Note: Can we keep going with Superscalar path for better performance?

Increase instruction windowIssue widthData path width

→ wire delay become more important factor→ clustered organization may help

frequent intra-cluster operationsinfrequent inter-cluster operations

Simpler may be better?But it does not utilize available on-chip resources fullyAdapting multiprocessor approach?How to control multiprocessors for multiple instructions

Page 25: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 25

Note: Removing dependency limit1. Current practice/convention of programming model

imposes unnecessary dependency WAR and WAW through memory

because of the way stack frame is allocated or deallocated, a procedure may reuse memory locations a previous procedure on the stack used

specific use of registers loop counter, return address register, stack pointer,

2. Going beyond data-flow limit Data Value prediction with speculation

general value prediction; unlikely address value prediction constant/loop index value prediction

Page 26: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 26

Dealing with Other Walls

Memory Wall Faster Multilevel Cache Non-blocking pipelined cache Cache in multicore processor

Transaction memory

Power Wall Lower driving voltage Allowing errors

Page 27: CRE652 Processor Architecture Course Objective:  To gain

Korea UniversityG. Lee - 2009 27

Adding New Functionality

Network and I/O related Bypassing OS intervention

Multimedia Vector instructions

Trusted Computing Trusted Platform Module