Upload
tyler-hill
View
36
Download
2
Embed Size (px)
DESCRIPTION
CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures, and (2). skills for performing architecture research Ref. Text J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995. - PowerPoint PPT Presentation
Citation preview
Korea UniversityG. Lee - 2009 1
CRE652 Processor Architecture
Course Objective: To gain(1). knowledge on the current issues in processor architectures, and (2). skills for performing architecture research Ref. Text • J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995.
• Papers from ISCA , MICRO, and ICCD
• Computer Architecture: A Quantitative Approach, Hennessy and Patterson, Morgan Kaufmann.
Korea UniversityG. Lee - 2009 2
Reg. Filerename
D-Cache
Function units
Superscalar Processor ModelSuperscalar Processor Model
ROBI-Cache
I Buffer
BTB
Instr. window
Dispatch(scheduler)
Rename reservation station
IF
DP
IS
WBCT
•VLIW – EPIC•SMT
Korea UniversityG. Lee - 2009 3
Virtual address
I-TLB
D-TLB
Page table
Page table pointer register
Entry with Dirty = 1
Memory Access Flow
CacheMemory
From program counter orLoad/Store Instruction
Processor
Korea UniversityG. Lee - 2009 4
Walls:
Limit in performance
ILP Wall
Memory Wall
Power Wall
Korea UniversityG. Lee - 2009 5
ILP(Instruction Level Parallelism)
Fundamental limitation: data flow dependency
Practical limiting factors• Instruction Window Size
Branch Prediction• Data dependency
Register Renaming• Memory-address Alias
Memory Disambiguation• (Resource Conflicts)• (Memory Latency due to cache-miss and lack of
ports)
Korea UniversityG. Lee - 2009 6
ILP(Instruction Level Parallelism)
With no limiting factorsi.e. infinite window, infinite renaming registers, perfect branch
prediction, and all memory addresses are exactly known,
the average ILP in programs are known to be quite high.
But with realistic limiting factors,
IPC becomes fairly restricted.
Korea UniversityG. Lee - 2009 7
ILP Limit1. Foster and Riseman, “percolation of code to enhance parallel
dispatching and execution”, IEEE Trans. Computers, Vol. C-21, Dec. 1972.
No. of Branches bypassed ILP 0 (basic block) 1.72
1 2.722 3.618 7.2136 14.8128 24.4∞ 51.2
Korea UniversityG. Lee - 2009 8
ILP Limit
2. Spec92
H&P-Text Fig. 3.1 p. 157
ILP = 17.9 for li to 150.1 for tomcatv
3. M. A. Postiff, “The Limits of ILP in SPEC95 Applications”, INTERACT-3, ACM Computer Architecture News, Vol. 27, No.1, Mar. 1999
With no memory aliasing,
19.62 for li – 3933.03 for mgrid (61.47 for tomcatv)
With stack dependency (for allocating activation record) removed, 81.45 for li – 4003.44 for mgrid
Korea UniversityG. Lee - 2009 9
ILP due to practical limiting factors
Limiting Factors: (H&P-text p. 152 – 170) Instruction Window Size
more instructions to consider, better ILP potential Branch Prediction Accuracy
less wasted cycles Renaming Registers
more registers, better chance to remove WAR and WAW Memory Aliasing
more accurate memory dependency Resources
matching function unit types available to ILP
Korea UniversityG. Lee - 2009 10
ILP due to practical limiting factors
Limiting Factor - Instruction Window SizeInstruction Window; set of instructions examined for simultaneous
execution - reservation station + current fetch
max. no. of comparisons:
no. of completing instructions X no. of instructions waiting to be issued X 2 (assuming at most two source operands/instr)
with typical window size of 64 to 128, time-critical
Korea UniversityG. Lee - 200911
ILP due to practical limiting factors
Limiting Factor - Instruction Window Sizee.g. (from H&P-Text Fig. 3.2 p. 159)
ILP vs. window size
0
10
20
30
40
50
60
70
oo 2K 512 128 32 8 4
gccespressoli
note :1. effects of window size2. inefficiency of larger window
Korea UniversityG. Lee - 200912
ILP due to practical limiting factors
Limiting Factor – Branch Predictione.g. (from H&P-Text Fig. 3.3 p. 160)
ILP vs. Branch prediction
0
5
10
15
20
25
30
35
40
45
perf comb bi stat none
gccespressoli
note :perf: perfect branch predictioncomb: tournament predictorbi: bimodal predictor(2-bit counter)stat: static prediction with profilingnone: no prediction
note:instruction window size: 2Kissue limit: 64jmp prediction with 2K entry table
Korea UniversityG. Lee - 200913
ILP due to practical limiting factors
Limiting Factor – Renaming Registerse.g. (from H&P-Text Fig. 3.5 p. 163)
ILP vs. additional rename registers
0
2
4
6
8
10
12
14
16
oo 256 128 64 32 0
gccespressoli
note:instruction window size: 2Kissue limit: 64combining predictor of total 8K entryjmp prediction with 2K entry table
Korea UniversityG. Lee - 200914
ILP due to practical limiting factorsLimiting Factor – Memory Aliasing
e.g ld $3, #200($4)st $5, #200($6)
how to be sure about dependency between the two memory locations: ($4)+200 and ($6)+150
•Perfect – after executing program•Global reference and Stack references
Global data regionStack access for local variables (activation records)Unknown, i.e. assume conflicts, for heap region for dynamic data structures
•Inspection – compile time region analysis
Korea UniversityG. Lee - 200915
ILP due to practical limiting factorsLimiting Factor – Memory Aliasing
e.g. (from H&P-Text Fig. 3.6 p. 164)
ILP vs. aliasing detection schemes
0
2
4
6
8
10
12
14
16
P G/ S Ins none
gccespressoli
note:instruction window size: 2Kissue limit: 64 with 256 registerscombining predictor of total 8K entryjmp prediction with 2K entry table
P: perfect alias resolutionG/S: global/stackIns: inspection
Korea UniversityG. Lee - 200916
ILP Limit
A Realizable Superscalar Processor:H&P-Text sec.3.3 with rather realistic assumptions
64-issue with no issue restrictions Tournament predictor with 1K entries 16-entry jump return predictor 256 instruction window No alias within window 64 additional renaming registers
note: no issue restriction is virtually impossible even for lower issue count, say 16.
Korea UniversityG. Lee - 200917
ILP Limit – Realistic Processor
0
20
40
60
80
100
120
140
160
gcc li doduc
realideal
around 25%
Korea UniversityG. Lee - 200918
ILP Limit – Realistic Processor
ILP potential in softwareILP limited by resources
Window sizeFunction unit mismatchRegisters
ILP limited by dependencyBranch predictionFalse Dependency
Output dependency (WAW)Data dependency (RAW)
Korea UniversityG. Lee - 2009 19
Processor Micro architecture Fetch / Issue /
Execute
FU Clock Rate (GHz
)
TrsDie size
Power
Intel Pentium 4 Extreme
Speculative dynamically
scheduled; deeply pipelined; SMT
3/3/4 7 int. 1 FP
3.8 125 M 122 mm2
115 W
AMD Athlon 64
FX-57
Speculative dynamically scheduled
3/3/4 6 int. 3 FP
2.8 114 M 115 mm2
104 W
IBM Power5 (1 CPU only)
Speculative dynamically
scheduled; SMT; 2 CPU cores/chip
8/4/8 6 int. 2 FP
1.9 200 M 300 mm2 (est.)
80W (est.)
Intel Itanium 2
Statically scheduled VLIW-style
6/5/11 9 int. 2 FP
1.6 592 M 423 mm2
130 W
Processor Architecture Comparison (H&P-Text Sec.3.6)
Korea UniversityG. Lee - 200920
Performance on SPECint2000
0
500
1000
1500
2000
2500
3000
3500
gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf
SP
EC
Rat
io
Itanium 2 Pentium 4 AMD Athlon 64 Power 5
Korea UniversityG. Lee - 200921
Performance on SPECfp2000
0
2000
4000
6000
8000
10000
12000
14000
w upw ise sw im mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi
SP
EC
Ra
tio
Itanium 2 Pentium 4 AMD Athlon 64 Power 5
Korea UniversityG. Lee - 200922
Normalized Performance: Efficiency
0
5
10
15
20
25
30
35
SPECInt / MTransistors
SPECFP / MTransistors
SPECInt /mm^2
SPECFP /mm^2
SPECInt /Watt
SPECFP /Watt
I tanium 2 Pentium 4 AMD Athlon 64 POWER 5
Rank
Itanium2
PentIum4
Athlon
Power5
Int/Trans 4 2 1 3
FP/Trans 4 2 1 3
Int/area 4 2 1 3
FP/area 4 2 1 3
Int/Watt 4 3 1 2
FP/Watt 2 4 3 1
Korea UniversityG. Lee - 2009 23
Superscalar processor
N-way Superscalar:Fetch and decode N instructionsN “ready” instructions “issued” to function units
fetch, decode, renaming, dispatch, issueissue, execution, writeback/commit After issue, execution begins The maximum number of instruction a processor
can send simultaneously is the “issue width”. Actual issue rate is much less
Fetch=Decode > Issue = Execute > Commit
Korea UniversityG. Lee - 2009 24
Note: Can we keep going with Superscalar path for better performance?
Increase instruction windowIssue widthData path width
→ wire delay become more important factor→ clustered organization may help
frequent intra-cluster operationsinfrequent inter-cluster operations
Simpler may be better?But it does not utilize available on-chip resources fullyAdapting multiprocessor approach?How to control multiprocessors for multiple instructions
Korea UniversityG. Lee - 2009 25
Note: Removing dependency limit1. Current practice/convention of programming model
imposes unnecessary dependency WAR and WAW through memory
because of the way stack frame is allocated or deallocated, a procedure may reuse memory locations a previous procedure on the stack used
specific use of registers loop counter, return address register, stack pointer,
2. Going beyond data-flow limit Data Value prediction with speculation
general value prediction; unlikely address value prediction constant/loop index value prediction
Korea UniversityG. Lee - 2009 26
Dealing with Other Walls
Memory Wall Faster Multilevel Cache Non-blocking pipelined cache Cache in multicore processor
Transaction memory
Power Wall Lower driving voltage Allowing errors
Korea UniversityG. Lee - 2009 27
Adding New Functionality
Network and I/O related Bypassing OS intervention
Multimedia Vector instructions
Trusted Computing Trusted Platform Module