Upload
hugh-sanders
View
222
Download
0
Embed Size (px)
DESCRIPTION
Overview Last Time Readings for GPU programming New pthreads Stanford – (Itunes)http://code.google.com/p/stanford-cs193g-sp2010/ UIUC ECE 498 AL : Applied Parallel Programming http://courses.engr.illinois.edu/ece498/al/ Book (online) David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 http://courses.engr.illinois.edu/ece498/al/Syllabus.html New Back to chapter 3 Topics revisited: multiple issue; tomasulo’s data hazards for address field Hyperthreading Limits on ILP
Citation preview
Lecture 16Instruction Level Parallelism:
Hyper-threading and limitsTopics Topics
Hardware threading Limits on ILP
ReadingsReadings
November 30, 2015
CSCE 513 Computer Architecture
– 2 –CSCE 513 Fall 2015
OverviewLast TimeLast Time
pthreads
Readings for GPU programmingReadings for GPU programming Stanford – (Itunes)http://code.google.com/p/stanford-cs193g-sp2010/ UIUC ECE 498 AL : Applied Parallel Programming
http://courses.engr.illinois.edu/ece498/al/Book (online) David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
http://courses.engr.illinois.edu/ece498/al/Syllabus.html
NewNew Back to chapter 3 Topics revisited: multiple issue; tomasulo’s data hazards for
address field Hyperthreading Limits on ILP
– 3 –CSCE 513 Fall 2015
CSAPP – Bryant O’Hallaron..
– 4 –CSCE 513 Fall 2015
T1 (“Niagara”)
Target: Commercial server applicationsTarget: Commercial server applications High thread level parallelism (TLP)
Large numbers of parallel client requests Low instruction level parallelism (ILP)
High cache miss ratesMany unpredictable branchesFrequent load-load dependencies
Power, cooling, and space are major Power, cooling, and space are major concerns for data centersconcerns for data centers
Metric: Performance/Watt/Sq. Ft.Metric: Performance/Watt/Sq. Ft.
Approach: Multicore, Fine-grain Approach: Multicore, Fine-grain multithreading, Simple pipeline, Small multithreading, Simple pipeline, Small L1 caches, Shared L2L1 caches, Shared L2
– 5 –CSCE 513 Fall 2015
05/03/23 CS252 s06 T1 5
T1 ArchitectureAlso ships with 6 or 4 processorsAlso ships with 6 or 4 processors
– 6 –CSCE 513 Fall 2015
T1 pipelineSingle issue, in-order, 6-deep pipeline: F, S, D, E, M, W Single issue, in-order, 6-deep pipeline: F, S, D, E, M, W
3 clock delays for loads & branches.3 clock delays for loads & branches.
Shared units: Shared units: L1, L2 TLB X units pipe registers
• Hazards:– Data
– Structural
– 7 –CSCE 513 Fall 2015
T1 Fine-Grained MultithreadingEach core supports four threads and has its own level Each core supports four threads and has its own level
one caches (16KB for instructions and 8 KB for data)one caches (16KB for instructions and 8 KB for data)
Switching to a new thread on each clock cycle Switching to a new thread on each clock cycle
Idle threads are bypassed in the scheduling Idle threads are bypassed in the scheduling Waiting due to a pipeline delay or cache miss Processor is idle only when all 4 threads are idle or stalled
Both loads and branches incur a 3 cycle delay that can Both loads and branches incur a 3 cycle delay that can only be hidden by other threads only be hidden by other threads
A single set of floating point functional units is shared A single set of floating point functional units is shared by all 8 coresby all 8 cores floating point performance was not a focus for T1
– 8 –CSCE 513 Fall 2015
Memory, Clock, Power16 KB 4 way set assoc. I$/ core16 KB 4 way set assoc. I$/ core8 KB 4 way set assoc. D$/ core8 KB 4 way set assoc. D$/ core3MB 12 way set assoc. L2 $ shared3MB 12 way set assoc. L2 $ shared
4 x 750KB independent banks crossbar switch to connect 2 cycle throughput, 8 cycle latency Direct link to DRAM & Jbus Manages cache coherence for the 8 cores CAM based directory
Coherency is enforced among the L1 caches by a directory Coherency is enforced among the L1 caches by a directory associated with each L2 cache block associated with each L2 cache block
Used to track which L1 caches have copies of an L2 block Used to track which L1 caches have copies of an L2 block By associating each L2 with a particular memory bank and By associating each L2 with a particular memory bank and
enforcing the subset property, T1 can place the directory at L2 enforcing the subset property, T1 can place the directory at L2 rather than at the memory, which reduces the directory overhead rather than at the memory, which reduces the directory overhead
L1 data cache is write-through, only invalidation messages are L1 data cache is write-through, only invalidation messages are required; the data can always be retrieved from the L2 cacherequired; the data can always be retrieved from the L2 cache
1.2 GHz at 1.2 GHz at 72W typical, 79W peak power consumption72W typical, 79W peak power consumption
Write through• allocate LD
• no-allocate ST
– 9 –CSCE 513 Fall 2015
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
1.5 MB;32B
1.5 MB;64B
3 MB;32B
3 MB;64B
6 MB;32B
6 MB;64B
L2 M
iss
rate
TPC-C
SPECJBB
Miss Rates: L2 Cache Size, Block Size
T1
– 10 –CSCE 513 Fall 2015
0
20
40
60
80
100
120
140
160
180
200
1.5 MB; 32B 1.5 MB; 64B 3 MB; 32B 3 MB; 64B 6 MB; 32B 6 MB; 64B
L2 M
iss
late
ncy
TPC-CSPECJBB
Miss Latency: L2 Cache Size, Block Size
T1
– 11 –CSCE 513 Fall 2015
CPI Breakdown of Performance
Benchmark
Per Thread
CPI
Per core CPI
Effective CPI for 8 cores
Effective IPC for 8 cores
TPC-C
7.20 1.80
0.23 4.4
SPECJBB
5.60 1.40
0.18 5.7
SPECWeb99
6.60 1.65
0.21 4.8
– 12 –CSCE 513 Fall 2015
Not Ready Breakdown
TPC-C - store buffer full is largest contributorTPC-C - store buffer full is largest contributor
SPEC-JBB - atomic instructions are largest contributor SPEC-JBB - atomic instructions are largest contributor
SPECWeb99 - both factors contributeSPECWeb99 - both factors contribute
0%
20%
40%
60%
80%
100%
TPC-C SPECJBB SPECWeb99
Frac
tion
of c
ycle
s no
t rea
dy
Other
Pipeline delay
L2 miss
L1 D miss
L1 I miss
– 13 –CSCE 513 Fall 2015
Performance: Benchmarks + Sun Marketing
Benchmark\Architecture Sun Fire T2000
IBM p5-550 with 2 dual-core Power5 chips Dell PowerEdge
SPECjbb2005 (Java server software) business operations/ sec 63,378 61,789 24,208 (SC1425 with dual single-
core Xeon)
SPECweb2005 (Web server performance) 14,001 7,881 4,850 (2850 with two dual-core
Xeon processors)
NotesBench (Lotus Notes performance) 16,061 14,740
– 14 –CSCE 513 Fall 2015
HP marketing view of T1 Niagara1.1. Sun’s radical UltraSPARC T1 chip is made up of individual Sun’s radical UltraSPARC T1 chip is made up of individual
cores that have much slower single thread performance when cores that have much slower single thread performance when compared to the higher performing cores of the Intel Xeon, compared to the higher performing cores of the Intel Xeon, Itanium, AMD Opteron or even classic UltraSPARC Itanium, AMD Opteron or even classic UltraSPARC processors.processors.
2.2. The Sun Fire T2000 has poor floating-point performance, by The Sun Fire T2000 has poor floating-point performance, by Sun’s own admission.Sun’s own admission.
3.3. The Sun Fire T2000 does not support commerical Linux or The Sun Fire T2000 does not support commerical Linux or Windows® and requires a lock-in to Sun and Solaris.Windows® and requires a lock-in to Sun and Solaris.
4.4. The UltraSPARC T1, aka CoolThreads, is new and unproven, The UltraSPARC T1, aka CoolThreads, is new and unproven, having just been introduced in December 2005.having just been introduced in December 2005.
5.5. In January 2006, a well-known financial analyst downgraded In January 2006, a well-known financial analyst downgraded Sun on concerns over the UltraSPARC T1’s limitation to only Sun on concerns over the UltraSPARC T1’s limitation to only the Solaris operating system, unique requirements, and the Solaris operating system, unique requirements, and longer adoption cycle, among other things. [10]longer adoption cycle, among other things. [10]
Where is the compelling value to warrant taking such a risk?Where is the compelling value to warrant taking such a risk?
– 15 –CSCE 513 Fall 2015
05/03/23 CS252 s06 T1 15
Microprocessor ComparisonProcessor SUN T1 Opteron Pentium D IBM Power 5
Cores 8 2 2 2Instruction issues
/ clock / core 1 3 3 4Peak instr. issues
/ chip 8 6 6 8Multithreading Fine-
grained No SMT SMT
L1 I/D in KB per core 16/8 64/6412K
uops/16 64/32
L2 per core/shared3 MB
shared1MB /
core1MB/
core1.9 MB
shared
Clock rate (GHz) 1.2 2.4 3.2 1.9Transistor count (M) 300 233 230 276Die size (mm2) 379 199 206 389Power (W) 79 110 130 125
– 16 –CSCE 513 Fall 2015
05/03/23 CS252 s06 T1 16
Performance Relative to Pentium D
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
SPECIntRate SPECFPRate SPECJBB05 SPECWeb05 TPC-like
Per
form
ance
rela
tive
to P
entiu
m D
+Power5 Opteron Sun T1
– 17 –CSCE 513 Fall 2015
05/03/23 CS252 s06 T1 17
Performance/mm2, Performance/Watt
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
SPECIntRate
/mm^2
SPECIntRate
/Watt
SPECFPRate/m
m^2
SPECFPRate/W
att
SPECJBB05
/mm^2
SPECJBB05
/Watt
TPC-C
/mm^2
TPC-C
/Watt
Effi
cien
cy n
orm
aliz
ed to
Pen
tium
D
+Power5 Opteron Sun T1
– 18 –CSCE 513 Fall 2015
05/03/23 CS252 s06 T1 18
Niagara 2Improve performance by increasing threads Improve performance by increasing threads
supported per chip from 32 to 64 supported per chip from 32 to 64 8 cores * 8 threads per core
Floating-point unit for each core, not for each chipFloating-point unit for each core, not for each chipHardware support for encryption standards EAS, Hardware support for encryption standards EAS,
3DES, and elliptical-curve cryptography 3DES, and elliptical-curve cryptography Niagara 2 will add a number of 8x PCI Express Niagara 2 will add a number of 8x PCI Express
interfaces directly into the chip in addition to interfaces directly into the chip in addition to integrated 10Gigabit Ethernet XAU interfaces integrated 10Gigabit Ethernet XAU interfaces and Gigabit Ethernet ports. and Gigabit Ethernet ports.
Integrated memory controllers will shift support Integrated memory controllers will shift support from DDR2 to FB-DIMMs and double the from DDR2 to FB-DIMMs and double the maximum amount of system memory. maximum amount of system memory.
Kevin Krewell
“Sun's Niagara Begins CMT Flood -
The Sun UltraSPARC T1 Processor Released”
Microprocessor Report, January 3, 2006
– 19 –CSCE 513 Fall 2015
Amdahl’s Law PaperGene Amdahl, "Validity of the Single Processor Approach to Gene Amdahl, "Validity of the Single Processor Approach to
Achieving Large-Scale Computing Capabilities", AFIPS Achieving Large-Scale Computing Capabilities", AFIPS Conference Proceedings, (30), pp. 483-485, 1967.Conference Proceedings, (30), pp. 483-485, 1967.
How long is paper? How long is paper?
How much of it is Amdahl’s Law? How much of it is Amdahl’s Law?
What other comments about parallelism besides What other comments about parallelism besides Amdahl’s Law?Amdahl’s Law?
– 20 –CSCE 513 Fall 2015
Parallel Programmer Productivity
Lorin Hochstein Lorin Hochstein et alet al "Parallel Programmer Productivity: A Case Study of "Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers." International Conference for High Novice Parallel Programmers." International Conference for High Performance Computing, Networking and Storage (SC'05). Nov. 2005 Performance Computing, Networking and Storage (SC'05). Nov. 2005
What did they study?What did they study?
What is argument that novice parallel programmers are a good What is argument that novice parallel programmers are a good target for High Performance Computing?target for High Performance Computing?
How can account for variability in talent between programmers?How can account for variability in talent between programmers?
What programmers studied?What programmers studied?
What programming styles investigated? What programming styles investigated?
How big multiprocessor?How big multiprocessor?
How measure quality?How measure quality?
How measure cost?How measure cost?
– 21 –CSCE 513 Fall 2015
GUSTAFSON’s LawAmdahl’s Law Speedup Amdahl’s Law Speedup = (= (s + p s + p ) ⁄ () ⁄ (s + p ⁄ N s + p ⁄ N ))
= 1 ⁄ (= 1 ⁄ (s + p ⁄ N s + p ⁄ N ),),
N = number of processorsN = number of processors
s = sequential(serial) times = sequential(serial) time
p = parallel timep = parallel time
For N=1024 processorsFor N=1024 processors
http://mprc.pku.edu.cn/courses/architecture/autumn2005/reevaluating-Amdahls-law.pdf
– 22 –CSCE 513 Fall 2015
Scale the problem
– 23 –CSCE 513 Fall 2015
Matrix Multiplication revisitedfor i=0; i < n; ++ifor i=0; i < n; ++i
for (j=0; j<n; ++j){for (j=0; j<n; ++j){for(k=0; k<n; ++k){for(k=0; k<n; ++k){C[i][j] = C[i][j] + A[i][k]*B[k][j];C[i][j] = C[i][j] + A[i][k]*B[k][j];}}}}
}}Note nNote n33 multiplications, n multiplications, n3 3 additions, 4nadditions, 4n33 memory references?memory references?
How can we improve code?How can we improve code?
Stride through A? B? C?Stride through A? B? C?
Do reference to A[i][k] and C[i][j] work together or against each other on Do reference to A[i][k] and C[i][j] work together or against each other on the miss-ratethe miss-rate
BlockingBlocking
– 24 –CSCE 513 Fall 2015
GUSTAFSON’s Law model scaling ExampleSuppose a model is dominated by a matrix multiply, Suppose a model is dominated by a matrix multiply,
that for a given n x n matrix is multiplied a large that for a given n x n matrix is multiplied a large constant (k) number of times.constant (k) number of times.
knkn33 multiplies, adds (ignore memory references) multiplies, adds (ignore memory references)
If a model of size n=1024 can be executed in 10 minutes If a model of size n=1024 can be executed in 10 minutes on one processor, using Gustaphson’s how big can on one processor, using Gustaphson’s how big can the model be and execute in the 10 minutes the model be and execute in the 10 minutes assuming 1024 processors with 0% serial code?assuming 1024 processors with 0% serial code?
with 1% serial code?with 1% serial code?
with 10% serial code?with 10% serial code?
– 25 –CSCE 513 Fall 2015
– 26 –CSCE 513 Fall 2015
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure 3.26 ILP available in a perfect processor for six of the SPEC92 benchmarks. The first three programs are integer programs, and the last three are floating-point programs. The floating-point
programs are loop intensive and have large amounts of loop-level parallelism.
– 27 –CSCE 513 Fall 2015
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure 3.30 The relative change in the miss rates and miss latencies when executing with one thread per core versus four threads per core on the TPC-C benchmark. The latencies are the actual time to return the requested data after a miss. In the four-thread case, the execution of other threads could potentially hide much
of this latency.
– 28 –CSCE 513 Fall 2015
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure 3.30 The relative change in the miss rates and miss latencies when executing with one thread per core versus four threads per core on the TPC-C benchmark. The latencies are the actual time to return the requested data after a miss. In the four-thread case, the execution of other threads could potentially hide much
of this latency.
– 29 –CSCE 513 Fall 2015
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure 3.31 Breakdown of the status on an average thread. “Executing” indicates the thread issues an instruction in that cycle. “Ready but not chosen” means it could issue but another thread has been chosen, and “not ready” indicates that the thread is awaiting the completion of an event (a pipeline delay or cache miss, for
example).
– 30 –CSCE 513 Fall 2015
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure 3.32 The breakdown of causes for a thread being not ready. The contribution to the “other” category varies. InPC-C, store buffer full is the largest contributor; in SPEC-JBB, atomic instructions are the largest contributor; and in SPECWeb99,
both factors contribute.
– 31 –CSCE 513 Fall 2015
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure 3.35 The speedup from using multithreading on one core on an i7 processor averages 1.28 for the Java benchmarks and 1.31 for the PARSEC benchmarks (using an unweighted harmonic mean, which implies a
workload where the total time spent executing each benchmark in the single-threaded base set was the same). The energy efficiency averages 0.99 and 1.07, respectively (using the harmonic mean). Recall that anything above 1.0 for energy efficiency indicates that the feature reduces execution time by more than it increases average power. Two of the Java benchmarks experience little speedup and have significant negative energy efficiency because of this. Turbo Boost is off in all cases. These data were collected and analyzed by Esmaeilzadeh et al. [2011] using the Oracle (Sun) HotSpot
build 16.3-b01 Java 1.6.0 Virtual Machine and the gcc v4.4.1 native compiler.
– 32 –CSCE 513 Fall 2015
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure 3.41 The Intel Core i7 pipeline structure shown with the memory system components. The total pipeline depth is 14 stages, with branch mispredictions costing 17 cycles. There are 48 load and 32 store buffers. The six
independent functional units can each begin execution of a ready micro-op in the same cycle.
– 33 –CSCE 513 Fall 2015
Fermi (2010)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, Spring 2010 University of Illinois, Urbana-
Champaign
33
~1.5TFLOPS (SP)/~800GFLOPS (DP)230 GB/s DRAM Bandwidth