Lecture 16 Instruction Level Parallelism: Hyper-threading and limits

Lecture 16Instruction Level Parallelism:

Hyper-threading and limitsTopics Topics

Hardware threading Limits on ILP

ReadingsReadings

November 30, 2015

CSCE 513 Computer Architecture

– 2 –CSCE 513 Fall 2015

OverviewLast TimeLast Time

pthreads

Readings for GPU programmingReadings for GPU programming Stanford – (Itunes)http://code.google.com/p/stanford-cs193g-sp2010/ UIUC ECE 498 AL : Applied Parallel Programming

http://courses.engr.illinois.edu/ece498/al/Book (online) David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

http://courses.engr.illinois.edu/ece498/al/Syllabus.html

NewNew Back to chapter 3 Topics revisited: multiple issue; tomasulo’s data hazards for

address field Hyperthreading Limits on ILP

http://courses.engr.illinois.edu/ece498/al/

http://courses.engr.illinois.edu/ece498/al/

– 3 –CSCE 513 Fall 2015

CSAPP – Bryant O’Hallaron..

– 4 –CSCE 513 Fall 2015

T1 (“Niagara”)

Target: Commercial server applicationsTarget: Commercial server applications High thread level parallelism (TLP)

Large numbers of parallel client requests Low instruction level parallelism (ILP)

High cache miss ratesMany unpredictable branchesFrequent load-load dependencies

Power, cooling, and space are major Power, cooling, and space are major concerns for data centersconcerns for data centers

Metric: Performance/Watt/Sq. Ft.Metric: Performance/Watt/Sq. Ft.

Approach: Multicore, Fine-grain Approach: Multicore, Fine-grain multithreading, Simple pipeline, Small multithreading, Simple pipeline, Small L1 caches, Shared L2L1 caches, Shared L2

– 5 –CSCE 513 Fall 2015

05/03/23 CS252 s06 T1 5

T1 ArchitectureAlso ships with 6 or 4 processorsAlso ships with 6 or 4 processors

– 6 –CSCE 513 Fall 2015

T1 pipelineSingle issue, in-order, 6-deep pipeline: F, S, D, E, M, W Single issue, in-order, 6-deep pipeline: F, S, D, E, M, W

3 clock delays for loads & branches.3 clock delays for loads & branches.

Shared units: Shared units: L1, L2 TLB X units pipe registers

• Hazards:– Data

– Structural

– 7 –CSCE 513 Fall 2015

T1 Fine-Grained MultithreadingEach core supports four threads and has its own level Each core supports four threads and has its own level

one caches (16KB for instructions and 8 KB for data)one caches (16KB for instructions and 8 KB for data)

Switching to a new thread on each clock cycle Switching to a new thread on each clock cycle

Idle threads are bypassed in the scheduling Idle threads are bypassed in the scheduling Waiting due to a pipeline delay or cache miss Processor is idle only when all 4 threads are idle or stalled

Both loads and branches incur a 3 cycle delay that can Both loads and branches incur a 3 cycle delay that can only be hidden by other threads only be hidden by other threads

A single set of floating point functional units is shared A single set of floating point functional units is shared by all 8 coresby all 8 cores floating point performance was not a focus for T1

– 8 –CSCE 513 Fall 2015

Memory, Clock, Power16 KB 4 way set assoc. I$/ core16 KB 4 way set assoc. I$/ core8 KB 4 way set assoc. D$/ core8 KB 4 way set assoc. D$/ core3MB 12 way set assoc. L2 $ shared3MB 12 way set assoc. L2 $ shared

4 x 750KB independent banks crossbar switch to connect 2 cycle throughput, 8 cycle latency Direct link to DRAM & Jbus Manages cache coherence for the 8 cores CAM based directory

Coherency is enforced among the L1 caches by a directory Coherency is enforced among the L1 caches by a directory associated with each L2 cache block associated with each L2 cache block

Used to track which L1 caches have copies of an L2 block Used to track which L1 caches have copies of an L2 block By associating each L2 with a particular memory bank and By associating each L2 with a particular memory bank and

enforcing the subset property, T1 can place the directory at L2 enforcing the subset property, T1 can place the directory at L2 rather than at the memory, which reduces the directory overhead rather than at the memory, which reduces the directory overhead

L1 data cache is write-through, only invalidation messages are L1 data cache is write-through, only invalidation messages are required; the data can always be retrieved from the L2 cacherequired; the data can always be retrieved from the L2 cache

1.2 GHz at 1.2 GHz at 72W typical, 79W peak power consumption72W typical, 79W peak power consumption

Write through• allocate LD

• no-allocate ST

– 9 –CSCE 513 Fall 2015

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

1.5 MB;32B

1.5 MB;64B

3 MB;32B

3 MB;64B

6 MB;32B

6 MB;64B

L2 M

iss

rate

TPC-C

SPECJBB

Miss Rates: L2 Cache Size, Block Size

T1

– 10 –CSCE 513 Fall 2015

0

20

40

60

80

100

120

140

160

180

200

1.5 MB; 32B 1.5 MB; 64B 3 MB; 32B 3 MB; 64B 6 MB; 32B 6 MB; 64B

L2 M

iss

late

ncy

TPC-CSPECJBB

Miss Latency: L2 Cache Size, Block Size

T1

– 11 –CSCE 513 Fall 2015

CPI Breakdown of Performance

Benchmark

Per Thread

CPI

Per core CPI

Effective CPI for 8 cores

Effective IPC for 8 cores

TPC-C

7.20 1.80

0.23 4.4

SPECJBB

5.60 1.40

0.18 5.7

SPECWeb99

6.60 1.65

0.21 4.8

– 12 –CSCE 513 Fall 2015

Not Ready Breakdown

TPC-C - store buffer full is largest contributorTPC-C - store buffer full is largest contributor

SPEC-JBB - atomic instructions are largest contributor SPEC-JBB - atomic instructions are largest contributor

SPECWeb99 - both factors contributeSPECWeb99 - both factors contribute

0%

20%

40%

60%

80%

100%

TPC-C SPECJBB SPECWeb99

Frac

tion

of c

ycle

s no

t rea

dy

Other

Pipeline delay

L2 miss

L1 D miss

L1 I miss

– 13 –CSCE 513 Fall 2015

Performance: Benchmarks + Sun Marketing

Benchmark\Architecture Sun Fire T2000

IBM p5-550 with 2 dual-core Power5 chips Dell PowerEdge

SPECjbb2005 (Java server software) business operations/ sec 63,378 61,789 24,208 (SC1425 with dual single-

core Xeon)

SPECweb2005 (Web server performance) 14,001 7,881 4,850 (2850 with two dual-core

Xeon processors)

NotesBench (Lotus Notes performance) 16,061 14,740

– 14 –CSCE 513 Fall 2015

HP marketing view of T1 Niagara1.1. Sun’s radical UltraSPARC T1 chip is made up of individual Sun’s radical UltraSPARC T1 chip is made up of individual

cores that have much slower single thread performance when cores that have much slower single thread performance when compared to the higher performing cores of the Intel Xeon, compared to the higher performing cores of the Intel Xeon, Itanium, AMD Opteron or even classic UltraSPARC Itanium, AMD Opteron or even classic UltraSPARC processors.processors.

2.2. The Sun Fire T2000 has poor floating-point performance, by The Sun Fire T2000 has poor floating-point performance, by Sun’s own admission.Sun’s own admission.

3.3. The Sun Fire T2000 does not support commerical Linux or The Sun Fire T2000 does not support commerical Linux or Windows® and requires a lock-in to Sun and Solaris.Windows® and requires a lock-in to Sun and Solaris.

4.4. The UltraSPARC T1, aka CoolThreads, is new and unproven, The UltraSPARC T1, aka CoolThreads, is new and unproven, having just been introduced in December 2005.having just been introduced in December 2005.

5.5. In January 2006, a well-known financial analyst downgraded In January 2006, a well-known financial analyst downgraded Sun on concerns over the UltraSPARC T1’s limitation to only Sun on concerns over the UltraSPARC T1’s limitation to only the Solaris operating system, unique requirements, and the Solaris operating system, unique requirements, and longer adoption cycle, among other things. [10]longer adoption cycle, among other things. [10]

Where is the compelling value to warrant taking such a risk?Where is the compelling value to warrant taking such a risk?

– 15 –CSCE 513 Fall 2015

05/03/23 CS252 s06 T1 15

Microprocessor ComparisonProcessor SUN T1 Opteron Pentium D IBM Power 5

Cores 8 2 2 2Instruction issues

/ clock / core 1 3 3 4Peak instr. issues

/ chip 8 6 6 8Multithreading Fine-

grained No SMT SMT

L1 I/D in KB per core 16/8 64/6412K

uops/16 64/32

L2 per core/shared3 MB

shared1MB /

core1MB/

core1.9 MB

shared

Clock rate (GHz) 1.2 2.4 3.2 1.9Transistor count (M) 300 233 230 276Die size (mm2) 379 199 206 389Power (W) 79 110 130 125

– 16 –CSCE 513 Fall 2015

05/03/23 CS252 s06 T1 16

Performance Relative to Pentium D

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

SPECIntRate SPECFPRate SPECJBB05 SPECWeb05 TPC-like

Per

form

ance

rela

tive

to P

entiu

m D

+Power5 Opteron Sun T1

– 17 –CSCE 513 Fall 2015

05/03/23 CS252 s06 T1 17

Performance/mm2, Performance/Watt

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

SPECIntRate

/mm^2

SPECIntRate

/Watt

SPECFPRate/m

m^2

SPECFPRate/W

att

SPECJBB05

/mm^2

SPECJBB05

/Watt

TPC-C

/mm^2

TPC-C

/Watt

Effi

cien

cy n

orm

aliz

ed to

Pen

tium

D

+Power5 Opteron Sun T1

– 18 –CSCE 513 Fall 2015

05/03/23 CS252 s06 T1 18

Niagara 2Improve performance by increasing threads Improve performance by increasing threads

supported per chip from 32 to 64 supported per chip from 32 to 64 8 cores * 8 threads per core

Floating-point unit for each core, not for each chipFloating-point unit for each core, not for each chipHardware support for encryption standards EAS, Hardware support for encryption standards EAS,

3DES, and elliptical-curve cryptography 3DES, and elliptical-curve cryptography Niagara 2 will add a number of 8x PCI Express Niagara 2 will add a number of 8x PCI Express

interfaces directly into the chip in addition to interfaces directly into the chip in addition to integrated 10Gigabit Ethernet XAU interfaces integrated 10Gigabit Ethernet XAU interfaces and Gigabit Ethernet ports. and Gigabit Ethernet ports.

Integrated memory controllers will shift support Integrated memory controllers will shift support from DDR2 to FB-DIMMs and double the from DDR2 to FB-DIMMs and double the maximum amount of system memory. maximum amount of system memory.

Kevin Krewell

“Sun's Niagara Begins CMT Flood -

The Sun UltraSPARC T1 Processor Released”

Microprocessor Report, January 3, 2006

– 19 –CSCE 513 Fall 2015

Amdahl’s Law PaperGene Amdahl, "Validity of the Single Processor Approach to Gene Amdahl, "Validity of the Single Processor Approach to

Achieving Large-Scale Computing Capabilities", AFIPS Achieving Large-Scale Computing Capabilities", AFIPS Conference Proceedings, (30), pp. 483-485, 1967.Conference Proceedings, (30), pp. 483-485, 1967.

How long is paper? How long is paper?

How much of it is Amdahl’s Law? How much of it is Amdahl’s Law?

What other comments about parallelism besides What other comments about parallelism besides Amdahl’s Law?Amdahl’s Law?

– 20 –CSCE 513 Fall 2015

Parallel Programmer Productivity

Lorin Hochstein Lorin Hochstein et alet al "Parallel Programmer Productivity: A Case Study of "Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers." International Conference for High Novice Parallel Programmers." International Conference for High Performance Computing, Networking and Storage (SC'05). Nov. 2005 Performance Computing, Networking and Storage (SC'05). Nov. 2005

What did they study?What did they study?

What is argument that novice parallel programmers are a good What is argument that novice parallel programmers are a good target for High Performance Computing?target for High Performance Computing?

How can account for variability in talent between programmers?How can account for variability in talent between programmers?

What programmers studied?What programmers studied?

What programming styles investigated? What programming styles investigated?

How big multiprocessor?How big multiprocessor?

How measure quality?How measure quality?

How measure cost?How measure cost?

– 21 –CSCE 513 Fall 2015

GUSTAFSON’s LawAmdahl’s Law Speedup Amdahl’s Law Speedup = (= (s + p s + p ) ⁄ () ⁄ (s + p ⁄ N s + p ⁄ N ))

= 1 ⁄ (= 1 ⁄ (s + p ⁄ N s + p ⁄ N ),),

N = number of processorsN = number of processors

s = sequential(serial) times = sequential(serial) time

p = parallel timep = parallel time

For N=1024 processorsFor N=1024 processors

http://mprc.pku.edu.cn/courses/architecture/autumn2005/reevaluating-Amdahls-law.pdf

– 22 –CSCE 513 Fall 2015

Scale the problem

– 23 –CSCE 513 Fall 2015

Matrix Multiplication revisitedfor i=0; i < n; ++ifor i=0; i < n; ++i

for (j=0; j<n; ++j){for (j=0; j<n; ++j){for(k=0; k<n; ++k){for(k=0; k<n; ++k){C[i][j] = C[i][j] + A[i][k]*B[k][j];C[i][j] = C[i][j] + A[i][k]*B[k][j];}}}}

}}Note nNote n33 multiplications, n multiplications, n3 3 additions, 4nadditions, 4n33 memory references?memory references?

How can we improve code?How can we improve code?

Stride through A? B? C?Stride through A? B? C?

Do reference to A[i][k] and C[i][j] work together or against each other on Do reference to A[i][k] and C[i][j] work together or against each other on the miss-ratethe miss-rate

BlockingBlocking

– 24 –CSCE 513 Fall 2015

GUSTAFSON’s Law model scaling ExampleSuppose a model is dominated by a matrix multiply, Suppose a model is dominated by a matrix multiply,

that for a given n x n matrix is multiplied a large that for a given n x n matrix is multiplied a large constant (k) number of times.constant (k) number of times.

knkn33 multiplies, adds (ignore memory references) multiplies, adds (ignore memory references)

If a model of size n=1024 can be executed in 10 minutes If a model of size n=1024 can be executed in 10 minutes on one processor, using Gustaphson’s how big can on one processor, using Gustaphson’s how big can the model be and execute in the 10 minutes the model be and execute in the 10 minutes assuming 1024 processors with 0% serial code?assuming 1024 processors with 0% serial code?

with 1% serial code?with 1% serial code?

with 10% serial code?with 10% serial code?

– 25 –CSCE 513 Fall 2015

– 26 –CSCE 513 Fall 2015

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure 3.26 ILP available in a perfect processor for six of the SPEC92 benchmarks. The first three programs are integer programs, and the last three are floating-point programs. The floating-point

programs are loop intensive and have large amounts of loop-level parallelism.

– 27 –CSCE 513 Fall 2015


Figure 3.30 The relative change in the miss rates and miss latencies when executing with one thread per core versus four threads per core on the TPC-C benchmark. The latencies are the actual time to return the requested data after a miss. In the four-thread case, the execution of other threads could potentially hide much

of this latency.

– 28 –CSCE 513 Fall 2015


Figure 3.30 The relative change in the miss rates and miss latencies when executing with one thread per core versus four threads per core on the TPC-C benchmark. The latencies are the actual time to return the requested data after a miss. In the four-thread case, the execution of other threads could potentially hide much

of this latency.

– 29 –CSCE 513 Fall 2015


Figure 3.31 Breakdown of the status on an average thread. “Executing” indicates the thread issues an instruction in that cycle. “Ready but not chosen” means it could issue but another thread has been chosen, and “not ready” indicates that the thread is awaiting the completion of an event (a pipeline delay or cache miss, for

example).

– 30 –CSCE 513 Fall 2015


Figure 3.32 The breakdown of causes for a thread being not ready. The contribution to the “other” category varies. InPC-C, store buffer full is the largest contributor; in SPEC-JBB, atomic instructions are the largest contributor; and in SPECWeb99,

both factors contribute.

– 31 –CSCE 513 Fall 2015


Figure 3.35 The speedup from using multithreading on one core on an i7 processor averages 1.28 for the Java benchmarks and 1.31 for the PARSEC benchmarks (using an unweighted harmonic mean, which implies a

workload where the total time spent executing each benchmark in the single-threaded base set was the same). The energy efficiency averages 0.99 and 1.07, respectively (using the harmonic mean). Recall that anything above 1.0 for energy efficiency indicates that the feature reduces execution time by more than it increases average power. Two of the Java benchmarks experience little speedup and have significant negative energy efficiency because of this. Turbo Boost is off in all cases. These data were collected and analyzed by Esmaeilzadeh et al. [2011] using the Oracle (Sun) HotSpot

build 16.3-b01 Java 1.6.0 Virtual Machine and the gcc v4.4.1 native compiler.

– 32 –CSCE 513 Fall 2015


Figure 3.41 The Intel Core i7 pipeline structure shown with the memory system components. The total pipeline depth is 14 stages, with branch mispredictions costing 17 cycles. There are 48 load and 32 store buffers. The six

independent functional units can each begin execution of a ready micro-op in the same cycle.

– 33 –CSCE 513 Fall 2015

Fermi (2010)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, Spring 2010 University of Illinois, Urbana-

Champaign

33

~1.5TFLOPS (SP)/~800GFLOPS (DP)230 GB/s DRAM Bandwidth

Documents

Lecture 16 Instruction Level Parallelism: Hyper-threading and limits