23
Objectives_template file:///E|/parallel_com_arch/lecture41/41_1.htm[6/13/2012 2:28:01 PM] Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" TLP on Chip: HT/SMT and CMP Intel Montecito Features Overview Dual threads Thread urgency Core arbiter Power efficiency Foxton technology Die photo Sun Niagara OR Ultrasparc T1 Features Pipeline details Cache hierarchy Thread selection

Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Embed Size (px)

Citation preview

Page 1: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_1.htm[6/13/2012 2:28:01 PM]

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara"

TLP on Chip: HT/SMT and CMP

Intel Montecito

Features

Overview

Dual threads

Thread urgency

Core arbiter

Power efficiency

Foxton technology

Die photo

Sun Niagara OR Ultrasparc T1

Features

Pipeline details

Cache hierarchy

Thread selection

Page 2: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_2.htm[6/13/2012 2:28:01 PM]

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara"

Intel Montecito

Features

Dual core Itanium 2, each core dual threaded1.7 billion transistors, 21.5 mm x 27.7 mm die27 MB of on-chip three levels of cache

Not shared among cores1.8+ GHz, 100 WSingle-thread enhancements

Extra shifter improves performance of crypto codes by 100%Improved branch predictionImproved data and control speculation recoverySeparate L2 instruction and data caches buys 7% improvement over Itanium2; four timesbigger L2I (1 MB)Asynchronous 12 MB L3 cache

Overview

Reproduced from IEEE Micro

Dual threads

SMT only for cache, not for core resourcesSimulations showed high resource utilization at core level, but low utilization of cacheBranch predictor is still shared but use thread id tagsThread switch is implemented by flushing the pipe

More like coarse-grain multithreadingFive thread switch events

Page 3: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_2.htm[6/13/2012 2:28:01 PM]

L3 cache miss (immense impact on in-order pipe)/ L3 cache refillQuantum expirySpin lock/ ALAT invalidationSoftware-directed switchExecution in low power mode

Page 4: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_3.htm[6/13/2012 2:28:02 PM]

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara"

Thread urgency

Each thread has eight urgency levelsEvery L3 miss decrements urgency by oneEvery L3 refill increments urgency by one until urgency reaches 5A switch due to time quantum expiry sets the urgency of the switched thread to 7Arrival of asynchronous interrupt for a background thread sets the urgency level of thatthread to 6Switch from L3 miss requires urgency level to be compared also

Reproduced from IEEE Micro

Core arbiter

Page 5: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_3.htm[6/13/2012 2:28:02 PM]

Reproduced from IEEE Micro

Page 6: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_4.htm[6/13/2012 2:28:02 PM]

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara"

Power efficiency

Foxton technologyBlind replication of Itanium 2 cores at 90 nm would lead to roughly 300 W peak powerconsumption (Itanium 2 consumes 130 W peak at 130 nm)In case of lower than the ceiling power consumption, the voltage is increased leading tohigher frequency and performance

10% boost for enterprise applicationsSoftware or OS can also dictate a frequency change if power saving is required100 ms response time for the feedback loopFrequency control is achieved by 24 voltage sensors distributed across the chip: theentire chip runs at a single frequency (other than asynchronous L3)Clock gating found limited application in Montecito

Foxton technology

Reproduced from IEEE Micro

Embedded microcontroller runs a real-time scheduler to execute various tasks

Die photo

Page 7: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_4.htm[6/13/2012 2:28:02 PM]

Page 8: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_5.htm[6/13/2012 2:28:02 PM]

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara"

Sun Niagara OR Ultrasparc T1

Features

Eight pipelines or cores, each shared by 4 threads32-way multithreading on a single chipStarting frequency of 1.2 GHz, consumes 60 WShared 3 MB L2 cache, 4-way banked, 12-way set associative, 200 GB/s bandwidthSingle-issue six stage pipeTarget market is web service where ILP is limited, but TLP is huge (independenttransactions)

Throughput matters

Pipeline details

Reproduced from IEEE Micro

Four threads share a six-stage pipelineShared L1 caches and TLBsDedicated register file per threadFetches two instructions every cycle from a selected threadThread select logic also determines which thread’s instruction should be fed into the pipeAlthough pipe is in-order, there is an 8-entry store buffer per thread (why?)Threads may run into structural hazards due to limited number of FUs

Page 9: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_5.htm[6/13/2012 2:28:02 PM]

Divider is granted to the least recently executed thread

Page 10: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_6.htm[6/13/2012 2:28:03 PM]

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara"

Cache hierarchy

L1 instruction cache16 KB / 4-way / 32 bytes / random replacementFetches two instructions every cycleIf both instructions are useful, next cycle is free for icache refill

L1 data cache8 KB / 4-way / 16 bytes/ write-through, no-allocateOn avearge 10% miss rate for target benchmarksL2 cache extends the tag to maintain a directory for keeping the core L1s coherent

L2 cache is writeback with silent clean eviction

Thread selection

Based on long latency events such as load, divide, multiply, branchAlso based on pipeline stalls due to cache misses, traps, or structural hazardsSpeculative load dependent issue with low priority

Page 11: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_7.htm[6/13/2012 2:28:03 PM]

Solution of Exercise : 1

1. [10 points] Suppose you are given a program that does a fixed amount of work, and some fraction sof that work must be done sequentially. The remaining portion of the work is perfectly parallelizable on Pprocessors. Derive a formula for execution time on P processors and establish an upper bound on theachievable speedup.

Solution: Execution time on P processors, T(P) = sT(1) + (1-s)T(1)/P. Speedup = 1/(s + (1-s)/P).Upper bound is achieved when P approaches infinity. So maximum speedup = 1/s. As expected, theupper bound on achievable speedup is inversely proportional to the sequential fraction.

2. [40 points] Suppose you want to transfer n bytes from a source node S to a destination node D andthere are H links between S and D. Therefore, notice that there are H+1 routers in the path (includingthe ones in S and D). Suppose W is the node-to-network bandwidth at each router. So at S you requiren/W time to copy the message into the router buffer. Similarly, to copy the message from the buffer ofrouter in S to the buffer of the next router on the path, you require another n/W time. Assuming a store-and-forward protocol total time spent doing these copy operations would be (H+2)n/W and the data willend up in some memory buffer in D. On top of this, at each router we spend R amount of time to figureout the exit port. So the total time taken to transfer n bytes from S to D in a store-and-forward protocolis (H+2)n/W+(H+1)R. On the other hand, if you assume a cut-through protocol the critical path wouldjust be n/W+(H+1)R. Here we assume the best possible scenario where the header routing delay ateach node is exposed and only the startup n/W delay at S is exposed. The rest is pipelined. Nowsuppose that you are asked to compare the performance of these two routing protocols on an 8x8 grid.Compute the maximum, minimum, and average latency to transfer an n byte message in this topologyfor both the protocols. Assume the following values: W=3.2 GB/s and R=10 ns. Compute for n=64 and256. Note that for each protocol you will have three answers (maximum, minimum, average) for eachvalue of n. Here GB means 10^9 bytes and not 2^30 bytes.

Solution: The basic problem is to compute the maximum, minimum, and average values of H. The restis just about substituting the values of the parameters. The maximum value of H is 14 while theminimum is 1. To compute the average, you need to consider all possible messages, compute H forthem, and then take the average. Consider S=(x0, y0) and D=(x1, y1). So H = |x0-x1| + |y0-y1|.Therefore, average H = (sum over all x0, x1, y0, y1 |x0-x1| + |y0-y1|)/(64*63), where each of x0, x1, y0,y1 varies from 0 to 7. Clearly, this is same as (sum over x0, x1 |x0-x1| + sum over y0, y1 |y0-y1|)/63,which in turn is equal to 2*(sum over x0, x1 |x0-x1|)/63 = 2*(sum over x0=0 to 7, x1=0 to x0 (x0-x1)+sum over x0=0 to 7, x1=x0+1 to 7 (x1-x0))/63 = 16/3.

3. [20 points] Consider a simple computation on an nxn double matrix (each element is 8 bytes) whereeach element A[i][j] is modified as follows. A[i][j] = A[i][j] - (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1])/4.Suppose you assign one matrix element to one processor (i.e. you have n^2 processors). Compute thetotal amount of data communication between processors.

Solution: Each processor requires the four neighbors i.e. 32 bytes. So total amount of datacommunicated is 32n^2.

4. [30 points] Consider a machine running at 10^8 instructions per second on some workload with thefollowing mix: 50% ALU instructions, 20% load instructions, 10% store instructions, and 20% branchinstructions. Suppose the instruction cache miss rate is 1%, the writeback data cache miss rate is 5%,and the cache line size is 32 bytes. Assume that a store miss requires two cache line transfers, one toload the newly updated line and one to replace the dirty line at a later point in time. If the machine

Page 12: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_7.htm[6/13/2012 2:28:03 PM]

provides a 250 MB/s bus, how many processors can it accommodate at peak bus bandwidth?

Solution: Let us compute the bandwidth requirement of the processor per second. Instruction cachemisses 10^6 times transferring 32 bytes on each miss. Out of 20*10^6 loads 10^6 miss in the cachetransferring 32 bytes on each miss. Out of 10^7 stores 5*10^5 miss in the cache transferring 64 byteson each miss. Thus, total amount of data transferred per second is 96*10^6 bytes. Thus at most twoprocessors can be supported on a 250 MB/s bus.

Page 13: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_8.htm[6/13/2012 2:28:03 PM]

Solution of Exercise : 2

[Thanks to Saurabh Joshi for some of the suggestions.]

1. [30 points] For each of the memory reference streams given in the following, compare the cost ofexecuting it on a bus-based SMP that supports (a) MESI protocol without cache-to-cache sharing, and(b) Dragon protocol. A read from processor N is denoted by rN while a write from processor N isdenoted by wN. Assume that all caches are empty to start with and that cache hits take a single cycle,misses requiring upgrade or update take 60 cycles, and misses requiring whole block transfer take 90cycles. Assume that all caches are writeback.

Solution:Stream1: r1 w1 r1 w1 r2 w2 r2 w2 r3 w3 r3 w3

(a) MESI: read miss, hit, hit, hit, read miss, upgrade, hit, hit, read miss, upgrade, hit, hit. Total latency =90+1+1+1+2*(90+60+1+1) = 397 cycles(b) Dragon: read miss, hit, hit, hit, read miss, update, hit, update, read miss, update, hit, update. Totallatency = 90+1+1+1+2*(90+60+1+60) = 515 cycles

Stream2: r1 r2 r3 w1 w2 w3 r1 r2 r3 w3 w1

(a) MESI: read miss, read miss, read miss, upgrade, readX, readX, read miss, read miss, hit, upgrade,readX. Total latency = 90+90+90+60+90+90+90+90+1+60+90 = 841 cycles(b) Dragon: read miss, read miss, read miss, update, update, update, hit, hit, hit, update, update. Totallatency = 90+90+90+60+60+60+1+1+1+60+60=573 cycles

Stream3: r1 r2 r3 r3 w1 w1 w1 w1 w2 w3

(a) MESI: read miss, read miss, read miss, hit, upgrade, hit, hit, hit, readX, readX. Total latency =90+90+90+1+60+1+1+1+90+90 = 514 cycles(b) Dragon: read miss, read miss, read miss, hit, update, update, update, update, update, update. Totallatency=90+90+90+1+60*6=631 cycles

[For each stream for each protocol: 5 points]

2. [15 points] (a) As cache miss latency increases, does an update protocol become more or lesspreferable as compared to an invalidation based protocol? Explain.

Solution: If the system is bandwidth-limited, invalidation protocol will remain the choice. However, ifthere is enough bandwidth, with increasing cache miss latency, invalidation protocol will lose inimportance.

(b) In a multi-level cache hierarchy, would you propagate updates all the way to the first-level cache?What are the alternative design choices?

Solution: If updates are not propagated to L1 caches, on an update the L1 block must beinvalidated/retrieved to the L2 cache.

(c) Why is update-based protocol not a good idea for multiprogramming workloads running on SMPs?

Solution: Pack-rat. Discussed in class.

3. [20 points] Assuming all variables to be initialized to zero, enumerate all outcomes possible under

Page 14: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_8.htm[6/13/2012 2:28:03 PM]

sequential consistency for the following code segments.

(a) P1: A=1;P2: u=A; B=1;P3: v=B; w=A;

Solution: If u=1 and v=1, then w must be 1. So (u, v, w) = (1, 1, 0) is not allowed. All other outcomesare possible.

(b) P1: A=1;P2: u=A; v=B;P3: B=1;P4: w=B; x=A;

Solution: Observe that if u sees the new value A, v does not see the new value of B, and w sees thatnew value of B, then x cannot see the old value of A. So (u, v, w, x) = (1, 0, 1, 0) is not allowed.Similarly, if w sees the new value of B, x sees the old value of A, u sees the new value of A, then vcannot see the old value B. So (1, 0, 1, 0) is not allowed, which is already eliminated in the abovecase. All other 15 combinations are possible.

(c) P1: u=A; A=u+1;P2: v=A; A=v+1;

Solution: If v=A happens before A=u+1, then the final (u, v, A) = (0, 0, 1).If v=A happens after A=u+1, then the final (u, v, A) = (0, 1, 2).Since u and v are symmetric, we will also observe the outcome (1, 0, 2) in some cases.

(d) P1: fetch-and-inc (A)P2: fetch-and-inc (A)

Solution: The final value of A is 2.

4. [30 points] Consider a quad SMP using a MESI protocol (without cache-to-cache sharing). Eachprocessor tries to acquire a test-and-set lock to gain access to a null critical section. Assume that test-and-set instructions always go on the bus and they take the same time as the normal read transactions.The initial condition is such that processor 1 has the lock and processors 2, 3, 4 are spinning on theircaches waiting for the lock to be released. Every processor gets the lock once, unlocks, and then exitsthe program. Consider the bus transactions related to the lock/unlock operations only.

(a) What is the least number of transactions executed to get from the initial to the final state? [10points]

Solution: 1 unlocks, 2 locks, 2 unlocks (no transaction), 3 locks, 3 unlocks (no transaction), 4 locks, 4unlocks (no transaction). Notice that in the best possible scenario, the timings will be such that whensomeone is in the critical section no one will even attempt a test-and-set. So when the lock holderunlocks, the cache block will still be in its cache in M state.

(b) What is the worst-case number of transactions? [5 points]

Solution: Unbounded. While someone is holding the lock, other contending processors may keep oninvalidating each other indefinite number of times.

(c) Answer the above two questions if the protocol is changed to Dragon. [15 points]

Solution: Observe that it is an order of magnitude more difficult to implement shared test-and-set locks(LL/SC-based locks are easier to implement) in a machine running an update-based protocol. In astraightforward implementation, on an unlock everyone will update the value in cache and then will try to

Page 15: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_8.htm[6/13/2012 2:28:03 PM]

do test-and-set. Observe that the processor which wins the bus and puts its update first, will be the oneto enter the critical section. Others will observe the update on the bus and must abort their test-and-setattempts. While someone is in the critical section, nothing stops the other contending processors fromtrying test-and-set (notice the difference with test-and-test-and-set). However, these processors will notsucceed in getting entry to the critical section until the unlock happens.

Least number is still 7. A test-and-set or an unlock involves putting an update on the bus.

Worst case is still unbounded.

5. [30 points] Answer the above question for a test-and-test-and-set lock for a 16-processor SMP. Theinitial condition is such that the lock is released and no one has got the lock yet.

Solution:MESI:

Best case analysis: 1 locks, 1 unlocks, 2 locks, 2 unlocks, ... This involves exactly 16 transactions(unlocks will not generate any transaction in the best case timing).

Worst case analysis: Done in the class. The first round will have (16 + 15 + 1 + 15) transactions. Thesecond round will have (15 + 14 + 1 + 14) transactions. The last but one round will have (2 + 1 + 1 + 1)transactions and the last round will have one transaction (just locking of the last processor). The lastunlock will not generate any transaction. If you add these up, you will get (1.5P+2)(P-1) + 1. For P=16,this is 391.

Dragon:

Best case analysis: Now both unlocks and locks will generate updates. So the total number oftransactions would be 32.

Worst case analysis: The test & set attempts in each round will generate updates. The unlocks will alsogenerate updates. Everything else will be cache hits. So the number of transactions is(16+1)+(15+1)+...+(1+1) = 152.

6. [10 points] If the lock variable is not allowed to be cached, how will the traffic of a test-and-set lockcompare against that of a test-and-test-and-set lock?

Solution:In the worst case both would be unbounded.

7. [15 points] You are given a bus-based shared memory machine. Assume that the processors have acache block size of 32 bytes and A is an array of integers (four bytes each). You want to parallelize thefollowing loop.

for(i=0; i<17; i++) {for (j=0; j<256; j++) {A[j] = do_something(A[j]);}}

(a) Under what conditions would it be better to use a dynamically scheduled loop?

Solution: If runtime of do_something varies a lot depending on its argument value or if nothing isknown about do_something.

(b) Under what conditions would it be better to use a statically scheduled loop?

Solution: If runtime of do_something is roughly independent of its argument value.

(c) For a dynamically scheduled inner loop, how many iterations should a processor pick each time?

Page 16: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_8.htm[6/13/2012 2:28:03 PM]

Solution: Multiple of 8 integers (one cache block is eight integers).

8. [20 points] The following barrier implementation is wrong. Make as little change as possible to correctit.

struct bar_struct {LOCKDEC(lock);int count; // Initialized to zeroint releasing; // Initialized to zero} bar;

void BARRIER (int P) {LOCK(bar.lock);bar.count++;if (bar.count == P) {bar.releasing = 1;bar.count--;}else {UNLOCK(bar.lock);while (!bar.releasing);LOCK(bar.lock);bar.count--;if (bar.count == 0) {bar.releasing = 0;}}UNLOCK(bar.lock);}

Solution: There are too many problems with this implementation. I will not list them here. The correctbarrier code is given below which requires addition of one line of code. Notice that the releasing variablenicely captures the notion of sense reversal.

void BARRIER (int P) {while (bar.releasing); // New additionLOCK(bar.lock);bar.count++;if (bar.count == P) {bar.releasing = 1;bar.count--;}else {UNLOCK(bar.lock);while (!bar.releasing);LOCK(bar.lock);bar.count--;if (bar.count == 0) {bar.releasing = 0;}}UNLOCK(bar.lock);}

Page 17: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_8.htm[6/13/2012 2:28:03 PM]

Page 18: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_9.htm[6/13/2012 2:28:04 PM]

Solution of Exercise : 3

1. [0 points] Please categorize yourself as either "CS698Z AND CS622" or "CS622 ONLY".

2. [5+5] Consider a 512-node system. Each node has 4 GB of main memory. The cache block size is128 bytes. What is the total directory memory size for (a) bitvector, (b) DiriB with i=3?

Solution: (a) Number of cache blocks per node = 2^25 and 512 bits per directory entry. So totaldirectory memory size = (2^25)*64*512 bytes = 1 TB.(b) Each directory entry is 27 bits. So total directory memory size = 54 GB.

3. [5+5+5+10] For a simple two-processor NUMA system, the number of cache misses to three virtualpages X, Y, Z is as follows.

Page X: P0 has 14 misses, P1 has 11 misses. P1 takes the first miss.Page Y: P0 has zero misses, P1 has 18 misses.Page Z: P0 has 15 misses, P1 has 9 misses. P0 takes the first miss.

The remote to local miss latency ratio is four. Evaluate the aggregate time spent in misses by the twoprocessors in each of the following policies. Assume that a local miss takes 400 cycles.

(a) First touch placement.(b) All three pages on P0.(c) All three pages on P1.(d) Best possible application-directed static placement i.e. a one-time call to the OS to place the threepages.

Solution: (a) First touch: Page X is local to P1, page Y is local to P1, page Z is local to P0. P0 spends(14*1600+15*400) cycles or 28400 cycles. P1 spends (11*400+18*400+9*1600) cycles or 26000 cycles.Aggregate: 54400 cycles.

(b) All three pages on P0: P0 spends (14*400+15*400) cycles or 11600 cycles. P1 spends(11*1600+18*1600+9*1600) cycles or 60800 cycles. Aggregate: 72400 cycles.

(c) All three pages on P1: P0 spends (14*1600+15*1600) cycles or 46400 cycles. P1 spends(11*400+18*400+9*400) cycles or 15200 cycles. Aggregate: 61600 cycles.

(d) To determine the best page-to-node affinity, we need to compute the latency for both the choicesfor each page. This is shown in the following table. Since P0 doesn't access Y, it should be placed onP1. Therefore, Y is not included in the following table.

Page Home Latency of P0 Latency of P1 Aggregate------------------------------------------------------------------X P0 5600 17600 23200X P1 22400 4400 26800------------------------------------------------------------------Z P0 6000 14400 20400Z P1 24000 3600 27600------------------------------------------------------------------

Thus X and Z both should be placed on P0. Y should be placed on P1. The aggregate latency of this is50800 cycles. As you can see, this is about 7% better than first touch.

Page 19: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_9.htm[6/13/2012 2:28:04 PM]

4. [30] Suppose you want to transpose a matrix in parallel. The source matrix is A and the destinationmatrix is B. Both A and B are decomposed in such a way that each node gets a chunk of consecutiverows. Application-directed page placement is used to map the pages belonging to each chunk in thelocal memory of the respective nodes. Now the transpose can be done in two ways. The first algorithm,known as "local read algorithm", allows each node to transpose the band of rows local to it. So naturallythis algorithm involves a large fraction of remote writes to matrix B. Assume that the band is wideenough so that there is no false sharing when writing to matrix B. The second algorithm, known as"local write algorithm", allows each node to transpose a band of columns of A such that all the writes toB are local. Naturally, this algorithm involves a large number of remote reads in matrix A. Assume thatthe algorithms are properly tiled so that the cache utilization is good. In both the cases, before doing thetranspose, every node reads and writes to its local segment in A and after doing the transpose everynode reads and writes to its local segment in B. Assuming an invalidation-based cache coherenceprotocol, briefly but clearly explain which algorithm is expected to deliver better performance. How muchsynchronization does each algorithm require (in terms of the number of critical sections and barriers)?Assume that the caches are of infinite capacity and that a remote write is equally expensive in allrespects as a remote read because in both cases the retirement is held up for a sequentially consistentimplementation.

Solution: Analysis of local read algorithm: Before the algorithm starts, the band of rows to betransposed by a processor is already in its cache. Each remote write to a cache block of B involves a 2-hop transaction (requester to home and back) without involving any invalidation. After transpose eachprocessor has a large number of remote cache blocks in its cache. Here a barrier is needed before aprocessor is allowed to work on its local rows of B. After the barrier each cache read or write miss to Binvolves a 2-hop intervention (local home to owner and back). So in this algorithm, each processorsuffers from local misses before transpose, 2-hop misses during transpose, and 2-hop misses aftertranspose.

Analysis of local write algorithm: Before the algorithm starts, the band of columns to be transposed by aprocessor is quite distributed over the system. In fact, 1/P portion of the column would be in the localcache. Before the transpose starts, a barrier is needed. In the transpose phase, each cache miss of Ainvolves a 2-hop transaction (requester to home's cache and back). Each cache miss to B is a localmiss. After the transpose the local band of rows of B is already in the cache of each processor. So theyenjoy hits in this phase. So in this algorithm, each processor suffers from local misses before transpose,2-hop misses and local misses during transpose, and hits after transpose.

Both the algorithms have the same number of misses, but the local write algorithm has more localmisses and less remote misses. Both have the same synchronization requirement. Local write algorithmis better and that's what SPLASH-2 FFT implements. Cheers!

5. [5+5] If a compiler reorders accesses according to WO and the underlying processor is SC, what isthe consistency model observed by the programmer? What if the compiler produces SC code, but theprocessor is RC?

Solution: WO compiler and SC hardware: programmer sees WO because an SC processor willexecute whatever is presented to it in the intuitive order. But since the instruction stream presented to itis already WO, it cannot do any better.

SC compiler and RC hardware: programmer sees RC.

6. [5+5] Consider the following piece of code running on a faulty microprocessor system which does notpreserve any program order other than true data and control dependence order.

LOCK (L1)load A

Page 20: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_9.htm[6/13/2012 2:28:04 PM]

store AUNLOCK (L1)load Bstore BLOCK (L1)load Cstore CUNLOCK (L1)

(a) Insert appropriate WMB and MB instructions to enforce SC. Do not over-insertanything.(b) Repeat part (a) to enforce RC.

Solution:(a)MB LOCK(L1)MBload Astore AWMB // Need to hold the unlock back before the store to A is doneUNLOCK (L1)MBload Bstore BMBLOCK (L1)MBload Cstore CWMB // Need to hold the unlock back before the store to C is doneUNLOCK (L1)MB

(b)LOCK (L1)MB // This MB wouldn't be needed if the processor was not faulty; you lose some RCadvantage load Astore AWMBUNLOCK (L1)load Bstore BLOCK (L1)MB // This MB wouldn't be needed if the processor was not faulty; you lose some RCadvantageload Cstore CWMBUNLOCK (L1) 7. [5+10] Consider implementing a directory-based protocol to keep the private L1 caches of the cores

Page 21: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_9.htm[6/13/2012 2:28:04 PM]

in a chip-multiprocessor (CMP) coherent. Assume that the CMP has a shared L2 cache. The mostnatural way of doing it is to attach a directory to the tag of each L2 cache block. An L1 cache miss getsforwarded to the L2 cache. The directory entry is looked up in parallel with the L2 cache block. Doesthis implementation suit an inclusive hierarchy or exclusive hierarchy? Explain. If the cache block sizesof the L1 cache and the L2 cache are different, explain briefly how you will manage the state of the L2cache block on receiving a writeback from the L1 cache. Do not assume per sector directory entry inthis design.

Solution: This is typical design for an inclusive hierarchy. For exclusive hierarchy, you cannot keep adirectory entry per L2 cache block. Instead, you must have a separate large directory store holding thedirectory states of all the blocks in both L2 and L1 caches. If the L1 and L2 block sizes are different,you need to keep a count of dirty L1 sectors with the owner; otherwise you won't be able to turn off theL1M state even after the last dirty sector is written back by the L1 cache. This has no correctnessproblem, but would involve unnecessary interventions to L1 caches. Fortunately, you don't need anyextra bits for doing this for a medium to large scale CMP if the ratio of L2 to L1 block sizes is not toolarge. Assume a p-core CMP with L2 to L1 block size ratio of k. So you need log(p) bits to store theowner when L1M state is set,log(k) bits to store the dirty sector count, and p bits to store the sharer vector if the block is only shared.So as long as p >= log(p)+log(k), we can pack the owner and dirty sector count into the sharer vector.This condition corresponds to k <= 2^{p-log(p)} or k <= (2^p)/p. The number on the right-hand sidegrows very fast with p. For eight cores onward we are in the safe zone.

Page 22: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_10.htm[6/13/2012 2:28:04 PM]

Self-assessment Exercise

These problems should be tried after module 05 is completed.

1. Consider the following memory organization of a processor. The virtual address is 40 bits, thephysical address is 32 bits, the page size is 8 KB. The processor has a 4-way set associative 128-entryTLB i.e. each way has 32 sets. Each page table entry is 32 bits in size. The processor also has a 2-way set associative 32 KB L1 cache with line size of 64 bytes.

(A) What is the total size of the page table?(B) Clearly show (with the help of a diagram) the addressing scheme if the cache is virtually indexedand physically tagged. Your diagram should show the width of TLB and cache tags.(C) If the cache was physically indexed and physically tagged, what part of the addressing schemewould change?

2. A set associative cache has longer hit time than an equally sized direct-mapped cache. Why?

3. The Alpha 21264 has a virtually indexed virtually tagged instruction cache. Do you see anysecurity/protection issues with this? If yes, explain and offer a solution. How would you maintaincorrectness of such a cache in a multi-programmed environment?

4. Consider the following segment of C code for adding the elements in each column of an NxN matrixA and putting it in a vector x of size N.

for(j=0;j<N;j++) {for(i=0;i<N;i++) {x[j] += A[i][j];}}

Assume that the C compiler carries out a row-major layout of matrix A i.e. A[i][j] and A[i][j+1] areadjacent to each other in memory for all i and j in the legal range and A[i][N-1] and A[i+1][0] areadjacent to each other for all i in the legal range. Assume further that each element of A and x is afloating point double i.e. 8 bytes in size. This code is executed on a modern speculative out-of-orderprocessor with the following memory hierarchy: page size 4 KB, fully associative 128-entry data TLB, 32KB 2-way set associative single level data cache with 32 bytes line size, 256 MB DRAM. You mayassume that the cache is virtually indexed and physically tagged, although this information is notneeded to answer this question. For N=8192, compute the following (please show all the intermediatesteps). Assume that every instruction hits in the instruction cache. Assume LRU replacement policy forphysical page frames, TLB entries, and cache sets.

(A) Number of page faults.(B) Number of data TLB misses.(C) Number of data cache misses. Assume that x and A do not conflict with each other in the cache.(D) At most how many memory operations can the processor overlap before coming to a halt? Assumethat the instruction selection logic (associated with the issue unit) gives priority to older instructions overyounger instructions if both are ready to issue in a cycle.

5. Suppose you are running a program on two machines, both having a single level of cache hierarchy(i.e. only L1 caches). In one machine the cache is virtually indexed and physically tagged while in theother it is physically indexed and physically tagged. Will there be any difference in cache miss rates

Page 23: Module 18: TLP on Chip: HT/SMT and CMP Lecture 41: Case ... · Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" Thread urgency

Objectives_template

file:///E|/parallel_com_arch/lecture41/41_10.htm[6/13/2012 2:28:04 PM]

when the program is run on these two machines?