Lect. 10: Chip-Multiprocessors (CMP) · CS4/MSc Parallel Architectures - 2012-2013 Lect. 10: Chip-Multiprocessors (CMP) Main driving forces: – Complexity of design and verification

CS4/MSc Parallel Architectures - 2012-2013

Lect. 10: Chip-Multiprocessors (CMP) Main driving forces:

– Complexity of design and verification of wider-issue superscalar processor – Performance gains of either wider issue width or deeper pipelines would be

only marginal Limited ILP in applications Wire delays and longer access times of larger structures

– Power consumption of the large centralized structures necessary in wider-issue superscalar processors would be unmanageable

– Increased relative importance of throughput oriented computing as compared to latency oriented computing

– Continuation of Moore’s law so that more transistors fit in a chip

1


Early (ca. 2006) CMP’s

2

Example: Intel Core Duo – 2 cores

12-stage pipeline 2-way simultaneous

multithreading (HT) Up to 2.33GHz P6 (Pentium M)

microarchitecture – 2MB shared L2 cache – 151M transistors in 65nm

technology – Power consumption between

9W and 30W


A different design for CMP

3

Example: Intel Polaris (2007) – 80 cores

Single issue, statically scheduled

3.2GHz (up to 5GHz) – No shared L2 or L3 cache – No cache coherence – “Tiled” approach

Core + cache + router

Scalable, packet-switched, interconnect (8x10 mesh)

– Power consumption around 62W

Example: Intel SCC (2010) – 48 cores (full IA-32 compatible)


Now (2012)

4

Example: Intel core i7 – 2,4, 6, 8 cores

2-way simultaneous multithreading (HT)

Up to 3.5 GHz Sandybridge microarchitecture

– Upto 20 MB shared L3 – Upto 2B transistors in 22nm

technology – Power consumption between

45W and 150W


CMP’s vs. Multi-chip Multiprocessors

5

While conceptually similar to traditional multiprocessors, CMP’s have specific issues: – Off-chip memory bandwidth: number of pins per package does not

increase much – On-chip interconnection network: wires and metal layers are a very scarce

resource – Shared memory hierarchy: processors must share some lower level cache

LLC (e.g., L2 or L3) and the on-chip links between these – Wire delays: actual physical distances to be crossed for communication

affect the latency of the communication – Power consumption and heat dissipation: both are much harder to fit

within the limitations of a single chip package – Dark Silicon


Shared vs. Private L2 Caches

6

Private caches: + Less chance of negative interference between processors + Simpler interconnections – Possibly wasted storage in less loaded parts of the chip – Must enforce coherence across L2’s

Shared caches: – More chance for negative interference between processors + Possible positive interference between processors + Better utilization of storage + Single/few threads have access to all resources when cores are idle + No need enforce coherence (but still must enforce coherence across L1’s)

and L2 can act as a coherence point (i.e., directory) – All-to-one interconnect takes up large area and may become a bottleneck


Shared vs. Private L2 Caches

6

Note: L1 caches are tightly integrated into the pipeline and are an inseparable part of the core

Note: Processor nowadays have private L2 caches and shared L3 caches


Priority inversion

7

– In uniprocessors and multi-chip multiprocessors: processes with higher priority are given more resources (e.g., more processors, larger scheduling quanta, more memory/caches, etc) → faster execution

– In CMP’s with shared resources (e.g., LLC caches, off-chip memory bandwidth, issue slots with multithreading) Dynamic allocation of resources to threads/processes is oblivious to OS (e.g.,

LRU replacement policy in caches) Hardware policies attempt to maximize utilization across the board Hardware treats all threads/processes equally and threads/processes compete

dynamically for resources

– Thus, at run time, a lower priority thread/process may grab a larger share of resources and may execute relatively faster than a higher thread/process

– In more general terms, overall quality of service should be directly proportional to priority


Fair Cache Sharing

8

– Example:

– Interference in L2 causes gzip to have 3 to 10 times more L2 misses and to run at as low as half the original speed

– E.g. : gzip + art: gcc 10 times more misses, 40% of original speed – But art, only 15% lesser misses, no significant slowdown

Figure from Kim et. al.


Fair Cache Sharing

9

Fair Sharing – Condition for fair sharing:

Where Tdedi is the execution time of thread i when executed alone in the CMP with a dedicated LLC cache and Tshri is its execution time when sharing LLC with the other n-1 threads

– To maximize fair sharing, minimize:

where

– Possible solution: partition caches in different sized portions either statically or at run time

Tshr1

Tded1

= Tshr2

Tded2

= … = Tshrn

Tdedn

Mij = Xi - Xj

Xi = Tshri

Tdedj

Partitioning Caches

HW Support for partitioning – Constraining cache placement – Constraining cache replacement

How to partition – Static fair caching – Dynamic fair caching



NUCA LLC Caches

10

On-chip LLC caches are expected to continue increasing in size Such caches are logically divided in a few (2 to 8) logical banks

with independent access Banks are physically divided into small (128KB to 512KB) sub-

banks L3 caches will likely have 32 or more sub-banks Increasing wire delays mean that sub-banks closer to a given

processor could be accessed quicker than sub-banks further away Also, some sub-banks will invariably be closer to one processor

and far from another, and some sub-banks will be at similar distances from a few processors

Bottom-line: Access times will be increasingly inefficient


NUCA LLC Caches

11

Key ideas: – Allow and exploit the fact that different sub-banks have different access

times

– Dynamically map and migrate the most heavily used lines to the banks closer to the processor

– By tweaking the dynamic mapping and migration mechanisms such NUCA caches can adapt from shared to private caches

– Obviously, with such dynamic mapping and migration, searching the cache and performing replacements becomes more expensive


Directory Coherence On-Chip?

12

Mem. Dir.

CPU

L2 Cache

Mem. Dir.

CPU

L2 Cache

L2 Cache Dir.

CPU

L1 Cache

L2 Cache Dir.

CPU

L1 Cache

One-to-One mapping from CC-NUMA?

L2 Cache → L1 Cache Main memory → L2 Cache Dir. entry per memory line → Dir. entry per L2 cache line Mem. lines mapped to physical mem. by first-touch policy at OS page

granularity → L2 lines mapped to physical L2 by first-touch policy at OS page level


Directory Coherence On-Chip

13

The mapping problem (home node) OS page granularity is too coarse, may lead to imbalance in mapping Line granularity with first-touch needs a hardware/OS mapping of every

individual cache line to a physical L2 (too expensive) Solution: map at line granularity but circularly based on physical address

(mem. line 0 maps to L2 #0, mem. line 1 maps to L2 #1, etc) The problem with this solution is that locality of use is lost!

The eviction problem Upon eviction of an L2 (mem.) line the corresponding dir. entry is lost

and all L1 cached copies must be invalidated (ok for rare paging case in CC-NUMA, but not ok for small L2)

Solution: associate dir. entries not with L2 cache lines, but with cached L1 lines (replicated tags and exclusive L1-Home L2)


Exclusivity with Replicated Tags

12

Dir. contains copy of the L1 tags of lines mapped to the home L2, but L2 does not have to keep the L1 data itself Good: lines can be evicted from L2 silently (by exclusivity, they are not

cached in any L1) and Dir. does not change Bad: replicated tags (i.e., the Dir. information) increases with number of

L1 caches E.g., for 8 cores with 32KB L1 with 32B lines (i.e., 1024 lines) and

fully associative → 8x1024 = 8,192 entries per Dir.

L2 Cache

CPU

L1 Cache

L2 Cache Dir.

CPU

L1 Cache

Dir.


References and Further Reading

14

Early study of chip-multiprocessors “The Case for a Single-Chip Multiprocessor”, K. Olukotun, B. Nayfeh, L.

Hammond, K. Wilson, and K. Chang, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1996.

More recent study of chip-multiprocessors (throughput-oriented) “Maximizing CMP Throughput with Mediocre Cores”, J. Davis, J. Laudon,

and K. Olukotun, Intl. Conf. on Parallel Architecture and Compilation Techniques, September 2005.

First NUCA caches proposal (for uniprocessor) “An Adaptive, Non-uniform Cache Structure for Wire-delay Dominated On-

chip Caches”, C. Kim, D. Burger, and S. Keckler, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 2002.


References and Further Reading

15

NUCA cache study for CMP “Managing Wire Delay in Large Chip-Multiprocessor Caches”, B. Beckmann

and D. Wood, Intl. Symp. on Microarchitecture, December 2004.

Fair cache sharing studies “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture”,

S. Kim, D. Chandra, and Y. Solihin, Intl. Conf. on Parallel Architecture and Compilation Techniques, October 2004.

“CQoS: A Framework for Enabling QoS in Shared Caches of CMP Platforms”, R. Iyer, Intl. Conf. on Supercomputing, June 2004.

Other studies on priorities and quality of service in CMP/SMT “Symbiotic Job-Scheduling with Priorities for Simultaneous Multithreading

Processors”, A. Snavely, D. Tullsen, and G. Voelker, Intl. Conf. on Measurement and Modeling of Computer Systems, June 2002.

Documents

Lect. 10: Chip-Multiprocessors (CMP) · CS4/MSc Parallel Architectures - 2012-2013 Lect. 10: Chip-Multiprocessors (CMP) Main driving forces: – Complexity of design and verification