18
CS4/MSc Parallel Architectures - 2012-2013 Lect. 10: Chip-Multiprocessors (CMP) Main driving forces: Complexity of design and verification of wider-issue superscalar processor Performance gains of either wider issue width or deeper pipelines would be only marginal Limited ILP in applications Wire delays and longer access times of larger structures Power consumption of the large centralized structures necessary in wider- issue superscalar processors would be unmanageable Increased relative importance of throughput oriented computing as compared to latency oriented computing Continuation of Moore’s law so that more transistors fit in a chip 1

Lect. 10: Chip-Multiprocessors (CMP) · CS4/MSc Parallel Architectures - 2012-2013 Lect. 10: Chip-Multiprocessors (CMP) Main driving forces: – Complexity of design and verification

Embed Size (px)

Citation preview

CS4/MSc Parallel Architectures - 2012-2013

Lect. 10: Chip-Multiprocessors (CMP)   Main driving forces:

–  Complexity of design and verification of wider-issue superscalar processor –  Performance gains of either wider issue width or deeper pipelines would be

only marginal   Limited ILP in applications   Wire delays and longer access times of larger structures

–  Power consumption of the large centralized structures necessary in wider-issue superscalar processors would be unmanageable

–  Increased relative importance of throughput oriented computing as compared to latency oriented computing

–  Continuation of Moore’s law so that more transistors fit in a chip

1

CS4/MSc Parallel Architectures - 2012-2013

Early (ca. 2006) CMP’s

2

  Example: Intel Core Duo –  2 cores

  12-stage pipeline   2-way simultaneous

multithreading (HT)   Up to 2.33GHz   P6 (Pentium M)

microarchitecture –  2MB shared L2 cache –  151M transistors in 65nm

technology –  Power consumption between

9W and 30W

CS4/MSc Parallel Architectures - 2012-2013

A different design for CMP

3

  Example: Intel Polaris (2007) –  80 cores

  Single issue, statically scheduled

  3.2GHz (up to 5GHz) –  No shared L2 or L3 cache –  No cache coherence –  “Tiled” approach

  Core + cache + router

  Scalable, packet-switched, interconnect (8x10 mesh)

–  Power consumption around 62W

  Example: Intel SCC (2010) –  48 cores (full IA-32 compatible)

CS4/MSc Parallel Architectures - 2012-2013

Now (2012)

4

  Example: Intel core i7 –  2,4, 6, 8 cores

  2-way simultaneous multithreading (HT)

  Up to 3.5 GHz   Sandybridge microarchitecture

–  Upto 20 MB shared L3 –  Upto 2B transistors in 22nm

technology –  Power consumption between

45W and 150W

CS4/MSc Parallel Architectures - 2012-2013

CMP’s vs. Multi-chip Multiprocessors

5

  While conceptually similar to traditional multiprocessors, CMP’s have specific issues: –  Off-chip memory bandwidth: number of pins per package does not

increase much –  On-chip interconnection network: wires and metal layers are a very scarce

resource –  Shared memory hierarchy: processors must share some lower level cache

LLC (e.g., L2 or L3) and the on-chip links between these –  Wire delays: actual physical distances to be crossed for communication

affect the latency of the communication –  Power consumption and heat dissipation: both are much harder to fit

within the limitations of a single chip package –  Dark Silicon

CS4/MSc Parallel Architectures - 2012-2013

Shared vs. Private L2 Caches

6

  Private caches: + Less chance of negative interference between processors + Simpler interconnections –  Possibly wasted storage in less loaded parts of the chip –  Must enforce coherence across L2’s

  Shared caches: –  More chance for negative interference between processors + Possible positive interference between processors + Better utilization of storage + Single/few threads have access to all resources when cores are idle + No need enforce coherence (but still must enforce coherence across L1’s)

and L2 can act as a coherence point (i.e., directory) –  All-to-one interconnect takes up large area and may become a bottleneck

CS4/MSc Parallel Architectures - 2012-2013

Shared vs. Private L2 Caches

6

  Note: L1 caches are tightly integrated into the pipeline and are an inseparable part of the core

  Note: Processor nowadays have private L2 caches and shared L3 caches

CS4/MSc Parallel Architectures - 2012-2013

Priority inversion

7

–  In uniprocessors and multi-chip multiprocessors: processes with higher priority are given more resources (e.g., more processors, larger scheduling quanta, more memory/caches, etc) → faster execution

–  In CMP’s with shared resources (e.g., LLC caches, off-chip memory bandwidth, issue slots with multithreading)   Dynamic allocation of resources to threads/processes is oblivious to OS (e.g.,

LRU replacement policy in caches)   Hardware policies attempt to maximize utilization across the board   Hardware treats all threads/processes equally and threads/processes compete

dynamically for resources

–  Thus, at run time, a lower priority thread/process may grab a larger share of resources and may execute relatively faster than a higher thread/process

–  In more general terms, overall quality of service should be directly proportional to priority

CS4/MSc Parallel Architectures - 2012-2013

Fair Cache Sharing

8

–  Example:

–  Interference in L2 causes gzip to have 3 to 10 times more L2 misses and to run at as low as half the original speed

–  E.g. : gzip + art: gcc 10 times more misses, 40% of original speed –  But art, only 15% lesser misses, no significant slowdown

Figure from Kim et. al.

CS4/MSc Parallel Architectures - 2012-2013

Fair Cache Sharing

9

  Fair Sharing –  Condition for fair sharing:

  Where Tdedi is the execution time of thread i when executed alone in the CMP with a dedicated LLC cache and Tshri is its execution time when sharing LLC with the other n-1 threads

–  To maximize fair sharing, minimize:

where

–  Possible solution: partition caches in different sized portions either statically or at run time

Tshr1

Tded1

= Tshr2

Tded2

= … = Tshrn

Tdedn

Mij = Xi - Xj

Xi = Tshri

Tdedj

Partitioning Caches

 HW Support for partitioning – Constraining cache placement – Constraining cache replacement

 How to partition – Static fair caching – Dynamic fair caching

CS4/MSc Parallel Architectures - 2012-2013

CS4/MSc Parallel Architectures - 2012-2013

NUCA LLC Caches

10

  On-chip LLC caches are expected to continue increasing in size   Such caches are logically divided in a few (2 to 8) logical banks

with independent access   Banks are physically divided into small (128KB to 512KB) sub-

banks   L3 caches will likely have 32 or more sub-banks   Increasing wire delays mean that sub-banks closer to a given

processor could be accessed quicker than sub-banks further away   Also, some sub-banks will invariably be closer to one processor

and far from another, and some sub-banks will be at similar distances from a few processors

  Bottom-line: Access times will be increasingly inefficient

CS4/MSc Parallel Architectures - 2012-2013

NUCA LLC Caches

11

  Key ideas: –  Allow and exploit the fact that different sub-banks have different access

times

–  Dynamically map and migrate the most heavily used lines to the banks closer to the processor

–  By tweaking the dynamic mapping and migration mechanisms such NUCA caches can adapt from shared to private caches

–  Obviously, with such dynamic mapping and migration, searching the cache and performing replacements becomes more expensive

CS4/MSc Parallel Architectures - 2012-2013

Directory Coherence On-Chip?

12

Mem. Dir.

CPU

L2 Cache

Mem. Dir.

CPU

L2 Cache

L2 Cache Dir.

CPU

L1 Cache

L2 Cache Dir.

CPU

L1 Cache

  One-to-One mapping from CC-NUMA?

  L2 Cache → L1 Cache   Main memory → L2 Cache   Dir. entry per memory line → Dir. entry per L2 cache line   Mem. lines mapped to physical mem. by first-touch policy at OS page

granularity → L2 lines mapped to physical L2 by first-touch policy at OS page level

CS4/MSc Parallel Architectures - 2012-2013

Directory Coherence On-Chip

13

  The mapping problem (home node)   OS page granularity is too coarse, may lead to imbalance in mapping   Line granularity with first-touch needs a hardware/OS mapping of every

individual cache line to a physical L2 (too expensive)   Solution: map at line granularity but circularly based on physical address

(mem. line 0 maps to L2 #0, mem. line 1 maps to L2 #1, etc)   The problem with this solution is that locality of use is lost!

  The eviction problem   Upon eviction of an L2 (mem.) line the corresponding dir. entry is lost

and all L1 cached copies must be invalidated (ok for rare paging case in CC-NUMA, but not ok for small L2)

  Solution: associate dir. entries not with L2 cache lines, but with cached L1 lines (replicated tags and exclusive L1-Home L2)

CS4/MSc Parallel Architectures - 2012-2013

Exclusivity with Replicated Tags

12

  Dir. contains copy of the L1 tags of lines mapped to the home L2, but L2 does not have to keep the L1 data itself   Good: lines can be evicted from L2 silently (by exclusivity, they are not

cached in any L1) and Dir. does not change   Bad: replicated tags (i.e., the Dir. information) increases with number of

L1 caches   E.g., for 8 cores with 32KB L1 with 32B lines (i.e., 1024 lines) and

fully associative → 8x1024 = 8,192 entries per Dir.

L2 Cache

CPU

L1 Cache

L2 Cache Dir.

CPU

L1 Cache

Dir.

CS4/MSc Parallel Architectures - 2012-2013

References and Further Reading

14

  Early study of chip-multiprocessors “The Case for a Single-Chip Multiprocessor”, K. Olukotun, B. Nayfeh, L.

Hammond, K. Wilson, and K. Chang, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1996.

  More recent study of chip-multiprocessors (throughput-oriented) “Maximizing CMP Throughput with Mediocre Cores”, J. Davis, J. Laudon,

and K. Olukotun, Intl. Conf. on Parallel Architecture and Compilation Techniques, September 2005.

  First NUCA caches proposal (for uniprocessor) “An Adaptive, Non-uniform Cache Structure for Wire-delay Dominated On-

chip Caches”, C. Kim, D. Burger, and S. Keckler, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 2002.

CS4/MSc Parallel Architectures - 2012-2013

References and Further Reading

15

  NUCA cache study for CMP “Managing Wire Delay in Large Chip-Multiprocessor Caches”, B. Beckmann

and D. Wood, Intl. Symp. on Microarchitecture, December 2004.

  Fair cache sharing studies “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture”,

S. Kim, D. Chandra, and Y. Solihin, Intl. Conf. on Parallel Architecture and Compilation Techniques, October 2004.

“CQoS: A Framework for Enabling QoS in Shared Caches of CMP Platforms”, R. Iyer, Intl. Conf. on Supercomputing, June 2004.

  Other studies on priorities and quality of service in CMP/SMT “Symbiotic Job-Scheduling with Priorities for Simultaneous Multithreading

Processors”, A. Snavely, D. Tullsen, and G. Voelker, Intl. Conf. on Measurement and Modeling of Computer Systems, June 2002.