22
Scalable Shared-Memory Implementations Erik Hagersten Uppsala University Sweden Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh 2 AVDARK 2010 Cache-to-cache in snoop-based Thread $ Thread $ Thread $ A: ... Read A Write A B: Read B Read A BusRTS My RTS wait for data Gotta answer Read A Read A Read A Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh 3 AVDARK 2010 ”Upgrade” in dir-based Thread $ Thread $ Thread $ Read A Read A A: ... Read A Write A B: Read B Read A INV INV Who has a copy Who has a copy INV ACK ACK ACK Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh 4 AVDARK 2010 Read A Read A Read A Cache-to-cache in dir-based Thread $ Thread $ Thread $ A: ... Read A Write A B: Read B Read A Forward ReadRequest Who has a copy Who has a copy ReadDemand Ack

Implementations Scalable Shared-Memory - Uppsala University · Scalable Shared-Memory Implementations Erik Hagersten Uppsala University Sweden Dept of Information Technology | ©Erik

  • Upload
    others

  • View
    9

  • Download
    1

Embed Size (px)

Citation preview

Scalable Shared-Memory Implementations

Erik HagerstenUppsala University

Sweden

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh2

AVDARK2010

Cache-to-cache in snoop-based

Thread

$

Thread

$

Thread

$

A:

...Read A…Write A

B:

Read B…Read A

BusRTS

My RTSwait

for data

Gottaanswer

Read ARead A……Read A

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh3

AVDARK2010

”Upgrade” in dir-based

Thread

$

Thread

$

Thread

$

Read ARead A……

A:

...Read A…Write A

B:

Read B…Read A

INV INV

Who has a copy

Who has a copy

INV

ACK ACK

ACK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh4

AVDARK2010

Read ARead A……Read A

Cache-to-cache in dir-based

Thread

$

Thread

$

Thread

$

A:

...Read A…Write A

B:

Read B…Read A

Forward

ReadRequest

Who has a copy

Who has a copy

ReadDemandAck

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh5

AVDARK2010

Directory-based coherence: per-cachline info in the memory

Thread

$

Thread

$

Thread

$

A: B:

Cache access Cache access Cache access

Directory Protocol

State

State

State

Directorystate

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh6

AVDARK2010

Directory-based snooping: NUMA.Per-cachline info in the home node

Thread

$

Thread

$

A: B:

Cache access Cache access

Directory Protocol

State

State

Directory Protocol

Interconnect

Directorystate

Directorystate

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh7

AVDARK2010

Why directory-basedP2P messages high bandwidthSuits out-of-the-box coherenceMuch more scalable!Note:

Dir-based can be used to build a uniform-memory architecture (UMA)Bandwidth will be great!!Memory latency will be OKCache-to-cache latency will not!Memory overhead can be high (storing directory...)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh8

AVDARK2010

Cache-to-cache in snoop-based

Thread

$

Thread

$

Thread

$

A:

...Read A…Write A

B:

Read B…Read A

BusRTS

My RTSwait

for data

Gottaanswer

Read ARead A……Read A

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh9

AVDARK2010

Read ARead A……Read A

Cache-to-cache in dir-based

Thread

$

Thread

$

Thread

$

A:

...Read A…Write A

B:

Read B…Read A

Forward

ReadRequest

Who has a copy

Who has a copy

ReadDemandAck

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh10

AVDARK2010

Fully mapped directory

• k Nodes• Each node is the ”home”

for 1/k of the memory• Dir entry per cacheline in

home memory: k presence-bits + 1 dirty-bit

• Requests are first sent to the home node’s CA

• ••Memory Directory

presence bits dirty bit”

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh11

AVDARK2010

Reducing the Memory Overhead: SCI

--- Scalable Coherence Interface (SCI)• home only holds pointer to rest of the directory info [log(N) bits]• distributed linked list of copies, weaves through caches

• cache tag has pointer, points to next cache with a copy• on read, add yourself to head of the list (comm. needed)• on write, propagate chain of invalidations down the list• on replacement: remove yourself from the list

P

Cache

P

Cache

P

Cache

Main Memory(Home)

Node 0 Node 1 Node 2

Log N bit ptr

Dir Data

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh12

AVDARK2010

Cache Invalidation PatternsBarnes-Hut Invalidation Patterns

1.27

48.35

22.87

10.56

5.332.87 1.88 1.4 2.5 1.06 0.61 0.24 0.28 0.2 0.06 0.1 0.07 0 0 0 0 0.33

05

10

1520

253035

4045

50

0 1 2 3 4 5 6 7

8 to

11

12 to

15

16 to

19

20 to

23

24 to

27

28 to

31

32 to

35

36 to

39

40 to

43

44 to

47

48 to

51

52 to

55

56 to

59

60 to

63

# of invalidations

Radiosity Invalidation Patterns

6.68

58.35

12.04

4.162.24 1.59 1.16 0.97

3.28 2.2 1.74 1.46 0.92 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0.910

10

20

30

40

50

60

0 1 2 3 4 5 6 7

8 to

11

12 to

15

16 to

19

20 to

23

24 to

27

28 to

31

32 to

35

36 to

39

40 to

43

44 to

47

48 to

51

52 to

55

56 to

59

60 to

63

# of invalidations

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh13

AVDARK2010

Overflow Schemes for Limited Pointers

Broadcast (DiriB)broadcast bit turned on upon overflowbad for widely-shared invalidated data

No-broadcast (DiriNB)on overflow, new sharer replaces one of the old ones (invalidated)bad for widely read data

Coarse vector (DiriCV)change representation to a coarse vector, 1 bit per k nodeson a write, invalidate all nodes that a bit corresponds to

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

1

Over½ow bit 8-bit coarse vector

(a) Over½ow

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

0

Overflow bit 2 Pointers

(a) No over½ow

Dir in memory:

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh14

AVDARK2010

Directory cache

Thread

$

Thread

$

A: B:

Cache access Cache access

Directory Protocol

State

State

Directory Protocol

Interconnect

Directorystate

Directorystate

Dir$ Dir$

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh15

AVDARK2010

cc-NUMA issuesMemory placement is key!Gotta’ migrate data to where it’s being usedGotta’ have cache affinity

Long time between process switches in the OSReschedule processor on the CPU it ran last

SGI Origin 2000’s migration always turned off

Interconnect

$P

M$P

M

...

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh16

AVDARK2010

Three options for shared memory

Interconnect

$P

AM$P

AM

...

Interconnect

$P

M$P

M

...

Interconnect

$P

M

$P

M

...

COMAcache-only

(@SICS)

NUMAnon-uniform

UMAuniform(a.k.a. SMP)

Sun’s E6000 Server Family

Erik HagerstenUppsala University

Sweden

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh18

AVDARK2010

What Approach to Shared Memory

I/O devicesMem

P1

$ $

Pn

P1

Switch

Main memory

Pn

(Interleaved)

(Interleaved)

P1

$

Inter connection network

$

Pn

Mem Mem

(b) Bus-based shar ed memory

(c) Dancehall

(a) Shar ed cache

First-level $

Bus

P1

$

Inter connection network

$

Pn

Mem Mem

(d) Distributed-memory NUMAUMA

UMA

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh19

AVDARK2010

Looks like a NUMA but drives like a UMA

Logical View

P1

$

Interconnection network

$

Pn

Mem Mem

Physical Viewl

P1

$ $

PnMem Mem

Addr

Interconnection network

•Memory bandwidth scales with the processor count•One “interconnect load” per (2xCPU + 2xMem)•Optimize for the dancehall case (no memory shortcut)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh20

AVDARK2010

SUN Enterprise Overview

16 slots with with either CPUs or IOUp to 30 UltraSPARC processors (peak 9 GFLOPs)GigaplaneTM bus has peak bw 2.67 GB/s; up to 30GB memory16 bus slots, for processing or I/O boards

Gigaplane TM bus (256 data, 41 address+ctrl, 83 MHz-100MHz)

I/O Cards

P

$2

$P

$ 2

$

mem ctrl

Bus Interface / SwitchBus Interf ace

CPU/MemCards

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh21

AVDARK2010

Enterprize Server E6000

I/O

Interconnect

P

$MEM

P

$

16 boards

P

$MEM

P

$

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh22

AVDARK2010

An E6000 Proc Board

P

$

P

$Mem

SS

80 signals = addr, uid, arb, ...

288 signals = 256 data + ECC

Proctags

Data

S

S

Snooptags

ctrlAddressController

Addr

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh23

AVDARK2010

An I/O Board80 signals = addr, uid, arb, ...

288 signals = 256 data + ECC

ctrlAddressController

DataAddr

$I/O

$I/O

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh24

AVDARK2010

Split-Transaction Bus

Address/CMD

Mem Access Delay

Data

Address/CMD

Data

Address/CMD

Busarbitration

Split bus transaction into request and response sub-transactions

Separate arbitration for each phase

Other transactions may interveneImproves bandwidth dramaticallyResponse is matched to requestBuffering between bus and cache controllers

time

AddrSignals

DataSignals

A A A D D

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh25

AVDARK2010

Gigaplane Bus Timing

Arbitration

Address

State

Tag

Status

Data

1

Rd A

A D A D A D A D A D A D A D A D

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Share ~Own

OK

D0 D1

4,5

Rd B

Own

Tag

6

Cancel

Tag

7

uid1 uid2

uid1

D0 D

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh26

AVDARK2010

Electrical Characteristics of the Bus

At most 16 electrical loads per signal8 boards from each side (ex. 15 CPU+1 I/O)20.5 inches "centerplane"Well controlled impedance~350-400 signalsRuns at 90/100 MHz

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh27

AVDARK2010

Address Controller

S

S

ctrl

addr arb aid did

OQ

IQ

P

$

P

$ SSProctags

Snooptags

CohProt

Data

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh28

AVDARK2010

Dual State Tags

S

S

ctrl

addr arb aid did

OQ

IQ

P

$

P

$ SSAccess-rightStat

ObligationState

CohProt

Data

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh29

AVDARK2010

Timing of a single read transBoard 1 reading from mem 2

Arbitration

Address

State

Tag

Status

Data

1

RdA

A D A D A D A D A D A D A D A D2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Share ~Own

OK

D0 D1

4,5

RdB

Own

Tag

6

Cancel

Tag

7

uid1 uid2

uid1

D0 D1

= Fixed latency

= Shortest possible latency

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh30

AVDARK2010

Protocol tuned for timing

11 cycles = ~110 ns

SRAM lookup

DRAM accessAddr decode

Arbitration

Address

State

Tag

Status

Data

1

RdA

A D A D A D A D A D A D A D A D2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Share ~Own

OK

D0 D1

4,5

RdB

Own

Tag

6

Cancel

Tag

7

uid1 uid2

uid1

D0 D1

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh31

AVDARK2010

Foreign and own transactions queue in IQState Change on Address Packet

•Data “A” initially resides in CPU7’s cache•CPU1: Issues a store request to “A”•CPU1: Read-To-Write req, ID=d, (i.e., “write request”)•CPU13: LD “A” -> Read-To-Shared req, ID=e•CPU15: ST “A” -> RTW req , ID=f

mRTO stored in IQCPU1Own read IQtrans retired when data arrivesLater requests for A queued in IQCPU1 behind mRTOIQCPU1 will eventually store: <mRTWIDd, fRTSIDe, fRTWIDf>

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh32

AVDARK2010

S

S

ctrl

addr arb aid did

OQ

IQ

P

$

P

$ SSProctags = I

Snooptags =I

CohProt

.

Store A

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh33

AVDARK2010

S

S

ctrl

addr arb aid did

OQ

IQ

P

$

P

$ SSProctags = I

Snooptags = I

CohProt

.

RTW

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh34

AVDARK2010

S

S

ctrl

addr arb aid did

OQ

IQ

P

$

P

$ SSProctags = I

Snooptags => M

CohProt

.

mRTW

mRTW, IDd

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh35

AVDARK2010

S

S

ctrl

addr arb aid did

OQ

P

$

P

$ SSProctags = I

Snooptags => O

CohProt

.

fRTSmRTW

fRTS, IDe

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh36

AVDARK2010

S

S

ctrl

addr arb aid did

OQ

P

$

P

$ SSProctags = I

Snooptags => I

CohProt

.

fRTWfRTSmRTW

fRTO, IDf

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh37

AVDARK2010

S

S

ctrl

addr arb aid did

OQ

P

$

P

$ SSProctags = I

Snooptags = I

CohProt

.

fRTWfRTSmRTW

Data, IDd

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh38

AVDARK2010

S

S

ctrl

addr arb aid did

OQ

P

$

P

$ SSProctags =>M

Snooptags = I

CohProt

Data, IDd

.

fRTWfRTS

mRTO

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh39

AVDARK2010

S

S

ctrl

addr arb aid did

OQ

P

$

P

$ SSProctags =>O

Snooptags = I

CohProt

Data, IDe

.

fRTWData, IDe

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh40

AVDARK2010

S

S

ctrl

addr arb aid did

OQ

P

$

P

$ SSProctags =>I

Snooptags = I

CohProt

Data, IDf

.

Data, IDf

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh41

AVDARK2010

A cascade of "write requests”A initially resides in CPU7’s cacheOn the bus:

•CPU1: RTW, ID=a•CPU2: RTW, ID=b•…•CPU5: RTW, ID=f

IQ1 = <mRTWIDa, fRTWIDb>IQ2 = <mRTWIDb, fRTWIDc>...IQ5 = <mRTWIDf>…IQ7= < fRTWIDa>

I I

I I

I M

S I

CPUtags

Snooptags

Implementing Sun’s SunFire 6800

Erik HagerstenUppsala University

Sweden

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh43

AVDARK2010

CPU$ CPU $DCDS

MEM

MEM

CPU$ CPU $DCDS

MEM

MEM

Addr. Rep.CPUboard

Addr. Rep.

L2 cache = 8MB, snoop tags on-chipCPU 1+GHz UltraSPARC IIIMem= 4+GB/CPU

FirePlane, 24 CPUs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh44

AVDARK2010

CPU$ CPU $DCDS

MEM

MEM

CPU$ CPU $DCDS

MEM

MEM

Addr. Rep.CPUboard

Addr. Rep.

CMD, Addr, ID

ID = <CPU#, Uid>

FirePlane, 24 CPUs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh45

AVDARK2010

CPU$ CPU $DCDS

MEM

MEM

CPU$ CPU $DCDS

MEM

MEM

Addr. Rep.CPUboard

Addr. Rep.FirePlane, 24 CPUs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh46

AVDARK2010

CPU$ CPU $DCDS

MEM

MEM

CPU$ CPU $DCDS

MEM

MEM

Addr. Rep.CPUboard

Addr. Rep.FirePlane, 24 CPUs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh47

AVDARK2010

CPU$ CPU $DCDS

MEM

MEM

CPU$ CPU $DCDS

MEM

MEM

Addr. Rep.CPUboard

Addr. Rep.FirePlane, 24 CPUs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh48

AVDARK2010

CPU$ CPU $DCDS

MEM

MEM

CPU$ CPU $DCDS

MEM

MEM

Addr. Rep.CPUboard

Addr. Rep.

Here it is!!

FirePlane, 24 CPUs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh49

AVDARK2010

CPU$ CPU $DCDS

MEM

MEM

CPU$ CPU $DCDS

MEM

MEM

Data Repeater

Addr. Rep.CPUboard

Addr. Rep.

Data Repeater

DR

Data, ID

ID = <CPU#, Uid>

FirePlane, 24 CPUs

DR DRDRDR

Sun’s WildFire System

Erik HagerstenUppsala University

Sweden

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh51

AVDARK2010

Sun’s WildFire System

Runs unmodified SMP apps in a more scalable way than E6000Minor modifications to E6000 snooping requiredCPUs generate local address OR global addressGlobal address --> no replication (NUMA)Coherent Memory Replication(~Simple COMA@ SICS)Hardware support for detecting migration/replication pagesDirectory cache + address translation cache backed by memoryDeterministic directory implementation (easy to verify)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh52

AVDARK2010

WildFire: One Solaris spanning four nodes

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh53

AVDARK2010

SwitchSwitch

...$$

PP

MM

$$

PP

MM

ccNUMA

SwitchSwitch

...$$

PP

$$

PP

$$

COMA

$$

COMA: self-optimizing DSM

COMA:Self-optimizing architectureProblem at high memory pressureComplex hardware and coherence protocol

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh54

AVDARK2010

Adaptive S-COMA of Large SMPs

A page may have space allocated in many nodes

HW maintains memory coherence per cache line

Replication under SW control --> simple HW (S-COMA)

Adaptive replication algorithm in OS (R-NUMA)

Coherent Memory Replication (CMR)

Hierarchical affinity scheduler (HAS)

Few large nodes ->simple interconnect and coherence protocol

...SwitchSwitch

$$PP

$$PP...

MemoryMemory

SwitchSwitch

$$PP

$$PP...

MemoryMemory

InterconnectInterconnect

28X

4X

28X

o

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh55

AVDARK2010

A WildFire Node

16 slots with either CPUs, IO or…oard

Gigaplane TM bus (256 data, 41 address+ctrl, 83 MHz-100MHz)

I/O I/O

P

$2

$P

$ 2

$

mem ctrl

Bus Interface / SwitchBus Interf ace

CPU/Mem Cards

WildFireCard

Bus Interf ace

WildFire extension board

Up to 28 UltraSPARC processors GigaplaneTM bus has peak bw 2.67 GB/sLocal access time of 330ns (lmbench)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh56

AVDARK2010

Sun WildFire Interface Board

ADDRController

SRAM

DataBuffers

Link Link LinkThis spacefor rent

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh57

AVDARK2010

Sun WildFire Interface Board

Data BusAddr Bus

Links

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh58

AVDARK2010

WildFire as a vanilla "NUMA"

Dir$ 8 b/line

Mtag2 b/line

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Interconnect

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh59

AVDARK2010

NUMA -- local memory acces

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Interconnect

AccessrightOK?

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh60

AVDARK2010

NUMA -- remote memory access

Dir$ 8b/line

Mtag 2b/line

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

InterconnectWho hasthe data

SRAM overhead = 10/512 = 2% (lower bound 2/512 = 0.4%)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh61

AVDARK2010

Global Cache Coherence Prot.

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Accessrightchanges

Mod direntry

Reply(Data)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh62

AVDARK2010

NUMA -- local memory access

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Interconnect

AccessrightOK?NO!!

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh63

AVDARK2010

Gigaplane Bus Timing

Arbitration

Address

State

Tag

Status

Data

1

Rd A

A D A D A D A D A D A D A D A D

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Share ~Own

OK

D0 D1

4,5

Own

uidB

6

Cancel

uidB

7

uidA Rd B uidB

uidA

Rd C uidCXXX xxx

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh64

AVDARK2010

Arbitration

Address

State

Tag

Status

Data

1

Rd A

A D A D A D A D A D A D A D A D

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Share Ignore

OK

D0 D1

4,5

Own

uidB

6

Cancel

uidB

7

uidA Rd B uidB

uidA

Rd A uidAXXX xxx

Resent by WildFireAsserted by WildFire

9

WildFire Bus Extensions

Ignore transaction squashes an ongoing transaction => not put in IQWildFire eventually reissues the same transactionRTSF -- a new transaction sends data to CPU and memory

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh65

AVDARK2010

WildFire Directory -- only 4 nodes!!

• k nodes (with one or more procs). • With each cache-block in memory: k

presence-bits, 1 dirty-bit• With each cache-block in cache: 1 valid

bit, and 1 dirty (owner) bit• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

• ReadRequestfrom main memory by processor i:• If dirty-bit OFF then { read from main memory; turn p[i] ON}• if dirty-bit ON then { recall line from dirty proc (cache state

to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}

….

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh66

AVDARK2010

NUMA "detecting excess misses"

Dir$ 8b/line

Mtag 2b/line

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

InterconnectI thoughtyou hadthe data!!*

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh67

AVDARK2010

Detecting a page for replication

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

= addr

AssociativeCounters

Data w/ E-miss-bit)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh68

AVDARK2010

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Interconnect

AT AT$

= addr

AddressTranslationGrey. <--> Yel..

VA->PA

OS Initializes a CMR page

Init accright toINV

/page

NewV->Pmappingin thisnode

/CL

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh69

AVDARK2010

An access to a CMR page

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Interconnect

AT AT$

AccessrightOK?

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh70

AVDARK2010

An access to a CMR page (miss)

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Interconnect

AT AT$

AccessrightOK?NO!!

Address Translation (AT) overhead = 8B/8kB = 0.1%No extra latency added

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh71

AVDARK2010

An access to a CMR page (miss)

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Interconnect

AT AT$

ChangeMTAGto “shared”

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh72

AVDARK2010

An access to a CMR page (hit)

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Dir$

Mtag

Mem

I/F

Cache

Proc

Cache

Proc...

Interconnect

AT AT$

AccessrightOK?YES!

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh73

AVDARK2010

Deterministic Directory MOSI protocol, fully mapped directory (one bit/node)Directory blocking: one outstanding trans/cache lineDirectory blocks new requests until completion receivedThe directory state and cache state always in agreement (except for silent replacement…)

L H R

1: req 2a:demand

3a:reply(D)

(a) Write: remote shared

L H R

1: req 2:demand

3:reply(D)

4:compl.

(b) Read: Remote dirty

4:compl.

3b. ack RR

2b. inv

L H

1: req

2:ack/nack

3:comp(D).

(c) Writeback

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh74

AVDARK2010

Replication Issues Revisited

Replication

"Physical" memory

mem tot

mem tot

controlledby the OS

WildFire

COMA

ccNUMA

Only "promising" pages are replicatedOS dynamically limits the amount of replicationSolaris CMR changes in the hat_layer (=port)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh75

AVDARK2010

Advantages of Multiprocessor Nodes

Pros:amortization of fixed node costs over multiple processorscan use commodity SMPsfewer nodes to keep track of in the directorymuch communication may stay within node (NUCA)can share “node caches” (WildFire: Coherent Memory Replication)

Cons:bandwidth shared among processors and interfacebus may increases latency to local memorysnoopy bus at remote node increases delays there too

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh76

AVDARK2010

Memory cost of replication

Example: Replicate 10% of data in all nodes50 nodes, each with 2 CPUs ==> 490% overhead

4 nodes, each with 25 CPUs ==> 30% overhead

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh77

AVDARK2010

Does migration/replication help?NAS parallel Benchmark Study (Execution time in seconds)[M. Bull, EPCC 2002]

No migr Migr NoMigr MigrNo Repl 26 5.9 6.1 6.1

Repl 7.2 6.2 6.1 6.1

No Initial Plac. Initial PlacementNo migr Migr NoMigr Migr

960 610 620 600590 580 580 580

No Initial Plac. Initial Placement

No migr Migr NoMigr MigrNo Repl 520 330 380 260

Repl 250 260 190 200

No Initial Plac. Initial PlacementNo migr Migr NoMigr Migr1540 780 760 780670 680 670 670

No Initial Plac. Initial Placement

Shallow BT

FT SP

No migr Migr NoMigr MigrNo Repl 230 230 240 230

Repl 220 220 220 220

No Initial Plac. Initial Placement

MG

No migr Migr NoMigr Migr1060 700 940 700300 280 300 290

No Initial Plac. Initial Placement

CG

Unopt. SWopt.HWopt.

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh78

AVDARK2010

WildFire’s Technology Limits

Dir$=8 b/line

Mtag=2 b/line

Mem

I/F

Cache

Proc

Cache

Proc...

Mem

Interconnect

SRAM size =DRAMsize/256Snoop frequency

SRAM size =DRAMsize/256Snoop frequency

Dir $ reach >> sum(cache size)Dir $ reach >> sum(cache size)

Slow interconnect

Slow interconnect

Hard to make busses faster

Hard to make busses faster

Sun’s SunFire 15k/25k

Erik HagerstenUppsala University

Sweden

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh80

AVDARK2010

StarCatSun Fire 15k/25k(used at Lab2)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh81

AVDARK2010

Front Side

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh82

AVDARK2010

Back Side

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh83

AVDARK2010

StarCat, 72 CPUs

CPU$ CPU $DCDS

MEM

MEM

CPU$ CPU $DCDS

MEM

MEM

Data Repeater

Addr. Rep.CPUboard

Glob-coh prot Data Rep.Dir$Expanderboard

18x18 addr X-bar 18x18 addr X-barActive Backplane

Total18 slots

P2P Directory Coherence

Repeater Snoop Coherence

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh84

AVDARK2010

StarCat Coherence Mechanism

CPU$ CPU $DCDS

MEM

MEM

CPU$ CPU $DCDS

MEM

MEM

Data Repeater

Addr. Rep.CPUboard

Glob-coh prot Data Rep.Dir$Expanderboard

18x18 addr X-bar 18x18 addr X-barActive Backplane

DATA+MTAG+ECC =576 bits

DATA+MTAG+ECC =576 bits

MTAG Check!

MTAG Check!

DATA

Addr (snoop)Remoterequest

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh85

AVDARK2010

StarCat, 72 CPUs

CPU$ CPU $DCDS

MEM

MEM

CPU$ CPU $DCDS

MEM

MEM

Data Repeater

Addr. Rep.CPUboard

Glob-coh prot Data Rep.Dir$Expanderboard

18x18 addr X-bar 18x18 addr X-barActive Backplane

Allocate Dir$ entry only for writerequests. Speculate onclean data on Dir$ miss

Allocate Dir$ entry only for writerequests. Speculate onclean data on Dir$ miss

WildCat coherencew/o CMR& w/ faster interconnect

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh86

AVDARK2010

Directory cache, but no directory(broadcast on Dir$ miss)

Thread

$

Thread

$

A: B:

Cache access Cache access

Directory Protocol

State

State

Directory Protocol

Interconnect

Dir$ Dir$

Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh87

AVDARK2010

StarCat Performance Data

CPU$ CPU $DCDS

MEM

MEM

CPU$ CPU $DCDS

MEM

MEM

Data Repeater

Addr. Rep.CPUboard

Glob-coh prot Data Rep.Dir$Expanderboard

18x18 addr X-bar 18x18 addr X-barActive Backplane

Lat = 200-340nsGBW=43GB/sLBW=86GB/s

Up to 104 CPU(trading for I/O)