20
Scalable Nanophotonic Interconnect for Cache Coherent Multicores Randy W. Morris, Jr. and Avinash K. Kodi Department of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701 E-mail: [email protected], [email protected] Website: http://oucsace.cs.ohiou.edu/~avinashk/ WINDS 2010: Workshop on the Interaction between Nanophotonic Devices and Systems Atlanta, GA December 5, 2010

0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Scalable Nanophotonic Interconnect for Cache Coherent Multicores

Randy W. Morris, Jr. and Avinash K. KodiDepartment of Electrical Engineering and Computer Science

Ohio University, Athens, OH 45701 E-mail: [email protected], [email protected]

Website: http://oucsace.cs.ohiou.edu/~avinashk/

WINDS 2010: Workshop on the Interaction between Nanophotonic Devices and Systems

Atlanta, GADecember 5, 2010

Page 2: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Talk Outline

• Section I: Motivation & Background

• Section II: Dual Sub-Network for Snoopy Cache Coherent Nanophotonic Architecture

• Section IV: Performance Analysis

• Section V: Future Work

!

Page 3: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Why Nanophotonics?

"

2. Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51-61

Clock Distribution 11%Dual FPMACs 36 %Router & Links 28 %10-port RF 4%IMEM + DMEM 21%

Tile Power: Intel Tera-Flops (65 nm)2

28%

• Power consumption of Network-on-Chips (NoCs) 1 using metallic interconnects is projected to exceed expectation by a factor of 10

1. Reference : J.D.Owens, W.J.Dally, R.Ho, D.N.Jayasimha, S.W.Keckler and L.S.Peh, “Research Challenges for On-Chip Interconnection Networks”, IEEE Micro, vol. 27, no. 5, pp. 96 – 108, September-October 2007.

Nanophotonic Technology

- Low Power

- Small Footprint (10 – 15 !m)

- High Bandwidth (10 – 20 Gbps)

- CMOS Compatibility

Page 4: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

#

Resonant wavelength (!0) !0 ! m= neff ! 2"R

m # an integerneff # effective refractive indexR # radius of the ring resonator

Output Port 1

VR

Input Port 0 Output Port 0

n+ p+ n+ =VOFF=VONVR

Input Port 0 Output Port 0

n+ p+ n+

Micro-ring Resonators

1. Lipson, M., Compact Electro-Optic Modulators on a Silicon Chip, IEEE J. Sel. Top. Quant., Vol. 12, No. 6, Nov.-Dec. 2006, p. 1520-6.2. M. Lipson, Guiding, Modulating and Emitting Light on Silicon - Challenges and Opportunities, IEEE Journal of LightwaveTechnologies, Vol. 23, No. 12, 12 December 2005 (invited).

Page 5: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Cache Coherence

- Write propagation (write by any processor should become visible to all other processors)

- Write serialization (all writes from same or different processors are seen in the same order by all processors)

Snoopy Protocols

P1 P2 P3 P4

$ $ $ $

Memory Broadcast

Easy to ProgramNot Easily Scalable

Directory Protocols

P1 P4

$ $ MD

Interconnection Network

Point-to-Point

DM

ScalableHigh miss latency

Page 6: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Problems with Snoopy Networks

Two major problems with snoopy cache coherent networks

(1) Interconnect bandwidth for broadcasting of memory requests- Bus Networks: Limits one request per cycle - Multiple Buses: Increases cache controllers- Point-to-Point Networks: Selective multicasting & Ordering

(2) Cache Access Rate- Cache tag lookup (latency)- Increased power consumption

Page 7: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

$

Electrical– Split Transactional Bus– Sun Fireplane (SC 2001)– Timestamp Snooping (ASPLOS 2000), Multicast Snooping

(ISCA 2001– Jetty (HPCA 2001), Region Scout (ISCA 2005), Intel QPI– Broadcasting on Ordered Networks (HPCA 2009, MICRO 2009)

Related Work (to name a few)

Optical/Nanophotonic- SYMNET (Trans on Parallel & Dist Systems 2004)- Shared Bus (MICRO 2006), Wavelength Routed Oblivious Network (ASPLOS 2010)- Spectra (ISPLED 2009), ATAC (PACT 2010)

Page 8: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

%

• Advantages of the proposed architecture– Dual sub-networks for memory request

• Broadcast & Multicast networks

– Broadcast network used by all tiles to fetch the missed block

• Network access implemented using tokens• Determines the sharing pattern

– Multicast network to be shared between nodes to send selective requests

• Reduces the broadcast requirement• Simultaneous transient requests in progress to different memory

locations

– Reducing the external laser power by unique power guiding techniques

CC-NPA Architecture

Page 9: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Tile 0

Tile 4

Tile 8

Tile12

Tile 1

Tile 5

Tile 9

Tile13

Tile 2

Tile 6

Tile10

Tile14

Tile 3

Tile 7

Tile11

Tile 15

Proposed Broadcast Sub-Network Architecture: CC-NPA

Control center

Core 0

Core 2

Core 1

Core 3

Sh

ared L2

L1 C

ache

L1 C

ache

L1 C

ache

L1 C

ache

Tran

smitter

Receiver

Page 10: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Power Guiding

As only one core can transmit, route power to a column of cores.

- Reduction in optical power (~75%)

To Colum

n 1 To C

olum

n 2

To Column 3

The active column is determined by the circulating optical tokes

2 dB optical loss

To Column 0

Page 11: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Tile 0

Tile 4

Tile 8

Tile 12 Tile 13

Tile 9

Tile 5

Tile 1 Tile 2

Tile 6

Tile 10

Tile 14 Tile 15

Tile 11

Tile 7

Tile 3

Optical Token System (1/3)

power powerinject inject

return return

Control Center

Requestsa token

inject token

Received Token

power

Page 12: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Tile 0

Tile 4

Tile 8

Tile 12 Tile 13

Tile 9

Tile 5

Tile 1 Tile 2

Tile 6

Tile 10

Tile 14 Tile 15

Tile 11

Tile 7

Tile 3

Optical Token System (2/3)

power powerinject inject

return return

Requestsa token

inject token

power

&&&&&&&&&&&&&&&&&&&&&&&&&&

Token Re-Injected

Page 13: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Tile 0

Tile 4

Tile 8

Tile 12 Tile 13

Tile 9

Tile 5

Tile 1 Tile 2

Tile 6

Tile 10

Tile 14 Tile 15

Tile 11

Tile 7

Tile 3

Optical Token System (3/3)

power powerinject inject

return return

Requestsa token

inject token

power Token Returns

To next column

Fairness can be insured with additional techniques (Fair slot, Two pass)

Page 14: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Proposed Multicast Sub-Network

For larger networks, snoopy-based cache coherence reduces performance- Broadcasting data to all shared tiles, consuming more address bandwidth- Consumes more latency and power at the caches

• Wavelength routed second multicast sub-network

• Filter and route cache requests to nodes that hold the cache data

• Reduction in required bandwidth and power dissipation

• Potential for simultaneous multiple requests (could lead to race conditions)

0

0"2

0"$

0"6

0"&

1

1"2

(() L+ ,adi0 OceanSin6le!Sharer Mutiple!Sharer

Percentage of request with multiple sharers

Page 15: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

'(

Initial Performance Analysis• Performance Comparison

– Simics with Gems Memory Module– FFT, LU, Radiosity, Ocean, Radix, & Water

• Area & Power Analysis

>arameter @alue >arameter @alue

L1AL2!coherence !"#$ Core!(reDuency! 5!&'(

L2!cache siGeAaccoc 256 +,-16"/a1 )hreads!HcoreI 2

L1 cacheAaccoc 64+,-4"/a1 Issue!policy $3"or6er

Cache!line!siGe 64, Memory!SiGe!HKLI 4

Memory!Controllers 16 Mddress!Landwidth!HoptI

)#*!+,-.

Mddress!Landwidth!HelecI

320!&,:;

Simics Parameters

Page 16: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Splash-2 Speed up (16-cores)

0

0"2

0"$

0"6

0"&

1

1"2

1"$

(() L+ ,adiocity ,adi0 ,aytrace

OlectricalCC"N>M

- CC-NPA increases performance by about 25%

Page 17: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Splash-2 Speed up (64-cores)

0

0"Q

1

1"Q

2

2"Q

3

3"Q

$

(() L+ ,adi0 Ocean

OlectricalCC"N>M

- CC-NPA increases performance of up to 2x

Page 18: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

DeTice LossHdLI DeTice LossHdLI

Coupler HLcI 1 (ilter drop!HLfI 1

Non"Linearity HLnI 1 Lendin6 HLLI 1

>hoto"detector HLpI 1 VaTe6uide Crossin6!HLwcI 0<05

Modulator InsertionHLiI

1 ,eceiTer!HL,SISensitiTity

"20 6,=

VaTe6uide!Hper!cmI!HLVI 1<3 Splitter!HLsI 3

Laser Officiently 30> ,in6!modulation 150!?@-A

,in6!Heatin6 100!?@-A )IMA!Tolta6e!amp" 1<1!:@-A!B 100!?@-Ait

Power Analysis

LB LC

LsLi

Lf

LWC

LP,LRS

5×LS + 7×LW + LC + LN + 3×LI + LF + 8×LB+ 100×LWC

"$3"1!dL!Hper!waTelen6thI

Total Power (opt) = 5.44 W (8 wavelengths)

LW

Page 19: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Area Analysis

DeTice Mrea! H!m2)

VaTe6uide!HpitchI (/( "0

Micro"rin6!resonator '**!

>hoto"detector '**!

)IMA!Limitin6!Mmplifier */*!)!( 100!2

Off-Chip Laser

On-ChipModulator

Transmission Medium Photodetector

TIABuffer Chain Limiting Amplifier

Driver for Electronics

Optical Layer

Electronics Layer

On-Chip

Pitch (5.5!m)

Photo-detector (100 !m2)

TIA/Limiting Amp (0.02625 mm2)

Broadcast Sub-Network: 24 mm2 (optical) 51 mm2 (electrical)

Ring Resonator (100 !m2)

Page 20: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation

Conclusion & Future Work

• CC-NPA is both a low power & high bandwidth network for future cache coherent many-core processors

• CC-NPA combines the benefits the of snoopy cache coherent protocols and nanophotonics

• CC-NPA provides scalable bandwidth using two sub-networks (broadcast and multicast)

• Future work will involve designing and optimizing the multicast sub-network