Optics for disaggregating Datacenters and · Nikos Pleros The energy problem: World’s No. 1 HPC 200 Pflops (20% of Exascale target) 10MW (50% of the 20MW limit) No.#1: IBM Summit

Optics for disaggregating Datacenters and

disintegrating Computing

ARISTOTLE UNIV. OF THESSALONIKI

N.Terzenidis, M.Moralis-Pegios, S. Pitris, G.Mourgias-Alexandris, A. Tsakyridis, C. Vagionas, K. Vyrsokinos, T. Alexoudi, C. Mitsolidou, N. Pleros

Wireless and Photonics Systems and Networks (WinPhos) Lab

Dept. of Informatics, Aristotle Univ. of Thessaloniki,

Center for Interdisciplinary Research & Innovation (CIRI), Greece

Nikos Pleros

The energy problem: World’s No. 1 HPC

200 Pflops (20%

of Exascale target)

10MW (50% of the

20MW limit)

No.#1: IBM Summit (top500.org Nov. 2018 List)

2014 2016 2025

1.6% >3%(420TWh) ~20% Energy in DataCenters (% of global power) :

Nikos Pleros

The energy efficiency problem

High diversity of workloads with different resource requirements

up to 50% underutilized resources..impacting cost and energy !

Intel Whitepaper, 2015

Nikos Pleros

The way-out

Photonics

ArchitectureTechnology

Disaggregation

Energy Resource utilization

Nikos Pleros

Challenges across the hierarchy

hierarchy>2μsec latency in high-port switches

Energy increases with capacity

< 8-node connectivityin p2p QPI-based

High-latency in non-p2p switched-based

Non-modular settings with >40% on-die caches

Energy dominated by

memory access

Nikos Pleros

Our work

hierarchyHipoλaos Optical Packet

Switch

1024x1024 ports, 10T capacity scalable to >40T

sub-μsec latency

On-board disaggegration with

>8 sockets in p2p

Low-energy, low-latency with Si-Pho

Optical Static RAM technology

Disintegrate via off-chip high-speed optical caches

1.48 cm

18 SOAs

44 I/O

Test

structures

disaggregate at rack-level…

down to disintegration at

chip-level

T. Alexoudi et al, “Optics in Computing: from Photonic Network-on-Chip to Chip-to-Chip Interconnects and Disintegrated Architectures “, JLT 2019

Nikos Pleros

The Hipoλaos Optical Packet Switch

1024x1024 -port Multicasting Si-integration ?

N. Terzenidis et al, IEEE PTL, pp. 1535-

1538, Sept. 2018

M. Moralis-Pegios et al, IEEE PTL

pp. 712-715, April, 2018N. Terzenidis et al, ECOC, Sept. 2018

N. Terzenidis et al, ONDM 2019 to demonstrate performance in a 256-

node network

Nikos Pleros

Beating QPI: a p2p board-

level optical interconnect

for >8 sockets

Disaggregate at board-level

Nikos Pleros

The inner-anatomy: board-level Switch-based

• PCIe• RapidIO

Unlimited connectivity

Increased latency, power,

cost, size

Intel® Xeon®

processor E5-4600

CISC processors INTEL (x86)

Direct P2P - High-End Servers

INTEL QPI, AMD HyperTransport, IBM POWER8 SMP

Links, Oracle SPARK Coherence links, NVIDIA NVlink

Low-latency, low-power

Limited connectivity

Nikos Pleros

QPI

…the dominant P2P link

4 sockets P2P connected

9.6 Gb/s line rate

38.4 GB/s BW (307.2 Gb/s or 102.4 Gb/s /direction)

60ns-240ns Latency

16 pJ/bit TxRx power

efficiency

4s-“glueless”

architecture

102.4Gb/s

Intel Xeon E7-8800v4: 1st Intel Processor to support

up to 8-socket ! (released June 2016)

up to 2-hop

Increased latency

8s- “glueless”

architecture

Nikos Pleros

Going beyond 8 sockets?Bixby (BX) switches

January 2017

Price/performance drastically decreases as #processors increases

Latency increases

65% of QPI Bandwidth wasted for cache coherency updates

Nikos Pleros

www.ict-streams.eu

1300nm SM EOPCB with 50G RF

16x50Gb/s WDM TxRx SiP O-band

16x16 AWGR- O-band

The ICT-STREAMS O-band technology

http://www.ict-streams.eu/

Nikos Pleros

Strictly non-blocking

All-to-All connectivity

Suitable for broadcasting/multicasting

High number of ports (up to 32)

High bandwidth optical links

No O/E/O conversions

Passive (no power consumption)

Time of flight latency

The ICT-STREAMS p2MP architecture

Nikos Pleros

STREAMS vs QPI

QPI STREAMS gain

Sockets/

nodesUp to 8 >16 x 2

#hops up to 2 1 ÷ 2

Line rate 9.6 Gb/s 50 Gb/s x 10

Link BW 307.2 Gb/s 16x50 Gb/s x 2.6

Throughput 2.45 Tb/s 12.8 Tb/s x 5.2

Power

efficiency16 pj/bit < 3.5 pj/bit ÷ 4

Nikos Pleros

S. Pitris et al., Opt. Express 26(5), 2018

8×8 O-band Si-AWGR

The on-board routing platform

Nikos Pleros

Si-based 8x8 O-band AWGR technology

S. Pitris et. al., OFC 2018

S. Pitris et. al., OpEx 2018

10nm ch. spacing channel crosstalk : 11 dB 3-dB bandwidth 5.7nm non-uniformity : 3.5 dB loss, insertion losses: 2.5 dB loss

Nikos Pleros

Si-AWGR on board

AWGR chip

30mm

18m

m

Si-AWGR onto an EOPCB for on-board routing

Testing currently on-going

Nikos Pleros

S. Pitris et al., Opt. Express, 2018.

8×8 O-band Si-AWGR 40G PD & BICMOS TIA

S. Pitris et al., PJ 2018

40 Gb/s O-band RM

S. Pitris et al., IEEE PJ 2018

Multisocket routing @40Gb/s

Nikos Pleros

8x40Gb/s multi-socket Tx/Rx/routing

RM: 1.7pJ/bit @40Gbps

TIA: 4 pJ/bit @40Gbps

5.7pJ/bit for Tx/Rx link (incl. thermal tuning)

S. Pitris et al., IEEE Photon. J., Oct. 2018

Nikos Pleros

4ch Si-pho WDM TxRx

S. Pitris et al., OFC 2019

The WDM Transceiver engine

Nikos Pleros

Mask Layout

Tx1 Tx2 Tx3 Tx4

ER = 4.4 dB ER = 4.1 dB ER = 4.2 dB ER = 4 dB

2m

V

10p

s/d

iv

λ1=1310.25 nm λ2=1303.23 λ4=1324.59 nm λ3=1317.28 nm

4x40Gb/s O-band Si WDM transmitterS. Pitris et. al., OFC 2019

Nikos Pleros

HF

Board

High Speed TX RF traces

High Speed RX RF traces

4-channel 55nm BiCMOS DR

4-channel 55nm BiCMOS TIA

Tx out 1 @ 50 Gb/s NRZ(λ=1314.9 nm)

ER = 4.3 dB

4mV-10ps/div

Rx TIA out 1 @ 30 Gb/s NRZ(λ=1318.4 nm)

10mV-20ps/div

RM-driving voltage = 1.9 Vpp

EE=1.525pJ/bit/RM +0.8 pJ/bit (tun)

TIA output voltage = 120 mVpp

PCTIA=130mW/ch

DC traces

4x50Gb/s on-board WDM transmitter

All 4-channels capable of >50G operation

Nikos Pleros

Power penalty = 0.2 dB @10-9

BER measurements @ 40 Gb/s

H. Ramon et al., IEEE PTL 30, 2018.

1V-driven 50 Gb/s NRZ operation &

transmission at 52 km through SMF

EE = 0.8 pJ/bit @ 50 Gb/s

FDSOI CMOS DR

Micro-ring Modulator

1-Volt 50Gb/s x 52km transmission

Record-high BxL= 2600 Gb-km/s

M. Moralis-Pegios et al., OECC/PSC 2019

Nikos Pleros

Device Ref.Power

consumtpionEE@40Gb/s EE@50Gb/s

LD [1] 6.1 dBm 1 pJ/bit 0.8 pJ/bit

RM DRIVER +

TUNINGThis work 40 mW + 40 mW 2 pJ/bit 1.6 pJ/bit

PD TIA This work 130 mW 3.25 pJ/bit 2.6 pJ/bit

SerDes [2] 11.6 mW 0.29 pJ/bit 0.232 pJ/bit

Totals 6.54 pJ/bit 5.23 pJ/bit

%Reduction to QPI ~60% ~68%

1 S. Pitris et al, PJ 20182 V. Stojanovic, OPEX 2018

~68% energy savings compared to QPI

with >4-node direct connectivity !

The energy-latency gain

Nikos Pleros

Optical RAMs: disintegrate

via off-chip optical caching

Disintegrate at chip-level

Nikos Pleros

ORACLE, Pranay Koka et al, “Silicon-Photonic Network Architectures

for Scalable, Power-Efficient Multi-Chip Systems”, ISCA 2010Yigit Demir et al, “Galaxy: A High-Performance Energy-Efficient

Multi-Chip Architecture Using Photonic Interconnects”, ICS 2014

Why to disintegrate?

Smaller dies (chiplets) interconnected through optics

Relieve from area & power constraints – ease scaling to many-core

>2x speed-up in512-core (Oracle) and 4k-core (Galaxy) settings

Nikos Pleros

Memory speed-up mainly through hierarchical levelsLarge and complex two or three level cache hierarchies

Up to 40% of chip real-estate consumed by caches and X-bar

Still <20GB/s memory bandwidth

[Source: S. Borkar and A.A.Chien, Com. ACM, 2011]

>40%

<20GB/s

[Source] INTEL

Evolution of on-die caches

On-chip caches and memory bandwidth

Nikos Pleros

optical

cache

p-NoC

CMP:

cores

only!

Now!

o/e o/e

Modular and granular setting with cache-free cores and a unified,

shared pool of cache resources

Needs an ultra-fast optical cache for time-sharing between lower-clock cores

Our goal

Nikos Pleros

Give up access times for lower footprint/power

Up to 5GHz RAM speed

Fail to exceed electronics

First to exceed electronics2x the Speed: 10Gbps

A. Tsakyridis et al, CLEO 2019

10 Gbps Optical RAM

Nikos Pleros

BER penalties: Write 6.2 dB, Read 0.4 dB

Packaged Optical Memory

10 Gb/s monolithic InP Flip-Flop

A. Tsakyridis et al, CLEO 2019, A. Tsakyridis et al, OSA Optics Letters 2019

10 Gbps Optical RAM

Nikos Pleros

10 Gbps Optical RAM

Nikos Pleros

Conclusions

hierarchy

Hipoλaos Switch

1024x1024-port

sub-μsec latency

Scalable to 10Tbpswith 113pJ/bit

On-board p2p >8 sockets

5.2pJ/bit link @50Gb/s

Synergy between technology+architecture: 68% energy save over QPI

10GHz Optical SRAM: 2x the speed of electronics

Si-based RAM layout + Optical cache design

Modular, granular, flexible

Nikos Pleros

Contacts:

Prof. Nikos Pleros : [email protected]

Dr. Theoni Alexoudi : [email protected]

www.ict-streams.eu

mailto:[email protected]

mailto:[email protected]

Nikos Pleros

Back-up slides

Nikos Pleros

6.2μm2 footprint

<50psec response

time

Switching energy 6.4 fJ/bit@ 5 GHz

Hybrid PhC-based laser Flip Flop with only 13fJ/bit*In collaboration with F. Raineri and R. Raj, C2N/CNRS Lowest power consumption FF

Input

Output

*T. Alexoudi et al, JSTQE (2016)* D. Fitsios et al, Optics Express, (2016)

The Si-based Optical SRAM

3.2 fJ/bit@ 10 GHz

Nikos Pleros

Fast Optical RAM cell: design+theory > 40GHz experiment @10GHz Record-low power

Optical RAM

Peripheral Circuits Row Access Gate +

Column Decoder Row Decoder Tag Comparator

Experiments @10Gb/s

Access Gate + Col. Decoder

Tag Comparator

Ro

w D

ec

od

er

WDM

optical

word /

data

WDM

optical

memory

address

Optical cache simulations @ 16GHz

The first optical cache design* P. Maniotis et.al., "Optical Buffering for Chip Multiprocessors: A 16GHz Optical Cache Memory Architecture,” JLT, pp.4175-4191, 2013

Nikos Pleros

VPI simulations with exp.verified SOA model

Write operation

average ER 6.2 dB

average AM 1.9 dB

close to experimental values!

Caching with light at 16Gbps

Write to row “00”: Only when both λ1,λ2=0 & RW=0 (green area) the FF contents follow the

Bit contents (blue area)

Nikos Pleros

And compare…

CONVENTIONAL TOPOLOGY OPTICAL TOPOLOGY

8 cores, 2GHz clock

Dedicated L1

Shared L2=8xL1

8 cores, 2GHz clock

16GHz cache clock

Off-die shared L1 – no L2

Nikos Pleros

*P. Maniotis et al."High-Speed Optical Cache Memory as Single-Level SharedCache in Chip-Multiprocessor architectures," in Proceedings of HIPEAC15, 19-21 January, 2015

MEAN 63.4 62.7 19.4 10.7

L1 ConventionalL1+L2 Conventional 2xN GHz Optical

Execution times for PARSEC

Documents

Optics for disaggregating Datacenters and · Nikos Pleros The energy problem: World’s No. 1 HPC 200 Pflops (20% of Exascale target) 10MW (50% of the 20MW limit) No.#1: IBM Summit