Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Optics for disaggregating Datacenters and
disintegrating Computing
ARISTOTLE UNIV. OF THESSALONIKI
N.Terzenidis, M.Moralis-Pegios, S. Pitris, G.Mourgias-Alexandris, A. Tsakyridis, C. Vagionas, K. Vyrsokinos, T. Alexoudi, C. Mitsolidou, N. Pleros
Wireless and Photonics Systems and Networks (WinPhos) Lab
Dept. of Informatics, Aristotle Univ. of Thessaloniki,
Center for Interdisciplinary Research & Innovation (CIRI), Greece
Nikos Pleros
The energy problem: World’s No. 1 HPC
200 Pflops (20%
of Exascale target)
10MW (50% of the
20MW limit)
No.#1: IBM Summit (top500.org Nov. 2018 List)
2014 2016 2025
1.6% >3%(420TWh) ~20% Energy in DataCenters (% of global power) :
Nikos Pleros
The energy efficiency problem
High diversity of workloads with different resource requirements
up to 50% underutilized resources..impacting cost and energy !
Intel Whitepaper, 2015
Nikos Pleros
The way-out
Photonics
ArchitectureTechnology
Disaggregation
Energy Resource utilization
Nikos Pleros
Challenges across the hierarchy
hierarchy>2μsec latency in high-port switches
Energy increases with capacity
< 8-node connectivityin p2p QPI-based
High-latency in non-p2p switched-based
Non-modular settings with >40% on-die caches
Energy dominated by
memory access
Nikos Pleros
Our work
hierarchyHipoλaos Optical Packet
Switch
1024x1024 ports, 10T capacity scalable to >40T
sub-μsec latency
On-board disaggegration with
>8 sockets in p2p
Low-energy, low-latency with Si-Pho
Optical Static RAM technology
Disintegrate via off-chip high-speed optical caches
1.48 cm
18 SOAs
44 I/O
Test
structures
disaggregate at rack-level…
down to disintegration at
chip-level
T. Alexoudi et al, “Optics in Computing: from Photonic Network-on-Chip to Chip-to-Chip Interconnects and Disintegrated Architectures “, JLT 2019
Nikos Pleros
The Hipoλaos Optical Packet Switch
1024x1024 -port Multicasting Si-integration ?
N. Terzenidis et al, IEEE PTL, pp. 1535-
1538, Sept. 2018
M. Moralis-Pegios et al, IEEE PTL
pp. 712-715, April, 2018N. Terzenidis et al, ECOC, Sept. 2018
N. Terzenidis et al, ONDM 2019 to demonstrate performance in a 256-
node network
Nikos Pleros
Beating QPI: a p2p board-
level optical interconnect
for >8 sockets
Disaggregate at board-level
Nikos Pleros
The inner-anatomy: board-level Switch-based
• PCIe• RapidIO
Unlimited connectivity
Increased latency, power,
cost, size
Intel® Xeon®
processor E5-4600
CISC processors INTEL (x86)
Direct P2P - High-End Servers
INTEL QPI, AMD HyperTransport, IBM POWER8 SMP
Links, Oracle SPARK Coherence links, NVIDIA NVlink
Low-latency, low-power
Limited connectivity
Nikos Pleros
QPI
…the dominant P2P link
4 sockets P2P connected
9.6 Gb/s line rate
38.4 GB/s BW (307.2 Gb/s or 102.4 Gb/s /direction)
60ns-240ns Latency
16 pJ/bit TxRx power
efficiency
4s-“glueless”
architecture
102.4Gb/s
Intel Xeon E7-8800v4: 1st Intel Processor to support
up to 8-socket ! (released June 2016)
up to 2-hop
Increased latency
8s- “glueless”
architecture
Nikos Pleros
Going beyond 8 sockets?Bixby (BX) switches
January 2017
Price/performance drastically decreases as #processors increases
Latency increases
65% of QPI Bandwidth wasted for cache coherency updates
Nikos Pleros
www.ict-streams.eu
1300nm SM EOPCB with 50G RF
16x50Gb/s WDM TxRx SiP O-band
16x16 AWGR- O-band
The ICT-STREAMS O-band technology
Nikos Pleros
Strictly non-blocking
All-to-All connectivity
Suitable for broadcasting/multicasting
High number of ports (up to 32)
High bandwidth optical links
No O/E/O conversions
Passive (no power consumption)
Time of flight latency
The ICT-STREAMS p2MP architecture
Nikos Pleros
STREAMS vs QPI
QPI STREAMS gain
Sockets/
nodesUp to 8 >16 x 2
#hops up to 2 1 ÷ 2
Line rate 9.6 Gb/s 50 Gb/s x 10
Link BW 307.2 Gb/s 16x50 Gb/s x 2.6
Throughput 2.45 Tb/s 12.8 Tb/s x 5.2
Power
efficiency16 pj/bit < 3.5 pj/bit ÷ 4
Nikos Pleros
S. Pitris et al., Opt. Express 26(5), 2018
8×8 O-band Si-AWGR
The on-board routing platform
Nikos Pleros
Si-based 8x8 O-band AWGR technology
S. Pitris et. al., OFC 2018
S. Pitris et. al., OpEx 2018
10nm ch. spacing channel crosstalk : 11 dB 3-dB bandwidth 5.7nm non-uniformity : 3.5 dB loss, insertion losses: 2.5 dB loss
Nikos Pleros
Si-AWGR on board
AWGR chip
30mm
18m
m
Si-AWGR onto an EOPCB for on-board routing
Testing currently on-going
Nikos Pleros
S. Pitris et al., Opt. Express, 2018.
8×8 O-band Si-AWGR 40G PD & BICMOS TIA
S. Pitris et al., PJ 2018
40 Gb/s O-band RM
S. Pitris et al., IEEE PJ 2018
Multisocket routing @40Gb/s
Nikos Pleros
8x40Gb/s multi-socket Tx/Rx/routing
RM: 1.7pJ/bit @40Gbps
TIA: 4 pJ/bit @40Gbps
5.7pJ/bit for Tx/Rx link (incl. thermal tuning)
S. Pitris et al., IEEE Photon. J., Oct. 2018
Nikos Pleros
4ch Si-pho WDM TxRx
S. Pitris et al., OFC 2019
The WDM Transceiver engine
Nikos Pleros
Mask Layout
Tx1 Tx2 Tx3 Tx4
ER = 4.4 dB ER = 4.1 dB ER = 4.2 dB ER = 4 dB
2m
V
10p
s/d
iv
λ1=1310.25 nm λ2=1303.23 λ4=1324.59 nm λ3=1317.28 nm
4x40Gb/s O-band Si WDM transmitterS. Pitris et. al., OFC 2019
Nikos Pleros
HF
Board
High Speed TX RF traces
High Speed RX RF traces
4-channel 55nm BiCMOS DR
4-channel 55nm BiCMOS TIA
Tx out 1 @ 50 Gb/s NRZ(λ=1314.9 nm)
ER = 4.3 dB
4mV-10ps/div
Rx TIA out 1 @ 30 Gb/s NRZ(λ=1318.4 nm)
10mV-20ps/div
RM-driving voltage = 1.9 Vpp
EE=1.525pJ/bit/RM +0.8 pJ/bit (tun)
TIA output voltage = 120 mVpp
PCTIA=130mW/ch
DC traces
4x50Gb/s on-board WDM transmitter
All 4-channels capable of >50G operation
Nikos Pleros
Power penalty = 0.2 dB @10-9
BER measurements @ 40 Gb/s
H. Ramon et al., IEEE PTL 30, 2018.
1V-driven 50 Gb/s NRZ operation &
transmission at 52 km through SMF
EE = 0.8 pJ/bit @ 50 Gb/s
FDSOI CMOS DR
Micro-ring Modulator
1-Volt 50Gb/s x 52km transmission
Record-high BxL= 2600 Gb-km/s
M. Moralis-Pegios et al., OECC/PSC 2019
Nikos Pleros
Device Ref.Power
consumtpionEE@40Gb/s EE@50Gb/s
LD [1] 6.1 dBm 1 pJ/bit 0.8 pJ/bit
RM DRIVER +
TUNINGThis work 40 mW + 40 mW 2 pJ/bit 1.6 pJ/bit
PD TIA This work 130 mW 3.25 pJ/bit 2.6 pJ/bit
SerDes [2] 11.6 mW 0.29 pJ/bit 0.232 pJ/bit
Totals 6.54 pJ/bit 5.23 pJ/bit
%Reduction to QPI ~60% ~68%
1 S. Pitris et al, PJ 20182 V. Stojanovic, OPEX 2018
~68% energy savings compared to QPI
with >4-node direct connectivity !
The energy-latency gain
Nikos Pleros
Optical RAMs: disintegrate
via off-chip optical caching
Disintegrate at chip-level
Nikos Pleros
ORACLE, Pranay Koka et al, “Silicon-Photonic Network Architectures
for Scalable, Power-Efficient Multi-Chip Systems”, ISCA 2010Yigit Demir et al, “Galaxy: A High-Performance Energy-Efficient
Multi-Chip Architecture Using Photonic Interconnects”, ICS 2014
Why to disintegrate?
Smaller dies (chiplets) interconnected through optics
Relieve from area & power constraints – ease scaling to many-core
>2x speed-up in512-core (Oracle) and 4k-core (Galaxy) settings
Nikos Pleros
Memory speed-up mainly through hierarchical levelsLarge and complex two or three level cache hierarchies
Up to 40% of chip real-estate consumed by caches and X-bar
Still <20GB/s memory bandwidth
[Source: S. Borkar and A.A.Chien, Com. ACM, 2011]
>40%
<20GB/s
[Source] INTEL
Evolution of on-die caches
On-chip caches and memory bandwidth
Nikos Pleros
optical
cache
p-NoC
CMP:
cores
only!
Now!
o/e o/e
Modular and granular setting with cache-free cores and a unified,
shared pool of cache resources
Needs an ultra-fast optical cache for time-sharing between lower-clock cores
Our goal
Nikos Pleros
Give up access times for lower footprint/power
Up to 5GHz RAM speed
Fail to exceed electronics
First to exceed electronics2x the Speed: 10Gbps
A. Tsakyridis et al, CLEO 2019
10 Gbps Optical RAM
Nikos Pleros
BER penalties: Write 6.2 dB, Read 0.4 dB
Packaged Optical Memory
10 Gb/s monolithic InP Flip-Flop
A. Tsakyridis et al, CLEO 2019, A. Tsakyridis et al, OSA Optics Letters 2019
10 Gbps Optical RAM
Nikos Pleros
10 Gbps Optical RAM
Nikos Pleros
Conclusions
hierarchy
Hipoλaos Switch
1024x1024-port
sub-μsec latency
Scalable to 10Tbpswith 113pJ/bit
On-board p2p >8 sockets
5.2pJ/bit link @50Gb/s
Synergy between technology+architecture: 68% energy save over QPI
10GHz Optical SRAM: 2x the speed of electronics
Si-based RAM layout + Optical cache design
Modular, granular, flexible
Nikos Pleros
Contacts:
Prof. Nikos Pleros : [email protected]
Dr. Theoni Alexoudi : [email protected]
www.ict-streams.eu
Nikos Pleros
Back-up slides
Nikos Pleros
6.2μm2 footprint
<50psec response
time
Switching energy 6.4 fJ/bit@ 5 GHz
Hybrid PhC-based laser Flip Flop with only 13fJ/bit*In collaboration with F. Raineri and R. Raj, C2N/CNRS Lowest power consumption FF
Input
Output
*T. Alexoudi et al, JSTQE (2016)* D. Fitsios et al, Optics Express, (2016)
The Si-based Optical SRAM
3.2 fJ/bit@ 10 GHz
Nikos Pleros
Fast Optical RAM cell: design+theory > 40GHz experiment @10GHz Record-low power
Optical RAM
Peripheral Circuits Row Access Gate +
Column Decoder Row Decoder Tag Comparator
Experiments @10Gb/s
Access Gate + Col. Decoder
Tag Comparator
Ro
w D
ec
od
er
WDM
optical
word /
data
WDM
optical
memory
address
Optical cache simulations @ 16GHz
The first optical cache design* P. Maniotis et.al., "Optical Buffering for Chip Multiprocessors: A 16GHz Optical Cache Memory Architecture,” JLT, pp.4175-4191, 2013
Nikos Pleros
VPI simulations with exp.verified SOA model
Write operation
average ER 6.2 dB
average AM 1.9 dB
close to experimental values!
Caching with light at 16Gbps
Write to row “00”: Only when both λ1,λ2=0 & RW=0 (green area) the FF contents follow the
Bit contents (blue area)
Nikos Pleros
And compare…
CONVENTIONAL TOPOLOGY OPTICAL TOPOLOGY
8 cores, 2GHz clock
Dedicated L1
Shared L2=8xL1
8 cores, 2GHz clock
16GHz cache clock
Off-die shared L1 – no L2
Nikos Pleros
*P. Maniotis et al."High-Speed Optical Cache Memory as Single-Level SharedCache in Chip-Multiprocessor architectures," in Proceedings of HIPEAC15, 19-21 January, 2015
MEAN 63.4 62.7 19.4 10.7
L1 ConventionalL1+L2 Conventional 2xN GHz Optical
Execution times for PARSEC