29
© 2019 Western Digital Corporation or its affiliates. All rights reserved. 4/16/2020 | WESTERN DIGITAL CONFIDENTIAL Datacenter CPUs and SoCs with OmniXtend fabric Zvonimir Z. Bandic, Dejan Vucinic Next Gen Platforms Technologies, Western Digital Corporation

Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

  • Upload
    others

  • View
    1

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 4/16/2020| WESTERN DIGITAL CONFIDENTIAL

Datacenter CPUs and SoCs with OmniXtend fabric

Zvonimir Z. Bandic, Dejan Vucinic

Next Gen Platforms Technologies, Western Digital Corporation

Page 2: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 2| WESTERN DIGITAL CONFIDENTIAL

Forward-Looking Statements

This presentation contains certain forward-looking statements that involve risks and uncertainties, including, but not

limited to, statements regarding: the RISC-V Foundation and its initiatives; our contributions to and investments in the

RISC-V ecosystem; the transition of our devices, platforms and systems to RISC-V architectures; shipments of RISC-V

processor cores; our business strategy, growth opportunities and technology development efforts; market trends and

data growth and its drivers. Forward-looking statements should not be read as a guarantee of future performance or

results, and will not necessarily be accurate indications of the times at, or by, which such performance or results will be

achieved, if at all. Forward-looking statements are subject to risks and uncertainties that could cause actual performance

or results to differ materially from those expressed in or suggested by the forward-looking statements.

Additional key risks and uncertainties include the impact of continued uncertainty and volatility in global economic

conditions; actions by competitors; business conditions; growth in our markets; and pricing trends and fluctuations in

average selling prices. More information about the other risks and uncertainties that could affect our business are listed

in our filings with the Securities and Exchange Commission (the “SEC”) and available on the SEC’s website at

www.sec.gov, including our most recently filed periodic report, to which your attention is directed. We do not undertake

any obligation to publicly update or revise any forward-looking statement, whether as a result of new information,

future developments or otherwise, except as otherwise required by law.

Safe Harbor | Disclaimers

Page 3: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 3| WESTERN DIGITAL CONFIDENTIAL

Agenda

• Datacenter CPU vision

• OmniXtend backplane reference design

• OmniXtend vs. other memory centric concepts

• OmniXtend system design:– Programmable Ethernet switches

– Providing cache coherence

– Dataplane implementation

• It works: two sockets and a switch…

• Performance measurements

• Conclusions and what is next?

Page 4: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 4| WESTERN DIGITAL CONFIDENTIAL

Datacenter CPU vision I

• Multi-threaded, multi-core CPU:– 1) Medium performance, OOO RISC-V Core for

general purpose OS and software applications

– 2) Standardized and open JEDEC interface architecture (NVDIMM-P) for high density emerging non-volatile memories

– 3) Support for standardized memory protocol fabric – e.g. OmniXtend

– 4) Support for high bandwidth and low latency accelerator interfaces:

• Supporting machine learning and inference engine accelerators

– 5) Inference accelerator:

• May contain large number of small RISC-V Cores with vector ALUs (similar to GP-GPU)

• May contain set of hardware primitives allowing optimal acceleration

Page 5: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 5| WESTERN DIGITAL CONFIDENTIAL

Datacenter CPU vision II

• Allows large number of RISC-V compute nodes to connect to universally shared memory (NUMA) – standardized

• Enables memory appliance, aggregation/disaggregation

Page 6: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 6| WESTERN DIGITAL CONFIDENTIAL

OmniXtend backplane reference design

Page 7: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 7| WESTERN DIGITAL CONFIDENTIAL

OmniXtend backplane reference design II

• Barefoot Networks Tofino programmable switch:– 64 4x25G ports

– Supports P4 Programmable Pipeline

• On board Interfaces– 16x QSFP28 connectors for 100G (4x25G)

– 4x QSFP28 connectors for 100G (4x25G) with retimers

– 6x SFF-TA-1002 connectors w/ 8x25G

– 1x SFF-TA-1002 connectors w/ 16x25G

– 4x Firefly connectors for 4x25G Optical MTP/MPO

– 12x 4x25G over TwinAx flyover cables for extension boards

• Extension boards– 2 Extension boards

– 12x EDSFF SFF-TA-1002 connectors w/ 8x25G on each board

– Alternate configurations possible

Page 8: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 8| WESTERN DIGITAL CONFIDENTIAL 8

OmniXtend vs. other memory-centric concepts

•Memory fabric may mean different things to different people:

– Page fault trap leading to RDMA request (incurs context switch and SW overhead)

– Global address translation management in SW, leading to LD/ST across global memory fabric

– Coherence protocol scaled out, global page management and no context switching

Context switch cost comparable to memory access latency

No rewriting of software, scalable like the algorithm

This is OmniXtend

Require software/kernel support and/or rewriting of applications

Page 9: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 9| WESTERN DIGITAL CONFIDENTIAL

Computer and storage server architecture

• Single node heavily dominated by Intel

• Top tier Google, Amazon, MSFT have access to some proprietary interfaces

• We could not get QPI/UPI license since first attempt in HGST (2013) (Zvonimir, Dejan)

• Architecture: CLOSED

• No guarantees – not even PCIe or SATA

• Amount of memory per node is limited

Last 40 years

Page 10: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 10| WESTERN DIGITAL CONFIDENTIAL

Data centers’ memory are struggling to keep up with data explosion

I/O struggling to keep up with CPU innovations

Big data and fast data ecosystem requires a standard scale-out memory-centric architecture

By 2020 IDC predicts the data volume will have reached 40,000 Exabytes

Year

Perf

orm

ance

Year

Dat

a V

olu

me

General purpose architectures no longer sufficient Big Data and Fast Data workloads exceed capability of uniform resource ratios

Page 11: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 11| WESTERN DIGITAL CONFIDENTIAL 11

OmniXtend system design

• OmniXtend is a fully open coherence protocol meant to restore unrestricted interoperability of heterogeneous compute engines with a wide variety of memory and storage technologies

• OmniXtend supports a four-hop MESI protocol and takes advantage of a new wave of Ethernet switches with stateful and programmable data planes to facilitate system scalability

Programmable Switch

802.3

PH

Y802.3

PH

Y

802.3

PH

Y

802.3 PHY

ML Accelerator

802.3 PHY

NVMNVM

NVM

Page 12: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 12| WESTERN DIGITAL CONFIDENTIAL

• Current trends towards modern networking supports 6.4 Tb/s packet switching with application specific header parsing & customized match-action

• Programmable network devices provide:➢Programmable data plane and control plane

➢High throughput (100Gbps) with low processing latency (<0.5s )

➢Data processing while being transmitted

➢Protocol-independent

The auspicious timing of programmable Ethernet switches made the choice of transport easy

• Great Benefits for:➢Distributed Systems

➢ In-Network Computing

➢Distributed Main Memory and Storage Systems

Page 13: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 13| WESTERN DIGITAL CONFIDENTIAL

Protocol independent switch architecture enables coherence message routing over Ethernet

• OmniXtend leverages the programmability of smart switches for parsing, processing and routing coherence messages between computing nodes to suit cache coherence protocol's needs

Memory ALU

Memory ALU

Memory ALU

Memory ALU

Memory ALU

Memory ALU

Memory ALU

Memory ALU

Memory ALU

Memory ALU

Memory ALU

Memory ALU

Pro

gram

mab

le P

arse

r

Pro

gram

mab

le D

epar

ser

. . .

Packet In Packet OutProgrammable Pipeline

Page 14: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 14| WESTERN DIGITAL CONFIDENTIAL

Providing cache coherence

• All of the memory operations on a shared address space are completed through two-stage request and response transactions between active participants:– Master Agent: request for specific permission and perform

memory operations

– Slave Agent: manage access permissions, communicate with main memory, and respond to the original requester

• A Master Agent must first obtain necessary permissions on a specific block before performing memory operations:– Get: Read the data from a specified address

– Put: Write the data at a specified address

– Atomic: Atomically read a data value and write a new value that is the result of some arithmetic or logical operations

Master Agent

Slave Agent

Request Message

Reply Message

Page 15: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 15| WESTERN DIGITAL CONFIDENTIAL

Data plane implementation

– Keeps standard 802.3 L1 frame, interoperates with Barefoot Tofino and future Ethernet switches

– Custom frames are parsed and processed in P4 language

– Enables stateful message processing inside the switching fabric

– FPGA or ASIC switch; not limited to 802.3

• Protocol translation and modification inside fabric:– Requires no new silicon

OmniXtend Header Format

• 100 Gb/s is available today– Clear roadmap to 200 and 400 with 56Gb PAM4 and x8

• Replaces Ethernet L2 with serialized TileLink messages

Page 16: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 16| WESTERN DIGITAL CONFIDENTIAL

Routing OmniXtend messages through a programmable switch• Barefoot Tofino ASIC:

– 64-port 100 GigE switch, 6.4 Tbit/s aggregate throughput, < 400 ns latency

– Fully programmable parser allows describing TileLink message format in P4

– Fully programmable match-action pipeline (a.k.a. “flow tables”) enables flexible coherence message processing to suit cache coherence protocol's needs at line-rate performance

– Modifications to coherence domains, protocols require no new silicon

/**************H E A D E R S ****************/header_type OmniXtend_t {

fields {format : 3;opcode : 3;param : 3;size : 4;domain : 3;source : 16;

}}

/************** P A R S E R ******************/header OmniXtend_t OmniXtend;parser start {

extract(OmniXtend);return ingress;

}

/**********I N G R E S S P R O C E S S I N G **********/action send(port) {modify_field(ig_intr_md_for_tm.ucast_egress_port, port);

}action discard() {

modify_field(ig_intr_md_for_tm.drop_ctl, 1);}table OmniXtend_host{

reads {ig_intr_md.ingress_port : exact;

}actions {

send;discard;

}}

control ingress {apply(OmniXtend_host);

}

Packet data

ParserMatch-ActionUnit 0 Deparser

Packet data

Match-ActionUnit 1

Match-ActionUnit 11

PH

V

Match

Table

PHV’

Key

Parameters

Match-Action

Unit (MAU)Actions

Page 17: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 17| WESTERN DIGITAL CONFIDENTIAL

Performance measurementsSystem Setup

RISC-V SoC with OmniXtendrunning in FPGA

Tofino Switch programmed with P4 code to support OmniXtend

Page 18: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 18| WESTERN DIGITAL CONFIDENTIAL

Performance measurementsMemory access latency

L1 L2 DRAM

Page 19: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 19| WESTERN DIGITAL CONFIDENTIAL

RISC-V system with 8 harts

Page 20: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 20| WESTERN DIGITAL CONFIDENTIAL

Conclusions

• Realization of 7yr+ dream: open source cache coherence protocol and interconnect, and at a low cost

• True enablement of plug-n-play heterogenous compute:– Some more RISC-V ISA work and other ISA work may be needed!

• The backplane:– Allegro files are available to CHIPS Alliance members

• What’s next:– Compute and memory boards – CHIPS Alliance project 2020

Page 21: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

© 2019 Western Digital Corporation or its affiliates. All rights reserved. 4/16/2020| WESTERN DIGITAL CONFIDENTIAL

Page 22: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 22| WESTERN DIGITAL CONFIDENTIAL

Cache Coherence

SSD

HDD

Network

CPU

Core

Cache

Core

Cache

Core

Cache

Core

Cache

Shared

Cache

PCIe Root Complex

Memory Controller

CPU

Core

Cache

Core

Cache

Core

Cache

Core

Cache

Shared

Cache

PCIe Root Complex

Memory Controller

CPU

Core

Cache

Core

Cache

Core

Cache

Core

Cache

Shared

Cache

PCIe Root Complex

Memory Controller

CPU

Core

Cache

Core

Cache

Core

Cache

Core

Cache

Shared

Cache

PCIe Root Complex

Memory Controller

DRAMDRAM

DRAMDRAM PCIe

DDR

Cache Coherency

SystemPCIe RootComplex

StorageInterface

NetworkInterface

SATASAS

NVMe

CPU

Core

Cache

Core

Cache

Core

Cache

Core

Cache

Shared

Cache

PCIe Root Complex

Home Agent and Memory Controller

Multi-CoreCPU

CPU 1

Core 1

Control

ALU

L1 Cache

L2 Cache

Core 2

Control

ALU

L1 Cache

CPU 2

Core 1

Control

ALU

L1 Cache

L2 Cache

Core 2

Control

ALU

L1 Cache

DRAM DRAM

Coherency

CPU Socket 1 CPU Socket 2

Modified (M)Exclusive (E) Shared (S)Invalid (I)

PW/S PR/~SBW

BR+BW

BW

PR/S

PR+PW

BR

PR+BR

BR

PW/~S

BW

PW

PR

PW

Invalid(I)

Shared(S)

Exclusive(E)

Modified(M)

PR = processor readPW = processor write

BR = observed bus readBW = observed bus write

S/~S = shared or not shared

Page 23: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 23| WESTERN DIGITAL CONFIDENTIAL

OmniXtend™ : A Cache Coherent Memory Fabric

Western Digital and SiFivecollaborating on extension of TileLinkcache coherency protocol over ethernet.

• Leverages ubiquitous Ethernet

• Routable and switchable fabric

• Scalable as Ethernet speeds increase

OmniXtend initial specification and FPGA bitstream are now live at https://github.com/westerndigitalcorporation/omnixtend

Page 24: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 24| WESTERN DIGITAL CONFIDENTIAL 24

RDMA Networking latency

• As a direct memory transfer protocol RDMA is already suited to accessing remote byte-addressable persistent memory devices

• Typical latency for RDMA reads: 1.6-1.8us

– For PCIe attached RNICs on host and target nodes

• Simultaneous PCIe tracing reveals as little as 600ns spent actually transferring data over the network

– 70% of latency is spent in memory or PCIe-protocol overhead

• How much can be saved by attaching memory directly to network fabric (i.e., no PCIe overhead)

– Maximal benefit (~1.0 us) would require integrated RNIC interfaces on both Initiator and Target

Page 25: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 25| WESTERN DIGITAL CONFIDENTIAL 25

RDMA to persistent memory latency

• Raw RDMA access to remote PCM via PCIe peer2peer is 26% slower than to remote DRAM

Page 26: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 26| WESTERN DIGITAL CONFIDENTIAL

Providing cache coherence

• Modified: The node has the only valid cached copy which contains dirty data with read and write permission

• Exclusive: The node has read and write permission on a cached copy which is valid and clean

• Shared: The node has read-only permission on a shared cached copy which is valid, clean

• Invalid: The node has no permission on that block

Memory block states

Page 27: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 27| WESTERN DIGITAL CONFIDENTIAL

Providing cache coherence

• A Master Agent must first obtain necessary permissions on a specific memory block through transfer operations to perform read and/or write operations on that specific memory block– None

– Read

– Read+write

• Memory operations are performed through exchanging coherence messages between master and slave agents:– Acquire: A request message from a Master Agent to obtain appropriate permissions on that address

along with a copy of that data block

– Grant Data: A response message from a Slave Agent to acknowledge the reception of an acquire and grant permission to the original requesting Master Agent along with a copy of the data block

– Release: A request message from a Master Agent to downgrade its permissions on a cached copy

– Probe: A request message from a Slave Agent to query, modify or revoke the permissions of a cached copy stored by a particular Master Agent

Coherence messages

Page 28: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 28| WESTERN DIGITAL CONFIDENTIAL

Providing cache coherence

Master (Requester)

I → S

Slave

S → S

E→ S

Acquire to read

Grant Data

Master(Requester)

I → E

Slave

E → I

Acquire to read+write

Grant Data

Master(Sharer)

M → I

Master(Requester)

I → E

Slave

I → I

Acquire to read+write

Grant Data

Probe

ProbeACK and Data

Master(Sharer)

S → I

Master(Requester)

I → E

S → E

Slave

S → I

Acquire to read+write

Grant Data Master(Sharer)

S → I

Master(Sharer)

M → S

Master (Requester)

I → S

Slave

I → S

Acquire to read

Grant Data

Probe

ProbeACK and Data

Master(Requester)

M → I

Slave

I → E

Release + Data

ReleaseACK

(a) Upgrading permissions and transmissions from I to S to perform read operation

(b) Upgrading permissions and transmissions from I/S to E to perform read and write operation

(c) Downgrading the permission on a black and state transmission from M/E to I to relinquish that copy back

Master(Requester)

E → I

Slave

I → E

Release

ReleaseACK

State transitions

Page 29: Datacenter CPUs and SoCs with OmniXtend fabric · Data centers’ memory are struggling to keep up with data explosion I/O struggling to keep up with CPU innovations Big data and

4/16/2020©2019 Western Digital Corporation or its affiliates. All rights reserved. 29| WESTERN DIGITAL CONFIDENTIAL

OmniXtend memory centric system

RISC-V node