Download pdf - CCIX: a new coherent multichip interconnect for ... · PDF filemultichip interconnect for accelerated use cases Akira Shimizu | Senior ... N IC s, F PG As, G PU s, cryp to , o th e

© 2017 Arm Limited Arm Tech Symposia 2017

CCIX: a new coherent multichip interconnect for

accelerated use cases

Akira Shimizu | Senior Manager, Operator relations | Arm

© Arm 2017 2

Interconnects for different scale

SoC interconnect.

• Connectivity for on-chip processor, accelerator, IO and memory elements

Server node interconnect - ‘scale-up’.

• Simple multichip interconnect (typically PCIe) topology on a PCB motherboard with simple switches and expansion connectors

Rack interconnect - ‘scale-out’.

• Scale-out capabilities with complex topologies connecting 1000’s of server nodes and storage elements

Tech Symposia

© Arm 2017 3

Key drivers for interconnect technology

• Decline of Moore’s law forcing more heterogeneous compute.

• Big data analytics growing at 11.7% CAGR.

• 5G wireless applications requiring 10x more bandwidth, 10x lower latency by 2021.

• Increase in distributed data forcing more network intelligence at faster data rates (10GbE -> 100GbE -> 400GbE).

• Data bandwidth and sharing growth projected at 10x-50x increase vs. present PCIe by 2021.

Tech Symposia

© Arm 2017 4

CCIXTM cache coherent interconnect for accelerators

New class of interconnect for accelerated applications.

Mission of the CCIX Consortium is to develop and promote adoption of an industry standard specification to enable coherent interconnect technologies between general-purpose processors and acceleration devices for efficient heterogeneous computing.

https://www.ccixconsortium.com/

Tech Symposia

https://www.ccixconsortium.com/

© Arm 2017 5

CCIX Consortium Inc

Formed January 2016, incorporated in

February 2017.

Complete ecosystem with 34

members and growing.

Hardware specification available for

design starts for member companies.

CCIX pronounced: (c’ siks).Arteris, Inc. Guizhou Huaxintong Semiconductor Technology Co. Ltd.INVECAS INC Netronome Phytium Technology Co., Ltd. PLDA Shanghai Zhaoxin Semiconductor Co., Ltd. Silicon Laboratories Inc.SmartDV Technologies India Private Ltd.

Promoters

Contributors

Adopters

Tech Symposia

© Arm 2017 6

Applications benefiting from CCIX

4G and 5G base station

Data-center Search

Embedded Computing

High Performance (Super)Computing

In memory database processing

Intelligent network acceleration

Machine / Deep Learning

Mobile Edge Computing

Video analytics

Tech Symposia

© Arm 2017 7

CCIX multichip connectivity

High performance, low latency.

• CCIX defines 25GT/s (3x performance*)

• Examining 56GT/s (7x performance*) and beyond

• Enabling low latency via light transaction layer

Flexible, scalable interconnect topologies.

• Flexible point-to-point, daisy chained and switched topologies

Seamless integration.

• Runs on existing PCIe transport layer and management stack

• Supports all major instruction set architectures (ISA)

Processor

Accelerator

Smart Network

PersistentMemory

Switch

Tech Symposia

© Arm 2017 8

System topology examples

Accelerator

CCIX

Switch

Processor

CCIX

Processor

CCIX

Memory

CCIX

Memory

CCIX

Processor

CCIX

Accelerator

CCIX

Processor

CCIX

Accel

CCIX

CCIX

CC

IX

CC

IX Accel

CCIX

CCIX

CC

IX

CC

IXAccel

CCIX

CCIX

CC

IX

CC

IXAccel

CCIX

CCIX

CC

IX

CC

IX

Processor

CCIX

Processor

PCIe

Accel

CCIX

CC

IX

CC

IX

PCIe

Accel

CCIX

CC

IX

CC

IX

PCIe

Accel

CCIX

CCIX

CC

IX

CC

IXAccel

CCIX

CCIX

CC

IX

CC

IX

Processor

PCIe

Direct attached, daisy chain, mesh and switched topologies.

Tech Symposia

© Arm 2017 9

DMA Engines: The problem with traditional accelerators

Operating System vendors are interested in the opportunity for workload-optimized accelerators.

Traditional DMA approach is to provide a special (Linux) kernel driver for every unique accelerator.• Requires skilled kernel developers (a driver for each

accelerator), failure mode is catastrophic (system crash/downtime)

Operating Systems used tomorrow have already been deployed. Updates are 9-12 months apart.• Drivers must be in “upstream” Linux before we support

them, a year+ turnaround for every accelerator

'Trilby”: DMA Engine driven FPGA based workload accelerator built by Jon Masters for research into the barriers to adoption in the Enterprise, uses traditional approach of kernel driver and Operating System hacks.

Tech Symposia

© Arm 2017 10

Coherent virtual memory eliminates data transfer overhead

Processor

Accelerator

Processor

Accelerator

Clean and copy data

Non-coherent system without Shared Virtual Memory (SVM)Software must manage cache maintenance and data copying

Clean and copy data

Clean and copy data

Cache coherent system with Shared Virtual Memory (SVM)Hardware managed cache maintenance, shared address space with direct memory access

AcceleratorProcessor

Tech Symposia

© Arm 2017 11

CCIX acceleration functions that just work in the cloud

Container_2

Physical Machine (e.g., processors, DRAM, caches, mmu, iommu, other resources and SoC devices ...)

Virtual Machines

Guest OS1

JVM_1App_1

VNF_2

Virtual Machine1

Guest OS2

VNF_1

Virtual Machine2

Virtual Machine Monitor (VMM)/Hypervisor

Physical Machine

...

Hyper-Priveleged

Nonpriveleged

Priveleged

Container_1

Guest OS3

App_4

Virtual Machine3

...

Container_1

VNF_3

Container_2

VNF_4

Other External Devices (e.g., disks, NICs, FPGAs, GPUs, crypto, other accelerators, other devices ...)

FirmwareFirmware,

Option ROMs, etc

(Optional)SystemDependent

Non-privileged

Privileged

HyperPrivileged

OptionalSystem Dependent

CCIXfunction

CCIXfunction

CCIXfunc

CCIXfunc

Tech Symposia

© Arm 2017 12

CCIX layered architecture

Protocol Layer – coherency protocol, memory read & write flows

Link Layer – formats CCIX messages for target transport

Transaction Layer – Adds optimized packets, manages credit based flow control

Physical Layer – Dual mode PHY to support extended data rates

PCIeTransaction

LayerCCIX

Transaction Layer

PCIe Data Link Layer

CCIX/PCIe Physical Layer

Tx Rx

PCIe packetsCCIX messages

CCIXLink Layer

CCIXProtocol Layer

Tech Symposia

© Arm 2017 13

CCIX example request to home data flows

Memory

Accelerator shares processor memory

ReqCache

Home

LALA

Daisy chain to shared processor memory

ReqCache

Memory

ReqCache

Home

LALA

ReqCache

LA

ReqCache

LA

Memory

ReqCache

Home

LALA

ReqCache

Shared processor and accelerator memory

Memory

Home

Memory

ReqCache

Home

LALA

ReqCache

Shared memory with aggregation

Memory

Home

LALA

Tech Symposia

© Arm 2017 14

Improved efficiency with CCIX transaction layer

Reduced latency with light weight transaction layer

Improved packet efficiency with optimized CCIX header

Tech Symposia

© Arm 2017 15

CCIX port aggregation to boost bandwidth and transactions

CCIX defines a hashing function to steer requests across multiple links.

Aggregation effectively multiplies the bandwidth.

Aggregation could also be used to increase number of transactions (eg 50GT/s vs 25GT/s).

PCIe requires separate address spaces, requests can not be hashed.

Memory

ReqCache

Home

LALA

ReqCache

CCIX with Port Aggregation

Memory

Home

LALA

Memory

Processor

Cache

Home

PC

I

PC

I

Accelerator

Cache

Mem0

Home0

PC

I

PC

I

Mem1

Home0

PCIe with Aggregation

Tech Symposia

© Arm 2017 16

CCIX SoC integration example

PC

IeTr

ansa

ctio

n

Laye

r

CC

IXTr

ansa

ctio

n

Laye

r

Dat

a Li

nk

Laye

r

PH

Y (u

p t

o 2

5G

pb

s)

16 Lanes

DMC-620

DMC-620 DMC-620

DMC-620

XP

CM

LR

NI

CXS

AXI

3rd party PCIe/CCIX IP

Example CMN-600 mesh design

CoreLink CMN-600

Tech Symposia

© Arm 2017 17

Arm CCIX demonstration vehicle

Arm’s DynamIQ and CoreLink CMN-600 technology.

Cadence CCIX and PCIe controller and PHY IP.

TSMC 7nm process technology.

CCIX Connectivity to Xilinx’s Virtex UltraSoC+ FPGA.

Xilinx, Arm, Cadence, and TSMC Announce World's First CCIX Silicon Demonstration Vehicle in 7nm Process Technology

Tech Symposia

https://www.xilinx.com/news/press/2017/xilinx-arm-cadence-and-tsmc-announce-world-s-first-ccix-silicon-demonstration-vehicle-in-7nm-process-technology.html

© Arm 2017 18

Scale-up server node performance with CCIX

CCIX is a class of interconnect providing high performance, low latency for new accelerators use cases.

Easy adoption and simplified development by leveraging today’s data centerinfrastructure.

IP available from Arm and ecosystem to optimize CCIX SoC today.

• Server, FPGAs, GPUs, network/storage adapters, intelligent networks and custom ASICs

For more information go to:

https://developer.arm.com

Tech Symposia

1919

Thank You!Danke!Merci!谢谢!ありがとう!Gracias!Kiitos!

© Arm 2017

The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

www.arm.com/company/policies/trademarks