26
Exploring ARM’s Cache Coherent Network Technology to Handle Exponential Data-Flow Growth Winnie Shao Server and Enterprise Marketing Manager, ARM [email protected] ARM Technology Symposia Nov 25 th ~29 th , 2013 – SH, BJ& SZ

Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

Exploring ARM’s Cache Coherent Network Technology to Handle Exponential Data-Flow Growth

Winnie Shao

Server and Enterprise Marketing Manager, ARM

[email protected]

ARM Technology Symposia

Nov 25th ~29th , 2013 – SH, BJ& SZ

Page 2: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

Information Society Driving Enterprise Data Demands

Wireless

Networks

Residential

Gateways

Business

Gateways

Scalable High-Bandwidth I/O

Network

Infrastructure: Backbone

Edge

Server

Infrastructure: Routers Switches

Servers Storage

Throughput

Density

Scalability

1Gb/s to 100Gb/s

Advanced RAS

Choice

Rapid Innovation

TCO

Cloud To Sever

Mobile/IOT From Senor

Network

Page 3: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

The Operator Challenges

3 Source: Cisco Systems, ISuppli, NSN, IEEE

18X Data traffic

increase on

mobile networks

2011-16

107% LTE Capex

CAGR

2010-2014

~80% Network energy

consumption

attributable to

base stations

Page 4: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

3m small cells forecast (2016)

2013-2016 Infrastructure Trends

Heterogeneous form factors

Rapid deployment of new services

Server capacity at edge

Scalable processing

Open software platforms

3G

4G

2016

Macro

Micro

Small cell

Page 5: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

Estimated 20x increase in network traffic

IT is the business

Density, power and cost over raw perf.

One size no longer fits all

Optimized highly integrated SoCs

Open source and end user S/W

2013-2016 Enterprise Trends

Page 6: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

Server/Networking Equipment – Tomorrow…

One size does not fit all

increased server specialization

Network flexibility SDN

/ NFV

Traditional 2P 2U

server

Traditional networking

equipment

Software Defined <X>

Accelerated innovation

Flexibility

Manageability

Scalability

Efficiency

Choice

Page 7: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

ARM Enterprise Building Blocks

Server

Control

Plane

Data

Plane Cortex® -A7 Cortex-A53 Future

Features Acceleration Scalability

Cloud

Advanced

RAS

Security

Reliability

Pipe

Cortex-A15

Single-thread Performance

Features 64-bit, ECC Advanced RAS

Cortex-A57 Future

Scalability

AMBA® 5 CHI – coherency for scalable efficient performance

CoreLinkTM CCN-500 Family Future

Throughput density

Scalability

1Gb-100Gb

Page 8: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

LSI Delivers Axxia® 5500 Communication Processor

for Faster, More Power-Efficient Networks

CoreLink CCN-504 SOC delivered to OEMs

CoreLink CCN-504 suitable for

Mobile and fixed networks

Based on CoreLink CCN-504

Use of CoreLink CCN-504 support a

variety of network configurations

Wide deployment of CoreLink CCN-504.....

“Represent over half of the mobile base

station market”…

World’s first cache-coherent, 16-core SMP ARM

processor

…………“With AXM5500 in hand, our customers are

developing next-generation networking systems

with the performance, intelligence and scalability

to keep pace with the unprecedented growth in

network traffic,” said Gene Scuteri, vice president of

engineering, Networking Solutions Group, LSI. “The

Axxia 5500 multicore family, with its unmatched,

highly integrated design and broad features, is an

ideal platform on which to build high-performance

systems.”

Page 9: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

CoreLink CCN Pedigree

Physically-Optimized Microarchitecture

Based on the proven CCN-504 microarchitecture and components:

Fully functional silicon running at 1.5 GHz+ in 28 HPM (achieved)

Leverages extensive verification: trillions of cycles in simulation, co-emulation with ARM IP

Proven, non blocking Architecture (AMBA 5 CHI)/Microarchitecture (CoreLink CCN-504)

Detailed Microarchitectural Performance Model

Cycle-accurate SystemC performance model: correlated to within 3% of RTL

Extensive performance tuning with numerous internal and external workloads

SystemC TLM interfaces for connecting partner IP or ARM-provided traffic generators

Page 10: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

NIC-400 Network Interconnect

Flash GPIO

NIC-400

USB

CoreLinkDMC-520

x72DDR4-2667

PHY

Snoop Filter

CoreLinkDMC-520

x72DDR4-2667

1-32MB L3 cache

PCIe

10-40GbE

DPI Crypto

ARM CoreLink™ CCN-508 Cache Coherent Network

IO Virtualisation with System MMU

DSPDSP

DSP

SATA

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

CoreLinkDMC-520

x72DDR4-2667

PHY

CoreLinkDMC-520

x72DDR4-2667

NIC-400 Network Interconnect

GPIO GPIO

Interrupt Control

CoreLink CCN-508 : High-end Networking/Server

Quad channel

DDR3/4 x72

Up to 4 cores

per cluster

Up to 8

coherent

clusters

Integrated

L3 cache

Up to 24

AMBA

interfaces for

I/O coherent

accelerators

and IO

Peripheral address space

Heterogeneous processors – CPU, GPU,

DSP and accelerators Virtualized Interrupts

Uniform

System

memory

Page 11: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

Traffic Mixes in Networking Infrastructure CCN Interconnect can scale to different networking solutions which use different

traffic mixes

You can implement your SOC with ARM CPU, ARM Interconnect and DSP/ Accelerators

Control Processing

CCN-5xx (C-Programmable,

Mixed traffic types under

Linux OS.)

User Interface/

Scheduling Processing

Backhaul/

Packet Processing

Page 12: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

ARM Networking Use-Cases: BTS

BTS Application

Many-core processing for thread-able workloads

CCN-508 based with mix of Cortex-A57 and

A53 with acceleration

L3 Coherency managed by CCN-508

BTS, Wireless Backhaul, Radio-Network-

Controller

Security appliances (DPI)

Clusters of 4 core, moderate frequency

Control Processing

Packet Processing UE Scheduling

Priority 1

Priority 2

Priority 3

NIC-400 Network Interconnect

Flash GPIO

NIC-400

USB

Interrupt Control

CoreLinkDMC-520

x72DDR4-3200

PHY

Snoop Filter

CoreLinkDMC-520

x72DDR4-3200

1-32MB L3 cache

PCIe

10-40GbE

DPI Crypto

ARM CoreLink™ CCN-508 Cache Coherent Network

IO Virtualisation with System MMU

DSPDSP

DSP

SATA

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

Quad

Cortex-

A57/A53

L2 cache

CoreLinkDMC-520

x72DDR4-3200

PHY

CoreLinkDMC-520

x72DDR4-3200

NIC-400 Network Interconnect

GPIO GPIO

Page 13: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

ARM Networking Use-Cases: CDN/Core

Control Plane/Server Application

EPC, Core Network, Content at Edge.

High-performance, single thread cores

Optimized for high frequency

Multiple Cortex-A57/ Next Gen. clusters, low core

count per cluster

I/O, Storage and Compute Virtualization

(KVM/XEN)

Priority 1

Priority 2

Priority 3

Control Processing

Packet Processing UE Scheduling

Page 14: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

ARM Partner designs for the Data Center

Partners developing variety of solutions

Leveraging their expertise

More choices for applications

DDR3/L

Controller

64bits

+ ECC

Device Bus,

Nand, SPI,

UARTs,

I2C, SDIO

3 x USB

Host/Dev

4 x IDMA,

4 x XOR

2 x Security

Engine

TDM I/F

32 VoIP

Channels

2 x SATA II

Coherency Fabric

4 x GbE /

QSGMII

USB Phy x3

Advanced

Power

Mng.

Secured

Boot

PCI-E PCI-E

PCI-E PCI-E

System

Crossbar

10 la

nes S

ER

DE

S

16 la

nes S

ER

DE

S

Shared L2/SRAM

1MB

Shared L2/SRAM

2MB 2x ARM v6/v7

CPU, w/FPU

@ 2Ghz

4x Sheeva

CPU, w/FPU

AXE4500 Architecture Diagram

Network Compute Adapter

DDR3/4 Controllers

L3 Cache

Modular Packet Processor

Memory Management Engine

Interfaces & Peripherals

PPE PPE PPE PPE

PPE PPE PPE PPE

Virtual Pipeline Task Ring

ARM A15

ARM A15

ARM A15

ARM A15

Shared L2 Cache

Page 15: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

ARM Partner designs for the Data Center

DDR3/L

Controller

64bits

+ ECC

Device Bus,

Nand, SPI,

UARTs,

I2C, SDIO

3 x USB

Host/Dev

4 x IDMA,

4 x XOR

2 x Security

Engine

TDM I/F

32 VoIP

Channels

2 x SATA II

Coherency Fabric

4 x GbE /

QSGMII

USB Phy x3

Advanced

Power

Mng.

Secured

Boot

PCI-E PCI-E

PCI-E PCI-E

System

Crossbar

10 la

nes S

ER

DE

S

16 la

nes S

ER

DE

S

Shared L2/SRAM

1MB

Shared L2/SRAM

2MB 2x ARM v6/v7

CPU, w/FPU

@ 2Ghz

4x Sheeva

CPU, w/FPU

AXE4500 Architecture Diagram

Network Compute Adapter

DDR3/4 Controllers

L3 Cache

Modular Packet Processor

Memory Management Engine

Interfaces & Peripherals

PPE PPE PPE PPE

PPE PPE PPE PPE

Virtual Pipeline Task Ring

ARM A15

ARM A15

ARM A15

ARM A15

Shared L2 Cache

ARM Processor

Multi - Core

Quad to 16 core

1.4-2.4 GHz

32-bit and 64-bit

ARM Fabric

Page 16: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

ARM Partner designs for the Data Center

DDR3/L

Controller

64bits

+ ECC

Device Bus,

Nand, SPI,

UARTs,

I2C, SDIO

3 x USB

Host/Dev

4 x IDMA,

4 x XOR

2 x Security

Engine

TDM I/F

32 VoIP

Channels

2 x SATA II

Coherency Fabric

4 x GbE /

QSGMII

USB Phy x3

Advanced

Power

Mng.

Secured

Boot

PCI-E PCI-E

PCI-E PCI-E

System

Crossbar

10 lan

es SE

RD

ES

16 lan

es SE

RD

ES

Shared L2/SRAM

1MB

Shared L2/SRAM

2MB 2x ARM v6/v7

CPU, w/FPU

@ 2Ghz

4x Sheeva

CPU, w/FPU

AXE4500 Architecture Diagram

Network Compute Adapter

DDR3/4 Controllers

L3 Cache

Modular Packet Processor

Memory Management Engine

Interfaces & Peripherals

PPE PPE PPE PPE

PPE PPE PPE PPE

Virtual Pipeline Task Ring

ARM A15

ARM A15

ARM A15

ARM A15

Shared L2 Cache

ARM Processor

Multi - Core

Quad to 16 core

1.4-2.4 GHz

32-bit and 64-bit

ARM Fabric

Partner

differentiated

Fabric, Storage,

Networking

and Security IP

Page 18: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

Cortex-A57: Flagship Debut of 64-bit

Leading the transition to ARMv8

Streamlined 64-bit ISA

Extended 32-bit functionality

Hardware-accelerated cryptography

Enhanced IEEE SP/DP floating point

Extended soft-error recovery

Highest Performance for next-generation

Improved performance on 32-bit, 64-bit code and floating point

Up to 10x improvement on crypto algorithms

Larger working sets, better branch prediction

Greater focus on memory subsystem

Extremely scalable system architecture

Next-generation CCN for advanced scalable systems

Aligned with Cortex-A53 for big.LITTLE

Highest Performance 64-bit ARM Core

Cortex-A57 MPCore

4

3

2

L2 w/ECC (512kB ~ 2MB)

128-bit AMBA 4 or AMBA 5 CHI Coherent Bus

Interface

ACP SCU

Cortex-A57 Core ARMv8

32b/64b Core

NEON SIMD engine

Floating

Point Unit

48k I-Cache

w/Parity

32k D-Cache

w/ECC

Core 1

Page 19: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

The Most Efficient 64b Core

Cortex-A53 – Extreme Power Efficiency

Power efficient performance

In-order, 8-stage, dual-issue pipeline

Improved integer, NEONTM, FPU, and memory performance

ARMv8-A

64b support, and 32b w/AArch32

Architectural enhancements

Scalable for many markets

big.LITTLE™ companion to Cortex-A57

AMBA4 ACE and AMBA 5 CHI interfaces

Cortex-A53 MPCore

4

3

2

128-bit AMBA 4 or AMBA 5 CHI Coherent Bus

SCU

Cortex-A53 Core

ARMv8

32b/64b Core

NEON SIMD engine

Floating

Point Unit

8-64K

I-Cache, Parity

8-64K

D-Cache, ECC

Core 1

L2 Cache w/ECC (128KB – 2MB) ACP

Page 20: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

High Performance Memory Access

5th Generation ARM DMC

ECC and RAS features

Performance Profiling

TrustZone Address Space Control

End to end traffic management

System wide QoS, designed and verified with

ARM CPUs and CoreLink CCN

Performance Max bandwidth > 21GB/s per channel

Buffering to optimize read/write turnaround

Interfaces AMBA 5 CHI direct connection to CCN-508

Industry standard DFI-3.0 to connect to PHY

Memory Support for x72 DRAM

DDR3, DDR3L and DDR4 up to DDR4-2667

Low Power

Support

Programmable DRAM power modes

CoreLink DMC-520

Page 21: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

CoreLink End-to-End QoS Architecture Enterprise Requirements

“Bursty” Datapath

High peak bandwidth

Multiple interfaces

“Compute intensive” Control Path

Lowest latency requirements

“Latency Sensitive” Air interface

Predictable Latency

CoreLink End-to-End QoS

QoS field for every transaction

Fixed, programmable or regulated

End-to-End propagation

Programmable QoS mechanisms

Regulation of all traffic on ingress

Bandwidth management

Latency management

QoS re-ordering and arbitration

Interconnect ingress

Non-blocking transport

L3 cache

DMC

L3

CPU Accel IO

DDR

Page 22: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

CoreLink CCN-508: Performance

250GB/s peak, 160GB/s sustained interconnect bandwidth

300GB/s peak IO/accelerator bandwidth (50GB/s per port)*

Integrated snoop-filter for highly scalable coherency

Sophisticated end-to-end QoS and system-level power-management

Quad channel DRAM support for DDR3, DDR3L and DDR4 up to DDR4-2667

* At equivalent processor frequencies

Page 23: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

Linaro Networking Group (LNG)

Vertical subgroup launched Feb’13

Optimized, open-source networking

platform software for scalable networks

Coordinates and multiplies members’ efforts,

accelerates product TTM

LNG Members: Sept 2013

Page 24: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

Linaro Enterprise Group (LEG)

Existing and new members will

deliver optimized core open-source software

for ARM servers

Reduces costs, eliminates

fragmentation, accelerates

product time to market

Enables ARM Server vendors

to focus on innovation and differentiated value-

add

www.linaro.org/s

erver

Page 25: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

Redefining the Server and Network Equipment

Summary

A true partnership = innovation and choice

32

CoreLink™ CCN-508 Cache Coherent Network IP

CoreLinkTM DMC-520 Dynamic Memory Controller IP

Page 26: Exploring ARM’s Cache Coherent Network Technologyarmtechforum.com.cn/2013/2_ExploringARMCache... · 12/6/2013  · Leverages extensive verification: trillions of cycles in simulation,

CONFIDENTIAL