18
Architecting a multi-core server SoC for the cloud Barry Wolford Sr. Director of Technology, Chief SoC Architect Qualcomm Datacenter Technologies, Inc. Linley Processor Conference October 4-5, 2017

Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

Architecting amulti-core server SoC for the cloud

Barry WolfordSr. Director of Technology, Chief SoC ArchitectQualcomm Datacenter Technologies, Inc.

Linley Processor Conference October 4-5, 2017

Page 2: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

2Source: IDC

Agenda

• Seeding the Cloud…

• Qualcomm Centriq™ 2400 Overview

• Selected Architecture Features

• Summary

Qualcomm Centriq is a product of Qualcomm Datacenter Technologies, Inc.

Page 3: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

3

Traditionalenterprise

Monolithic | Stateful | OS or VM boundScale up | Silo’d

The shift to the cloud…

Cloudenvironments

Microservices | Mix of stateless / statefulContainerized | Scale out | Devops | Multi-tenant

More than 50 percent of servers soldby 2020 will be deployed for cloud computing services*

Source: IDC

Page 4: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

4

…driving new requirements for datacenter infrastructure…

Throughput scalability

Performance at scale

Power efficiency

Workload-optimized infrastructure tiers

Application level redundancy

Efficient resource poolingCloud

environmentsMicroservices | Mix of stateless / stateful

Containerized | Scale out | Devops | Multi-tenant

Page 5: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

5

…driving new considerations in processor architecture…

Among those are…

• Aggregate Performance◦ Throughput, concurrency, parallelism

• Thread density

◦ VM-hosting, multi-instance/multi-tenant• Thread isolation

◦ Reliable performance, SLAs

• Quality of service◦ SLAs, “noisy neighbors”, tail latencies

• Power efficiency

◦ Performance/Watt

Qualcomm Centriq 2400

Page 6: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

6

• World’s First 10nm Server Processor

• Qualcomm® Falkor™ CPU ◦ Qualcomm's 5th-generation custom core design◦ ARMv8-compliant / AArch64 only

• Highly integrated Server SoC◦ Single chip platform-level solution◦ Integrated South Bridge

• High core count (up to 48 cores)◦ High performance single threaded CPU ◦ 1 CPU per “thread”

• Distributed architecture◦ Increase parallelism◦ Maximize concurrency

• Targeting Cloud and throughput-oriented workloads◦ Virtualization and Containerization◦ Multi-instance and Multi-tenancy

Purpose-built for the Cloud

◦◦

◦◦◦

◦◦

Qualcomm Centriq 2400

Qualcomm Falkor is a product of Qualcomm Datacenter Technologies, Inc.

Page 7: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

7

SoC Overview

Coherent Ring

QDF2400

L3Cache

DDR4 Memory

Controllers

PCIeGen3

Falkor

CPUL1

CPUL1

L2

Falkor

CPUL1

CPUL1

L2 SATA

IMC

DMA

Low-speed IO

Qualcomm Centriq 2400

Page 8: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

8

Foundational Elements

8-S

erde

s

SA

TA

CT

LH

DM

AE

MA

C

OC

ME

M

QG

ICU

SB

US

B

US

B

US

B

QF

PR

OM

IMC

MP

M/C

CPW

8-Serdes

PC

le

8-Serdes

8-Serdes

PC

le

8-Serdes

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

DDR DDR

MC

DDR DDR DDR

L3L3 L3L3 L3L3

L3L3 L3L3 L3L3

MC

MC

MC

MC

MC

DDR

Coherent Segmented Ring Interconnect

Qualcomm Centriq 2400

Page 9: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

9

Falkor Core Duplex

Qualcomm System Bus is a product of Qualcomm Technologies, Inc.

Page 10: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

10

LLC & Memory

Qualcomm Centriq 2400

8-S

erde

s

SA

TA

CT

LH

DM

AE

MA

C

OC

ME

M

QG

ICU

SB

US

B

US

B

US

B

QF

PR

OM

IMC

MP

M/C

C

8-Serdes

PC

le

8-Serdes

8-Serdes

PC

le

8-Serdes

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

DDR DDR

MC

DDR DDR DDR

L3L3 L3L3 L3L3

L3L3 L3L3 L3L3

MC

MC

MC

MC

MC

DDR

Page 11: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

11

On-Chip Interconnect

Even Interleave - CW

Odd Interleave - CW

Even Interleave - CCW

Odd Interleave - CCW

Qualcomm Centriq 2400

8-S

erde

s

SA

TA

CT

LH

DM

AE

MA

C

OC

ME

M

QG

ICU

SB

US

B

US

B

US

B

QF

PR

OM

IMC

MP

M/C

CPW

8-Serdes

PC

le

8-Serdes

8-Serdes

PC

le

8-Serdes

DDR DDR

MC

DDR DDR DDR

L3L3 L3L3 L3L3

L3L3 L3L3 L3L3

MC

MC

MC

MC

MC

DDR

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

Coherent Segmented Ring Interconnect

Page 12: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

12

Distributed LLC & DDR

Qualcomm Centriq 2400

8-S

erde

s

SA

TA

CT

LH

DM

AE

MA

C

OC

ME

M

QG

ICU

SB

US

B

US

B

US

B

QF

PR

OM

IMC

MP

M/C

C

8-Serdes

PC

le

8-Serdes

8-Serdes

PC

le

8-Serdes

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

DDR DDR

MC

DDR DDR DDR

L3L3 L3L3 L3L3

L3L3 L3L3 L3L3

MC

MC

MC

MC

MC

DDR

MC

MC

MC

MC

MC

MC

L3

L3

L3

L3

L3

L3

L3

L3

L3

L3

L3

L3

Page 13: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

13

Distributed PoC & Snoop Filter

8-S

erde

s

SA

TA

CT

LH

DM

AE

MA

C

OC

ME

M

QG

ICU

SB

US

B

US

B

US

B

QF

PR

OM

IMC

MP

M/C

C

8-Serdes

PC

le

8-Serdes

8-Serdes

PC

le

8-Serdes

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

DDR DDR

MC

DDR DDR DDR

L3L3 L3L3 L3L3

L3L3 L3L3 L3L3

MC

MC

MC

MC

MC

DDR

Qualcomm Centriq 2400

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS

/PoC

Snp

Filt

er

PoS/PoC/SnpF

Page 14: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

14

Distributed IOMMUs

Qualcomm Centriq 2400

8-S

erde

s

SA

TA

CT

LH

DM

AE

MA

C

OC

ME

M

QG

ICU

SB

US

B

US

B

US

B

QF

PR

OM

IMC

MP

M/C

C

8-Serdes

PC

le

8-Serdes

8-Serdes

PC

le

8-Serdes

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

FalkorDuplex

DDR DDR

MC

DDR DDR DDR

L3L3 L3L3 L3L3

L3L3 L3L3 L3L3

MC

MC

MC

MC

MC

DDR

Page 15: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

159/27/17 Qualcomm Confidential and Proprietary

L3

L3 Quality of Service (QoS) Extensions

QoS Extensions:• Hardware Abstracted QoS Domain Identifier

• Per Client (Core/Virtual Machine, IO/Virtual Function)• Per-Resource Monitoring and Way-based Allocation

• Monitor Utilization per QoSID per L3• Policy Enforcement per QoSID per L3

• Instruction/Data Granularity • Fine-Tune Cache Allocation per Thread or Class of Threads

Shared Resource Contention- Distributed L3 Cache- Limited/No Allocation Policy Enforcement

VM/Thread 0 VM/Thread 1 IO/VF 0

L3

CPU 0 CPU 1 Device 0

VM/Thread 0 VM/Thread 1 IO/VF 0

CPU 0 CPU 1 Device 0

No L3 QoS L3 QoS

Improved cache utilization and per-workload performance (lower application latency) for critical workloads…..

Page 16: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

169/27/17 Qualcomm Confidential and Proprietary

Memory Bandwidth Compression

Uncompressed Memory(128B Lines)

0a 0b 1a 1b2a 2b 3a 3b4a 4b 5a 5b6a 6b 7a 7b8a 8b 9a 9bAa Ab Ba Bb

Bandwidth Compression:• Proprietary algorithm• Inline compression w/in Memory Controllers

• Fully transparent to software• Compress 128B line to 64B when possible• ECC is encoded with compression bit• Very low latency decompression

• 2 – 4 cycles• Effective on compressible bandwidth intensive workloads

Compressed Memory

Increased effective memory bandwidth and reduced power for compressible workloads…..

Constrained Memory Bandwidth- Channel limited peak MT/s- Limited number of DDR Channels

0 12a 2b 34 5a 5b

6a 6b 7a 7b8 9a 9bA Ba Bb

0a 0b 1a 1b 2a 2b 3a 3b 4a 4b 5a 5b 6a 6b 7a 7b

0 1 2a 2b 3 4 5a 5b 6a 6b 7a 7b

8a 8b

8

Memory Access Stream – w/o Bandwidth Compression

Memory Access Stream – w/ Bandwidth Compression9a 9b A Ba Bb

Page 17: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

17

SummaryQualcomm Centriq 2400

Page 18: Qualcomm Centriq Architecting a multi-core server SoC for ......•Distributed architecture Increase parallelism Maximize concurrency •Targeting Cloud and throughput-oriented workloads

Follow us on:For more information, visit us at: www.qualcomm.com & www.qualcomm.com/blog

Thank you

Nothing in these materials is an offer to sell any of the components or devices referenced herein.

©2017 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners.

References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiariesor business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development functions, and substantially allof its product and services businesses, including its semiconductor business, QCT.