Solros: A Data-Centric Operating System Architecture for Heterogeneous...

Preview:

Citation preview

Solros: A Data-Centric Operating System Architecturefor Heterogeneous Computing

Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap,Steffen Maass, Heeseung Jo, Taesoo Kim

Virginia Tech, eBay, Georgia Tech, Chonbuk National University

April 26, 2018

Changwoo Min Solros: Data-Centric OS April 26, 2018 1 / 21

Cambrian Explosion of Processor Architecture

Changwoo Min Solros: Data-Centric OS April 26, 2018 2 / 21

Specialization of general-purpose processors

Cambrian Explosion of Processor Architecture

Changwoo Min Solros: Data-Centric OS April 26, 2018 2 / 21

Specialization of general-purpose processors

Generalization of co-processors

Cambrian Explosion of Processor Architecture

Changwoo Min Solros: Data-Centric OS April 26, 2018 2 / 21

Specialization of general-purpose processors

Generalization of co-processors

Specialization of co-processors

Blazingly fast IO Devices

Changwoo Min Solros: Data-Centric OS April 26, 2018 3 / 21

Blazingly fast storage/memory

Blazingly fast IO Devices

Changwoo Min Solros: Data-Centric OS April 26, 2018 3 / 21

Blazingly fast storage/memory

Blazingly fast network

Blazingly fast IO Devices

Changwoo Min Solros: Data-Centric OS April 26, 2018 3 / 21

Blazingly fast storage/memory

Blazingly fast network

How to exploit the full potential of such hardware devices without pain?

System-wide performance

Ease of programming

Outline

1 Heterogeneous Computing Architectures

2 Solros: Split-Kernel ApproachSolros ArchitectureOperating System Services

3 Evaluation

Changwoo Min Solros: Data-Centric OS April 26, 2018 4 / 21

Host-Centric Approach

Host OS controls co-processors and IO devices

Examples: OpenCL, CUDA

Application

Host processor

OS

Mem Core

I/O deviceSSD / NIC

Co-processorApplication

Mem Core controldata

Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21

Host-Centric Approach

Host OS controls co-processors and IO devices

Examples: OpenCL, CUDA

Application

Host processor

OS

Mem Core

I/O deviceSSD / NIC

Co-processorApplication

Mem Core

controldata

Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21

Host-Centric Approach

Host OS controls co-processors and IO devices

Examples: OpenCL, CUDA

Application

Host processor

OS

Mem Core

I/O deviceSSD / NIC

Co-processorApplication

Mem Core

controldata

Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21

Host-Centric Approach

Host OS controls co-processors and IO devices

Examples: OpenCL, CUDA

Application

Host processor

OS

Mem Core

I/O deviceSSD / NIC

Co-processorApplication

Mem Core

③ controldata

Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21

Host-Centric Approach

Host OS controls co-processors and IO devices

Examples: OpenCL, CUDA

Application

Host processor

OS

Mem Core

I/O deviceSSD / NIC

Co-processorApplication

Mem Core

③ controldata

Problem

Redundant data communicationComplex to program and hard to optimize

Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21

Coprocessor-Centric Architecture

Co-processors control IO devices

Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]

Application

Host processor

Mem Core

I/O deviceSSD / NIC

Application

Co-processor

OS

Mem Core

OS

controldata

Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21

Coprocessor-Centric Architecture

Co-processors control IO devices

Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]

Application

Host processor

Mem Core

I/O deviceSSD / NIC

Application

Co-processor

OS

Mem Core

OS

controldata

Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21

Coprocessor-Centric Architecture

Co-processors control IO devices

Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]

Application

Host processor

Mem Core

I/O deviceSSD / NIC

Application

Co-processor

OS

Mem Core

OS

controldata

Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21

Coprocessor-Centric Architecture

Co-processors control IO devices

Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]

Application

Host processor

Mem Core

I/O deviceSSD / NIC

Application

Co-processor

OS

Mem Core

OS

controldata

Problem

Significant effort required for porting IO stack to co-processorNot completely exploiting powerful host processors

Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21

Outline

1 Heterogeneous Computing Architectures

2 Solros: Split-Kernel ApproachSolros ArchitectureOperating System Services

3 Evaluation

Changwoo Min Solros: Data-Centric OS April 26, 2018 7 / 21

Solros Goal

Ease of programming

Best use of processor architecture

System-wide optimization

Changwoo Min Solros: Data-Centric OS April 26, 2018 8 / 21

Solros Goal

Ease of programming

Best use of processor architecture

System-wide optimization

Challenge

Co-processor needs IO abstraction

IO stacks is branch-divergent and difficult to parallelize

It needs system-wide information

Changwoo Min Solros: Data-Centric OS April 26, 2018 8 / 21

Solros Architecture

Split-Kernel Architecture

Data-plane OS

Runs on a co-processorProvides IO abstractionDelegates actual IO operations to a control-plane OS

Control-plane OS

Runs on a host processorRuns actual IO stackPerforms system-wide coordination

Changwoo Min Solros: Data-Centric OS April 26, 2018 9 / 21

Solros Architecture

Control-plane OS: actual OS service + system-wide coordination

Data-plane OS: thin communication layer to host processor

controldata

Application

Host processor

OS proxy

Mem Core

I/O deviceSSD / NIC

Application

Co-processor

OS stub

Core Mem

Policy

Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21

Solros Architecture

Control-plane OS: actual OS service + system-wide coordination

Data-plane OS: thin communication layer to host processor

controldata

Application

Host processor

OS proxy

Mem Core

I/O deviceSSD / NIC

Application

Co-processor

OS stub

Core Mem

① Policy

Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21

Solros Architecture

Control-plane OS: actual OS service + system-wide coordination

Data-plane OS: thin communication layer to host processor

controldata

Application

Host processor

OS proxy

Mem Core

I/O deviceSSD / NIC

Application

Co-processor

OS stub

Core Mem

Policy

Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21

Solros Architecture

Control-plane OS: actual OS service + system-wide coordination

Data-plane OS: thin communication layer to host processor

controldata

Application

Host processor

OS proxy

Mem Core

I/O deviceSSD / NIC

Application

Co-processor

OS stub

Core Mem

②③

Policy

Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21

Solros Architecture

Control-plane OS: actual OS service + system-wide coordination

Data-plane OS: thin communication layer to host processor

controldata

Application

Host processor

OS proxy

Mem Core

I/O deviceSSD / NIC

Application

Co-processor

OS stub

Core Mem

②③

Policy

Co-processor has OS abstraction with minimal effort

Best use of each of the fat and lean processors

Efficient global coordination among devices (policy)

Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21

Operating System Services

1 Transport service

2 Filesystem service

3 Network service

Changwoo Min Solros: Data-Centric OS April 26, 2018 11 / 21

Operating System Services

1 Transport service

2 Filesystem service

3 Network service

Changwoo Min Solros: Data-Centric OS April 26, 2018 12 / 21

Transport Service

High performance data transfer among devices are challenging:

Uniform data transfer among devices

High contention in massively-parallel co-processor

Asymmetric performance between host processor and co-processor

Changwoo Min Solros: Data-Centric OS April 26, 2018 13 / 21

Transport Service

High performance data transfer among devices are challenging:

Uniform data transfer among devices

High contention in massively-parallel co-processor

Asymmetric performance between host processor and co-processor

Our approach

Uniform data transfer ⇒ system-mapped PCIe window

High contention ⇒ combining, replication, interleaving, etc.

Asymmetric performance ⇒ flexibly configurable (host DMA enginevs. co-processor DMA engine)

Changwoo Min Solros: Data-Centric OS April 26, 2018 13 / 21

Transport Service

High performance data transfer among devices are challenging:

Uniform data transfer among devices

High contention in massively-parallel co-processor

Asymmetric performance between host processor and co-processor

Our approach

Uniform data transfer ⇒ system-mapped PCIe window

High contention ⇒ combining, replication, interleaving, etc.

Asymmetric performance ⇒ flexibly configurable (host DMA enginevs. co-processor DMA engine)

See details in the paper

Changwoo Min Solros: Data-Centric OS April 26, 2018 13 / 21

Filesystem Service

Peer-to-peer operation

Buffered operation

Host processor

SSD

DMA engine

Application

Co-processor

File system stub①

File system proxy

File system

PCIe

controldata

Zero-copy of data between co-processor memory and SSDMinimal data transfer

Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21

Filesystem Service

Peer-to-peer operation

Buffered operation

Host processor

SSD

DMA engine

Application

Co-processor

File system stub①

File system proxy

File system

PCIe

controldata

Zero-copy of data between co-processor memory and SSDMinimal data transfer

Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21

Filesystem Service

Peer-to-peer operation

Buffered operation

Host processor

SSD

DMA engine

Application

Co-processor

File system stub① ③

File system proxy

File system

PCIe

controldata

Zero-copy of data between co-processor memory and SSDMinimal data transfer

Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21

Filesystem Service

Peer-to-peer operation

Buffered operation

Host processor

SSD

DMA engine

Application

Co-processor

File system stub① ③

File system proxy

File system

PCIe

controldata

Zero-copy of data between co-processor memory and SSDMinimal data transfer

Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21

Filesystem Service

Peer-to-peer operation

Buffered operation

Host processor

SSD

DMA engine

Application

Co-processor

File system stub① ③

File system proxy

File system

PCIe

⑤ controldata

Zero-copy of data between co-processor memory and SSDMinimal data transfer

Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21

Filesystem Service

Peer-to-peer operation

Buffered operation

Host processor

SSD

DMA engine

Application

Co-processor

File system stub①

File system proxy

File system

PCIe

controldata

Buffer cache

Reduce storage IO by leveraging shared buffer cache among co-processorsAvoid performance anomaly of peer-to-peer communication over PCIe bus

Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21

Filesystem Service

Peer-to-peer operation

Buffered operation

Host processor

SSD

DMA engine

Application

Co-processor

File system stub① ③

File system proxy

Buffer cache

File system

PCIe

controldata

Reduce storage IO by leveraging shared buffer cache among co-processorsAvoid performance anomaly of peer-to-peer communication over PCIe bus

Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21

Filesystem Service

Peer-to-peer operation

Buffered operation

Host processor

SSD

DMA engine

Application

Co-processor

File system stub① ③

File system proxy

Buffer cache

File system

PCIe

controldata

Reduce storage IO by leveraging shared buffer cache among co-processorsAvoid performance anomaly of peer-to-peer communication over PCIe bus

Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21

Filesystem Service

Peer-to-peer operation

Buffered operation

Host processor

SSD

DMA engine

Application

Co-processor

File system stub① ③

File system proxy

Buffer cache

File system

PCIe

controldata

Reduce storage IO by leveraging shared buffer cache among co-processorsAvoid performance anomaly of peer-to-peer communication over PCIe bus

Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21

Filesystem Service

Peer-to-peer operation

Buffered operation

Host processor

SSD

DMA engine

Application

Co-processor

File system stub① ③

File system proxy

Buffer cache

File system

PCIe

controldata

④⑥

Reduce storage IO by leveraging shared buffer cache among co-processorsAvoid performance anomaly of peer-to-peer communication over PCIe bus

Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21

Implementation

Host: 2-socket Xeon processor (12 cores each)

Co-processor: 4 Xeon Phi (KNC, 61 cores, Linux, PCIe Gen 3x16)

Storage device: 4 NVMe SSD

NIC: 100 Gbps Ethernet

ModuleLines of code

Added lines Deleted lines

Transport service 1,035 365

File system ServiceStub 5,957 2,073Proxy 2,338 124

Network ServiceStub 2,921 79Proxy 5,609 34

NVMe device driver 924 25SCIF kernel module 60 14

Total 18,844 2,714

Changwoo Min Solros: Data-Centric OS April 26, 2018 16 / 21

Evaluation

Questions:

Performance of Solros services

Impact on real-world applications

Changwoo Min Solros: Data-Centric OS April 26, 2018 17 / 21

Performance of Solros Services

0

0.5

1

1.5

2

2.5

32KB

64KB

128KB

256KB

512KB

1MB

2MB

4MB

0

20

40

60

80

100

10 100 1000

GB

/se

c

(a) file random read on SSD

Phi-SolrosPhi-Linux

Per

cen

tag

eo

fre

qu

ests

(%)

Latency(usec)

(b) TCP latency: 64-byte message

Phi-SolrosPhi-Linux

File IO performance: 19x faster than the stock Linux on Xeon PhiTCP latency (99 percentile): 7x shorter than the stock Linux on Xeon Phi

Changwoo Min Solros: Data-Centric OS April 26, 2018 18 / 21

Performance of Solros Services

0

2

4

6

Phi-Linux Phi-Solros0

2

4

6

8

10

Phi-Linux Phi-Solros

La

ten

cy(m

sec)

(a) file random read on SSD

StorageBlock/Transport

File system

(b) TCP latency: 64-byte message

Proxy/TransportNetwork stack

Significant performance gain in data transportRunning IO stack on co-processor is slower

Changwoo Min Solros: Data-Centric OS April 26, 2018 19 / 21

Real-world Application - Image Search

Image search engine is running on Xeon Phi

Image database is on NVMe SSD (shared read-only)

Image search queries are from network

0

100

200

300

400

0

100

200

0 100 200 300

Ela

pse

dT

ime

(sec

)

Phi-SolrosPhi-Linux

Ba

nd

wid

th(M

B/

s)

Time(sec)

Phi-SolrosPhi-Linux

Solros performs 2x faster than stock Linux on Xeon Phi

Changwoo Min Solros: Data-Centric OS April 26, 2018 20 / 21

Conclusion

Solros, a new operating system architecture for co-processors and fastIO devices

Control-plane, data-plane architecture allow:

Supporting high-level OS abstraction on co-processorEfficient global coordination among devicesNear ideal IO performance from co-processor

We will release source code soon

Changwoo Min Solros: Data-Centric OS April 26, 2018 21 / 21

Image Search - Scalability

Increase the number of Xeon Phi to 4

0

1

2

3

4

1 2 3 4

Nor

ma

lize

dsp

eed

up

The number of Xeon Phi co-processors

Solros load balancing mechanism achieves near linear scaling

Changwoo Min Solros: Data-Centric OS April 26, 2018 22 / 21

Related work

Control-plane/data-plane OS: Arrakis [OSDI’14], IX [OSDI’14]

OS for heterogeneous systems: Helios [SOSP]09], M3 [ASPLOS’16],Hydra [ASPLOS’08]

IO support for GPU: PTask [SOSP’11], GPUfs [ASPLOS’13], GPUnet[OSDI’14]

Changwoo Min Solros: Data-Centric OS April 26, 2018 23 / 21

Transport service

Host's physical address space

Host RAM PCIe Window mmio

DRAM

NICDRAM mmio

Xeon PhiDRAM mmio

NVMe

Host's virtualaddress space

App 1 App 2

...

mmioDevice's physical

address sapce

mapped to

(across devices)

mapped to

(in-host)

Changwoo Min Solros: Data-Centric OS April 26, 2018 24 / 21

Network Service (TCP)

Outbound operation

Inbound operation

Host processor

NIC

Application

Co-processor

TCP stub

TCP proxy

Load balancer

TCP stack

PCIe

controldata

A load balancer on a host distributes incoming TCP connections to one ofleast-loaded co-processors.

See details in the paper.

Changwoo Min Solros: Data-Centric OS April 26, 2018 25 / 21

Discussion

Hardware support other than Xeon Phi

Two atomic instructions: transport serviceMMU: isolation among co-processor applications

Scalability of control-plane OS

Limited by scalability of OS service, PCIe interconnect, andperformance of IO devices

Changwoo Min Solros: Data-Centric OS April 26, 2018 26 / 21

Real-world Application - Text Search

CLucene text indexing engine running on Xeon Phi

Text data is on NVMe SSD

0

500

1000

1500

2000

2500

3000

0.0

0.5

1.0

1.5

0 50 100 150

Ela

pse

dT

ime

(sec

)

Phi-SolrosPhi-Linux

Ba

nd

wid

th(G

B/

s)

Time(sec)

Phi-SolrosPhi-Linux (virtio)

Solros performs 19x faster than stock Linux (ext4/virtio) on Xeon Phi

Changwoo Min Solros: Data-Centric OS April 26, 2018 27 / 21

Recommended