17
THINCI, Inc. • www.thinci.com August 2017 Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook – Chief Software Architect Satyaki Koneru – Chief Technology Officer Ke Yin – Chief Scientist Dinakar Munagala – Chief Executive Officer

Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Graph Streaming ProcessorA Next-Generation Computing Architecture

Val G. Cook – Chief Software ArchitectSatyaki Koneru – Chief Technology OfficerKe Yin – Chief ScientistDinakar Munagala – Chief Executive Officer

Page 2: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Introduction

• THINCI, Inc. “think-eye” is 5-year-old strategic/venture-backed

technology startup

• Develop silicon for machine learning, computer vision and other

strategic parallel workloads

• Provide innovative software along with a comprehensive SDK

• 69-person team (95% engineering & operations)

• Key IP (patents, trade secrets)– Streaming Graph Processor– Graph Computing Compiler

• Product Status– Early Access Program started Q1 2017– First edition PCIe-based development boards will ship Q4 2017

Page 3: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Levels of Parallelism

• Task Level Parallelism

• Thread Level Parallelism

• Data Level Parallelism

• Instruction Level Parallelism

Key Architectural Choices

• Direct Graph Processing

• Fine-Grained Thread Scheduling

• 2D Block Processing

• Parallel Reduction Instructions

• Hardware Instruction Scheduling

Architectural Objective

Exceptional efficiency via balanced application of multiple parallel execution mechanisms

Page 4: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Task Level Parallelism

Direct Graph Processing

Page 5: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Task Graphs

• Formalized Task Level Parallelism

– Graphs define only computational semantics

– Nodes reference kernels

– Kernels are programs

– Nodes bind to buffers

– Buffers contain structured data

– Data dependencies explicit

• ThinCI Hardware Processes Graphs Natively

– A graph is an execution primitive

– A program is a proper sub-set of graph

A

B C

D F

G

E

Page 6: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Graph Based Frameworks

• Graph Processing or Data Flow Graphs

– They are a very old concept, for example Alan Turing’s “Graph Turing Machine”.

– Gaining value as a computation model, particularly in the field of machine learning.

• Graph-based machine learning frameworks have proliferated in recent years.

TensorFlow

Karas

Torch

cuDNN

MxNet

Caffe2

Caffe

CNTKChainer

MatConvNet

deeplearning4j

Lasagne

DSSTNE

2017201620152014201320122011

Apache SINGA

Neural Designer

BIDMachKaldi

maxDNN

leaf

Machine Learning Frameworks

Page 7: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Streaming vs. Sequential Processing

• Sequential Node Processing– Commonly used by DSPs and GPUs

– Intermediate buffers are written back and forth to memory

– Intermediate buffers are generally non-cacheable globally

– DRAM accesses are costly

• Excessive power

• Excessive latency

• Graph Streaming Processor– Intermediate buffers are small

(~1% of the original size)

– Data is more easily cached

– Benefits of significantly reduced memory bandwidth

• Lower power consumption

• Higher performance

6

3 4

1 2

5

A

0

B C

D

1

2

3

4

5 6

Node A Node B Node C Node D

time

6

4

A

0

B C

D

Nodes A,B,C,D

time

1

3 5

2

Sequential Execution Streaming Execution

Page 8: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Thread Level Parallelism

Fine-Grained Thread Scheduling

Page 9: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Fine-Grained Thread Scheduling

• Thread Scheduler

– Aware of data dependencies

– Dispatches threads when:

• Resources available

• Dependencies satisfied

– Maintains ordered behavior as needed

– Prevents dead-lock

• Supports Complex Scenarios

– Aggregates Threads

– Fractures Threads

InstructionUnit

Read Only Cache

Quad 0

Processor 0

MPUSPU

Inst. Scheduler

Thread State

Processor 1

MPUSPU

Inst. Scheduler

Thread State

Processor 2

MPUSPU

Inst. Scheduler

Thread State

Processor 3

MPUSPU

Inst. Scheduler

Thread State

Special Op Unit

Arbiter

Quad 1

Processor 0

MPUSPU

Inst. Scheduler

Thread State

Processor 1

MPUSPU

Inst. Scheduler

Thread State

Processor 2

MPUSPU

Inst. Scheduler

Thread State

Processor 3

MPUSPU

Inst. Scheduler

Thread State

Special Op Unit

Arbiter

Quad 2

Processor 0

MPUSPU

Inst. Scheduler

Thread State

Processor 1

MPUSPU

Inst. Scheduler

Thread State

Processor 2

MPUSPU

Inst. Scheduler

Thread State

Processor 3

MPUSPU

Inst. Scheduler

Thread State

Special Op Unit

Arbiter

Quad N

Processor 0

MPUSPU

Inst. Scheduler

Thread State

Processor 1

MPUSPU

Inst. Scheduler

Thread State

Processor 2

MPUSPU

Inst. Scheduler

Thread State

Processor 3

MPUSPU

Inst. Scheduler

Thread State

Special Op Unit

Arbiter

StateUnit

Read Only Cache

Input/Output

Unit

Read Write Cache Array

Read Only Cache Array

Execution CommandRing Unit

DMA CommandRing Unit

Controller

Transfer(DMA)

Unit

L2 Cache

Thread Scheduler

Read Write Cache

L3 Cache

AXI Bus Matrix

Page 10: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Graph Execution Trace

• Threads can execute from all nodes of the graph simultaneously

• True hardware managed streaming behavior

Thread life-span

Thre

ad C

ou

nt/

No

de

time

Graph Execution Trace

Page 11: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Data Level Parallelism

2D Block ProcessingParallel Reduction Instructions

Page 12: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

2D Block Processing/Reduction Instructions

• Persistent data structures are accessed in blocks

• Arbitrary alignment support

• Provides for “in-place compute”

• Parallel reduction instructions support efficient processing

– Reduced power

– Greater throughput

– Reduced bandwidth

• Experience better scaling across data types vs. the 2x scaling of traditional vector pipelines

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

4

5

6

7

8

9

10

11

dst

src

src

0

1

2

3

Page 13: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Instruction Level Parallelism

Hardware Instruction Scheduling

Page 14: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Hardware Instruction Scheduling

• Scheduling Groups of Four Processors

– Hardware Instruction Picker

– Selects from 100’s of threads

– Targets 10’s of independent pipelines

Thread Spawn

State Mgmt.

Scalar Pipeline

Thread State Register Files

Instruction Decode

Instruction Scheduler

Vector Pipeline

Custom Arithmetic

Flow Control

Memory Ops.

Move Pipeline

Page 15: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Programming Model

Page 16: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Programming Model

• Fully Programmable

– No a-priori constraints regarding data types, precision or graph topologies

– Fully pipelined concurrent graph execution

– Comprehensive SDK with support for all abstraction levels, assembly to frameworks

• Machine Learning Frameworks

– TensorFlow

– Caffe

– Torch

• OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si)

– Provides rich graph creation and execution semantics

– Extended with fully accelerated custom kernel support

Page 17: Graph Streaming Processor - Hot Chips: A …...–Caffe –Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) –Provides rich graph creation and

THINCI, Inc. • www.thinci.com August 2017

Results

• Arithmetic Pipeline Utilization

– 95% for CNN’s (VGG16, 8-bit)

• Physical Characteristics

– TSMC 28nm HPC+

– Standalone SoC Mode

– PCIe Accelerator Mode

– SoC Power Estimate: 2.5W