Upload
truonganh
View
244
Download
1
Embed Size (px)
Citation preview
THINCI, Inc. • www.thinci.com August 2017
Graph Streaming ProcessorA Next-Generation Computing Architecture
Val G. Cook – Chief Software ArchitectSatyaki Koneru – Chief Technology OfficerKe Yin – Chief ScientistDinakar Munagala – Chief Executive Officer
THINCI, Inc. • www.thinci.com August 2017
Introduction
• THINCI, Inc. “think-eye” is 5-year-old strategic/venture-backed
technology startup
• Develop silicon for machine learning, computer vision and other
strategic parallel workloads
• Provide innovative software along with a comprehensive SDK
• 69-person team (95% engineering & operations)
• Key IP (patents, trade secrets)– Streaming Graph Processor– Graph Computing Compiler
• Product Status– Early Access Program started Q1 2017– First edition PCIe-based development boards will ship Q4 2017
THINCI, Inc. • www.thinci.com August 2017
Levels of Parallelism
• Task Level Parallelism
• Thread Level Parallelism
• Data Level Parallelism
• Instruction Level Parallelism
Key Architectural Choices
• Direct Graph Processing
• Fine-Grained Thread Scheduling
• 2D Block Processing
• Parallel Reduction Instructions
• Hardware Instruction Scheduling
Architectural Objective
Exceptional efficiency via balanced application of multiple parallel execution mechanisms
THINCI, Inc. • www.thinci.com August 2017
Task Graphs
• Formalized Task Level Parallelism
– Graphs define only computational semantics
– Nodes reference kernels
– Kernels are programs
– Nodes bind to buffers
– Buffers contain structured data
– Data dependencies explicit
• ThinCI Hardware Processes Graphs Natively
– A graph is an execution primitive
– A program is a proper sub-set of graph
A
B C
D F
G
E
THINCI, Inc. • www.thinci.com August 2017
Graph Based Frameworks
• Graph Processing or Data Flow Graphs
– They are a very old concept, for example Alan Turing’s “Graph Turing Machine”.
– Gaining value as a computation model, particularly in the field of machine learning.
• Graph-based machine learning frameworks have proliferated in recent years.
TensorFlow
Karas
Torch
cuDNN
MxNet
Caffe2
Caffe
CNTKChainer
MatConvNet
deeplearning4j
Lasagne
DSSTNE
2017201620152014201320122011
Apache SINGA
Neural Designer
BIDMachKaldi
maxDNN
leaf
Machine Learning Frameworks
THINCI, Inc. • www.thinci.com August 2017
Streaming vs. Sequential Processing
• Sequential Node Processing– Commonly used by DSPs and GPUs
– Intermediate buffers are written back and forth to memory
– Intermediate buffers are generally non-cacheable globally
– DRAM accesses are costly
• Excessive power
• Excessive latency
• Graph Streaming Processor– Intermediate buffers are small
(~1% of the original size)
– Data is more easily cached
– Benefits of significantly reduced memory bandwidth
• Lower power consumption
• Higher performance
6
3 4
1 2
5
A
0
B C
D
1
2
3
4
5 6
Node A Node B Node C Node D
time
6
4
A
0
B C
D
Nodes A,B,C,D
time
1
3 5
2
Sequential Execution Streaming Execution
THINCI, Inc. • www.thinci.com August 2017
Fine-Grained Thread Scheduling
• Thread Scheduler
– Aware of data dependencies
– Dispatches threads when:
• Resources available
• Dependencies satisfied
– Maintains ordered behavior as needed
– Prevents dead-lock
• Supports Complex Scenarios
– Aggregates Threads
– Fractures Threads
InstructionUnit
Read Only Cache
Quad 0
Processor 0
MPUSPU
Inst. Scheduler
Thread State
Processor 1
MPUSPU
Inst. Scheduler
Thread State
Processor 2
MPUSPU
Inst. Scheduler
Thread State
Processor 3
MPUSPU
Inst. Scheduler
Thread State
Special Op Unit
Arbiter
Quad 1
Processor 0
MPUSPU
Inst. Scheduler
Thread State
Processor 1
MPUSPU
Inst. Scheduler
Thread State
Processor 2
MPUSPU
Inst. Scheduler
Thread State
Processor 3
MPUSPU
Inst. Scheduler
Thread State
Special Op Unit
Arbiter
Quad 2
Processor 0
MPUSPU
Inst. Scheduler
Thread State
Processor 1
MPUSPU
Inst. Scheduler
Thread State
Processor 2
MPUSPU
Inst. Scheduler
Thread State
Processor 3
MPUSPU
Inst. Scheduler
Thread State
Special Op Unit
Arbiter
Quad N
Processor 0
MPUSPU
Inst. Scheduler
Thread State
Processor 1
MPUSPU
Inst. Scheduler
Thread State
Processor 2
MPUSPU
Inst. Scheduler
Thread State
Processor 3
MPUSPU
Inst. Scheduler
Thread State
Special Op Unit
Arbiter
StateUnit
Read Only Cache
Input/Output
Unit
Read Write Cache Array
Read Only Cache Array
Execution CommandRing Unit
DMA CommandRing Unit
Controller
Transfer(DMA)
Unit
L2 Cache
Thread Scheduler
Read Write Cache
L3 Cache
AXI Bus Matrix
THINCI, Inc. • www.thinci.com August 2017
Graph Execution Trace
• Threads can execute from all nodes of the graph simultaneously
• True hardware managed streaming behavior
Thread life-span
Thre
ad C
ou
nt/
No
de
time
Graph Execution Trace
THINCI, Inc. • www.thinci.com August 2017
Data Level Parallelism
2D Block ProcessingParallel Reduction Instructions
THINCI, Inc. • www.thinci.com August 2017
2D Block Processing/Reduction Instructions
• Persistent data structures are accessed in blocks
• Arbitrary alignment support
• Provides for “in-place compute”
• Parallel reduction instructions support efficient processing
– Reduced power
– Greater throughput
– Reduced bandwidth
• Experience better scaling across data types vs. the 2x scaling of traditional vector pipelines
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
4
5
6
7
8
9
10
11
dst
src
src
0
1
2
3
THINCI, Inc. • www.thinci.com August 2017
Instruction Level Parallelism
Hardware Instruction Scheduling
THINCI, Inc. • www.thinci.com August 2017
Hardware Instruction Scheduling
• Scheduling Groups of Four Processors
– Hardware Instruction Picker
– Selects from 100’s of threads
– Targets 10’s of independent pipelines
Thread Spawn
State Mgmt.
Scalar Pipeline
Thread State Register Files
Instruction Decode
Instruction Scheduler
Vector Pipeline
Custom Arithmetic
Flow Control
Memory Ops.
Move Pipeline
THINCI, Inc. • www.thinci.com August 2017
Programming Model
• Fully Programmable
– No a-priori constraints regarding data types, precision or graph topologies
– Fully pipelined concurrent graph execution
– Comprehensive SDK with support for all abstraction levels, assembly to frameworks
• Machine Learning Frameworks
– TensorFlow
– Caffe
– Torch
• OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si)
– Provides rich graph creation and execution semantics
– Extended with fully accelerated custom kernel support