Kyushu University KL, Malaysia Hardware and Software Requirements for Implementing a...
Preview:
Citation preview
- Slide 1
- Kyushu University KL, Malaysia Hardware and Software
Requirements for Implementing a High-Performance Superconductivity
Circuits-Based Accelerator Farhad Mehdipour, Hiroaki Honda, Hiroshi
Kataoka, Koji Inoue, Kazuaki Murakami Kyushu University, Japan
- Slide 2
- Kyushu University KL, Malaysia CREST-JST (2006~): Low-power,
high-performance, reconfigurable processor using single-flux
quantum (SFQ) circuits SFQ-LSRDP K. Murakami K. Inoue H. Honda F.
Mehdipour H. Kataoka K. Murakami K. Inoue H. Honda F. Mehdipour H.
Kataoka Kyushu Univ. Architecture, Compiler and Applications Kyushu
Univ. Architecture, Compiler and Applications S. Nagasawa et al.
Superconducting Research Lab. (SRL) SFQ process Superconducting
Research Lab. (SRL) SFQ process N. Yoshikawa et al. Yokohama
National Univ. SFQ-FPU chip, cell library Yokohama National Univ.
SFQ-FPU chip, cell library A. Fujimaki et al. Nagoya Univ. SFQ-RDP
chip, cell library, and wiring Nagoya Univ. SFQ-RDP chip, cell
library, and wiring N. Takagi (Leader) et al. N. Takagi (Leader) et
al. Nagoya Univ. CAD for logic design and arithmetic circuits
Nagoya Univ. CAD for logic design and arithmetic circuits Our
mission: Architecture, compiler and application development 2
- Slide 3
- Kyushu University KL, Malaysia Outline of Large-Scale
Reconfigurable Data-Path (LSRDP) Processor 3 SFQ Features:
High-speed switching and signal transmission Low power consumption
Compact implementation (smaller area) Suitable for pipeline
processing SFQ Features: High-speed switching and signal
transmission Low power consumption Compact implementation (smaller
area) Suitable for pipeline processing
- Slide 4
- Kyushu University KL, Malaysia Buffers inst; conf_LSRDP ( );
Loop: rearrange_input_data ( ); set_IO_info ( ); run_LSRDP ( );
inst; sync_lsrdp ( ); rearrange_output_data ( ); End_Loop inst;
inst conf_LSRDP(); conf. bit-stream rearrange_input_data () GPP
Memory Controller set_IO_info ( ); Memory Controller run_LSRDP (
);inst sync_lsrdp ( ); GPP Waiting for the LSRDP LSRDP terminating
the operation rearrange_output_data ( ) GPP How it works 4 Memory
Buffers LSRDP
- Slide 5
- Kyushu University KL, Malaysia Architecture Exploration MCL= 1
Number of rows = 1.5M Number of columns = 4MCL Number of rows = 2M
Number of columns = 6MCL+2 MCL= 1 Number of rows = 1.5M Number of
columns = 4MCL+1 MCL= 2 LSRDP Layouts ORN structures 5 FUTU PE
arch. I 4-inps/3-outs FU TU PE arch. II 3-inps/3-outs TU FUTU Basic
PE arch. 3-inps/2-outs PE structures
- Slide 6
- Kyushu University KL, Malaysia LSRDP Tool Chain Application C
code Application C code 1 Modified application code Modified
application code 2 Modifying application code Inserting LSRDP
instructions in the code Modifying application code Inserting LSRDP
instructions in the code 1 ISAcc or COINS compiler 2 DFG Extraction
1 binary code 2 Data flow graphs Placing and Routing Tool 2
Configuration file + various text & schematic reports
Configuration file + various text & schematic reports 1 LSRDP
library file Function definitions & declarations 1 LSRDP
architecture description 2 1: flow of the assembly code generation
for GPP 2: flow of configuration bit-stream generation for the
LSRDP 1: flow of the assembly code generation for GPP 2: flow of
configuration bit-stream generation for the LSRDP Simulator
Performance evaluation Simulator Performance evaluation 6
- Slide 7
- Kyushu University KL, Malaysia Mapping DFGs onto LSRDP 7
Longest connections DFG LSRDP Architecture Description LSRDP
Architecture Description Placing Input Nodes Placing Operational
& Output Nodes Placing Operational & Output Nodes Routing
Nets Routing IO Nets Final Map
- Slide 8
- Kyushu University KL, Malaysia Global routing algorithms src
dest src dest vacant fully- occupied exhaustive search-based very
time consuming exhaustive search-based very time consuming branch
and bound alg. Very fast branch and bound alg. Very fast Routing
DFG connections between source and destination PEs 8
- Slide 9
- Kyushu University KL, Malaysia Micro-Routing-Problem Definition
Inputs LSRDP basic specifications Layout, Width (W), MCL, PE arch.,
and etc. List of connections b/w consecutive rows ORN structure
including The number of CBs and T2s in each row The number of CB
rows Topology of connections among CBs Output Detailed routes via
cross-bar switches The list of CBs used for routing each connection
Configuration of CBs FUT T T T T T T T ORN i-th row (i+1)-th row A
micro-routing algorithm has been implemented for the LSRDP with
underlying layout II and PE arch. III
- Slide 10
- Kyushu University KL, Malaysia ORN Micro-routing 0001 1011 0001
1011 CB CB (PE1 PE 5) (PE2 PE5, PE6, PE7) (PE3 PE6, PE8 ) (PE4 PE7,
PE8) (PE1 PE 5) (PE2 PE5, PE6, PE7) (PE3 PE6, PE8 ) (PE4 PE7, PE8)
1/2CB: 1-input/2-ouput CB: 2-input/2-output Micro-nets Example 10
PE PE 2 PE 3 PE 5 PE 6 PE 7 PE 4 PE 8 CB CB (CB) CB 3 2 4 2 2 3 4 1
1 2 2 2 4 3 3 4 3 4 3 2 2 4 1 -
- Slide 11
- Kyushu University KL, Malaysia 18 17 12 20 18 25 24 32 31 PEs
in 3 rd Row PEs in 4 th row 4567891011 ORN Micro-Routing Example:
Heat 8x2- ORN b/w 3rd and 4th Rows 9 10 11 12 13 14 16 18 8 17 6 15
7 9 10 11 12 13 14 16 18 8 17 6 15 7 9 10 11 12 13 14 16 18 8 17 6
15 7 9 10 11 12 13 14 16 18 8 17 6 15 7 9 10 11 12 13 14 16 18 8 17
6 15 7 9 10 11 12 13 14 16 18 8 17 6 15 7 12 17 24 20 25 18 31 32
18 24 12 18 20 24 18 17 32 25 24 31 12 18 25 24 31 18 32 17 20 12
18 24 31 32 25 17 20 9 10 11 12 13 14 16 18 8 17 6 15 7 12 18 20 24
31 32 17 18 25 12 18 20 24 31 17 32 24 25 12 18 24 25 32 9 10 11 12
13 14 16 18 8 17 6 15 7 17 20 31 12 18 20 24 31 32 25 17 9 10 11 12
13 14 16 18 8 17 6 15 7 12 20 24 31 17 32 18 25 18 12 17 20 24 31
32 25 9 10 11 12 13 14 16 18 8 17 6 15 7 6 4567891011
- Slide 12
- Kyushu University KL, Malaysia Specifications of Attempted DFGs
total # of nodes # of Inputs# of outputs# of ops Heat-8x1 34 6416
Heat-8x2 608432 Heat-16x2 172161296 Poisson-3x3 6218133
Vibration-4x2 488424 Vibration-8x2136161272 Vibration-8x416816896
ERI-1 7616951 ERI-2 6719147 12
- Slide 13
- Kyushu University KL, Malaysia Example of a DFG Mapping
Vibration- 8x2 13
- Slide 14
- Kyushu University KL, Malaysia Results of routing nets using
the proposed algorithms DFGavg. hor. C.L. avg./max. ver. C.L. # of
global/micro nets to route Time to map (sec) Heat-8x1
0.350.75/336/640.015 Heat-8x2 0.44 1.32/5 68/1141.75 Heat-16x2 0.47
1.64/7 204/3431.05 Poisson-3x3 0.68 2.4/16 67/1202074.5
Vibration-4x2 0.46 1.58/9 50/880.34 Vibration-8x2 0.42 2.15/10
154/3322.20 Vibration-8x4 2.48 3.72/16 348/6106721.3 ERI-1 0.75
2.21/9 111/37453.61 ERI-2 0.78 2.99/9 95/3320.327 14
- Slide 15
- Kyushu University KL, Malaysia Thank You for Your Attention!
Any Questions!
- Slide 16
- Kyushu University KL, Malaysia 16 SMAC 10TFLOPS SFQ-RDP
computer :...::: SMAC SB ORN... ORN... : : : : ORN... ORN FPU SFQ
RDP 32 PE32 chips 2.5 GFLOPS PE) 4.2 K Streaming memory Access
controller CMOS CPU (One Chip) Memory bandwidth per MCM 256GB/
(=16GB/s 16 channels) 1024FPU@MCM chips 4MCM 2TB memory module
FB-DIMM [DDR3@1333MHz, 128GB] 16 modules SFQ 0.5m process
- Slide 17
- Kyushu University KL, Malaysia Chip Micro-architecture: Two
types of PEs: F and PE layout: Checkered pattern PE Two Inputs
A,B,C Three Outputs A(*B),B,C Three scales of RDP (Small, Medium
and Large-Scales ) 17 FU TU FP TU RDP parameters optimized by total
number of JJs # Input# OutputWidthHeightMCL Total JJs RDP size
RDP-S19122214419387K RDP-M19122417527027K RDP-L38244134696374K
Development of RDP Architecture Data Through
- Slide 18
- Kyushu University KL, Malaysia Development of RDP Complier
Application C code Application C code 1 Modified code 2 Modifying
application code Manual: Inserting LSRDP instructions in the code
Modifying application code Manual: Inserting LSRDP instructions in
the code 1 ISAcc or COINS compiler 2 DFG Extraction Semi-manual DFG
Extraction Semi-manual 1.asm code for MIPS-based GPP.asm code for
MIPS-based GPP 2 Data flow graphs Placement and Routing Tool 2
Configuration file + various text and schematic reports
Configuration file + various text and schematic reports 1 RDP
library file Functions definition & declaration 1 RDP
architecture description 2 1: flow of the assembly code generation
for GPU 2: flow of configuration bit-stream generation for the RDP
1: flow of the assembly code generation for GPU 2: flow of
configuration bit-stream generation for the RDP Simulator
Performance evaluation Simulator Performance evaluation
- Slide 19
- Kyushu University KL, Malaysia 19 Development of RDP Oriented
Algorithms One-dimensional heat and vibrational equations
Two-dimensional heat and FDTD equations Two-Electron Repulsion
Integral calculation in quantum chemistry Runge-Kutta calculation
for ordinary differential equation Performance Evaluation
Two-dimensional heat equation (1024x1024 mesh SFQ-RDP 1) :
50.6GFlop/s vs. GPU 2) : 63.0GFlop/s 1) Evaluation method: RDP: -
Execution time model, - DFG has 21 inputs, 9 outputs, and 63
operations GPP: - Cycle-accurate processor simulator - BW:
159.0GB/s 2) T. Aoki, and A. Nukada,CUDA programming premier,
Kougakusya, ISBN-10:4777514773, 2009 (in Japanese).