View
22
Download
0
Category
Preview:
DESCRIPTION
21 st International Symposium on Superconductivity Tsukuba, Japan October 27-29, 2008. Yokohama National University. Recent development of large-scale reconfigurable data-paths using RSFQ circuits. Nobuyuki Yoshikawa Department of Electrical and Computer Engineering, - PowerPoint PPT Presentation
Citation preview
Recent development of large-scale reconfigurable data-paths using
RSFQ circuits
Recent development of large-scale reconfigurable data-paths using
RSFQ circuitsNobuyuki Yoshikawa
Department of Electrical and Computer Engineering,
Yokohama National University, Yokohama, Japan
Nobuyuki YoshikawaDepartment of Electrical and Computer Engineering,
Yokohama National University, Yokohama, Japan
Yokohama Yokohama National National UniversityUniversity
Yokohama Yokohama National National UniversityUniversity
2121stst International Symposium on International Symposium on SuperconductivitySuperconductivityTsukuba, JapanTsukuba, JapanOctober 27-29, 2008October 27-29, 2008
2121stst International Symposium on International Symposium on SuperconductivitySuperconductivityTsukuba, JapanTsukuba, JapanOctober 27-29, 2008October 27-29, 2008
CoworkerH. Park, H. Hara, K. Taketomi, T. Kainuma, Y. Yamanashi
Yokohama National UniversityI. Kataeva, R. Kasagi, S. Iwasaki, H. Akaike, A. Fujimaki,
M. Tanaka, K. Obata, Y. Ito, K. Takagi, N. Takagi Nagoya University
H. Honda, K. Inoue, K. MurakamiKyusyu University
S. Nagasawa, M. HidakaSRL/ISTEC
CoworkerH. Park, H. Hara, K. Taketomi, T. Kainuma, Y. Yamanashi
Yokohama National UniversityI. Kataeva, R. Kasagi, S. Iwasaki, H. Akaike, A. Fujimaki,
M. Tanaka, K. Obata, Y. Ito, K. Takagi, N. Takagi Nagoya University
H. Honda, K. Inoue, K. MurakamiKyusyu University
S. Nagasawa, M. HidakaSRL/ISTEC
Outline of This Talk
Background Architecture Target system Component developments
Floating-point adders/multipliers (FPA/FPU)2 x 2 RDP
New process and cell library Road map Summary
Demand on High-Performance Computer
Calculation amount of electronic structure of molecules using the molecular orbital method
A molecule with 1000 atoms
600 TB of ERI calculations composed of a lot of product-sum operations
O(N4)O(N4)
0.2
0.4
0.6
0.81
3
5
1998 1999 2000 2001 2002 2003 2004
Pentium 4
Pentium IIICeleronXeon
1.6x / year
1.1x / year
Clo
ck fr
eque
ncy
[GH
z]
http://www.intel.com/
Breakdown of Moore’s Law
Trends of the clock frequency of recent microprocessors
Problem in High-Performance Computersand Our Approach
Large power consumption Memory wall problem
(Single Flux Quantum circuits + new architecture) solves these problems
(Single Flux Quantum circuits + new architecture) solves these problems
Josephson junction
0 = h/2e
= 2.07 mV. ps
Large-Scale Reconfigurable Data-Path ( LSRDP ) using RSFQ Circuits
A lot of FPUs+
Reconfigurable network
The data are directly transferred between FPUs.The data are directly transferred between FPUs.
Reduction of memory wall problemReduction of memory wall problem
N. Takagi et al. IEICE Technical Report, SCE2006-36, January 2007.
Example of Application of LSRDP
tei(4,4,4,4)=(((3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*f(0,t))/(p**2*q**2)+(4*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*f(1,t))/(p*q*(p+q))(4*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*f(1,t))/(p*q*(p+q))(8*(PAx+PBx)*(3+2*p*PAx*PBx)*(QCx+QDx)*(3+2*q*QCx*QDx)*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p*q*(p+q)**2)+(2*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p*q**2*(p+q)**2)+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p**2*q*(p+q)**2)+(4*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3)\+(8*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)(8*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3)(4*(PAx+PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)+((3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(q**2*(p+q)**4)(8*(PAx+PBx)*(3+2*p*PAx*PBx)*(QCx+QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(q*(p+q)**4)(8*(PAx+PBx)*(QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*(p+q)**4)+(4*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*q*(p+q)**4)+((3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p**2*(p+q)**4)(4*p*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(q*(p+q)**5)+(8*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*PQx*(QCx+QDx)*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p+q)**5+(4*PQx*q*(QCx+QDx)*(3+2*q*QCx*QDx)*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p*(p+q)**5)(8*(PAx+PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p+q)**5+(8*(PAx+PBx)*(QCx+QDx)*(15*(p+q)**3*f(3,t)+30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))8*p**3*PQx**6*q**3*f(6,t)))/(p+q)**6+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(15*(p+q)**3*f(3,t)30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))+8*p**3*PQx**6*q**3*f(6,t)))/(q*(p+q)**6)+(2*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(15*(p+q)**3*f(3,t)30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))+8*p**3*PQx**6*q**3*f(6,t)))/(p*(p+q)**6)
787 MUL, 261 ADD, 69 FUNCData-flow graph mapped to the LSRDP
Electron repulsion integral calculations of molecular orbit
while (I < 1000):
I = I+1:
LSRDP Architecture: Suitable for RSFQ Circuits
Data flow in one direction.
No loop structure. Need high throughput.
Latency is not so important.
Suitable for bit-serial processing.
Reduced requirement on memory band width.
High switching activity. Heating is serious in semiconductor circuits
Application Fields of LSRDP Processors
Molecular orbit calculation
Diffusion equation
Wave equation
Poisson equation
etc.
Target System:10-TFLOPS RSFQ-LSRDP Computer
SMACSMAC
:...:::
SMAC
SB
ORN
...
ORN
...
: : : :
ORN
...
ORN
FPU SFQ RDP( 32FPU×32chips )(4 GFLOPS /FPU)
4.2 K
SFQ Streaming Buffer( 64Kb×2chips )
CMOSCPU
(1chip)
Memory band width per MCM : 256GB/ s(=16GB/s ×16 channels)
1024FPU@MCM(34 chips ) ×4MCM
2TB memory module( FB-DIMM
[DDR3@1333MHz, 128GB]×16 modules )
SFQ 0.5um process
Organization of the Project
Profs. K. Murakami, H. Honda (Kyushu Univ.) LSRDP architecture, compiler, algorithm
Profs. N. Takagi, K. Takagi (Nagoya Univ.) CAD for logic design, arithmetic circuits
Prof. N. Yoshikawa (Yokohama National Univ.) RSFQ-FPU chip, cell library
Profs. A. Fujimaki, H. Akaike (Nagoya Univ.) Network, RSFQ-LSRDP chip, cell library
Dr. S. Nagasawa (SRL) Advanced process
Component Development
Floating-point adder (FPA) Floating-point multiplier (FPM) Operand routing network (ORN) 2 x 2 LSRDP prototype
Floating-Point Numbers
Sign Exponent Fraction
Half-precision 1 5 11
Single-precision 1 8 24
Double-precision 1 11 53
S ( 1bit )
E ( 8 bit ) F ( 23 bit )
S: Sign
E: Exponent
F: Significand or Fraction(-1)S×F×2E
Example (single precision, 32 bit) : 1.101×24
0 11000011 10100000000000000000000
Data format in IEEE754 standard
Bit-Serial Floating-Point Calculation
MS
B
LS
B
MS
B
LS
B
Significand
ExponentSign
nf
ne
t
Two bit-serial data-paths are used for the calculation of significand and exponent.
Timing Parameters in Bit-Serial Calculation
(clock)
Input 1Input 1LS
B
MS
B
Input 2Input 2LS
B
MS
B
Input 3Input 3LS
B
MS
B
Output 1Output 1LS
B
MS
B
(data)
TimeTime
(clock)
(data)
(clock)
(data)
(clock)
(data)Operation UnitOperation UnitInputInput
MS
BM
SB
LS
BL
SB
OutputOutput
MS
BM
SB
LS
BL
SB
Floating-Point Addition: Example
Block Diagram of Bit-Serial FPA
Shifter of A
Shifter of B
Buffer
Buffer
Normalizer &
Sign and Exponent‘s Combine
circuit
Comparator of
magnitude
(1) Align significand& Rounding
(2) Addition(or subtraction)
(3) Normalization
Significand of ASignificand of A
Significand of BSignificand of B
Exponent & Sign of A
Exponent & Sign of A
Exponent & Sign of B
Exponent & Sign of B
Significand of Result
Significand of Result
Exponent & Sign of Result
Exponent & Sign of Result
Fa
Fb
SaSb
Ea
Eb
Resu
lt o
f “A
-B
”S
hift
valu
e
Eff
ect
ive O
pera
tion
Am
ou
nt
of
Corr
ect
ion
Sin
g o
f R
esu
lt
Resu
lt o
f O
pera
tion
Normalizer
Controller
A >
B
Separator
circuit
: Data signals: Control signals
Chip Photograph of Half-Precision FPA
CONNECTcooperated with SRL, NiCT, NU & YNU
*nf : bit length of significand
DC Bias Margin of Each Component Circuits @20GHz
Floating-Point Multiplier
Significand part is calculated by a systolic-array multiplier.
Ze=Xe+Ye
Zf=XfYf
Exponent part is calculated by a bit-serial adder.
S ( 1bit )
E ( 8 bit ) F ( 23 bit )
S: SignE: Exponent F: Fraction
(-1)S×F×2E
Systolic-Array Multiplier
- Composed of 1D array of 1-b processing element (PE).- Small hardware cost: ∝ (bit length)- High throughput : ~ 1/(bit length)
InputInput
MS
BM
SB
LS
BL
SB OutputOutput
MS
BM
SB
LS
BL
SB
Chip Photograph of Half-Precision FPM
CONNECTcooperated with SRL, NiCT, NU & YNU*nf : bit length of significand
Test Result of FPM@25GHz
[Calculation of exponent part ]
Correct operation was confirmed at high speed.
(10) + (-2) + 1 = 9EX EY Carry from
fraction part
LSB MSB
(10)
(-2)
FX : 11010110111 EX: 11001
FY: 11001010011 EY: 01101
FXY: 10101001110 EXY: 11000
Maximum operating frequency: 31.5 GHz
Summary of Half-Precision FPUs
Floating Point Adder Floating Point Multiplier
# of JJs 11700 11044
Size (mm2) 6.76 x 4.96 6.22 x 3.78
Minimum interval (clocks) 12 ( nf + 1)
Latency (clocks) 23 (2 nf + 1)
CONNECTcooperated with SRL, NiCT, NU & YNU
nf : bit length of fraction part
NDRO-based and crossbar-based architectures of ORN
FPU
FPU
FPU
FPU
FPU
FPU
NDRO
“+”: small number of Josephson junctions required
“–”: irregular non-pipelined structure => with the increase of the complexity becomes cumbersome
NDRO
NDRO
NDRO
NDRO
NDRO
FPU
FPU
FPU
FPU
FPU
FPU½CBT
½CBT
½CBT
CBT
CBT
CBT
CBT
CBT
CBT
CBT
“+”: scalable pipelined easily re-designed for any number of N and M
“–”: large number of Josephson junctions required
ORN requirements: 1-to-N connections where N is an odd number connections to either input of the FPU
M-FPUs
Comparison of the ORN architectures
ORN complexity
latency, ps
skew, ps minimum interval
number of control lines
bias current, A
power, mW
number of J J
N=3, M=8 ~60 ~60 nf+60ps 96 0.6 1.5 ~5500
N=5, M=10 ~80 ~80 nf+80ps 200 0.9 2.25 ~8000
N=9, M=32 ~100 ~100 nf+100ps 1152 5.5 13.75 ~50500
ORN complexity
latency, clocks
skew, ps minimum interval
number of control lines
bias current, A
power, mW
number of J J
N=3, M=8 6 ~300 nf 100 0.63 1.575 6230
N=5, M=10 10 ~500 nf 208 1.41 3.525 13930
N=9, M=32 18 ~900 nf 1168 8.28 20.7 77440
Crossbar-based ORN
NDRO-based ORN
Number of J Js of NDRO-based ORN in a table is an estimation based on a design of the switch for RDP prototype (N=3, M=4) that consisted of 2750 J Js and requires 300 mA bias current (Iwasaki, not published yet)
A crossbar switch with broadcasting function: 296 J Js
Note that almost the same number of J J s are required for both ORNs if isometric (equal length wirings) network is employed in the NDRO-based ORN.
1-to-2ORN test
CBT0CBT2
CBT1
din0
ladder
clkin
_lfi
n
din2
din1
clkin_ lfout1clkin_ lfout2
dout01 dout11clkin_hf
dout02 dout12
1-to-2 ORN: 2043 J Js, bias current 226 mA
Total test circuit: 3098 J Js Total bias current: 359 mA
bias_kern1
bias_kern0bias_kern2
CBT0
CBT2
CBT1
bias_kern1 margins for din0 -> dout11 routing
-30.000
-25.000
-20.000
-15.000
-10.000
-5.000
0.000
5.000
10.000
15.000
20.000
10.842 12.679 14.324 15.858 17.241 18.818 20.345 21.854 23.480 upper margin
lower margin
Example: open466, no. 4 chip F2
completely functional, exhaustive test bias_kern0 = -14.6/ 5.3 % does not depend on the pattern bias_kern1 = -16.1/ 18.3 % for din0 -> dout11, dout12 bias_kern2 = -20.7/ 12.6 % for din0 -> dout11, dout12 minimum! bias_kern1 = -40.3/ 17.2% for din1 -> dout01 bias_kern2 = -38/ 12.6% for din2 -> dout02, dout12 maximum!
din0
cross10bar00clkin_ lfin
dout01dout11
clkoutclkout1clkout2
dout02dout12
cross11cross01
bar02bar12
Example of the low frequency test:din0 -> dout01, dout02, dout12
Frequency dependence of the bias margins: din0 -> dout11
Design of 2x2 SFQ-RDP
ALUInput SR
Output SR
ALU ControllerORN
Buffer1 mm
Buffer
• 11 pipeline stages• Designed frequency:25 GHz• InSR & OutSR length:16-bits• Data length: 7-bits
• Bias current: 1.27 A• Circuit area:5.90 x 3.68 mm2
• 10839 JJs
Demonstration of 2x2 SFQ-RDP
Frequency characteristic of RDP
Input patterns Output patterns
Maximum operating frequency23 GHz
The function for each ALU is chosen as shown above.
Device Structure of Nb 10-layer Fabrication Process
Layout■ DCP
(M1)■ Bias Pillar
(C1, 2, 3, 4, 5, 6, GC)5 x 5 mm2
■ 6 layers Moat (M2, 3, 4, 5, 6, 7)
□ PTL(M3, 5)Width: 4.8 – 5.5 mm
□ Via of PTLsless then 12 x 12 mm2
30 mm Maximum current value: 2.4 mA(limited by size of contacts)
Cell library
30μm
DC/SFQ SFQ/DC
CBE D2FF
DFF JAND
30μm
JANDFJNOR
JNOT JOR
RTFFB SPL3
SPLL
T1
Jc: 10 kA/cm2
c = 2
Design of Bit-Serial Half Adder using a New Cell Library
Jc: 10 kA/cm2
c = 2
Logic simulation results of bit-serial half adder
Clock Generator
Shift Register for Input
Shift Register
for Output
Bit-Serial Adder
On-Chip High-Speed Test Results of Bit-Serial Half Adder
Jc: 10 kA/cm2
c = 2
Road Map of RSFQ LSRDP Processor
2008 2009 2010 2011 2012 20132007 2014 -
2.5 kA/cm2
Process
10 kA/cm2
Process
40 kA/cm2
Process
25GHz FPU/RDP
60 GHz FPU & LSRDP prototype
100 GHz FPU & LSRDP prototype
10 TFLOPS LSRDP system development
Summary
Our target is to make a fundamental technology for high-end supercomputers based on large-scale reconfigurable data-path (LRDP) architecture.
Some key components were designed and implemented using standard Nb process, and their correct operations were demonstrated. Half-precision RSFQ FPA and FPU Operand routing network (ORN) 2 x 2 RDP
Structure of the SRL advanced II process was determined and a new cell library is under development. 85 GHz operation of bit-serial half-adder was demonstrated.
Recommended