Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Recent development of large-scale reconfigurable data-paths using

RSFQ circuits

Recent development of large-scale reconfigurable data-paths using

RSFQ circuitsNobuyuki Yoshikawa

Department of Electrical and Computer Engineering,

Yokohama National University, Yokohama, Japan

Nobuyuki YoshikawaDepartment of Electrical and Computer Engineering,

Yokohama National University, Yokohama, Japan

Yokohama Yokohama National National UniversityUniversity

2121stst International Symposium on International Symposium on SuperconductivitySuperconductivityTsukuba, JapanTsukuba, JapanOctober 27-29, 2008October 27-29, 2008

CoworkerH. Park, H. Hara, K. Taketomi, T. Kainuma, Y. Yamanashi

Yokohama National UniversityI. Kataeva, R. Kasagi, S. Iwasaki, H. Akaike, A. Fujimaki,

M. Tanaka, K. Obata, Y. Ito, K. Takagi, N. Takagi Nagoya University

H. Honda, K. Inoue, K. MurakamiKyusyu University

S. Nagasawa, M. HidakaSRL/ISTEC

CoworkerH. Park, H. Hara, K. Taketomi, T. Kainuma, Y. Yamanashi

Yokohama National UniversityI. Kataeva, R. Kasagi, S. Iwasaki, H. Akaike, A. Fujimaki,

M. Tanaka, K. Obata, Y. Ito, K. Takagi, N. Takagi Nagoya University

H. Honda, K. Inoue, K. MurakamiKyusyu University

S. Nagasawa, M. HidakaSRL/ISTEC

Outline of This Talk

Background Architecture Target system Component developments

Floating-point adders/multipliers (FPA/FPU)2 x 2 RDP

New process and cell library Road map Summary

Demand on High-Performance Computer

Calculation amount of electronic structure of molecules using the molecular orbital method

A molecule with 1000 atoms

600 TB of ERI calculations composed of a lot of product-sum operations

O(N4)O(N4)

1998 1999 2000 2001 2002 2003 2004

Pentium 4

Pentium IIICeleronXeon

1.6x / year

1.1x / year

http://www.intel.com/

Breakdown of Moore’s Law

Trends of the clock frequency of recent microprocessors

Problem in High-Performance Computersand Our Approach

Large power consumption Memory wall problem

(Single Flux Quantum circuits + new architecture) solves these problems

Josephson junction

0 = h/2e

= 2.07 mV. ps

Large-Scale Reconfigurable Data-Path （ LSRDP ） using RSFQ Circuits

A lot of FPUs+

Reconfigurable network

The data are directly transferred between FPUs.The data are directly transferred between FPUs.

Reduction of memory wall problemReduction of memory wall problem

N. Takagi et al. IEICE Technical Report, SCE2006-36, January 2007.

Example of Application of LSRDP

tei(4,4,4,4)=(((3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*f(0,t))/(p**2*q**2)+(4*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*f(1,t))/(p*q*(p+q))(4*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*f(1,t))/(p*q*(p+q))(8*(PAx+PBx)*(3+2*p*PAx*PBx)*(QCx+QDx)*(3+2*q*QCx*QDx)*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p*q*(p+q)**2)+(2*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p*q**2*(p+q)**2)+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p**2*q*(p+q)**2)+(4*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3)\+(8*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)(8*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3)(4*(PAx+PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)+((3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(q**2*(p+q)**4)(8*(PAx+PBx)*(3+2*p*PAx*PBx)*(QCx+QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(q*(p+q)**4)(8*(PAx+PBx)*(QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*(p+q)**4)+(4*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*q*(p+q)**4)+((3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p**2*(p+q)**4)(4*p*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(q*(p+q)**5)+(8*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*PQx*(QCx+QDx)*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p+q)**5+(4*PQx*q*(QCx+QDx)*(3+2*q*QCx*QDx)*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p*(p+q)**5)(8*(PAx+PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p+q)**5+(8*(PAx+PBx)*(QCx+QDx)*(15*(p+q)**3*f(3,t)+30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))8*p**3*PQx**6*q**3*f(6,t)))/(p+q)**6+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(15*(p+q)**3*f(3,t)30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))+8*p**3*PQx**6*q**3*f(6,t)))/(q*(p+q)**6)+(2*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(15*(p+q)**3*f(3,t)30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))+8*p**3*PQx**6*q**3*f(6,t)))/(p*(p+q)**6)

787 MUL, 261 ADD, 69 FUNCData-flow graph mapped to the LSRDP

Electron repulsion integral calculations of molecular orbit

while (I < 1000):

I = I+1:

LSRDP Architecture: Suitable for RSFQ Circuits

Data flow in one direction.

No loop structure. Need high throughput.

Latency is not so important.

Suitable for bit-serial processing.

Reduced requirement on memory band width.

High switching activity. Heating is serious in semiconductor circuits

Application Fields of LSRDP Processors

Molecular orbit calculation

Diffusion equation

Wave equation

Poisson equation

Target System:10-TFLOPS RSFQ-LSRDP Computer

SMACSMAC

:...:::

: : : :

FPU SFQ RDP（ 32FPU×32chips ）（４ GFLOPS ／FPU)

SFQ Streaming Buffer（ 64Kb×2chips ）

CMOSCPU

(1chip)

Memory band width per MCM ： 256GB/ ｓ(=16GB/s ×16 channels)

1024FPU@MCM（３４ chips ） ×4MCM

2TB memory module（ FB-DIMM

[DDR3@1333MHz, 128GB]×16 modules ）

SFQ 0.5um process

Organization of the Project

Profs. K. Murakami, H. Honda (Kyushu Univ.) LSRDP architecture, compiler, algorithm

Profs. N. Takagi, K. Takagi (Nagoya Univ.) CAD for logic design, arithmetic circuits

Prof. N. Yoshikawa (Yokohama National Univ.) RSFQ-FPU chip, cell library

Profs. A. Fujimaki, H. Akaike (Nagoya Univ.) Network, RSFQ-LSRDP chip, cell library

Dr. S. Nagasawa (SRL) Advanced process

Component Development

Floating-point adder (FPA) Floating-point multiplier (FPM) Operand routing network (ORN) 2 x 2 LSRDP prototype

Floating-Point Numbers

Sign Exponent Fraction

Half-precision 1 5 11

Single-precision 1 8 24

Double-precision 1 11 53

S （ 1bit ）

E （ 8 bit ） F （ 23 bit ）

S: Sign

E: Exponent

F: Significand or Fraction(-1)S×F×2E

Example (single precision, 32 bit) ： 1.101×24

0 11000011 10100000000000000000000

Data format in IEEE754 standard

Bit-Serial Floating-Point Calculation

Significand

ExponentSign

Two bit-serial data-paths are used for the calculation of significand and exponent.

Timing Parameters in Bit-Serial Calculation

(clock)

Input 1Input 1LS

Input 2Input 2LS

Input 3Input 3LS

Output 1Output 1LS

(data)

TimeTime

(clock)

(data)

(clock)

(data)

(clock)

(data)Operation UnitOperation UnitInputInput

OutputOutput

Floating-Point Addition: Example

Block Diagram of Bit-Serial FPA

Shifter of A

Shifter of B

Buffer

Normalizer 　&

Sign and Exponent‘s Combine

circuit

Comparator of

magnitude

(1) Align significand& Rounding

(2) Addition(or subtraction)

(3) Normalization

Significand of ASignificand of A

Significand of BSignificand of B

Exponent & Sign of A

Exponent & Sign of B

Significand of Result

Exponent & Sign of Result

f “A

Normalizer

Controller

Separator

circuit

: Data signals: Control signals

Chip Photograph of Half-Precision FPA

CONNECTcooperated with SRL, NiCT, NU & YNU

*nf : bit length of significand

DC Bias Margin of Each Component Circuits @20GHz

Floating-Point Multiplier

Significand part is calculated by a systolic-array multiplier.

Ze=Xe+Ye

Zf=XfYf

Exponent part is calculated by a bit-serial adder.

S （ 1bit ）

E （ 8 bit ） F （ 23 bit ）

S: SignE: Exponent F: Fraction

(-1)S×F×2E

Systolic-Array Multiplier

- Composed of 1D array of 1-b processing element (PE).- Small hardware cost: ∝ (bit length)- High throughput : ~ 1/(bit length)

InputInput

SB OutputOutput

Chip Photograph of Half-Precision FPM

CONNECTcooperated with SRL, NiCT, NU & YNU*nf : bit length of significand

Test Result of FPM@25GHz

[Calculation of exponent part ]

Correct operation was confirmed at high speed.

(10) + (-2) + 1 = 9EX EY Carry from

fraction part

LSB MSB

FX : 11010110111 EX: 11001

FY: 11001010011 EY: 01101

FXY: 10101001110 EXY: 11000

Maximum operating frequency: 31.5 GHz

Summary of Half-Precision FPUs

Floating Point Adder Floating Point Multiplier

# of JJs 11700 11044

Size (mm2) 6.76 x 4.96 6.22 x 3.78

Minimum interval (clocks) 12 ( nf + 1)

Latency (clocks) 23 (2 nf + 1)

CONNECTcooperated with SRL, NiCT, NU & YNU

nf : bit length of fraction part

NDRO-based and crossbar-based architectures of ORN

“+”: small number of Josephson junctions required

“–”: irregular non-pipelined structure => with the increase of the complexity becomes cumbersome

FPU½CBT

“+”: scalable pipelined easily re-designed for any number of N and M

“–”: large number of Josephson junctions required

ORN requirements: 1-to-N connections where N is an odd number connections to either input of the FPU

M-FPUs

Comparison of the ORN architectures

ORN complexity

latency, ps

skew, ps minimum interval

number of control lines

bias current, A

power, mW

number of J J

N=3, M=8 ~60 ~60 nf+60ps 96 0.6 1.5 ~5500

N=5, M=10 ~80 ~80 nf+80ps 200 0.9 2.25 ~8000

N=9, M=32 ~100 ~100 nf+100ps 1152 5.5 13.75 ~50500

ORN complexity

latency, clocks

skew, ps minimum interval

number of control lines

bias current, A

power, mW

number of J J

N=3, M=8 6 ~300 nf 100 0.63 1.575 6230

N=5, M=10 10 ~500 nf 208 1.41 3.525 13930

N=9, M=32 18 ~900 nf 1168 8.28 20.7 77440

Crossbar-based ORN

NDRO-based ORN

Number of J Js of NDRO-based ORN in a table is an estimation based on a design of the switch for RDP prototype (N=3, M=4) that consisted of 2750 J Js and requires 300 mA bias current (Iwasaki, not published yet)

A crossbar switch with broadcasting function: 296 J Js

Note that almost the same number of J J s are required for both ORNs if isometric (equal length wirings) network is employed in the NDRO-based ORN.

1-to-2ORN test

CBT0CBT2

ladder

clkin_ lfout1clkin_ lfout2

dout01 dout11clkin_hf

dout02 dout12

1-to-2 ORN: 2043 J Js, bias current 226 mA

Total test circuit: 3098 J Js Total bias current: 359 mA

bias_kern1

bias_kern0bias_kern2

bias_kern1 margins for din0 -> dout11 routing

-30.000

-25.000

-20.000

-15.000

-10.000

-5.000

10.000

15.000

20.000

10.842 12.679 14.324 15.858 17.241 18.818 20.345 21.854 23.480 upper margin

lower margin

Example: open466, no. 4 chip F2

completely functional, exhaustive test bias_kern0 = -14.6/ 5.3 % does not depend on the pattern bias_kern1 = -16.1/ 18.3 % for din0 -> dout11, dout12 bias_kern2 = -20.7/ 12.6 % for din0 -> dout11, dout12 minimum! bias_kern1 = -40.3/ 17.2% for din1 -> dout01 bias_kern2 = -38/ 12.6% for din2 -> dout02, dout12 maximum!

cross10bar00clkin_ lfin

dout01dout11

clkoutclkout1clkout2

dout02dout12

cross11cross01

bar02bar12

Example of the low frequency test:din0 -> dout01, dout02, dout12

Frequency dependence of the bias margins: din0 -> dout11

Design of 2x2 SFQ-RDP

ALUInput SR

Output SR

ALU ControllerORN

Buffer1 mm

Buffer

• 11 pipeline stages• Designed frequency：25 GHz• InSR & OutSR length：16-bits• Data length: 7-bits

• Bias current: 1.27 A• Circuit area：5.90 x 3.68 mm2

• 10839 JJs

Demonstration of 2x2 SFQ-RDP

Frequency characteristic of RDP

Input patterns Output patterns

Maximum operating frequency23 GHz

The function for each ALU is chosen as shown above.

Device Structure of Nb 10-layer Fabrication Process

Layout■ DCP

(M1)■ Bias Pillar

(C1, 2, 3, 4, 5, 6, GC)5 x 5 mm2

■ 6 layers Moat (M2, 3, 4, 5, 6, 7)

□ PTL(M3, 5)Width: 4.8 – 5.5 mm

□ Via of PTLsless then 12 x 12 mm2

30 mm Maximum current value: 2.4 mA(limited by size of contacts)

Cell library

DC/SFQ SFQ/DC

CBE D2FF

DFF JAND

JANDFJNOR

JNOT JOR

RTFFB SPL3

Jc: 10 kA/cm2

Design of Bit-Serial Half Adder using a New Cell Library

Jc: 10 kA/cm2

Logic simulation results of bit-serial half adder

Clock Generator

Shift Register for Input

Shift Register

for Output

Bit-Serial Adder

On-Chip High-Speed Test Results of Bit-Serial Half Adder

Jc: 10 kA/cm2

Road Map of RSFQ LSRDP Processor

2008 2009 2010 2011 2012 20132007 2014 -

2.5 kA/cm2

Process

10 kA/cm2

Process

40 kA/cm2

Process

25GHz FPU/RDP

60 GHz FPU & LSRDP prototype

100 GHz FPU & LSRDP prototype

10 TFLOPS LSRDP system development

Summary

Our target is to make a fundamental technology for high-end supercomputers based on large-scale reconfigurable data-path (LRDP) architecture.

Some key components were designed and implemented using standard Nb process, and their correct operations were demonstrated. Half-precision RSFQ FPA and FPU Operand routing network (ORN) 2 x 2 RDP

Structure of the SRL advanced II process was determined and a new cell library is under development. 85 GHz operation of bit-serial half-adder was demonstrated.

Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Documents

Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU

Analisis Reconfigurable

Reconfigurable Computing Reconfigurable … Computing Reconfigurable Architectures Chapter 3.1 Prof. Dr.-Ing. Jürgen Teich Lehrstuhl für Hardware-Software-Co-Design Reconfigurable

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive

Compact Reconfigurable Avionics – Reconfigurable Data

Reconfigurable Radio Design Reconfigurable Architecture –Reconfigurable Chip design example –Hardware Reconfiguration Introduction to Software RADIO

MIMO Communication Systems with Reconfigurable Antennas · MIMO COMMUNICATION SYSTEMS WITH RECONFIGURABLE ANTENNAS ... MIMO COMMUNICATION SYSTEMS WITH RECONFIGURABLE ANTENNAS

Reconfigurable computing

RECONFIGURABLE MAGNETOHYDRODYANAMIC ANTENNAvisconedutech.com/wp-content/uploads/2018/05/Reconfigurable-Magnetohydrodyanamic...• Reconfigurable antenna are also designed using capacitor

EN2911X: Reconfigurable Computing - Brown Universityscale.engin.brown.edu/classes/EN2911XF14/topic03.pdf · EN2911X: Reconfigurable Computing Topic 03: Reconfigurable Computing Design

Integrated and reconfigurable optical paths based on ... · Integrated and reconfigurable optical paths based on stacking optical functional films MING-JIE TANG, 1 PENG CHEN,1 WAN-LONG

A Multitechnique Reconfigurable Electrochemical Biosensor for …bioee.ucsd.edu/papers/A Multitechnique Reconfigurable... · 2015-11-12 · A Multitechnique Reconfigurable Electrochemical

CPRE 583 Reconfigurable Computing (VHDL …class.ece.iastate.edu/cpre488/resources/CPRE583_VHDL_Quick...1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing HW, VHDL 2

High-Capacity Backbone Networks and Multilayer Integrated ... · that use the colorless, directionless reconfigurable optical add/drop multiplexer (CDC-ROAMDM), optical paths can

Lecture 13: Reconfigurable Computing Applications October 10, 2013 ECE 636 Reconfigurable Computing Lecture 11 Reconfigurable Computing Applications

Configurable, reconfigurable, and run-time reconfigurable computing

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture

Polarization Reconfigurable Omnidirectional Antennas3. CP Reconfigurable Omnidirectional Antenna The configuration of a CP reconfigurable omnidirectional antenna is illustrated in

Reconfigurable Antennas

RECONFIGURABLE PROCESS PLANS FOR RECONFIGURABLE MANUFACTURINGltodi.est.ips.pt/det2006/papers/Keynote/DET2006_Hoda_ElMaraghy.pdf · Reconfigurable Process Plans for Reconfigurable