Using Clos Switches in Area Efficient
Asynchronous SDM Routers
Wei Song and Doug EdwardsAdvanced Processor Technologies Group
The University of Manchester
Contents
• Asynchronous Network-on-Chip
• Spatial division multiplexing (SDM)
• 2-stage Clos switch
– Motivation
– Clos
– 2-stage Clos switch
• SDM router using Clos switches
– Structure
– Performance
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
2
Asynchronous Network-on-Chip
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
3
ProcessorCache/Memory
IPs/Function Blocks
Router
Processing Element
PE
Router
PE
Router
PE
Router
PE
Router
PE
Router
Network
Interface
Network-on-Chip (NoC) or on-chip network is the state-of-the-art on-chip communication structure for multiprocessor systems.
GALS:Globally asynchronous and locally synchronous
Basic Router Structure
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
4
Buffer
Buffer
Buffer
SwitchAllocator
Ain
Bin
Cin
Aout
Bout
Cout
HA DA DA TA
HB DB DB TB
HA DA DA TA HB DB DB TBtime
1 2 3 4 5 6 7 8 9
Ain
Bin
Cout
Flow control:The algorithm used to allocate resources in a router to multiple frames.
Wormhole:Frames are divided into flits. The header flit contains the target addr. and it is used to reserve a path. Other flits just follow the path.
Head-of-Line (HOL)
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
5
N N
WW E
SS
E
R1 R2
A
B
Frame AFrame BFrame C
C
The spread of blocking:Assuming frame B is blocked by frame A, frame B may also block frame C; however, C is not directly related to A.
Spatial Division Multiplexing
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
6
SwitchAllocator
Input ports
Output ports
A virtual circuit
The output buffer ofa virtual circuitThe input buffer of
a virtual circuit
SDM vs. Wormhole
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
7
0 50 100 150 200 250 300 3500
100
200
300
400
500
600
700
800
900
1000
Avg
. F
ram
e L
ate
ncy (
ns)
Injected Traffic(MByte/Node/s)
Wormhole
SDM
WH SDM
Input Buf. 14,303 21,995
Output Buf. 5,935 6,000
Crossbar 4,356 21,744
Arbiters 772 22,208
Overall 25,366 71,956
The area of crossbarsWH:
P*P*WSDM:
MP*MP*W/M = MP*P*W
[7] W. Song and D. Edwards. “Asynchronous spatial division multiplexing router,”
Microprocessors and Microsystems, 35(2), 85-97, 2011
Clos Switch - Motivation
• The problems of SDM
– High-radix crossbars
– Large crossbar and switch allocator
• Clos networks are the optimal switch
structure
• Problems to solve
– Dynamic configuration [11]
– Optimal structure for SDM router (this paper)
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
8
Clos Networks
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
9
n×m k×k m×n
n×m
n×m
IM(1)
IM(i)
IM(k)
CM(1)
k×k
CM(r)
k×k
CM(m)
OM(1)
m×n
OM(j)
m×n
OM(k)
IP(1,1)
IP(1,n)
IP(k,1)
IP(k,n)
IP(i,1)
IP(i,n)
IP(i,h)
OP(1,1)
OP(1,n)
OP(k,1)
OP(k,n)
OP(j,1)
OP(j,n)
OP(j,h)
LI(1,1)
LI(k,m)
LO(1,1)
LO(m,k)
IP/OP: input/output portIM: input moduleCM: central moduleOM: output modulen: number of IPs in IMk: number of IMsm: number of CMsN = kn: the total number of IPs
When m >= n, the switch is no-blocking
Clos vs. Crossbar
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
10
Asynchronous Clos Scheduler
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
11
IRG
IRG
IRG
IRG
OMRICB
CMRICB
IMD
OMRCCB
CMDimcfg
cmcfg
IMSCHi CMSCHr
CMSCH4IMSCH8
IMSCH1 CMSCH1
OMSCH8
OMSCHj
OMSCH1req1,1
req1,2
req1,3
req1,4
reqi,1reqi,2reqi,3reqi,4
req8,1req8,2req8,3req8,4
[11] W. Song and D. Edwards. “An asynchronous routing algorithm for Clos
networks,” In Proc. of International Conference on Application of Concurrency to
System Design, 2010, 67-76
Async vs. Sync Algorithm
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
12
0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75
0.34
0.36
0.38
0.40
0.42
0.44
0.46
0.48
0.50
0.52T
hro
ug
hp
ut
Injected Traffic
Synchronous Scheduler
Asynchronous Scheduler
[11] W. Song and D. Edwards. “An asynchronous routing algorithm for Clos
networks,” In Proc. of International Conference on Application of Concurrency to
System Design, 2010, 67-76
2-stage Clos Switch
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
13
M×M5×5
IM(s) CM(1)
5×5
CM(r)
5×5
CM(M)
IP(s,1)
IP(s,M)
OP(s,1)
OP(s,M)
M×M
IM(w)IP(w,1)
IP(w,M)
M×M
IM(n)IP(n,1)
IP(n,M)
M×M
IM(e)IP(e,1)
IP(e,M)
M×M
IM(l)IP(l,1)
IP(l,M)
OP(w,1)
OP(w,M)
OP(n,1)
OP(n,M)
OP(e,1)
OP(e,M)
OP(l,1)
OP(l,M)
Benefits:1. 2-stage1.a less latency1.b smaller1.c simpler scheduler2. CMs can be further
simplified
Simplified CM
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
14
Si
Wi
Ni
Ei
Li
So Wo No Eo Lo
cfg EL
cfg NL
cfg WL
cfg SL
cfg LE
cfg WE
cfg WN
cfg SN
cfg EN
cfg LN
cfg LW
cfg EW
cfg ES
cfg LS
cfg NS
cfg WS
Clos vs. Crossbar
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
15
2-stage Clos Scheduler
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
16
IRG0
IRGi
IRGM-1
CMRICB
IMD
CMD
imcfg
IMSCHN
CMSCHr
CMSCHM-1IMSCHL
IMSCHS CMSCH0
rt_rS,0
rt_rS,i
rt_rS,M-1
rt_rN,M-1
rt_rN,0
rt_rL,M-1
rt_rL,0
IRG
IRG
IRG
IRG
OMRICB
CMRICB
IMD
OMRCCB
CMDimcfg
cmcfg
IMSCHi CMSCHr
CMSCH4IMSCH8
IMSCH1 CMSCH1
OMSCH8
OMSCHj
OMSCH1req1,1
req1,2
req1,3
req1,4
reqi,1reqi,2reqi,3reqi,4
req8,1req8,2req8,3req8,4
A New SDM-Clos Router
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
17
M×M5×5
M×M
M×M
M×M
M×M
5×5
5×5
Clos Scheduler
South In
West In
North In
East In
Local In
South Out
West Out
North Out
East Out
Local Out
Input buffers Output buffers2-stage Clos
Area Breakdown
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
18
WH: 1 SDM: 3.9 SDM-Clos: 1.7
Speed Performance
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
19
Evaluation: MPEG-4
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
20
Video Output
SDRAM0
AudioDSP
AudioOuput
SDRAM1
Unsampling MCE
Padding
MediaCPU
SDRAM2
BABScaling context calc.
3D GFX rasteri-zation
iScanAC/DCiQuantiDCT
RISCCPU
190
0.5
60
60040
40
500 250
173
670
32
910
0.5R
PE(0,1)
R R
PE(0,2)
PE(0,0)
R
PE(1,1)
R R
PE(1,2)
PE(1,0)
R
PE(2,1)
R R
PE(2,2)
PE(2,0)
R
PE(3,1)
R R
PE(3,2)
PE(3,0)
9595
425.2595
896.5896.5
0.250.25
330.25
0
335
386
400.25
5050
3700
050
0801.25
796.5796.5
320320
3200 250
695
0320
0.250.25
0.25471.25
790790
4710
7900
102.5102.5
16
335
102.5546.5
125125
1250
125
0
250250
18901280 500
3958
4040
840 1330
31922432.5741
94335861040.5
Network Performance
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
21
Throughput requirement: ~3,400 Mbyte/s
SDM-Clos: small latency overhead; low energy consumption; half of the area
Conclusion
• SDM improves throughput
• Clos switch can reduce the area overhead
• A new 2-stage Clos switch
– Half area (4 virtual circuits)
– Small latency overhead
– Less energy consumption
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
22
Thanks
http://opencores.org/project,async_sdm_noc
03/07/2011Advanced Processor Technologies Group
The School of Computer Science
23