Using Clos Switches in Area Efficient Asynchronous SDM...

Preview:

Citation preview

Using Clos Switches in Area Efficient

Asynchronous SDM Routers

Wei Song and Doug EdwardsAdvanced Processor Technologies Group

The University of Manchester

Contents

• Asynchronous Network-on-Chip

• Spatial division multiplexing (SDM)

• 2-stage Clos switch

– Motivation

– Clos

– 2-stage Clos switch

• SDM router using Clos switches

– Structure

– Performance

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

2

Asynchronous Network-on-Chip

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

3

ProcessorCache/Memory

IPs/Function Blocks

Router

Processing Element

PE

Router

PE

Router

PE

Router

PE

Router

PE

Router

Network

Interface

Network-on-Chip (NoC) or on-chip network is the state-of-the-art on-chip communication structure for multiprocessor systems.

GALS:Globally asynchronous and locally synchronous

Basic Router Structure

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

4

Buffer

Buffer

Buffer

SwitchAllocator

Ain

Bin

Cin

Aout

Bout

Cout

HA DA DA TA

HB DB DB TB

HA DA DA TA HB DB DB TBtime

1 2 3 4 5 6 7 8 9

Ain

Bin

Cout

Flow control:The algorithm used to allocate resources in a router to multiple frames.

Wormhole:Frames are divided into flits. The header flit contains the target addr. and it is used to reserve a path. Other flits just follow the path.

Head-of-Line (HOL)

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

5

N N

WW E

SS

E

R1 R2

A

B

Frame AFrame BFrame C

C

The spread of blocking:Assuming frame B is blocked by frame A, frame B may also block frame C; however, C is not directly related to A.

Spatial Division Multiplexing

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

6

SwitchAllocator

Input ports

Output ports

A virtual circuit

The output buffer ofa virtual circuitThe input buffer of

a virtual circuit

SDM vs. Wormhole

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

7

0 50 100 150 200 250 300 3500

100

200

300

400

500

600

700

800

900

1000

Avg

. F

ram

e L

ate

ncy (

ns)

Injected Traffic(MByte/Node/s)

Wormhole

SDM

WH SDM

Input Buf. 14,303 21,995

Output Buf. 5,935 6,000

Crossbar 4,356 21,744

Arbiters 772 22,208

Overall 25,366 71,956

The area of crossbarsWH:

P*P*WSDM:

MP*MP*W/M = MP*P*W

[7] W. Song and D. Edwards. “Asynchronous spatial division multiplexing router,”

Microprocessors and Microsystems, 35(2), 85-97, 2011

Clos Switch - Motivation

• The problems of SDM

– High-radix crossbars

– Large crossbar and switch allocator

• Clos networks are the optimal switch

structure

• Problems to solve

– Dynamic configuration [11]

– Optimal structure for SDM router (this paper)

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

8

Clos Networks

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

9

n×m k×k m×n

n×m

n×m

IM(1)

IM(i)

IM(k)

CM(1)

k×k

CM(r)

k×k

CM(m)

OM(1)

m×n

OM(j)

m×n

OM(k)

IP(1,1)

IP(1,n)

IP(k,1)

IP(k,n)

IP(i,1)

IP(i,n)

IP(i,h)

OP(1,1)

OP(1,n)

OP(k,1)

OP(k,n)

OP(j,1)

OP(j,n)

OP(j,h)

LI(1,1)

LI(k,m)

LO(1,1)

LO(m,k)

IP/OP: input/output portIM: input moduleCM: central moduleOM: output modulen: number of IPs in IMk: number of IMsm: number of CMsN = kn: the total number of IPs

When m >= n, the switch is no-blocking

Clos vs. Crossbar

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

10

Asynchronous Clos Scheduler

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

11

IRG

IRG

IRG

IRG

OMRICB

CMRICB

IMD

OMRCCB

CMDimcfg

cmcfg

IMSCHi CMSCHr

CMSCH4IMSCH8

IMSCH1 CMSCH1

OMSCH8

OMSCHj

OMSCH1req1,1

req1,2

req1,3

req1,4

reqi,1reqi,2reqi,3reqi,4

req8,1req8,2req8,3req8,4

[11] W. Song and D. Edwards. “An asynchronous routing algorithm for Clos

networks,” In Proc. of International Conference on Application of Concurrency to

System Design, 2010, 67-76

Async vs. Sync Algorithm

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

12

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75

0.34

0.36

0.38

0.40

0.42

0.44

0.46

0.48

0.50

0.52T

hro

ug

hp

ut

Injected Traffic

Synchronous Scheduler

Asynchronous Scheduler

[11] W. Song and D. Edwards. “An asynchronous routing algorithm for Clos

networks,” In Proc. of International Conference on Application of Concurrency to

System Design, 2010, 67-76

2-stage Clos Switch

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

13

M×M5×5

IM(s) CM(1)

5×5

CM(r)

5×5

CM(M)

IP(s,1)

IP(s,M)

OP(s,1)

OP(s,M)

M×M

IM(w)IP(w,1)

IP(w,M)

M×M

IM(n)IP(n,1)

IP(n,M)

M×M

IM(e)IP(e,1)

IP(e,M)

M×M

IM(l)IP(l,1)

IP(l,M)

OP(w,1)

OP(w,M)

OP(n,1)

OP(n,M)

OP(e,1)

OP(e,M)

OP(l,1)

OP(l,M)

Benefits:1. 2-stage1.a less latency1.b smaller1.c simpler scheduler2. CMs can be further

simplified

Simplified CM

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

14

Si

Wi

Ni

Ei

Li

So Wo No Eo Lo

cfg EL

cfg NL

cfg WL

cfg SL

cfg LE

cfg WE

cfg WN

cfg SN

cfg EN

cfg LN

cfg LW

cfg EW

cfg ES

cfg LS

cfg NS

cfg WS

Clos vs. Crossbar

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

15

2-stage Clos Scheduler

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

16

IRG0

IRGi

IRGM-1

CMRICB

IMD

CMD

imcfg

IMSCHN

CMSCHr

CMSCHM-1IMSCHL

IMSCHS CMSCH0

rt_rS,0

rt_rS,i

rt_rS,M-1

rt_rN,M-1

rt_rN,0

rt_rL,M-1

rt_rL,0

IRG

IRG

IRG

IRG

OMRICB

CMRICB

IMD

OMRCCB

CMDimcfg

cmcfg

IMSCHi CMSCHr

CMSCH4IMSCH8

IMSCH1 CMSCH1

OMSCH8

OMSCHj

OMSCH1req1,1

req1,2

req1,3

req1,4

reqi,1reqi,2reqi,3reqi,4

req8,1req8,2req8,3req8,4

A New SDM-Clos Router

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

17

M×M5×5

M×M

M×M

M×M

M×M

5×5

5×5

Clos Scheduler

South In

West In

North In

East In

Local In

South Out

West Out

North Out

East Out

Local Out

Input buffers Output buffers2-stage Clos

Area Breakdown

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

18

WH: 1 SDM: 3.9 SDM-Clos: 1.7

Speed Performance

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

19

Evaluation: MPEG-4

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

20

Video Output

SDRAM0

AudioDSP

AudioOuput

SDRAM1

Unsampling MCE

Padding

MediaCPU

SDRAM2

BABScaling context calc.

3D GFX rasteri-zation

iScanAC/DCiQuantiDCT

RISCCPU

190

0.5

60

60040

40

500 250

173

670

32

910

0.5R

PE(0,1)

R R

PE(0,2)

PE(0,0)

R

PE(1,1)

R R

PE(1,2)

PE(1,0)

R

PE(2,1)

R R

PE(2,2)

PE(2,0)

R

PE(3,1)

R R

PE(3,2)

PE(3,0)

9595

425.2595

896.5896.5

0.250.25

330.25

0

335

386

400.25

5050

3700

050

0801.25

796.5796.5

320320

3200 250

695

0320

0.250.25

0.25471.25

790790

4710

7900

102.5102.5

16

335

102.5546.5

125125

1250

125

0

250250

18901280 500

3958

4040

840 1330

31922432.5741

94335861040.5

Network Performance

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

21

Throughput requirement: ~3,400 Mbyte/s

SDM-Clos: small latency overhead; low energy consumption; half of the area

Conclusion

• SDM improves throughput

• Clos switch can reduce the area overhead

• A new 2-stage Clos switch

– Half area (4 virtual circuits)

– Small latency overhead

– Less energy consumption

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

22

Thanks

http://opencores.org/project,async_sdm_noc

03/07/2011Advanced Processor Technologies Group

The School of Computer Science

23

Recommended