Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity

Mapping Stream based Applications to an Intel IXP Network Processor using Compaan

Sjoerd Meijer, Bart Kienhuis, Johan Walters, David SnuijfUniversity Leiden, LIACS

[email protected]

2

Outline

• Need for multi-processor platforms

• Problem: how to program them

• Solution: compiler support

• Realization: the IMCA back-end

• Conclusion

3

Problem description

• Ferocious appetite for

more embedded

computation power

Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate

0

500

1000

1500

2000

2500

2000 2001 2002 2003 2004 2005 2006

Bil

lio

n M

AC

/s

HDTV

MPEG4Voice

over IP

Video

over IP

3G Wireless/

WCDMA

FutureBroadband Standards

General Purpose DSP/CPU

IncreasingGap

2.5G

Solution: A need for Multi Processor Systems

4

Problem description – cont.

• Moore’s Law:

– More transistors

• Tooling:

– More lines of code

• Productivity Gap

Pro

ductivity

Lin

es o

f C

ode

/Man

Mon

th

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1981

1985

1989

1993

1997

2001

2005

2009

Lo

gic

tra

nsis

tors

pe

r ch

ip(K

)

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Logic Tr./Chip

58% / Year compoundcomplexity growth rate

Loc/MM

8–10% / Year Compound Productivity growth rate

Productivity gap

}

We know how to build multiprocessor systems,But not how to program them efficiently

5

Problem Description

for( int t=1; t<=P; t++){

for( int i=1; i<=M; i++ ){

for( int j=4; j<=N; j++ ){

r1[i+1][j-3] = F1(...); //stm1

}

}

for( int l=3; l<=M; l++ ){

for( int m=3; m<=N-1; m++ ){

if ( l+m<= 7 ){

r2[l][m] = F2( r1[l-1][m-2] ); //stm2

}

if ( l+m>=8 ){

r2[l][m] = F3( r1[l][N-3] ); //stm3

}

... = F4( r2[l][m] ); //stm4

}

}

}

F3F2

F1

Get() Get()

Put() Put()

FIFO1 FIFO2

F4

FIFO3 FIFO4

Put() Put()

Get() Get()

Storage arrays (R1) are located in Global Memory

Se

qu

en

tia

lly O

rdere

d

Processes run autonomously

Communicate via unbounded FIFOs

We need tools to convert the sequential programto a parallel equivalent

6

Our Solution

for j = 1:1:5,

for i = j:1:5,

[r(j,i)] = ReadMatrix_Zeros_64x64();

end

end

for k = 1:1:6,

for j = 1:1:5,

[r(j,j), x(k,j), t ] = Vectorize( r(j,j), x(k,j) );

for i = j+1:1:5,

[r(j,i), x(k,i), t] = Rotate( r(j,i), x(k,i), t );

end

end

end

Sequential Application

Specification

Application

P1 P3

P2 P4

P5

Parallel Application

Specification

Translator

EASY to specify DIFFICULT to specify

DIFFICULT to map EASY to map

FPGA

COMPAAN

IMCA

7

Questions

• Q1: Can we use the IXP architecture for stream-based applications?

• Q2: Can we map applications written as a KPN onto the IXP?

• Q3: Can we program the IXP using Compaan?

8

Intel IXP2400 Network Processor

• Optimized for streaming

– 2.6 Gbit ethernet connection

• Build to operate in real-time on internet traffic

– 8 microengines with 8 hardware supported

threads

• Completely programmable in C

– A SDK available for programming

9

IXP2400

• 8 microengines

• 1 XScale core

• MSF interface

• SRAM: 128 MB

• DRAM: 1 GB

• Scratchpad: 4 K

IXP2400

Media Switch

Fabric

Interface

RBUF

TBUF

Scratchpad

memory

CAP

Hash unit

DRAM

External Media

Device(s)

PCI

Intel

XScale

Core

Optional hosts

CPU, PCI

bus devices

SRAM

Controller 1

SRAM

Controller 0

DRAM

Controller 0

SRAM

ME 1:1ME 1:0

ME 1:3ME 1:2

ME 0:1ME 0:0

ME 0:3ME 0:2

10

Microengine

• Local memory: 640 K

• Registers– GPR

– Read Xfer

– Write Xfer

– Next neighbour

• Instruction store:4 K

• 8 Threads

A Operand

4K

Instruction

Store

Immed

640 long

words

Local

Memory

B Operand

Execution Datapath

(Shift, Add, Substract,

Multiply, Find Fist Bit Set)

16 entry CAM

LM addr 1

128 SRAM

Write Xfer

CRC Unit

CRC

Remainder

Local CRSs

128 DRAM

Write Xfer

LM addr 0

D-Pull

Bus

S-Pull

Bus

To Next

Neighbour

128 SRAM

Read Xfer

128 DRAM

Read Xfer

128 Next

Neighbour

128 GPRs

(B Bank)

128 GPRs

(A Bank)

S-Push

Bus

D-Push

Bus

From Next

Neighbour

11

Intel IXP Network Processor

• Not used on a large scale

• Difficult to program:

– Write code per microengine

– Infrastructure

– Synchronization

– Non-unified complex memory model

Conclusion: easier programming needed

12

Static Affine Nested Loop Programs

• Nested loop: all statements occur within loops

• Static: control flow known at compile time

• Affine: ax+bexpressions

• Explicit array references are required to extract data-level parallelism in the application

13

Kahn Process Network (KPN)

• Process Networks

– Processes run autonomously

– Communicate via unbounded FIFOs

– Synchronize via blocking read

• Process is either

– executing (Execute)

– communicating(Put/Get)

• BENEFITS:

– Deterministic Behavior

– Distributed Control

– Distributed Memory

Fc

A

Fa Fb

�Get

Execute

Put

Get

�Execute

Put

Put

�Get

Get

Execute

Put

Fifo

C

B

14

Threads

• 8 Threads per microengine

• Minimal costs swapping contexts

• Registers and Memory divided between threads:– Private: a copy for each thread

– Shared: all threads same value

• Instruction store shared by all threads

• 1 Process � 1 Thread

• Threads run in Round Robin schedule

• Non-pre-emptive: programmer must swap

15

Signals

• Synchronization between IXP elements

– Between threads and microengines

– Memory access

• Wait for signal � context switch

__declspec(sram_read_reg) x; SIGNAL sig;

__declspec(scratch)* addr = 0x400;

scratch_read(&x, addr, 1, sig_done, sig);

do_other_work();

__wait_for_all(&sig);

y = x;

16

Available FIFOs

• In hardware:– 16 Scratchpad memory rings

– 128 SRAM rings

– 7 Next neighbour rings

• In software:– Local memory

– Direct Xfer register access

– Scratchpad memory

17

FIFO mappings

• Small and frequently used:

– Scratchpad HW

– (Next neighbour rings)

• Small and less frequently used:

– Scratchpad SW

• Large and/or not

frequently used:

– SRAM rings

18

Process mapping

Interesting presentation related to this mapping problem is: An ILP Formulation for System-Level Application Mapping on Network Processor Architectures, Chris Ostlerand Karam S. Chatha, Proceedings of DATE’07, Nice 16-20 April 2007, France

19

Tool Flow Overview

• IMCA:

IXP

Mapper for

Compaan

Applications

20

Code generation

• Visitor design pattern

– Visits platform description, writes code per

element

• One “C” file per microengine

• FIFOs accessed uniformly

– Static functions to implement FIFO code

– Port struct to specify FIFO variables

21

Using the IMCA environment

Packet

Packet

Packet

Packet

Packet

Packet

Packet

Packet

Packet

Packet

22

Results

• QR algorithm

– 5 nodes

– 12 FIFOs

2108 Mhz213FPGA, full HW

39100 Mhz3865FPGA, 5 MB

67600 Mhz40247IXP

Time micro secs.

CPU freq.# clock cyclesArch.

23

Discussion

• Work presents a first try, still many open issues

– Selection of right communication channel

– Binding of the KPN processes to threads and

microengines

– MSF takes a lot of resources, what is the minimum

required.

• Future of the IXP is uncertain, perhaps the Cell

Processor is an interesting next research platform

24

Conclusion

• Q1: The IXP can be used for streaming applications– Yes, showed it for QR and DWT

• Q2: We automatically mapped QR– Yes, we can map the FIFO communication onto the

communication channels and the processes on the threads

• Q3: IMCA back-end generates IXP code– Yes, we can use Compaan to automatically generate

the Processes and FIFO channels that are subsequently mapped on the IXP

Documents

Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity