Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala. E., El-Ghazawi. T., Gaj. K.,

The George Washington University, George Mason University.

MAPLD 2004, Washington DC

Chitalwala 2 2004 MAPLD – 1010

Abbreviations

BRAM – Block RAM GRAM - Generalized Reconfigurable Architecture

Model LM - Local Memory Max – Maximum MAP – Multi Adaptive Processor MPM - Microprocessor Memory OCM - On-Chip Memory PE - Processing Element Trans Perms - Transfer of Permissons


Outline

Problem Statement GRAM Description Assumptions and Methodology Testbed Description: SRC-6E Results Conclusion and Future Direction


Problem Statement Develop a standardized model of Reconfigurable

Architectures. Define a set of synthetic benchmarks based on this

model to analyze performance and discover bottlenecks.

Evaluate the system against the peak performance specifications given by the manufacturer.

Prove the concept by using these benchmarks to assess and dynamically characterize the performance of a reconfigurable system, using the SRC-6E as a test case.

Generalized Reconfigurable Architecture Model (GRAM)


GRAM Benchmarks: Objective To measure maximum sustainable data

transfer rates and latency between the various elements of the GRAM.

Dynamically characterize the performance of the system against system peak performance.


Generalized Reconfigurable Architecture Model (GRAM)


GRAM Elements

PE – Processing Element

OCM – On-Chip Memory

LM – Local Memory Interconnect Network /

Shared Memory Bus Interface Microprocessor

Memory


GRAM Benchmarks OCM – OCM: Measure max. sustainable bandwidth

and latency between two OCMs residing on different PEs.

OCM – LM: Measure max. sustainable bandwidth and latency between OCM and LM in either direction.

OCM - Shared Memory: Measure max. sustainable bandwidth and latency between OCM and Shared Memory in either direction.

Shared Memory – MPM: Measure max. sustainable bandwidth and latency between Shared Memory and MPM in either direction.


GRAM Benchmarks

OCM – MPM: Measure max. sustainable bandwidth and latency between OCM and MPM in either direction.

LM – MPM: Measure max. sustainable bandwidth and latency between LM and MPM in either direction.

LM – LM: Measure max. sustainable bandwidth and latency between LM and LM in either direction.

LM – Shared Memory: Measure max. sustainable bandwidth and latency between LM and Shared Memory in either direction.

GRAM Assumptions


Assumptions

All devices on board are fed through a single clock

No direct path between the Local Memories of individual elements

Connections for add-on cards may exist but not shown

The generalized architecture has been created based on precedents set by past and current manufacturers of Reconfigurable Systems.


Methodology Data paths can be parallelized to the maximum extent

possible. Inputs and Outputs have been kept symmetrical. Hardware timers have been used to measure times

taken to transfer data. Measurements have been taken for transfers of

increasingly large amounts of data. Data must be verified for correctness after transfers. Multiple paths may exist between the elements specified.

Our aim will be to measure the fastest path available. All experiments will be conducted using the programming

model and library functions of the system.

Testbed Description: SRC-6E


Hardware Architecture of the SRC-6E

Control Chip

On-Board Memory

(4MB X 6)

User Chip(Xilinx

XC2V6000)

Control Chip

MAP III III BoardBoard

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

On-Board Memory

(4MB X 6)

P3/P4 µP

L2

PCI Slot

MIOC

PrivateMemory(1.5 GB)

SNAP

P Board

P3/P4 µP

To Ethernet

L2

MIOC

PCI Slot

SNAP


P Board

To Ethernet

P3/P4 µP P3/P4 µP

L2 L2

Control ChipControl Chip

On-Board Memory

(4MB X 6)

User Chip(Xilinx

XC2V6000)


MAP III III BoardBoard

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

On-Board Memory

(4MB X 6)

P3/P4 µP

L2

PCI Slot

MIOCMIOC



SNAP

SNAP

P Board

P3/P4 µP

To Ethernet

L2

MIOCMIOC

PCI Slot

SNAP

SNAP



P Board

To Ethernet

P3/P4 µP P3/P4 µP

L2 L2

800/1600 Mbytes/sec

64 x 6 64 x 6

800Mbytes/sec

64 x 6

800/1600 Mbytes/sec

800Mbytes/sec

64 x 664


Programming Model of the SRC-6E

.c or .f Files .mc or .mf Files

μP Compiler MAP Compiler

.o Files .o Files

Linker

.v Files

.vhd or .v Files

Logic Synthesis

.ngo FILES

Place & Route

.bin FilesApplication Executable

APPLICATION


GRAM Benchmarks for the SRC-6E

P3/P4 P(1/3 GHz)

L2

80008000

MIOCMIOC

800800L2

P3/P4 P(1/3 GHz)

PCIPCISlotSlot

SSNNAAPP

µ ProcessorMemory(1.5 GB)

P BoardP Board

P3/P4 P(1/3 GHz)

L2

80008000

PCIPCISlotSlot

MIOCMIOC

µ ProcessorMemory(1.5 GB)

800800

SSNNAAPP

L2

P3/P4 P(1/3 GHz)

P BoardP Board


On-Board On-Board MemoryMemory(24 MB)(24 MB)

48004800(6 x 800)(6 x 800)

48004800(6 x 800)(6 x 800)

User User ChipChip24002400

(4800*)(4800*)

User User ChipChip


On-Board On-Board MemoryMemory(24 MB)(24 MB)

48004800(6 x 800)(6 x 800)

48004800(6 x 800)(6 x 800)

User User ChipChip24002400

(4800*)(4800*)

User User ChipChip

MAP III MAP III BoardBoard

800/1600 MBytes/s800/1600 MBytes/s800/1600 MBytes/Sec800/1600 MBytes/Sec

Ethernet

OCM - OCM

OCM – Shared Memory

OCM - MPMShared Memory to MPM


GRAM Benchmarks for the SRC-6EBenchmark SRC-6EOCM – OCM BRAM – BRAM

OCM – LM NA

OCM – Shared Memory BRAM – On-Board Memory

Shared Memory – MPM On-Board Memory – Common Memory

OCM – MPM BRAM – Common Memory

LM – MPM NA

LM – LM NA

LM – Shared Memory NA

Results


Block Diagram for a Single Bank transfer between OCM to Shared Memory

µProcessor Memory to Shared Memory (DMA_in)

Start_timer

Read_timer(ht0)

Read_timer(ht1)

Shared Memory to OCM

Read_timer(ht2)

OCM to Shared Memory

Read_timer(ht3)

Shared Memory to µProcessor Memory (DMA_out) Read_timer(ht4)


Latency

LatencyMinimum Data Transferred

Latency In Clock Cycles Latency in μs

Pentium III

Pentium IV

Pentium III

Pentium IV

Shared Memory to OCM

1 word* 20 20 0.20 0.20

OCM to Shared Memory

1 word* 15 15 0.15 0.15

OCM to OCM (Bridge Port)

1 word * 11 11 0.11 0.11

Shared Memory to MPM

4 words * 4200 2100 42 21

MPM to Shared Memory

4 words * 1000 1000 10 10

*1 word = 64 bits


Latency

The difference between read and write times for the OCM and Shared Memory is due to the read latency of OBM (6 clocks) vs. BRAM (1 clock).

When transferring data from the MPM to Shared Memory, writes are issued at each clock cycle and there is no startup latency involved.

When reading data from the Shared Memory to the MPM, there is an additional five clock cycles required to transfer data after the read has been issued.


PROCESSING ELEMENT(FPGA)

OCM1

A4 MB

B4 MB

C4 MB

D4 MB

E4 MB

F4 MB

64

Shared Memory


OCM2

OCM1

OCM2

6464

64

64 64 6464

192

Data Path from OCM to OCM Using Transfer Of Permissions


A4 MB

B4 MB

C4 MB

D4 MB

E4 MB

F4 MB

Shared Memory

PROCESSING ELEMENT(FPGA 2)

OCM1

64 64


OCM1

64

Data Path from OCM to OCM Using The Bridge Port and the Streaming Protocol


P III & IV: Bandwidth: OCM and OCM (BM#1)

Pentium III and Pentium IV: OCM - OCM

0

20

40

60

80

100

120

140

160

1 10 100 1000 10000 100000

Number Of Bytes

Ban

dwid

th (M

Byt

es/s

)

Bandwidth (Bridge)


P III: Bandwidth: OCM and OCM (BM#1)

Pentium III: OCM - OCM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

Bandwidth (Trans Perms) Peak Bandwidth


P IV : Bandwidth: OCM and OCM (BM#1)

Pentium IV: OCM - OCM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

Bandwidth (Trans Perms) Peak Bandwidth


P IV : Bandwidth: OCM and OCM (BM#1) (Streaming Protocol in Bridge Port)

Pentium IV: OCM - OCM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

Bandwidth (Bridge) Peak Bandwidth


A4 MB


OCM1

OCM2

OCM3

B4 MB

C4 MB

D4 MB

E4 MB

F4 MB

64

64 6

4

64

64

64

Shared Memory

Control FPGAM

ICR

OP

RO

CE

SS

OR

ME

MO

RY

SNAP

Data Path from OCM to MPM and Shared Memory to MPM


Pentium III: OCM - Shared Memory

0100200300400500600700800900

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Ban

dwid

th

(MB

ytes

/s)

0102030

40506070

Bits

/Clo

ck

BW (Shared Memory to OCM) (Mbytes/sec)BW (OCM to Shared Memory) (Mbytes/sec)Peak BandwidthBits/clock (Shared Memory to OCM)Bits/clock (OCM to Shared Memory)

P III: Bandwidth: OCM and Shared Memory for a single bank


P III: Bandwidth: OCM and Shared Memory

Pentium III: OCM - Shared Memory

0

500

1000

1500

2000

2500

3000

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

Shared Memory to OCM OCM to Shared Memory Peak Bandwidth


P IV: Bandwidth: OCM and Shared Memory

Pentium IV: OCM - Shared Memory

0

500

1000

1500

2000

2500

3000

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

Shared Memory to OCM OCM to Shared Memory Peak Bandwidth


P III: Bandwidth: OCM and µP Memory

Pentium III: OCM - MPM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

µP Memory to OCM OCM to µP Memory Peak Bandwidth


P IV: Bandwidth: OCM and µP Memory

Pentium IV: MPM - OCM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000

Number of Bytes

Band

wid

th (M

Byte

s/s)

µP Memory to OCM OCM to µP Memory Peak Bandwidth


P III: Bandwidth: Shared Memory and µP Memory (BM#5) Pentium III: Shared Memory - MPM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000 10000000

1E+08

Number Of Bytes

Band

wid

th (M

Byte

s/s)

µP Memory to Shared Memory Shared Memory to µP Memory

Peak Bandwidth


P IV: Bandwidth: Shared Memory and µP Memory

Pentium IV: Shared Memory - MPM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000 10000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

MPM to Shared Memory Shared Memory to MPM Peak Bandwidth


P III: Bandwidth: Shared Memory and µP Memory

Pentium III: Shared Memory - MPM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000 1E+07 1E+08

Number Of Bytes

Band

wid

th (M

Byte

s/s)

0

5

10

15

20

25

30

Bits

/Clo

ck


Peak Bandwidth Bits/clock Into Shared Memory

Bits/clock Into µP Memory


P IV: Bandwidth: Shared Memory and µP Memory

Pentium IV: Shared Memory - MPM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000 1E+07 1E+08

Number Of Bytes

Band

wid

th (M

Byte

s/s)

0

10

20

30

40

50

60

70

Bits

/Clo

ck


Peak Bandwidth Bits/clock Into Shared Memory

Bits/clock Into µP Memory


A4 MB

B4 MB

C4 MB

D4 MB

E4 MB

F4 MB

Shared Memory

6464


Register

Data Path from FPGA Register to Shared Memory


P III: Bandwidth: Shared Memory and Register

Pentium III: Shared Memory - FPGA Register

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Bnad

wid

th (M

Byte

s/s)

0

10

20

30

40

50

60

70

Bits

/Clo

ck

Shared Memory to Reg Reg to Shared Memory

Peak Bandwidth Bits/Clock (Shared Memory to Reg)

Bits/Clock (Reg to Shared Memory)

Conclusion & Future Direction


GRAM Summation for Pentium IIIBenchmarks

Peak Performa

nce(Mbytes/s)

Maximum Sustainable Bandwidth Measured (Mbytes/s)

Efficiency(%)

Normalized Transfer Rate (compared with

PCI-X @ 133 MHz,32 bits

unidirectional)

OCM – OCM a (Bridge Port) 800 149 18.6 0.28

OCM – OCM b (Trans Perms) 800 793 99.13 1.5

OCM – OCM c (Streaming) 800 NA NA NA

OCM – LM NA NA NA NA

OCM → Shared Memory/Shared Memory → OCM *2400 2373/2373 98.8/98.8 4.46

OCM → MPM/MPM → OCM 800/800 182.8/227.3 22.85/28.41 0.34 / 0.43

Shared Memory → MPM/MPM → Shared Memory 800/800 203/314 25.3/39.3 0.38 / 0.59

Shared Memory → Reg/Reg → Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5

LM – MPM NA NA NA NA

LM – LM NA NA NA NA

LM – Shared Memory NA NA NA NA

* For three banks


GRAM Summation for Pentium IVBenchmarks

Peak Performance

(Mbytes/s)

Maximum Sustainable Bandwidth Measured (Mbytes/s)

Efficiency(%)

Normalized transfer Rate (compared with

PCI-X @ 133 MHz,32 bits

unidirectional)OCM – OCM a (Bridge Port) 800 149 18.6 0.28

OCM – OCM b (Trans Perms) 800 797.39 99.67 1.5

OCM – OCM c (Streaming) 800 799.49 100 1.5

OCM – LM NA NA NA NA

OCM → Shared Memory/Shared Memory → OCM *2400 2392 / 2390 99.6 / 99.6 4.5 / 4.5

OCM → MPM/MPM → OCM 800/800 578 / 562 72.25 / 70.25 1.08 / 1.05

Shared Memory → MPM/MPM → Shared Memory 800/800 796 / 799 99.5 / 99.8 1.5 / 1.5

Shared Memory → Reg/Reg → Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5

LM – MPM NA NA NA NA

LM – LM NA NA NA NA

LM – Shared Memory NA NA NA NA

* For three banks


Conclusions

Type of components used has a major role to play in determining the performance of the system as seen in the performance of the Pentium III and the Pentium IV versions of the SRC-6E.

Software environment and state of development plays a role in determining how effectively the program is able to utilize the hardware. This is clear when observing the difference in bandwidth achieved across the Bridge ports using the Carte 1.6.2 release and the Carte 1.7 release.


Conclusions … The GRAM Summation Tables help to serve

machine architects in the following ways: The efficiency column indicates how well a particular

communication channel is being utilized within the hardware context. If the efficiency is low, architects may be able to improve performance using a firmware improvement. If efficiency is high and the normalized bandwidth is low then they should consider a hardware upgrade.

By looking at the normalized bandwidths obtained from the GRAM benchmarks, designers can also determine whether the data transfer rates are balanced across the architectural modules. This helps identifying bottlenecks.

Designers can find out which channels have the maximum efficiency and can hence fine tune their application to exploit these channels to achieve the maximum data transfer rate.


Conclusions …

In addition, the GRAM Summation tables also provide the following information to application developers: The tables can tell a designer what bottlenecks to expect

and where these bottlenecks lie. By comparing the figures for Efficiency and the Normalized

transfer rates, designers can understand if the bottlenecks being created are by the hardware or the software.

By observing the GRAM summarization tables, designers can actually predict the performance of a pre-designed application on a particular reconfigurable system.


Future Direction

Benchmarks can be expanded to include end-to-end performance from asymmetrical and synthetic workloads.

The Benchmarks can also include tables to characterize the performance of reconfigurable computers as it compares to modern parallel architectures. A performance to cost analysis can also be considered.

Documents

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks