47
Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington University, George Mason University. MAPLD 2004, Washington DC

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

  • Upload
    malina

  • View
    27

  • Download
    4

Embed Size (px)

DESCRIPTION

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks. Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington University, George Mason University. MAPLD 2004, Washington DC. Abbreviations. BRAM – Block RAM - PowerPoint PPT Presentation

Citation preview

Page 1: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala. E., El-Ghazawi. T., Gaj. K.,

The George Washington University, George Mason University.

MAPLD 2004, Washington DC

Page 2: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 2 2004 MAPLD – 1010

Abbreviations

BRAM – Block RAM GRAM - Generalized Reconfigurable Architecture

Model LM - Local Memory Max – Maximum MAP – Multi Adaptive Processor MPM - Microprocessor Memory OCM - On-Chip Memory PE - Processing Element Trans Perms - Transfer of Permissons

Page 3: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 3 2004 MAPLD – 1010

Outline

Problem Statement GRAM Description Assumptions and Methodology Testbed Description: SRC-6E Results Conclusion and Future Direction

Page 4: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 4 2004 MAPLD – 1010

Problem Statement Develop a standardized model of Reconfigurable

Architectures. Define a set of synthetic benchmarks based on this

model to analyze performance and discover bottlenecks.

Evaluate the system against the peak performance specifications given by the manufacturer.

Prove the concept by using these benchmarks to assess and dynamically characterize the performance of a reconfigurable system, using the SRC-6E as a test case.

Page 5: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Generalized Reconfigurable Architecture Model (GRAM)

Page 6: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 6 2004 MAPLD – 1010

GRAM Benchmarks: Objective To measure maximum sustainable data

transfer rates and latency between the various elements of the GRAM.

Dynamically characterize the performance of the system against system peak performance.

Page 7: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 7 2004 MAPLD – 1010

Generalized Reconfigurable Architecture Model (GRAM)

Page 8: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 8 2004 MAPLD – 1010

GRAM Elements

PE – Processing Element

OCM – On-Chip Memory

LM – Local Memory Interconnect Network /

Shared Memory Bus Interface Microprocessor

Memory

Page 9: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 9 2004 MAPLD – 1010

GRAM Benchmarks OCM – OCM: Measure max. sustainable bandwidth

and latency between two OCMs residing on different PEs.

OCM – LM: Measure max. sustainable bandwidth and latency between OCM and LM in either direction.

OCM - Shared Memory: Measure max. sustainable bandwidth and latency between OCM and Shared Memory in either direction.

Shared Memory – MPM: Measure max. sustainable bandwidth and latency between Shared Memory and MPM in either direction.

Page 10: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 10 2004 MAPLD – 1010

GRAM Benchmarks

OCM – MPM: Measure max. sustainable bandwidth and latency between OCM and MPM in either direction.

LM – MPM: Measure max. sustainable bandwidth and latency between LM and MPM in either direction.

LM – LM: Measure max. sustainable bandwidth and latency between LM and LM in either direction.

LM – Shared Memory: Measure max. sustainable bandwidth and latency between LM and Shared Memory in either direction.

Page 11: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

GRAM Assumptions

Page 12: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 12 2004 MAPLD – 1010

Assumptions

All devices on board are fed through a single clock

No direct path between the Local Memories of individual elements

Connections for add-on cards may exist but not shown

The generalized architecture has been created based on precedents set by past and current manufacturers of Reconfigurable Systems.

Page 13: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 13 2004 MAPLD – 1010

Methodology Data paths can be parallelized to the maximum extent

possible. Inputs and Outputs have been kept symmetrical. Hardware timers have been used to measure times

taken to transfer data. Measurements have been taken for transfers of

increasingly large amounts of data. Data must be verified for correctness after transfers. Multiple paths may exist between the elements specified.

Our aim will be to measure the fastest path available. All experiments will be conducted using the programming

model and library functions of the system.

Page 14: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Testbed Description: SRC-6E

Page 15: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 15 2004 MAPLD – 1010

Hardware Architecture of the SRC-6E

Control Chip

On-Board Memory

(4MB X 6)

User Chip(Xilinx

XC2V6000)

Control Chip

MAP III III BoardBoard

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

On-Board Memory

(4MB X 6)

P3/P4 µP

L2

PCI Slot

MIOC

PrivateMemory(1.5 GB)

SNAP

P Board

P3/P4 µP

To Ethernet

L2

MIOC

PCI Slot

SNAP

PrivateMemory(1.5 GB)

P Board

To Ethernet

P3/P4 µP P3/P4 µP

L2 L2

Control ChipControl Chip

On-Board Memory

(4MB X 6)

User Chip(Xilinx

XC2V6000)

Control ChipControl Chip

MAP III III BoardBoard

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

On-Board Memory

(4MB X 6)

P3/P4 µP

L2

PCI Slot

MIOCMIOC

PrivateMemory(1.5 GB)

PrivateMemory(1.5 GB)

SNAP

SNAP

P Board

P3/P4 µP

To Ethernet

L2

MIOCMIOC

PCI Slot

SNAP

SNAP

PrivateMemory(1.5 GB)

PrivateMemory(1.5 GB)

P Board

To Ethernet

P3/P4 µP P3/P4 µP

L2 L2

800/1600 Mbytes/sec

64 x 6 64 x 6

800Mbytes/sec

64 x 6

800/1600 Mbytes/sec

800Mbytes/sec

64 x 664

Page 16: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 16 2004 MAPLD – 1010

Programming Model of the SRC-6E

.c or .f Files .mc or .mf Files

μP Compiler MAP Compiler

.o Files .o Files

Linker

.v Files

.vhd or .v Files

Logic Synthesis

.ngo FILES

Place & Route

.bin FilesApplication Executable

APPLICATION

Page 17: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 17 2004 MAPLD – 1010

GRAM Benchmarks for the SRC-6E

P3/P4 P(1/3 GHz)

L2

80008000

MIOCMIOC

800800L2

P3/P4 P(1/3 GHz)

PCIPCISlotSlot

SSNNAAPP

µ ProcessorMemory(1.5 GB)

P BoardP Board

P3/P4 P(1/3 GHz)

L2

80008000

PCIPCISlotSlot

MIOCMIOC

µ ProcessorMemory(1.5 GB)

800800

SSNNAAPP

L2

P3/P4 P(1/3 GHz)

P BoardP Board

Control ChipControl Chip

On-Board On-Board MemoryMemory(24 MB)(24 MB)

48004800(6 x 800)(6 x 800)

48004800(6 x 800)(6 x 800)

User User ChipChip24002400

(4800*)(4800*)

User User ChipChip

Control ChipControl Chip

On-Board On-Board MemoryMemory(24 MB)(24 MB)

48004800(6 x 800)(6 x 800)

48004800(6 x 800)(6 x 800)

User User ChipChip24002400

(4800*)(4800*)

User User ChipChip

MAP III MAP III BoardBoard

800/1600 MBytes/s800/1600 MBytes/s800/1600 MBytes/Sec800/1600 MBytes/Sec

Ethernet

OCM - OCM

OCM – Shared Memory

OCM - MPMShared Memory to MPM

Page 18: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 18 2004 MAPLD – 1010

GRAM Benchmarks for the SRC-6EBenchmark SRC-6EOCM – OCM BRAM – BRAM

OCM – LM NA

OCM – Shared Memory BRAM – On-Board Memory

Shared Memory – MPM On-Board Memory – Common Memory

OCM – MPM BRAM – Common Memory

LM – MPM NA

LM – LM NA

LM – Shared Memory NA

Page 19: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Results

Page 20: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 20 2004 MAPLD – 1010

Block Diagram for a Single Bank transfer between OCM to Shared Memory

µProcessor Memory to Shared Memory (DMA_in)

Start_timer

Read_timer(ht0)

Read_timer(ht1)

Shared Memory to OCM

Read_timer(ht2)

OCM to Shared Memory

Read_timer(ht3)

Shared Memory to µProcessor Memory (DMA_out) Read_timer(ht4)

Page 21: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 21 2004 MAPLD – 1010

Latency

LatencyMinimum Data Transferred

Latency In Clock Cycles Latency in μs

Pentium III

Pentium IV

Pentium III

Pentium IV

Shared Memory to OCM

1 word* 20 20 0.20 0.20

OCM to Shared Memory

1 word* 15 15 0.15 0.15

OCM to OCM (Bridge Port)

1 word * 11 11 0.11 0.11

Shared Memory to MPM

4 words * 4200 2100 42 21

MPM to Shared Memory

4 words * 1000 1000 10 10

*1 word = 64 bits

Page 22: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 22 2004 MAPLD – 1010

Latency

The difference between read and write times for the OCM and Shared Memory is due to the read latency of OBM (6 clocks) vs. BRAM (1 clock).

When transferring data from the MPM to Shared Memory, writes are issued at each clock cycle and there is no startup latency involved.

When reading data from the Shared Memory to the MPM, there is an additional five clock cycles required to transfer data after the read has been issued.

Page 23: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 23 2004 MAPLD – 1010

PROCESSING ELEMENT(FPGA)

OCM1

A4 MB

B4 MB

C4 MB

D4 MB

E4 MB

F4 MB

64

Shared Memory

PROCESSING ELEMENT(FPGA)

OCM2

OCM1

OCM2

6464

64

64 64 6464

192

Data Path from OCM to OCM Using Transfer Of Permissions

Page 24: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 24 2004 MAPLD – 1010

A4 MB

B4 MB

C4 MB

D4 MB

E4 MB

F4 MB

Shared Memory

PROCESSING ELEMENT(FPGA 2)

OCM1

64 64

PROCESSING ELEMENT(FPGA 1)

OCM1

64

Data Path from OCM to OCM Using The Bridge Port and the Streaming Protocol

Page 25: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 25 2004 MAPLD – 1010

P III & IV: Bandwidth: OCM and OCM (BM#1)

Pentium III and Pentium IV: OCM - OCM

0

20

40

60

80

100

120

140

160

1 10 100 1000 10000 100000

Number Of Bytes

Ban

dwid

th (M

Byt

es/s

)

Bandwidth (Bridge)

Page 26: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 26 2004 MAPLD – 1010

P III: Bandwidth: OCM and OCM (BM#1)

Pentium III: OCM - OCM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

Bandwidth (Trans Perms) Peak Bandwidth

Page 27: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 27 2004 MAPLD – 1010

P IV : Bandwidth: OCM and OCM (BM#1)

Pentium IV: OCM - OCM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

Bandwidth (Trans Perms) Peak Bandwidth

Page 28: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 28 2004 MAPLD – 1010

P IV : Bandwidth: OCM and OCM (BM#1) (Streaming Protocol in Bridge Port)

Pentium IV: OCM - OCM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

Bandwidth (Bridge) Peak Bandwidth

Page 29: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 29 2004 MAPLD – 1010

A4 MB

PROCESSING ELEMENT(FPGA)

OCM1

OCM2

OCM3

B4 MB

C4 MB

D4 MB

E4 MB

F4 MB

64

64 6

4

64

64

64

Shared Memory

Control FPGAM

ICR

OP

RO

CE

SS

OR

ME

MO

RY

SNAP

Data Path from OCM to MPM and Shared Memory to MPM

Page 30: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 30 2004 MAPLD – 1010

Pentium III: OCM - Shared Memory

0100200300400500600700800900

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Ban

dwid

th

(MB

ytes

/s)

0102030

40506070

Bits

/Clo

ck

BW (Shared Memory to OCM) (Mbytes/sec)BW (OCM to Shared Memory) (Mbytes/sec)Peak BandwidthBits/clock (Shared Memory to OCM)Bits/clock (OCM to Shared Memory)

P III: Bandwidth: OCM and Shared Memory for a single bank

Page 31: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 31 2004 MAPLD – 1010

P III: Bandwidth: OCM and Shared Memory

Pentium III: OCM - Shared Memory

0

500

1000

1500

2000

2500

3000

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

Shared Memory to OCM OCM to Shared Memory Peak Bandwidth

Page 32: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 32 2004 MAPLD – 1010

P IV: Bandwidth: OCM and Shared Memory

Pentium IV: OCM - Shared Memory

0

500

1000

1500

2000

2500

3000

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

Shared Memory to OCM OCM to Shared Memory Peak Bandwidth

Page 33: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 33 2004 MAPLD – 1010

P III: Bandwidth: OCM and µP Memory

Pentium III: OCM - MPM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

µP Memory to OCM OCM to µP Memory Peak Bandwidth

Page 34: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 34 2004 MAPLD – 1010

P IV: Bandwidth: OCM and µP Memory

Pentium IV: MPM - OCM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000

Number of Bytes

Band

wid

th (M

Byte

s/s)

µP Memory to OCM OCM to µP Memory Peak Bandwidth

Page 35: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 35 2004 MAPLD – 1010

P III: Bandwidth: Shared Memory and µP Memory (BM#5) Pentium III: Shared Memory - MPM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000 10000000

1E+08

Number Of Bytes

Band

wid

th (M

Byte

s/s)

µP Memory to Shared Memory Shared Memory to µP Memory

Peak Bandwidth

Page 36: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 36 2004 MAPLD – 1010

P IV: Bandwidth: Shared Memory and µP Memory

Pentium IV: Shared Memory - MPM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000 10000000

Number Of Bytes

Band

wid

th (M

Byte

s/s)

MPM to Shared Memory Shared Memory to MPM Peak Bandwidth

Page 37: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 37 2004 MAPLD – 1010

P III: Bandwidth: Shared Memory and µP Memory

Pentium III: Shared Memory - MPM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000 1E+07 1E+08

Number Of Bytes

Band

wid

th (M

Byte

s/s)

0

5

10

15

20

25

30

Bits

/Clo

ck

µP Memory to Shared Memory Shared Memory to µP Memory

Peak Bandwidth Bits/clock Into Shared Memory

Bits/clock Into µP Memory

Page 38: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 38 2004 MAPLD – 1010

P IV: Bandwidth: Shared Memory and µP Memory

Pentium IV: Shared Memory - MPM

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000 1E+07 1E+08

Number Of Bytes

Band

wid

th (M

Byte

s/s)

0

10

20

30

40

50

60

70

Bits

/Clo

ck

µP Memory to Shared Memory Shared Memory to µP Memory

Peak Bandwidth Bits/clock Into Shared Memory

Bits/clock Into µP Memory

Page 39: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 39 2004 MAPLD – 1010

A4 MB

B4 MB

C4 MB

D4 MB

E4 MB

F4 MB

Shared Memory

6464

PROCESSING ELEMENT(FPGA 1)

Register

Data Path from FPGA Register to Shared Memory

Page 40: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 40 2004 MAPLD – 1010

P III: Bandwidth: Shared Memory and Register

Pentium III: Shared Memory - FPGA Register

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Bnad

wid

th (M

Byte

s/s)

0

10

20

30

40

50

60

70

Bits

/Clo

ck

Shared Memory to Reg Reg to Shared Memory

Peak Bandwidth Bits/Clock (Shared Memory to Reg)

Bits/Clock (Reg to Shared Memory)

Page 41: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Conclusion & Future Direction

Page 42: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 42 2004 MAPLD – 1010

GRAM Summation for Pentium IIIBenchmarks

Peak Performa

nce(Mbytes/s)

Maximum Sustainable Bandwidth Measured (Mbytes/s)

Efficiency(%)

Normalized Transfer Rate (compared with

PCI-X @ 133 MHz,32 bits

unidirectional)

OCM – OCM a (Bridge Port) 800 149 18.6 0.28

OCM – OCM b (Trans Perms) 800 793 99.13 1.5

OCM – OCM c (Streaming) 800 NA NA NA

OCM – LM NA NA NA NA

OCM → Shared Memory/Shared Memory → OCM *2400 2373/2373 98.8/98.8 4.46

OCM → MPM/MPM → OCM 800/800 182.8/227.3 22.85/28.41 0.34 / 0.43

Shared Memory → MPM/MPM → Shared Memory 800/800 203/314 25.3/39.3 0.38 / 0.59

Shared Memory → Reg/Reg → Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5

LM – MPM NA NA NA NA

LM – LM NA NA NA NA

LM – Shared Memory NA NA NA NA

* For three banks

Page 43: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 43 2004 MAPLD – 1010

GRAM Summation for Pentium IVBenchmarks

Peak Performance

(Mbytes/s)

Maximum Sustainable Bandwidth Measured (Mbytes/s)

Efficiency(%)

Normalized transfer Rate (compared with

PCI-X @ 133 MHz,32 bits

unidirectional)OCM – OCM a (Bridge Port) 800 149 18.6 0.28

OCM – OCM b (Trans Perms) 800 797.39 99.67 1.5

OCM – OCM c (Streaming) 800 799.49 100 1.5

OCM – LM NA NA NA NA

OCM → Shared Memory/Shared Memory → OCM *2400 2392 / 2390 99.6 / 99.6 4.5 / 4.5

OCM → MPM/MPM → OCM 800/800 578 / 562 72.25 / 70.25 1.08 / 1.05

Shared Memory → MPM/MPM → Shared Memory 800/800 796 / 799 99.5 / 99.8 1.5 / 1.5

Shared Memory → Reg/Reg → Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5

LM – MPM NA NA NA NA

LM – LM NA NA NA NA

LM – Shared Memory NA NA NA NA

* For three banks

Page 44: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 44 2004 MAPLD – 1010

Conclusions

Type of components used has a major role to play in determining the performance of the system as seen in the performance of the Pentium III and the Pentium IV versions of the SRC-6E.

Software environment and state of development plays a role in determining how effectively the program is able to utilize the hardware. This is clear when observing the difference in bandwidth achieved across the Bridge ports using the Carte 1.6.2 release and the Carte 1.7 release.

Page 45: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 45 2004 MAPLD – 1010

Conclusions … The GRAM Summation Tables help to serve

machine architects in the following ways: The efficiency column indicates how well a particular

communication channel is being utilized within the hardware context. If the efficiency is low, architects may be able to improve performance using a firmware improvement. If efficiency is high and the normalized bandwidth is low then they should consider a hardware upgrade.

By looking at the normalized bandwidths obtained from the GRAM benchmarks, designers can also determine whether the data transfer rates are balanced across the architectural modules. This helps identifying bottlenecks.

Designers can find out which channels have the maximum efficiency and can hence fine tune their application to exploit these channels to achieve the maximum data transfer rate.

Page 46: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 46 2004 MAPLD – 1010

Conclusions …

In addition, the GRAM Summation tables also provide the following information to application developers: The tables can tell a designer what bottlenecks to expect

and where these bottlenecks lie. By comparing the figures for Efficiency and the Normalized

transfer rates, designers can understand if the bottlenecks being created are by the hardware or the software.

By observing the GRAM summarization tables, designers can actually predict the performance of a pre-designed application on a particular reconfigurable system.

Page 47: Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala 47 2004 MAPLD – 1010

Future Direction

Benchmarks can be expanded to include end-to-end performance from asymmetrical and synthetic workloads.

The Benchmarks can also include tables to characterize the performance of reconfigurable computers as it compares to modern parallel architectures. A performance to cost analysis can also be considered.