Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Chitalwala. E., El-Ghazawi. T., Gaj. K.,

The George Washington University, George Mason University.

MAPLD 2004, Washington DC

Chitalwala 2 2004 MAPLD – 1010

Abbreviations

BRAM – Block RAM GRAM - Generalized Reconfigurable Architecture

Model LM - Local Memory Max – Maximum MAP – Multi Adaptive Processor MPM - Microprocessor Memory OCM - On-Chip Memory PE - Processing Element Trans Perms - Transfer of Permissons

Outline

Problem Statement GRAM Description Assumptions and Methodology Testbed Description: SRC-6E Results Conclusion and Future Direction

Problem Statement Develop a standardized model of Reconfigurable

Architectures. Define a set of synthetic benchmarks based on this

model to analyze performance and discover bottlenecks.

Evaluate the system against the peak performance specifications given by the manufacturer.

Prove the concept by using these benchmarks to assess and dynamically characterize the performance of a reconfigurable system, using the SRC-6E as a test case.

Generalized Reconfigurable Architecture Model (GRAM)

GRAM Benchmarks: Objective To measure maximum sustainable data

transfer rates and latency between the various elements of the GRAM.

Dynamically characterize the performance of the system against system peak performance.

Generalized Reconfigurable Architecture Model (GRAM)

GRAM Elements

PE – Processing Element

OCM – On-Chip Memory

LM – Local Memory Interconnect Network /

Shared Memory Bus Interface Microprocessor

Memory

GRAM Benchmarks OCM – OCM: Measure max. sustainable bandwidth

and latency between two OCMs residing on different PEs.

OCM – LM: Measure max. sustainable bandwidth and latency between OCM and LM in either direction.

OCM - Shared Memory: Measure max. sustainable bandwidth and latency between OCM and Shared Memory in either direction.

Shared Memory – MPM: Measure max. sustainable bandwidth and latency between Shared Memory and MPM in either direction.

GRAM Benchmarks

OCM – MPM: Measure max. sustainable bandwidth and latency between OCM and MPM in either direction.

LM – MPM: Measure max. sustainable bandwidth and latency between LM and MPM in either direction.

LM – LM: Measure max. sustainable bandwidth and latency between LM and LM in either direction.

LM – Shared Memory: Measure max. sustainable bandwidth and latency between LM and Shared Memory in either direction.

GRAM Assumptions

Assumptions

All devices on board are fed through a single clock

No direct path between the Local Memories of individual elements

Connections for add-on cards may exist but not shown

The generalized architecture has been created based on precedents set by past and current manufacturers of Reconfigurable Systems.

Methodology Data paths can be parallelized to the maximum extent

possible. Inputs and Outputs have been kept symmetrical. Hardware timers have been used to measure times

taken to transfer data. Measurements have been taken for transfers of

increasingly large amounts of data. Data must be verified for correctness after transfers. Multiple paths may exist between the elements specified.

Our aim will be to measure the fastest path available. All experiments will be conducted using the programming

model and library functions of the system.

Testbed Description: SRC-6E

Hardware Architecture of the SRC-6E

Control Chip

On-Board Memory

(4MB X 6)

User Chip(Xilinx

XC2V6000)

Control Chip

MAP III III BoardBoard

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

On-Board Memory

(4MB X 6)

P3/P4 µP

PCI Slot

PrivateMemory(1.5 GB)

P Board

P3/P4 µP

To Ethernet

PCI Slot

P Board

To Ethernet

P3/P4 µP P3/P4 µP

Control ChipControl Chip

On-Board Memory

(4MB X 6)

User Chip(Xilinx

XC2V6000)

MAP III III BoardBoard

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

User Chip(Xilinx

XC2V6000)

On-Board Memory

(4MB X 6)

P3/P4 µP

PCI Slot

MIOCMIOC

P Board

P3/P4 µP

To Ethernet

MIOCMIOC

PCI Slot

P Board

To Ethernet

P3/P4 µP P3/P4 µP

800/1600 Mbytes/sec

64 x 6 64 x 6

800Mbytes/sec

64 x 6

800/1600 Mbytes/sec

800Mbytes/sec

64 x 664

Programming Model of the SRC-6E

.c or .f Files .mc or .mf Files

μP Compiler MAP Compiler

.o Files .o Files

Linker

.v Files

.vhd or .v Files

Logic Synthesis

.ngo FILES

Place & Route

.bin FilesApplication Executable

APPLICATION

GRAM Benchmarks for the SRC-6E

P3/P4 P(1/3 GHz)

80008000

MIOCMIOC

800800L2

P3/P4 P(1/3 GHz)

PCIPCISlotSlot

SSNNAAPP

µ ProcessorMemory(1.5 GB)

P BoardP Board

P3/P4 P(1/3 GHz)

80008000

PCIPCISlotSlot

MIOCMIOC

µ ProcessorMemory(1.5 GB)

800800

SSNNAAPP

P3/P4 P(1/3 GHz)

P BoardP Board

On-Board On-Board MemoryMemory(24 MB)(24 MB)

48004800(6 x 800)(6 x 800)

User User ChipChip24002400

(4800*)(4800*)

User User ChipChip

On-Board On-Board MemoryMemory(24 MB)(24 MB)

48004800(6 x 800)(6 x 800)

User User ChipChip24002400

(4800*)(4800*)

User User ChipChip

MAP III MAP III BoardBoard

800/1600 MBytes/s800/1600 MBytes/s800/1600 MBytes/Sec800/1600 MBytes/Sec

Ethernet

OCM - OCM

OCM – Shared Memory

OCM - MPMShared Memory to MPM

GRAM Benchmarks for the SRC-6EBenchmark SRC-6EOCM – OCM BRAM – BRAM

OCM – LM NA

OCM – Shared Memory BRAM – On-Board Memory

Shared Memory – MPM On-Board Memory – Common Memory

OCM – MPM BRAM – Common Memory

LM – MPM NA

LM – LM NA

LM – Shared Memory NA

Results

Block Diagram for a Single Bank transfer between OCM to Shared Memory

µProcessor Memory to Shared Memory (DMA_in)

Start_timer

Read_timer(ht0)

Read_timer(ht1)

Shared Memory to OCM

Read_timer(ht2)

OCM to Shared Memory

Read_timer(ht3)

Shared Memory to µProcessor Memory (DMA_out) Read_timer(ht4)

Latency

LatencyMinimum Data Transferred

Latency In Clock Cycles Latency in μs

Pentium III

Pentium IV

Pentium III

Pentium IV

Shared Memory to OCM

1 word* 20 20 0.20 0.20

OCM to Shared Memory

1 word* 15 15 0.15 0.15

OCM to OCM (Bridge Port)

1 word * 11 11 0.11 0.11

Shared Memory to MPM

4 words * 4200 2100 42 21

MPM to Shared Memory

4 words * 1000 1000 10 10

*1 word = 64 bits

Latency

The difference between read and write times for the OCM and Shared Memory is due to the read latency of OBM (6 clocks) vs. BRAM (1 clock).

When transferring data from the MPM to Shared Memory, writes are issued at each clock cycle and there is no startup latency involved.

When reading data from the Shared Memory to the MPM, there is an additional five clock cycles required to transfer data after the read has been issued.

PROCESSING ELEMENT(FPGA)

Shared Memory

64 64 6464

Data Path from OCM to OCM Using Transfer Of Permissions

Shared Memory

PROCESSING ELEMENT(FPGA 2)

Data Path from OCM to OCM Using The Bridge Port and the Streaming Protocol

P III & IV: Bandwidth: OCM and OCM (BM#1)

Pentium III and Pentium IV: OCM - OCM

1 10 100 1000 10000 100000

Number Of Bytes

Bandwidth (Bridge)

P III: Bandwidth: OCM and OCM (BM#1)

Pentium III: OCM - OCM

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Bandwidth (Trans Perms) Peak Bandwidth

P IV : Bandwidth: OCM and OCM (BM#1)

Pentium IV: OCM - OCM

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Bandwidth (Trans Perms) Peak Bandwidth

P IV : Bandwidth: OCM and OCM (BM#1) (Streaming Protocol in Bridge Port)

Pentium IV: OCM - OCM

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Bandwidth (Bridge) Peak Bandwidth

Shared Memory

Control FPGAM

Data Path from OCM to MPM and Shared Memory to MPM

Pentium III: OCM - Shared Memory

0100200300400500600700800900

1 10 100 1000 10000 100000 1000000

Number Of Bytes

0102030

40506070

BW (Shared Memory to OCM) (Mbytes/sec)BW (OCM to Shared Memory) (Mbytes/sec)Peak BandwidthBits/clock (Shared Memory to OCM)Bits/clock (OCM to Shared Memory)

P III: Bandwidth: OCM and Shared Memory for a single bank

P III: Bandwidth: OCM and Shared Memory

Pentium III: OCM - Shared Memory

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Shared Memory to OCM OCM to Shared Memory Peak Bandwidth

P IV: Bandwidth: OCM and Shared Memory

Pentium IV: OCM - Shared Memory

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Shared Memory to OCM OCM to Shared Memory Peak Bandwidth

P III: Bandwidth: OCM and µP Memory

Pentium III: OCM - MPM

1 10 100 1000 10000 100000

Number Of Bytes

µP Memory to OCM OCM to µP Memory Peak Bandwidth

P IV: Bandwidth: OCM and µP Memory

Pentium IV: MPM - OCM

1 10 100 1000 10000 100000 1000000

Number of Bytes

µP Memory to OCM OCM to µP Memory Peak Bandwidth

P III: Bandwidth: Shared Memory and µP Memory (BM#5) Pentium III: Shared Memory - MPM

1 10 100 1000 10000 100000 1000000 10000000

Number Of Bytes

µP Memory to Shared Memory Shared Memory to µP Memory

Peak Bandwidth

P IV: Bandwidth: Shared Memory and µP Memory

Pentium IV: Shared Memory - MPM

1 10 100 1000 10000 100000 1000000 10000000

Number Of Bytes

MPM to Shared Memory Shared Memory to MPM Peak Bandwidth

P III: Bandwidth: Shared Memory and µP Memory

Pentium III: Shared Memory - MPM

1 10 100 1000 10000 100000 1000000 1E+07 1E+08

Number Of Bytes

Peak Bandwidth Bits/clock Into Shared Memory

Bits/clock Into µP Memory

P IV: Bandwidth: Shared Memory and µP Memory

Pentium IV: Shared Memory - MPM

1 10 100 1000 10000 100000 1000000 1E+07 1E+08

Number Of Bytes

Peak Bandwidth Bits/clock Into Shared Memory

Bits/clock Into µP Memory

Shared Memory

Register

Data Path from FPGA Register to Shared Memory

P III: Bandwidth: Shared Memory and Register

Pentium III: Shared Memory - FPGA Register

1 10 100 1000 10000 100000 1000000

Number Of Bytes

Shared Memory to Reg Reg to Shared Memory

Peak Bandwidth Bits/Clock (Shared Memory to Reg)

Bits/Clock (Reg to Shared Memory)

Conclusion & Future Direction

GRAM Summation for Pentium IIIBenchmarks

Peak Performa

nce(Mbytes/s)

Maximum Sustainable Bandwidth Measured (Mbytes/s)

Efficiency(%)

Normalized Transfer Rate (compared with

PCI-X @ 133 MHz,32 bits

unidirectional)

OCM – OCM a (Bridge Port) 800 149 18.6 0.28

OCM – OCM b (Trans Perms) 800 793 99.13 1.5

OCM – OCM c (Streaming) 800 NA NA NA

OCM – LM NA NA NA NA

OCM → Shared Memory/Shared Memory → OCM *2400 2373/2373 98.8/98.8 4.46

OCM → MPM/MPM → OCM 800/800 182.8/227.3 22.85/28.41 0.34 / 0.43

Shared Memory → MPM/MPM → Shared Memory 800/800 203/314 25.3/39.3 0.38 / 0.59

Shared Memory → Reg/Reg → Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5

LM – MPM NA NA NA NA

LM – LM NA NA NA NA

LM – Shared Memory NA NA NA NA

* For three banks

GRAM Summation for Pentium IVBenchmarks

Peak Performance

(Mbytes/s)

Maximum Sustainable Bandwidth Measured (Mbytes/s)

Efficiency(%)

Normalized transfer Rate (compared with

PCI-X @ 133 MHz,32 bits

unidirectional)OCM – OCM a (Bridge Port) 800 149 18.6 0.28

OCM – OCM b (Trans Perms) 800 797.39 99.67 1.5

OCM – OCM c (Streaming) 800 799.49 100 1.5

OCM – LM NA NA NA NA

OCM → Shared Memory/Shared Memory → OCM *2400 2392 / 2390 99.6 / 99.6 4.5 / 4.5

OCM → MPM/MPM → OCM 800/800 578 / 562 72.25 / 70.25 1.08 / 1.05

Shared Memory → MPM/MPM → Shared Memory 800/800 796 / 799 99.5 / 99.8 1.5 / 1.5

Shared Memory → Reg/Reg → Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5

LM – MPM NA NA NA NA

LM – LM NA NA NA NA

LM – Shared Memory NA NA NA NA

* For three banks

Conclusions

Type of components used has a major role to play in determining the performance of the system as seen in the performance of the Pentium III and the Pentium IV versions of the SRC-6E.

Software environment and state of development plays a role in determining how effectively the program is able to utilize the hardware. This is clear when observing the difference in bandwidth achieved across the Bridge ports using the Carte 1.6.2 release and the Carte 1.7 release.

Conclusions … The GRAM Summation Tables help to serve

machine architects in the following ways: The efficiency column indicates how well a particular

communication channel is being utilized within the hardware context. If the efficiency is low, architects may be able to improve performance using a firmware improvement. If efficiency is high and the normalized bandwidth is low then they should consider a hardware upgrade.

By looking at the normalized bandwidths obtained from the GRAM benchmarks, designers can also determine whether the data transfer rates are balanced across the architectural modules. This helps identifying bottlenecks.

Designers can find out which channels have the maximum efficiency and can hence fine tune their application to exploit these channels to achieve the maximum data transfer rate.

Conclusions …

In addition, the GRAM Summation tables also provide the following information to application developers: The tables can tell a designer what bottlenecks to expect

and where these bottlenecks lie. By comparing the figures for Efficiency and the Normalized

transfer rates, designers can understand if the bottlenecks being created are by the hardware or the software.

By observing the GRAM summarization tables, designers can actually predict the performance of a pre-designed application on a particular reconfigurable system.

Future Direction

Benchmarks can be expanded to include end-to-end performance from asymmetrical and synthetic workloads.

The Benchmarks can also include tables to characterize the performance of reconfigurable computers as it compares to modern parallel architectures. A performance to cost analysis can also be considered.

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks

Documents

Compact Reconfigurable Avionics – Reconfigurable Data

Gram positive and Gram negative anaerobic rods Gram

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications

Tensilica lecture Some old reconfigurable ideas.. Tensilica Xtensa Automated Development Process ISA TIE Language Benchmarks

RECONFIGURABLE MAGNETOHYDRODYANAMIC ANTENNAvisconedutech.com/wp-content/uploads/2018/05/Reconfigurable-Magnetohydrodyanamic...• Reconfigurable antenna are also designed using capacitor

Reconfigurable computing

Software Development Environment for Reconfigurable ... · Environment for Reconfigurable Communications Architecture Software Development Environment for Reconfigurable ... WCDMA

ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors

Reconfigurable 20x20

Reconfigurable Antennas

Reconfigurable Computing Reconfigurable … Computing Reconfigurable Architectures Chapter 3.1 Prof. Dr.-Ing. Jürgen Teich Lehrstuhl für Hardware-Software-Co-Design Reconfigurable

Reconfigurable optical directed-logic circuits using ... › ece › xugroup › Papers › Reconfigurable... · Reconfigurable optical directed-logic circuits using microresonator-based

ENG3050 Embedded Reconfigurable Computing Systems Introduction to Reconfigurable Computing Introduction to Reconfigurable Computing

Syntax-Guided Synthesis of Datalog Programsxsi/data/fse18.pdf · of 34 benchmarks from three domains—knowledge discovery, pro-gram analysis, and database queries. The evaluation

BENCHMARKS Ramon Zatarain. INDEX Benchmarks and Benchmarking Relation of Benchmarks with Empirical Methods Benchmark definition Types of benchmarks Benchmark

Videocard Benchmarks - Warsaw University of Technology · CPU Benchmarks Video Card Benchmarks Hard Drive Benchmarks RAM PC Systems Android iOS / iPhone Home » Video Card Benchmarks

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington

ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Reconfigurable Architectures