View
27
Download
4
Category
Tags:
Preview:
DESCRIPTION
Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks. Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington University, George Mason University. MAPLD 2004, Washington DC. Abbreviations. BRAM – Block RAM - PowerPoint PPT Presentation
Citation preview
Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks
Chitalwala. E., El-Ghazawi. T., Gaj. K.,
The George Washington University, George Mason University.
MAPLD 2004, Washington DC
Chitalwala 2 2004 MAPLD – 1010
Abbreviations
BRAM – Block RAM GRAM - Generalized Reconfigurable Architecture
Model LM - Local Memory Max – Maximum MAP – Multi Adaptive Processor MPM - Microprocessor Memory OCM - On-Chip Memory PE - Processing Element Trans Perms - Transfer of Permissons
Chitalwala 3 2004 MAPLD – 1010
Outline
Problem Statement GRAM Description Assumptions and Methodology Testbed Description: SRC-6E Results Conclusion and Future Direction
Chitalwala 4 2004 MAPLD – 1010
Problem Statement Develop a standardized model of Reconfigurable
Architectures. Define a set of synthetic benchmarks based on this
model to analyze performance and discover bottlenecks.
Evaluate the system against the peak performance specifications given by the manufacturer.
Prove the concept by using these benchmarks to assess and dynamically characterize the performance of a reconfigurable system, using the SRC-6E as a test case.
Generalized Reconfigurable Architecture Model (GRAM)
Chitalwala 6 2004 MAPLD – 1010
GRAM Benchmarks: Objective To measure maximum sustainable data
transfer rates and latency between the various elements of the GRAM.
Dynamically characterize the performance of the system against system peak performance.
Chitalwala 7 2004 MAPLD – 1010
Generalized Reconfigurable Architecture Model (GRAM)
Chitalwala 8 2004 MAPLD – 1010
GRAM Elements
PE – Processing Element
OCM – On-Chip Memory
LM – Local Memory Interconnect Network /
Shared Memory Bus Interface Microprocessor
Memory
Chitalwala 9 2004 MAPLD – 1010
GRAM Benchmarks OCM – OCM: Measure max. sustainable bandwidth
and latency between two OCMs residing on different PEs.
OCM – LM: Measure max. sustainable bandwidth and latency between OCM and LM in either direction.
OCM - Shared Memory: Measure max. sustainable bandwidth and latency between OCM and Shared Memory in either direction.
Shared Memory – MPM: Measure max. sustainable bandwidth and latency between Shared Memory and MPM in either direction.
Chitalwala 10 2004 MAPLD – 1010
GRAM Benchmarks
OCM – MPM: Measure max. sustainable bandwidth and latency between OCM and MPM in either direction.
LM – MPM: Measure max. sustainable bandwidth and latency between LM and MPM in either direction.
LM – LM: Measure max. sustainable bandwidth and latency between LM and LM in either direction.
LM – Shared Memory: Measure max. sustainable bandwidth and latency between LM and Shared Memory in either direction.
GRAM Assumptions
Chitalwala 12 2004 MAPLD – 1010
Assumptions
All devices on board are fed through a single clock
No direct path between the Local Memories of individual elements
Connections for add-on cards may exist but not shown
The generalized architecture has been created based on precedents set by past and current manufacturers of Reconfigurable Systems.
Chitalwala 13 2004 MAPLD – 1010
Methodology Data paths can be parallelized to the maximum extent
possible. Inputs and Outputs have been kept symmetrical. Hardware timers have been used to measure times
taken to transfer data. Measurements have been taken for transfers of
increasingly large amounts of data. Data must be verified for correctness after transfers. Multiple paths may exist between the elements specified.
Our aim will be to measure the fastest path available. All experiments will be conducted using the programming
model and library functions of the system.
Testbed Description: SRC-6E
Chitalwala 15 2004 MAPLD – 1010
Hardware Architecture of the SRC-6E
Control Chip
On-Board Memory
(4MB X 6)
User Chip(Xilinx
XC2V6000)
Control Chip
MAP III III BoardBoard
User Chip(Xilinx
XC2V6000)
User Chip(Xilinx
XC2V6000)
User Chip(Xilinx
XC2V6000)
On-Board Memory
(4MB X 6)
P3/P4 µP
L2
PCI Slot
MIOC
PrivateMemory(1.5 GB)
SNAP
P Board
P3/P4 µP
To Ethernet
L2
MIOC
PCI Slot
SNAP
PrivateMemory(1.5 GB)
P Board
To Ethernet
P3/P4 µP P3/P4 µP
L2 L2
Control ChipControl Chip
On-Board Memory
(4MB X 6)
User Chip(Xilinx
XC2V6000)
Control ChipControl Chip
MAP III III BoardBoard
User Chip(Xilinx
XC2V6000)
User Chip(Xilinx
XC2V6000)
User Chip(Xilinx
XC2V6000)
On-Board Memory
(4MB X 6)
P3/P4 µP
L2
PCI Slot
MIOCMIOC
PrivateMemory(1.5 GB)
PrivateMemory(1.5 GB)
SNAP
SNAP
P Board
P3/P4 µP
To Ethernet
L2
MIOCMIOC
PCI Slot
SNAP
SNAP
PrivateMemory(1.5 GB)
PrivateMemory(1.5 GB)
P Board
To Ethernet
P3/P4 µP P3/P4 µP
L2 L2
800/1600 Mbytes/sec
64 x 6 64 x 6
800Mbytes/sec
64 x 6
800/1600 Mbytes/sec
800Mbytes/sec
64 x 664
Chitalwala 16 2004 MAPLD – 1010
Programming Model of the SRC-6E
.c or .f Files .mc or .mf Files
μP Compiler MAP Compiler
.o Files .o Files
Linker
.v Files
.vhd or .v Files
Logic Synthesis
.ngo FILES
Place & Route
.bin FilesApplication Executable
APPLICATION
Chitalwala 17 2004 MAPLD – 1010
GRAM Benchmarks for the SRC-6E
P3/P4 P(1/3 GHz)
L2
80008000
MIOCMIOC
800800L2
P3/P4 P(1/3 GHz)
PCIPCISlotSlot
SSNNAAPP
µ ProcessorMemory(1.5 GB)
P BoardP Board
P3/P4 P(1/3 GHz)
L2
80008000
PCIPCISlotSlot
MIOCMIOC
µ ProcessorMemory(1.5 GB)
800800
SSNNAAPP
L2
P3/P4 P(1/3 GHz)
P BoardP Board
Control ChipControl Chip
On-Board On-Board MemoryMemory(24 MB)(24 MB)
48004800(6 x 800)(6 x 800)
48004800(6 x 800)(6 x 800)
User User ChipChip24002400
(4800*)(4800*)
User User ChipChip
Control ChipControl Chip
On-Board On-Board MemoryMemory(24 MB)(24 MB)
48004800(6 x 800)(6 x 800)
48004800(6 x 800)(6 x 800)
User User ChipChip24002400
(4800*)(4800*)
User User ChipChip
MAP III MAP III BoardBoard
800/1600 MBytes/s800/1600 MBytes/s800/1600 MBytes/Sec800/1600 MBytes/Sec
Ethernet
OCM - OCM
OCM – Shared Memory
OCM - MPMShared Memory to MPM
Chitalwala 18 2004 MAPLD – 1010
GRAM Benchmarks for the SRC-6EBenchmark SRC-6EOCM – OCM BRAM – BRAM
OCM – LM NA
OCM – Shared Memory BRAM – On-Board Memory
Shared Memory – MPM On-Board Memory – Common Memory
OCM – MPM BRAM – Common Memory
LM – MPM NA
LM – LM NA
LM – Shared Memory NA
Results
Chitalwala 20 2004 MAPLD – 1010
Block Diagram for a Single Bank transfer between OCM to Shared Memory
µProcessor Memory to Shared Memory (DMA_in)
Start_timer
Read_timer(ht0)
Read_timer(ht1)
Shared Memory to OCM
Read_timer(ht2)
OCM to Shared Memory
Read_timer(ht3)
Shared Memory to µProcessor Memory (DMA_out) Read_timer(ht4)
Chitalwala 21 2004 MAPLD – 1010
Latency
LatencyMinimum Data Transferred
Latency In Clock Cycles Latency in μs
Pentium III
Pentium IV
Pentium III
Pentium IV
Shared Memory to OCM
1 word* 20 20 0.20 0.20
OCM to Shared Memory
1 word* 15 15 0.15 0.15
OCM to OCM (Bridge Port)
1 word * 11 11 0.11 0.11
Shared Memory to MPM
4 words * 4200 2100 42 21
MPM to Shared Memory
4 words * 1000 1000 10 10
*1 word = 64 bits
Chitalwala 22 2004 MAPLD – 1010
Latency
The difference between read and write times for the OCM and Shared Memory is due to the read latency of OBM (6 clocks) vs. BRAM (1 clock).
When transferring data from the MPM to Shared Memory, writes are issued at each clock cycle and there is no startup latency involved.
When reading data from the Shared Memory to the MPM, there is an additional five clock cycles required to transfer data after the read has been issued.
Chitalwala 23 2004 MAPLD – 1010
PROCESSING ELEMENT(FPGA)
OCM1
A4 MB
B4 MB
C4 MB
D4 MB
E4 MB
F4 MB
64
Shared Memory
PROCESSING ELEMENT(FPGA)
OCM2
OCM1
OCM2
6464
64
64 64 6464
192
Data Path from OCM to OCM Using Transfer Of Permissions
Chitalwala 24 2004 MAPLD – 1010
A4 MB
B4 MB
C4 MB
D4 MB
E4 MB
F4 MB
Shared Memory
PROCESSING ELEMENT(FPGA 2)
OCM1
64 64
PROCESSING ELEMENT(FPGA 1)
OCM1
64
Data Path from OCM to OCM Using The Bridge Port and the Streaming Protocol
Chitalwala 25 2004 MAPLD – 1010
P III & IV: Bandwidth: OCM and OCM (BM#1)
Pentium III and Pentium IV: OCM - OCM
0
20
40
60
80
100
120
140
160
1 10 100 1000 10000 100000
Number Of Bytes
Ban
dwid
th (M
Byt
es/s
)
Bandwidth (Bridge)
Chitalwala 26 2004 MAPLD – 1010
P III: Bandwidth: OCM and OCM (BM#1)
Pentium III: OCM - OCM
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000 1000000
Number Of Bytes
Band
wid
th (M
Byte
s/s)
Bandwidth (Trans Perms) Peak Bandwidth
Chitalwala 27 2004 MAPLD – 1010
P IV : Bandwidth: OCM and OCM (BM#1)
Pentium IV: OCM - OCM
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000 1000000
Number Of Bytes
Band
wid
th (M
Byte
s/s)
Bandwidth (Trans Perms) Peak Bandwidth
Chitalwala 28 2004 MAPLD – 1010
P IV : Bandwidth: OCM and OCM (BM#1) (Streaming Protocol in Bridge Port)
Pentium IV: OCM - OCM
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000 1000000
Number Of Bytes
Band
wid
th (M
Byte
s/s)
Bandwidth (Bridge) Peak Bandwidth
Chitalwala 29 2004 MAPLD – 1010
A4 MB
PROCESSING ELEMENT(FPGA)
OCM1
OCM2
OCM3
B4 MB
C4 MB
D4 MB
E4 MB
F4 MB
64
64 6
4
64
64
64
Shared Memory
Control FPGAM
ICR
OP
RO
CE
SS
OR
ME
MO
RY
SNAP
Data Path from OCM to MPM and Shared Memory to MPM
Chitalwala 30 2004 MAPLD – 1010
Pentium III: OCM - Shared Memory
0100200300400500600700800900
1 10 100 1000 10000 100000 1000000
Number Of Bytes
Ban
dwid
th
(MB
ytes
/s)
0102030
40506070
Bits
/Clo
ck
BW (Shared Memory to OCM) (Mbytes/sec)BW (OCM to Shared Memory) (Mbytes/sec)Peak BandwidthBits/clock (Shared Memory to OCM)Bits/clock (OCM to Shared Memory)
P III: Bandwidth: OCM and Shared Memory for a single bank
Chitalwala 31 2004 MAPLD – 1010
P III: Bandwidth: OCM and Shared Memory
Pentium III: OCM - Shared Memory
0
500
1000
1500
2000
2500
3000
1 10 100 1000 10000 100000 1000000
Number Of Bytes
Band
wid
th (M
Byte
s/s)
Shared Memory to OCM OCM to Shared Memory Peak Bandwidth
Chitalwala 32 2004 MAPLD – 1010
P IV: Bandwidth: OCM and Shared Memory
Pentium IV: OCM - Shared Memory
0
500
1000
1500
2000
2500
3000
1 10 100 1000 10000 100000 1000000
Number Of Bytes
Band
wid
th (M
Byte
s/s)
Shared Memory to OCM OCM to Shared Memory Peak Bandwidth
Chitalwala 33 2004 MAPLD – 1010
P III: Bandwidth: OCM and µP Memory
Pentium III: OCM - MPM
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000
Number Of Bytes
Band
wid
th (M
Byte
s/s)
µP Memory to OCM OCM to µP Memory Peak Bandwidth
Chitalwala 34 2004 MAPLD – 1010
P IV: Bandwidth: OCM and µP Memory
Pentium IV: MPM - OCM
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000 1000000
Number of Bytes
Band
wid
th (M
Byte
s/s)
µP Memory to OCM OCM to µP Memory Peak Bandwidth
Chitalwala 35 2004 MAPLD – 1010
P III: Bandwidth: Shared Memory and µP Memory (BM#5) Pentium III: Shared Memory - MPM
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000 1000000 10000000
1E+08
Number Of Bytes
Band
wid
th (M
Byte
s/s)
µP Memory to Shared Memory Shared Memory to µP Memory
Peak Bandwidth
Chitalwala 36 2004 MAPLD – 1010
P IV: Bandwidth: Shared Memory and µP Memory
Pentium IV: Shared Memory - MPM
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000 1000000 10000000
Number Of Bytes
Band
wid
th (M
Byte
s/s)
MPM to Shared Memory Shared Memory to MPM Peak Bandwidth
Chitalwala 37 2004 MAPLD – 1010
P III: Bandwidth: Shared Memory and µP Memory
Pentium III: Shared Memory - MPM
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000 1000000 1E+07 1E+08
Number Of Bytes
Band
wid
th (M
Byte
s/s)
0
5
10
15
20
25
30
Bits
/Clo
ck
µP Memory to Shared Memory Shared Memory to µP Memory
Peak Bandwidth Bits/clock Into Shared Memory
Bits/clock Into µP Memory
Chitalwala 38 2004 MAPLD – 1010
P IV: Bandwidth: Shared Memory and µP Memory
Pentium IV: Shared Memory - MPM
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000 1000000 1E+07 1E+08
Number Of Bytes
Band
wid
th (M
Byte
s/s)
0
10
20
30
40
50
60
70
Bits
/Clo
ck
µP Memory to Shared Memory Shared Memory to µP Memory
Peak Bandwidth Bits/clock Into Shared Memory
Bits/clock Into µP Memory
Chitalwala 39 2004 MAPLD – 1010
A4 MB
B4 MB
C4 MB
D4 MB
E4 MB
F4 MB
Shared Memory
6464
PROCESSING ELEMENT(FPGA 1)
Register
Data Path from FPGA Register to Shared Memory
Chitalwala 40 2004 MAPLD – 1010
P III: Bandwidth: Shared Memory and Register
Pentium III: Shared Memory - FPGA Register
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000 1000000
Number Of Bytes
Bnad
wid
th (M
Byte
s/s)
0
10
20
30
40
50
60
70
Bits
/Clo
ck
Shared Memory to Reg Reg to Shared Memory
Peak Bandwidth Bits/Clock (Shared Memory to Reg)
Bits/Clock (Reg to Shared Memory)
Conclusion & Future Direction
Chitalwala 42 2004 MAPLD – 1010
GRAM Summation for Pentium IIIBenchmarks
Peak Performa
nce(Mbytes/s)
Maximum Sustainable Bandwidth Measured (Mbytes/s)
Efficiency(%)
Normalized Transfer Rate (compared with
PCI-X @ 133 MHz,32 bits
unidirectional)
OCM – OCM a (Bridge Port) 800 149 18.6 0.28
OCM – OCM b (Trans Perms) 800 793 99.13 1.5
OCM – OCM c (Streaming) 800 NA NA NA
OCM – LM NA NA NA NA
OCM → Shared Memory/Shared Memory → OCM *2400 2373/2373 98.8/98.8 4.46
OCM → MPM/MPM → OCM 800/800 182.8/227.3 22.85/28.41 0.34 / 0.43
Shared Memory → MPM/MPM → Shared Memory 800/800 203/314 25.3/39.3 0.38 / 0.59
Shared Memory → Reg/Reg → Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5
LM – MPM NA NA NA NA
LM – LM NA NA NA NA
LM – Shared Memory NA NA NA NA
* For three banks
Chitalwala 43 2004 MAPLD – 1010
GRAM Summation for Pentium IVBenchmarks
Peak Performance
(Mbytes/s)
Maximum Sustainable Bandwidth Measured (Mbytes/s)
Efficiency(%)
Normalized transfer Rate (compared with
PCI-X @ 133 MHz,32 bits
unidirectional)OCM – OCM a (Bridge Port) 800 149 18.6 0.28
OCM – OCM b (Trans Perms) 800 797.39 99.67 1.5
OCM – OCM c (Streaming) 800 799.49 100 1.5
OCM – LM NA NA NA NA
OCM → Shared Memory/Shared Memory → OCM *2400 2392 / 2390 99.6 / 99.6 4.5 / 4.5
OCM → MPM/MPM → OCM 800/800 578 / 562 72.25 / 70.25 1.08 / 1.05
Shared Memory → MPM/MPM → Shared Memory 800/800 796 / 799 99.5 / 99.8 1.5 / 1.5
Shared Memory → Reg/Reg → Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5
LM – MPM NA NA NA NA
LM – LM NA NA NA NA
LM – Shared Memory NA NA NA NA
* For three banks
Chitalwala 44 2004 MAPLD – 1010
Conclusions
Type of components used has a major role to play in determining the performance of the system as seen in the performance of the Pentium III and the Pentium IV versions of the SRC-6E.
Software environment and state of development plays a role in determining how effectively the program is able to utilize the hardware. This is clear when observing the difference in bandwidth achieved across the Bridge ports using the Carte 1.6.2 release and the Carte 1.7 release.
Chitalwala 45 2004 MAPLD – 1010
Conclusions … The GRAM Summation Tables help to serve
machine architects in the following ways: The efficiency column indicates how well a particular
communication channel is being utilized within the hardware context. If the efficiency is low, architects may be able to improve performance using a firmware improvement. If efficiency is high and the normalized bandwidth is low then they should consider a hardware upgrade.
By looking at the normalized bandwidths obtained from the GRAM benchmarks, designers can also determine whether the data transfer rates are balanced across the architectural modules. This helps identifying bottlenecks.
Designers can find out which channels have the maximum efficiency and can hence fine tune their application to exploit these channels to achieve the maximum data transfer rate.
Chitalwala 46 2004 MAPLD – 1010
Conclusions …
In addition, the GRAM Summation tables also provide the following information to application developers: The tables can tell a designer what bottlenecks to expect
and where these bottlenecks lie. By comparing the figures for Efficiency and the Normalized
transfer rates, designers can understand if the bottlenecks being created are by the hardware or the software.
By observing the GRAM summarization tables, designers can actually predict the performance of a pre-designed application on a particular reconfigurable system.
Chitalwala 47 2004 MAPLD – 1010
Future Direction
Benchmarks can be expanded to include end-to-end performance from asymmetrical and synthetic workloads.
The Benchmarks can also include tables to characterize the performance of reconfigurable computers as it compares to modern parallel architectures. A performance to cost analysis can also be considered.
Recommended