Upload
phyllis-wheeler
View
216
Download
3
Embed Size (px)
Citation preview
Matrix Multiplication on FPGA
Final presentationOne semester – winter 2014/15
By : Dana Abergel and Alex FonariovSupervisor : Mony Orbach
High Speed Digital System Laboratory
HHHSSS DDDSSSLLL
Motivation and Background
• Matrix multiplication is a complex mathematical operation.
• Naive implementation of the common algorithm may cost a lot of resources and time.
• An efficient matrix multiplication implementation is needed.
Project Goals
• Implementation of an efficient algorithm and Infrastructure• Minimum FPGA resources.• Minimum run time.• Maximum throughput
• Examine the trade off• Working with memory interfaces (DDR)
Development Platform
• VHDL• Simulation – ModelSim/Vivado
• Xillinx – vc709 Evaluation Board• FPGA - Virtex 7• Synthesis - Vivado
Algorithm and Specifications
128 128 128 128 128 128
,
,
,
128
, ,n n,1
B
8
8
23
x x x
i j
i j
i j
i j i jn
A C
A bit
B bit
C bit
C A B
Block Diagram
• Two time domains.• Read FIFO:
write width- 512bitread width- 1024bit
• Write FIFO:write width- 64bitread width- 32bit
FIFO
FIFO
FIFO(Results)
Multiplier Adder[1023:0] [22:0]
[15:0]
[15:0]
100 Mhz200 Mhz
9'b0 [31:23]
Multiplier Adder
[15:0]
[15:0]
[1023:0] [22:0]
9'b0 [31:23]
[31:0]
[31:0]
DDR1
DDR2
FPGA
DDR_Interconnect
• Consist of 129 multipliers. • Multiplier’s input width is 8 bit.• Multiplier’s output width is 16 bit.• The multipliers are DSP slices.• An additional multiplier for the
valid bit.
Multiplier
[1023:0]
Row_Reg
[1023:0]
[7:0]
[1023:0]
[7:0]
[7:0]
[7:0]
[7:0]
[7:0]
[7:0]
[7:0]
[15:0]
[15:0]
[15:0]
[15:0]
Valid_din [15:0]
9'b0 [7:1]
9'b0 [7:1]
[7:0]
[7:0]
Adder
[15:0][15:0]
[15:0][15:0]
[15:0][15:0]
[15:0][15:0]
[16:0]
[16:0]
[16:0]
[16:0]
[16:0]
[22:0][21:0]
[21:0]
[17:0]
[17:0]
• Consists of 127 adders.• The adder’s width increases by 1
as the data advances through the pipeline.
• Pipeline implementation.
DDR_Interconnect (IP Integrator)
• Contains the following IP’s:• Axi_data_mover• Axi_interconnect• mig
Memory OrganizationRow1(M1)[1023:0]
Column127(M2)[1023:0]
Column3(M2)[1023:0]
Column5(M2)[1023:0]
Column1(M2)[1023:0]
Row3(M1)[1023:0]
Column127(M2)[1023:0]
Column3(M2)[1023:0]
Column5(M2)[1023:0]
Row5(M1)[1023:0]
DDR1
Column1(M2)[1023:0]
Column3(M2)[1023:0]
Column5(M2)[1023:0]
Column1(M2)[1023:0]
Row2(M1)[1023:0]
Column128(M2)[1023:0]
Column4(M2)[1023:0]
Column6(M2)[1023:0]
Column2(M2)[1023:0]
Row4(M1)[1023:0]
Column128(M2)[1023:0]
Column4(M2)[1023:0]
Column6(M2)[1023:0]
Column2(M2)[1023:0]
DDR2
Elem (1,1)[31:0]
Elem (2,1)31:0]
Elem (128,128)[31:0]
Elem (1,2)[31:0]
Elem (2,2)[31:0]
Elem (1,3)[31:0]
Elem (2,3)[31:0]
DDR2
Flow Chart
Read row(i) (fifo1)Read row(i+1) (fifo2) Read col1 (fifo1) Read col2 (fifo2) Read col3 (fifo1) Read col4 (fifo2)
Read col128 (fifo2)
i+2
Design Verification
• BIST (Built-In Self Test)• Memory tests (C code of the microblaze)
were adjusted to our purposes:• Loading the matrixes to the DDR.• Reading the result matrix from the DDR.
Verification Process
• Three bit-stream files are involved in the process:1. Bist that was modified to write the matrixes to the DDRs.2. Our design, which reads the matrixes, does the arithmetic calculation and writes the
result matrix to the DDR.3. Bist that was modified to read the result matrix and print it on the screen.
Performance
• Total run time: 1.1 sec (220,940,637 clock cycles in 200 Mhz)• Throughput:
• Total FPGA utilization:
128 *128*23 376,8320.0017
220,940,637 220,940,637
Summery and conclusions
• Higher throughput and lower latency as a result of using two DDRs and pipelined design.• The operation involves mathematic operations, which are done in 1.1 seconds.• We considered many ways of verification:
• PCI Express• UART• Microblaze• BIST
• Suggestions – to build an automatic verification environment.• Results –
• o Memory reading, multiplication and summing, and writing back to the memory in a pretty good latency and throughput.
• A low FPGA utilization.
Thank you!