Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed

Matrix Multiplication on FPGA

Final presentationOne semester – winter 2014/15

By : Dana Abergel and Alex FonariovSupervisor : Mony Orbach

High Speed Digital System Laboratory

HHHSSS DDDSSSLLL

Motivation and Background

• Matrix multiplication is a complex mathematical operation.

• Naive implementation of the common algorithm may cost a lot of resources and time.

• An efficient matrix multiplication implementation is needed.

Project Goals

• Implementation of an efficient algorithm and Infrastructure• Minimum FPGA resources.• Minimum run time.• Maximum throughput

• Examine the trade off• Working with memory interfaces (DDR)

Development Platform

• VHDL• Simulation – ModelSim/Vivado

• Xillinx – vc709 Evaluation Board• FPGA - Virtex 7• Synthesis - Vivado

Algorithm and Specifications

128 128 128 128 128 128

,

,

,

128

, ,n n,1

B

8

8

23

x x x

i j

i j

i j

i j i jn

A C

A bit

B bit

C bit

C A B

Block Diagram

• Two time domains.• Read FIFO:

write width- 512bitread width- 1024bit

• Write FIFO:write width- 64bitread width- 32bit

FIFO

FIFO

FIFO(Results)

Multiplier Adder[1023:0] [22:0]

[15:0]

[15:0]

100 Mhz200 Mhz

9'b0 [31:23]

Multiplier Adder

[15:0]

[15:0]

[1023:0] [22:0]

9'b0 [31:23]

[31:0]

[31:0]

DDR1

DDR2

FPGA

DDR_Interconnect

• Consist of 129 multipliers. • Multiplier’s input width is 8 bit.• Multiplier’s output width is 16 bit.• The multipliers are DSP slices.• An additional multiplier for the

valid bit.

Multiplier

[1023:0]

Row_Reg

[1023:0]

[7:0]

[1023:0]

[7:0]

[7:0]

[7:0]

[7:0]

[7:0]

[7:0]

[7:0]

[15:0]

[15:0]

[15:0]

[15:0]

Valid_din [15:0]

9'b0 [7:1]

9'b0 [7:1]

[7:0]

[7:0]

Adder

[15:0][15:0]

[15:0][15:0]

[15:0][15:0]

[15:0][15:0]

[16:0]

[16:0]

[16:0]

[16:0]

[16:0]

[22:0][21:0]

[21:0]

[17:0]

[17:0]

• Consists of 127 adders.• The adder’s width increases by 1

as the data advances through the pipeline.

• Pipeline implementation.

DDR_Interconnect (IP Integrator)

• Contains the following IP’s:• Axi_data_mover• Axi_interconnect• mig

Memory OrganizationRow1(M1)[1023:0]

Column127(M2)[1023:0]

Column3(M2)[1023:0]

Column5(M2)[1023:0]

Column1(M2)[1023:0]

Row3(M1)[1023:0]

Column127(M2)[1023:0]

Column3(M2)[1023:0]

Column5(M2)[1023:0]

Row5(M1)[1023:0]

DDR1

Column1(M2)[1023:0]

Column3(M2)[1023:0]

Column5(M2)[1023:0]

Column1(M2)[1023:0]

Row2(M1)[1023:0]

Column128(M2)[1023:0]

Column4(M2)[1023:0]

Column6(M2)[1023:0]

Column2(M2)[1023:0]

Row4(M1)[1023:0]

Column128(M2)[1023:0]

Column4(M2)[1023:0]

Column6(M2)[1023:0]

Column2(M2)[1023:0]

DDR2

Elem (1,1)[31:0]

Elem (2,1)31:0]

Elem (128,128)[31:0]

Elem (1,2)[31:0]

Elem (2,2)[31:0]

Elem (1,3)[31:0]

Elem (2,3)[31:0]

DDR2

Flow Chart

Read row(i) (fifo1)Read row(i+1) (fifo2) Read col1 (fifo1) Read col2 (fifo2) Read col3 (fifo1) Read col4 (fifo2)

Read col128 (fifo2)

i+2

Design Verification

• BIST (Built-In Self Test)• Memory tests (C code of the microblaze)

were adjusted to our purposes:• Loading the matrixes to the DDR.• Reading the result matrix from the DDR.

Verification Process

• Three bit-stream files are involved in the process:1. Bist that was modified to write the matrixes to the DDRs.2. Our design, which reads the matrixes, does the arithmetic calculation and writes the

result matrix to the DDR.3. Bist that was modified to read the result matrix and print it on the screen.

Performance

• Total run time: 1.1 sec (220,940,637 clock cycles in 200 Mhz)• Throughput:

• Total FPGA utilization:

128 *128*23 376,8320.0017

220,940,637 220,940,637

Summery and conclusions

• Higher throughput and lower latency as a result of using two DDRs and pipelined design.• The operation involves mathematic operations, which are done in 1.1 seconds.• We considered many ways of verification:

• PCI Express• UART• Microblaze• BIST

• Suggestions – to build an automatic verification environment.• Results –

• o Memory reading, multiplication and summing, and writing back to the memory in a pretty good latency and throughput.

• A low FPGA utilization.

Thank you!

Documents

Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed