33
18643F17L10S1, James C. Hoe, CMU/ECE/CALCM, ©2017 18643 Lecture 10: Vivado CtoIP HLS James C. Hoe Department of ECE Carnegie Mellon University

18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S1, James C. Hoe, CMU/ECE/CALCM, ©2017

18‐643 Lecture 10:Vivado C‐to‐IP HLS

James C. HoeDepartment of ECE

Carnegie Mellon University

Page 2: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S2, James C. Hoe, CMU/ECE/CALCM, ©2017

Housekeeping• Your goal today: learn how to tell Vivado HLS what you really want and understand what Vivado HLS is telling you

• Notices– Handout #4: lab 2, due noon, 10/6– 3.5 weeks to project proposal

• Readings– Ch 15, The Zynq Book (skim Ch 14)– Vivado Design Suite User Guide: High‐Level Synthesis (UG902)

Page 3: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017

Tortoise and Hare• Tortoise 

– delivers exact optimal implementation to a fullyspecified objective (functional + tuning)

– perfection takes timesay last 10% of quality takes up 90% of the time

• Hare– only gets to 90% quality– delivers the design 10 times faster

This hare doesn’t take a nap after one design . . .

Page 4: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S4, James C. Hoe, CMU/ECE/CALCM, ©2017

Good Enough Box

The Design Race

1/perf

power

out‐of‐time 90%

best possibleeducated guess

hey, it works

Page 5: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S5, James C. Hoe, CMU/ECE/CALCM, ©2017

Why the Hare Wins• In real design projects

– don’t always know exact target initially– can’t land first shot on target anyway– good enough really is good enough– hitting schedule is everything

show at COMDEX in Nov or bust in Dec• There are a lot more rabbits than turtles in this world; there are not enough turtles in this world

Even more turkeys . . . but that’s a different class

All characters appearing in this story are fictitious. Any resemblance to real persons, living or dead, is purely coincidental.

Page 6: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S6, James C. Hoe, CMU/ECE/CALCM, ©2017

Vivado HLS

Page 7: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S7, James C. Hoe, CMU/ECE/CALCM, ©2017

Function‐to‐IP, not Program‐to‐HW • **Object of design is an IP module**• Designer still in charge (garbage in, garbage out)

– specify functionality as algorithm (in C)– specify structure as pragmas (beyond C)– set optimization constraints (beyond C)Offload bit‐ and cycle‐level design/opt. to tools

• Vivado HLS (formerly AutoESL; formerly UCLA)– never mind all of C (what’s main( )? what malloc?)– never mind all usages of allowed subset (all loops okay, but static ones actually work well)

– what else beyond C might a HW designer need (types, interface, structural hints)

Page 8: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S8, James C. Hoe, CMU/ECE/CALCM, ©2017

What does Vivado see?int fibi(int n) {

int last=1; int lastlast=0; int temp;

if (n==0) return 0;if (n==1) return 1;

for(;n>1;n--) {temp=last+lastlast;lastlast=last;last=temp;

}

return temp;}

Page 9: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S9, James C. Hoe, CMU/ECE/CALCM, ©2017

Function to IP Block

n fibi

int fibi(int n) { . . . .return ...;

}

ap_clkap_rst

ap_readyap_doneap_idleap_start

What if I want multiple outputs?

Don’t lookinside yet

Page 10: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S10, James C. Hoe, CMU/ECE/CALCM, ©2017

ap_clk

ap_ready

ap_done

ap_rst

AP_CTRL_HS Block Protocol

ap_idle

inputsconsumed

output valid

ap_start

ready for new ap_start

1

2

1’

3

I

O

Page 11: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S11, James C. Hoe, CMU/ECE/CALCM, ©2017

Function Invocation:Latency vs Throughput

start ready done

start ready done

start ready done

minimuminitiation interval

latency

Page 12: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S12, James C. Hoe, CMU/ECE/CALCM, ©2017

Other Block Control Options• ap_ctrl_chain

– separate input producer and output consumer– ap_continue: driven by the consumer to backpressure the block and producer

– IF a block reaches “done” AND ap_continue is deasserted, the block will hold ap_done and keep output valid until ap_continue is asserted

• AXI compatible port interfaces– software on ARM interacts with the block using fxn‐call‐like interfaces (input, output, start, etc.) 

– IP‐specific .h and routines generated automatically

Page 13: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S13, James C. Hoe, CMU/ECE/CALCM, ©2017

Scalar I/O Port Timing• By default (ap_none)

– input ports should be stable between ap_start and ap_ready

– output port is valid when ap_done• 3 asynchronous handshake options on input

– ap_vld only: consumes only if input valid– ap_ack only: signals back when input consumed– ap_hs: ap_vld + ap_ack

• HLS’s job to follow protocol   nap_vldap_ack

Page 14: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S14, James C. Hoe, CMU/ECE/CALCM, ©2017

Pass‐by‐Reference Argumentsvoid fibi(int *n, int *fib) {

int last=1; int lastlast=0; int temp;int nn=*n;

if (nn==0) { *fib=0; *n=0; return; }if (nn==1) { *fib=1; *n=0; return; }for(;nn>1;nn--) {temp=last+lastlast;lastlast=last;last=temp;

}

*fib=last; *n=lastlast;}

Page 15: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S15, James C. Hoe, CMU/ECE/CALCM, ©2017

Pass‐by‐Reference I/O

Don’t lookinside yet

n_ifib

void fibi(int *n, int *fib) {. . . .*n in RHS and LHS; *fib in LHS only. . . .

} used before assigned

ap_clk ap_readyap_doneap_idle

ap_rstap_start

n_o

They are not really “pointers”• do not evaluate *(fib+1) or fib• except to pretend to be a fifo

Page 16: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S16, James C. Hoe, CMU/ECE/CALCM, ©2017

All I/O Options

Fig 1‐49, Vivado Design Suite User Guide: High‐Level Synthesis

Page 17: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S17, James C. Hoe, CMU/ECE/CALCM, ©2017

Array Arguments#define N (1<<10)void D2XPY (double Y[N], double X[N]) {

for(i=0; i<N; i++) {Y[i]=2*X[i]+Y[i];

}} X_q0[63:0]

X_ce0X_addr0[9:0]

Y_q0[63:0]Y_ce0Y_we0

Y_addr0[9:0]

*could ask to   use separateread and writeports

Page 18: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S18, James C. Hoe, CMU/ECE/CALCM, ©2017

Array Arg Options• By default, array args become BRAM ports

– array must be fixed size– can use 2 ports for bandwidth or split read/write

• If array arg is accessed always consecutively AND only either read or written– can become ap_fifo port – i.e., no addresses, just push or pop

• Array args can also become AXI or a generic bus master portsScheduler handles port sharing and dynamic delays

Page 19: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S19, James C. Hoe, CMU/ECE/CALCM, ©2017

Time to Look Inside

n fibi

ap_clk ap_readyap_doneap_idle

ap_rstap_start

Page 20: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S20, James C. Hoe, CMU/ECE/CALCM, ©2017

MMM (yet again)void mmm(char A[N][N], char B[N][N], short C[N][N) {

for(int i=0; i<N; i++) {for(int j=0; j<N; j++) {

C[i][j]=0;for(int k=0; k<N; k++) {

C[i][j] += A[i][k]*B[k][j];}

}}

}

mmm

BRAM Rd

BRAM Rd

N2‐by‐8b BRAM

N2‐by‐8b BRAM

N2‐by‐8b BRAM Same example as

Zynq Book Tutorial 3

BRAM Rd/Wr

keep it simple

Page 21: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S21, James C. Hoe, CMU/ECE/CALCM, ©2017

rd0 Ard0 B rd0 C

A*B accum wr0 C

rd0 Ard0 B rd0 C

A*B accum wr0 C

rd0 Ard0 B rd0 C

A*B accum wr0 C

Structural Pragma: Pipelining• Fully elaborate scope (e.g., unroll loops)• Find minimum “iteration interval (II)” schedule

– II >= num stages a resource instance is used– II >= RAW hazard distance

• E.g., to pipeline C[i][j]+=A[i][k]*B[k][j];

rd0 Ard0 B rd0 C

A*B accum wr0 Cstructuralconflict, II>=2(II>=1 if 2‐port)

RAW hazard,II>=3

Page 22: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S22, James C. Hoe, CMU/ECE/CALCM, ©2017

HLS Analysis and Visualization// Zynq Book Tutorial 3, Sol#2for(int i=0; i<5; i++) {

for(int j=0; j<5; j++) {C[i][j]=0;for(int k=0; k<5; k++) {

#pragma HLS PIPELINEC[i][j] += A[i][k]*B[k][j];

[Vivado HLS Screenshots]

Page 23: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S23, James C. Hoe, CMU/ECE/CALCM, ©2017

// Zynq Book Tutorial 3, Sol#3for(int i=0; i<5; i++) {

for(int j=0; j<5; j++) {C[i][j]=0;

#pragma HLS PIPELINEfor(int k=0; k<5; k++) {

C[i][j] += A[i][k]*B[k][j];

Design by Trial and Error

[Vivado HLS Screenshots]

Page 24: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S24, James C. Hoe, CMU/ECE/CALCM, ©2017

Design by Trial and Error

[Vivado HLS Screenshots]

// Zynq Book Tutorial 3, Sol#4#program HLS ARRAY_RESHAPE variable=A, dim=2#program HLS ARRAY_RESHAPE variable=B, dim=1for(int i=0; i<5; i++) {

for(int j=0; j<5; j++) {C[i][j]=0;

#pragma HLS PIPELINEfor(int k=0; k<5; k++) {

C[i][j] += A[i][k]*B[k][j];

A and B reshaped to readentire row/column at a time? 

What if N>>5?

Page 25: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S25, James C. Hoe, CMU/ECE/CALCM, ©2017

Recall from Last Time

for(i=…for(j=…for(k=…GET C[i][j]

for(k=…for(i=…for(j=…GET C[i][j]

for(i=…for(j=…GET C[i][j]

fully unrolled inner loops

parallel kernelpipelines

Page 26: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S26, James C. Hoe, CMU/ECE/CALCM, ©2017

With Algo. Rewrite (Option 1)

[Vivado HLS Screenshots]

// assume C initialized to 0for(int k=0; k<5; k++)

for(int i=0; i<5; i++) {for(int j=0; j<5; j++) {

#pragma HLS PIPELINEC[i][j]+= A[i][k]*B[k][j];

can fix by disableflattening

From here we can play with pragmas to sensibly widen concurrency if needed

Page 27: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S27, James C. Hoe, CMU/ECE/CALCM, ©2017

With Algo. Rewrite (Option 2)

[Vivado HLS Screenshots]

for(int i=0; i<5; i++) {for(int j=0; j<5; j++) {short Ctemp=0;for(int k=0; k<5; k++)

#pragma HLS PIPELINECtemp += A[i][k]*B[k][j];

C[i][j]=Ctemp;

can fix by disableflattening

HLS figured outforwarding

Page 28: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S28, James C. Hoe, CMU/ECE/CALCM, ©2017

• Loop Unroll (full and partial)– amortize loop control overhead– increase loop‐body size, hence“ILP” and scheduling flexibility

• Loop Merge– combine loop‐bodies of independent loops of same control

– improve parallelism and scheduling• Loop Flatten

– streamline loop‐nest control– reduce start/finish stutter

Pragma Crib Sheet: Loops

4 iter

2 iter (unroll by2)

2x (2 iter) 4 iter

2+2 iter 2 iter merged

fully unrolled 

longer steadystate

Page 29: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S29, James C. Hoe, CMU/ECE/CALCM, ©2017

• Map– multiple arrays in same BRAM– no perf loss if no scheduling conflicts

• Reshape– change BRAM aspect ratio to widen ports– higher bandwidth on consecutive addresses

• Partition– map 1 array to multiple BRAMs– multiple independent ports if no bank conflicts

Pragma Crib Sheet: Arrays addr

data

A lot more you can control; must read UG902

Page 30: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S30, James C. Hoe, CMU/ECE/CALCM, ©2017

Design by Explorationreference algorithm &testbench

algorithm forsynthesis

pragmas

HLS &analysis

goodenough

RTL

RTL backend

co‐simulationvalidation

no

yes

not good enoughafter backend

When this takes only minutes, a little trial‐and‐error is okay  (just a little!!!!)

Page 31: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S31, James C. Hoe, CMU/ECE/CALCM, ©2017

Putting it in context (from last time)• Why hardware design is hard

– reason #1: low level abstraction – reason #2: unrestricted design freedom – reason #3: massive concurrency

• C‐to‐HW (i.e., C‐to‐RTL) compiler bridges the gap between functionality and implementation– fill in the details below the functional abstraction– make good decisions when filling in the details– extract parallelism from a sequential specification

Vivado does its part fast and without mistakes

Page 32: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S32, James C. Hoe, CMU/ECE/CALCM, ©2017

Parting Thoughts• Vivado doesn’t turn program into HW• Vivado doesn’t turn programmer into HW designer• Multifaceted benefits to HW designer 

– algo. development/debug/validate in SW– pragma steering (no RTL hacking, machine tuning)– fast analysis and visualization– data type support

it is about more than adding “double” to Verilog– built‐in, stylized IP interfaces– integration with the rest of Vivado and Zynq!!

• We are entering a new era for FPGAs 

Page 33: 18 643 Lecture 10: Vivado C to IP HLSjhoe/course/ece643/F16... · 2018-06-27 · 18‐643‐F17‐L10‐S3, James C. Hoe, CMU/ECE/CALCM, ©2017 Tortoise and Hare •Tortoise – delivers

18‐643‐F17‐L10‐S33, James C. Hoe, CMU/ECE/CALCM, ©2017

Vivado “Software‐Defined” SoC

Screenshot, page 24, SDSoC Environment Getting Started (UG1028)