Final Quiz Instructions

Final

! This is a preview of the published version of the quiz

Started: Dec 20 at 9:48am

Quiz InstructionsRegulations: https://www.seas.upenn.edu/~ese532/fall2020/final_details.pdf

1 ptsQuestion 1

True

False

I certify that I have complied with the University of Pennsylvania’s Code ofAcademic Integrity and the exam regulationshttps://www.seas.upenn.edu/~ese532/fall2020/final_details.pdf(https://www.seas.upenn.edu/~ese532/fall2020/midterm_details.pdf) in completing thisexam.

Consider the following application in answering the questions on this exam:

#define NFRAMES 256#define MAX_RESULTS 8*NFRAMES#define frame_type ap_int<240>#define lookup_result_type ap_int<128>#define STATELEN 12#define KEYLEN (STATELEN+8)#define KEY_MASK 0x0FFFFF

https://canvas.upenn.edu/courses/1528473/quizzes/2517391/take?preview=1#

https://www.seas.upenn.edu/~ese532/fall2020/midterm_details.pdf


#define KEY_MASK 0x0FFFFF#define VAL_MASK 0x0FFF#define NUM_SLOTS 16384#define BUCKET_MASK 0x0FFFFFFFF#define slot_type uint32_t#define BUCKET_CAPACITY 4#include<stdint.h>extern lookup_result_type lookup[NUM_SLOTS];extern uint16_t init_lookup[256];void extract_compress(frame_type frames[NFRAMES], uint8_t bitlocs[64], uint16_t bitpos[KEYLEN], uint16_t results[MAX_RESULTS], int *num_results){ uint64_t tmp[NFRAMES]; for (int i=0;i<NFRAMES;i++) { // Loop A uint64_t result=0; int finalpos=1; frame_type val=frames[i]; for (int j=0;j<64;j++) { // Loop B uint8_t bitloc=bitlocs[j]; for (int k=128;k>0;k=k/2) { // Loop C if ((bitloc&0x01)==1) val=val/k; bitloc=bitloc/2; } if ((val&0x01)==1) result|=finalpos; finalpos=finalpos*2; } tmp[i]=result; } int result_count=0; int state=0; for (int i=0;i<NFRAMES;i++) { // Loop D uint64_t val64=tmp[i]; // val64 is input to pipeline pix for (int b=0;b<8;b++) // Loop E (also shown in pipeline pix)

for (int b=0;b<8;b++) // Loop E (also shown in pipeline pix) { <see pipeline in next box; complete code for question 4> } } *num_results=result_count; return;}

Here is a pipeline that implements loop E and the code (and loops) inside it.

Memories bitpos, lookup, init_lookup, and results are arrays defined in thecode above. Registers val64, b, state, and result_count are variables defined inthe code above.


5 ptsQuestion 2

Assuming:

up to 6-input xors or ors can be packed into a single 1ns operationa comparison followed by a mux can be performed in a single 1ns operation

mux alone can also be performed in a 0.2nseach memory operation can be performed in a single 2ns operationan add or increment can occur in a single 1ns operationbitpos registers are preloaded

What is the cycle-bound II for loop E in ns?


5 pts

"HTML Editor

Question 3

Explain your II answer to previous question.

� � � � � � � � � � � � � � �

� � # $ � � 12pt Paragraph

0 words�

10 ptsQuestion 4

Provide code for the body of Loop E based on the pipeline show.

Recreate loops as flagged in the diagram.





"HTML Editor

� � � � � � � � � � � � � � �


0 words�

Consider the following baseline system:

We start with a baseline, single processor system as shown.




For simplicity throughout, we will treat non-memory indexing adds (subtractscount as adds), compares, logical operations (&&, ||,|,^&), min, max, divides,and multplies as the only compute operations. We'll assume theother operations take negligible time or can be run in parallel (ILP) with thelisted compute and memory operations. (Some consequences: You may ignoreloop and conditional overheads in processor runtime estimates; you mayignore computations in array indicies.)Baseline processor can execute one operation (as defined previous bullet) percycle and runs at 1 GHz.Reads from and writes to the 1 MB main memory issue in one cycle, butrequire 5 cycles of latency (including issue) to get the first 64b result;memory can supply one 64b read or write each cycle. Reads larger than 64breturn 64b per cycle following the first result.Up to 64b reads from and writes to the 1 KB scratchpad memory take 1 cycle.By default, all arrays live in the main memory and all array referencesare to main memory.Assume non-array variables live in registers.Assume all additions are associative. Max and min are associative.A lookup in a small memory (1KB or small) can complete in 1ns.A write to the pipeline accelerator above can be performed in one cycle.

5 ptsQuestion 5


Estimate the throughput in cycles per frame for loop A running on the baselineprocessor.

5 pts

"HTML Editor

Question 6

Explain your throughput answer above.

� � � � � � � � � � � � � � �





0 words�

4 ptsQuestion 7

Loop A compute

Loop A memory

Loop E compute

Loop E memory

Where is the bottleneck in throughput processing frames?

4 pts

"HTML Editor

Question 8

What is the Amdahl's Law speedup if you were to accelerate the bottleneckidentified in the previous question? Support your answer with calculations.

� � � � � � � � � � � � � � �






0 words�

4 ptsQuestion 9

Entire tmp[] (all NFRAMES words in tmp[], each of which is a 64b word)

single 64b word

no streaming possible

What is the smallest granularity that you can profitably stream data between LoopA and Loop E?

10 ptsQuestion 10



"HTML Editor

Use the scratchpad memory to accelerate memory operations in Loop A.

Indicate which data you place in the scratchpad.

Provide code or other clear description of how you modify the provided code forLoop A to exploit the scratchpad memory.

Use part of this box to provide justification to the numerical answer in the nextquestion.

� � � � � � � � � � � � � � �


0 words�

4 ptsQuestion 11




For your revised code in the previous question, what is the throughput in cyclesper frame for the revised implementation of Loop A?

14 ptsQuestion 12

Classify each loop as sequential, reduce, or data parallel:

Loop A [ Select ]

Loop B [ Select ]

Loop C [ Select ]

Loop D [ Select ]

Loop E [ Select ]

Loop F [ Select ]

Loop G [ Select ]

5 ptsQuestion 13

What is the cycle-bound II (unlimited hardware) for loop A (assuming no bottleneckon input frames or output tmp)?



5 ptsQuestion 14

What is the latency bound (unlimited hardware) for loop A executing all NFRAMEframes (assuming no bottleneck on input frames or output tmp)?

5 pts

"HTML Editor

Question 15

Support your numeric answers to the previous two questions.

� � � � � � � � � � � � � � �






0 words�

9 pts

"HTML Editor

Question 16

Describe a VLIW architecture (types of operators and numbers of each, custommemories (memory ports) as needed) for executing Loop A that has a ResourceBound throughput (cycles per frame) that is the same throughput as the pipelinefor Loop E such that Loop A is no longer the bottleneck. For simplicity, you mayassume a monolithic, multi-ported register file. Try to identify an architecture withminimum hardware achieving the target resource bound. Provide description andcalculations to support your answer.

� � � � � � � � � � � � � � �





0 words�

5 pts

"HTML Editor

Question 17

Given the resources identified in the previous question, how close to the targetresource bound will a scheduled computation achieve? Explain your reasoning.

� � � � � � � � � � � � � � �





Quiz saved at 9:49am

0 words�

Submit Quiz

Documents

Final Quiz Instructions