Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Final
! This is a preview of the published version of the quiz
Started: Dec 20 at 9:48am
Quiz InstructionsRegulations: https://www.seas.upenn.edu/~ese532/fall2020/final_details.pdf
1 ptsQuestion 1
True
False
I certify that I have complied with the University of Pennsylvania’s Code ofAcademic Integrity and the exam regulationshttps://www.seas.upenn.edu/~ese532/fall2020/final_details.pdf(https://www.seas.upenn.edu/~ese532/fall2020/midterm_details.pdf) in completing thisexam.
Consider the following application in answering the questions on this exam:
#define NFRAMES 256#define MAX_RESULTS 8*NFRAMES#define frame_type ap_int<240>#define lookup_result_type ap_int<128>#define STATELEN 12#define KEYLEN (STATELEN+8)#define KEY_MASK 0x0FFFFF
#define KEY_MASK 0x0FFFFF#define VAL_MASK 0x0FFF#define NUM_SLOTS 16384#define BUCKET_MASK 0x0FFFFFFFF#define slot_type uint32_t#define BUCKET_CAPACITY 4#include<stdint.h>extern lookup_result_type lookup[NUM_SLOTS];extern uint16_t init_lookup[256];void extract_compress(frame_type frames[NFRAMES], uint8_t bitlocs[64], uint16_t bitpos[KEYLEN], uint16_t results[MAX_RESULTS], int *num_results){ uint64_t tmp[NFRAMES]; for (int i=0;i<NFRAMES;i++) { // Loop A uint64_t result=0; int finalpos=1; frame_type val=frames[i]; for (int j=0;j<64;j++) { // Loop B uint8_t bitloc=bitlocs[j]; for (int k=128;k>0;k=k/2) { // Loop C if ((bitloc&0x01)==1) val=val/k; bitloc=bitloc/2; } if ((val&0x01)==1) result|=finalpos; finalpos=finalpos*2; } tmp[i]=result; } int result_count=0; int state=0; for (int i=0;i<NFRAMES;i++) { // Loop D uint64_t val64=tmp[i]; // val64 is input to pipeline pix for (int b=0;b<8;b++) // Loop E (also shown in pipeline pix)
for (int b=0;b<8;b++) // Loop E (also shown in pipeline pix) { <see pipeline in next box; complete code for question 4> } } *num_results=result_count; return;}
Here is a pipeline that implements loop E and the code (and loops) inside it.
Memories bitpos, lookup, init_lookup, and results are arrays defined in thecode above. Registers val64, b, state, and result_count are variables defined inthe code above.
5 ptsQuestion 2
Assuming:
up to 6-input xors or ors can be packed into a single 1ns operationa comparison followed by a mux can be performed in a single 1ns operation
mux alone can also be performed in a 0.2nseach memory operation can be performed in a single 2ns operationan add or increment can occur in a single 1ns operationbitpos registers are preloaded
What is the cycle-bound II for loop E in ns?
5 pts
"HTML Editor
Question 3
Explain your II answer to previous question.
� � � � � � � � � � � � � � �
� � # $ � � 12pt Paragraph
0 words�
10 ptsQuestion 4
Provide code for the body of Loop E based on the pipeline show.
Recreate loops as flagged in the diagram.
"HTML Editor
� � � � � � � � � � � � � � �
� � # $ � � 12pt Paragraph
0 words�
Consider the following baseline system:
We start with a baseline, single processor system as shown.
For simplicity throughout, we will treat non-memory indexing adds (subtractscount as adds), compares, logical operations (&&, ||,|,^&), min, max, divides,and multplies as the only compute operations. We'll assume theother operations take negligible time or can be run in parallel (ILP) with thelisted compute and memory operations. (Some consequences: You may ignoreloop and conditional overheads in processor runtime estimates; you mayignore computations in array indicies.)Baseline processor can execute one operation (as defined previous bullet) percycle and runs at 1 GHz.Reads from and writes to the 1 MB main memory issue in one cycle, butrequire 5 cycles of latency (including issue) to get the first 64b result;memory can supply one 64b read or write each cycle. Reads larger than 64breturn 64b per cycle following the first result.Up to 64b reads from and writes to the 1 KB scratchpad memory take 1 cycle.By default, all arrays live in the main memory and all array referencesare to main memory.Assume non-array variables live in registers.Assume all additions are associative. Max and min are associative.A lookup in a small memory (1KB or small) can complete in 1ns.A write to the pipeline accelerator above can be performed in one cycle.
5 ptsQuestion 5
Estimate the throughput in cycles per frame for loop A running on the baselineprocessor.
5 pts
"HTML Editor
Question 6
Explain your throughput answer above.
� � � � � � � � � � � � � � �
� � # $ � � 12pt Paragraph
0 words�
4 ptsQuestion 7
Loop A compute
Loop A memory
Loop E compute
Loop E memory
Where is the bottleneck in throughput processing frames?
4 pts
"HTML Editor
Question 8
What is the Amdahl's Law speedup if you were to accelerate the bottleneckidentified in the previous question? Support your answer with calculations.
� � � � � � � � � � � � � � �
� � # $ � � 12pt Paragraph
0 words�
4 ptsQuestion 9
Entire tmp[] (all NFRAMES words in tmp[], each of which is a 64b word)
single 64b word
no streaming possible
What is the smallest granularity that you can profitably stream data between LoopA and Loop E?
10 ptsQuestion 10
"HTML Editor
Use the scratchpad memory to accelerate memory operations in Loop A.
Indicate which data you place in the scratchpad.
Provide code or other clear description of how you modify the provided code forLoop A to exploit the scratchpad memory.
Use part of this box to provide justification to the numerical answer in the nextquestion.
� � � � � � � � � � � � � � �
� � # $ � � 12pt Paragraph
0 words�
4 ptsQuestion 11
For your revised code in the previous question, what is the throughput in cyclesper frame for the revised implementation of Loop A?
14 ptsQuestion 12
Classify each loop as sequential, reduce, or data parallel:
Loop A [ Select ]
Loop B [ Select ]
Loop C [ Select ]
Loop D [ Select ]
Loop E [ Select ]
Loop F [ Select ]
Loop G [ Select ]
5 ptsQuestion 13
What is the cycle-bound II (unlimited hardware) for loop A (assuming no bottleneckon input frames or output tmp)?
5 ptsQuestion 14
What is the latency bound (unlimited hardware) for loop A executing all NFRAMEframes (assuming no bottleneck on input frames or output tmp)?
5 pts
"HTML Editor
Question 15
Support your numeric answers to the previous two questions.
� � � � � � � � � � � � � � �
� � # $ � � 12pt Paragraph
0 words�
9 pts
"HTML Editor
Question 16
Describe a VLIW architecture (types of operators and numbers of each, custommemories (memory ports) as needed) for executing Loop A that has a ResourceBound throughput (cycles per frame) that is the same throughput as the pipelinefor Loop E such that Loop A is no longer the bottleneck. For simplicity, you mayassume a monolithic, multi-ported register file. Try to identify an architecture withminimum hardware achieving the target resource bound. Provide description andcalculations to support your answer.
� � � � � � � � � � � � � � �
� � # $ � � 12pt Paragraph
0 words�
5 pts
"HTML Editor
Question 17
Given the resources identified in the previous question, how close to the targetresource bound will a scheduled computation achieve? Explain your reasoning.
� � � � � � � � � � � � � � �
� � # $ � � 12pt Paragraph
Quiz saved at 9:49am
0 words�
Submit Quiz