16
Final ! This is a preview of the published version of the quiz Started: Dec 20 at 9:48am Quiz Instructions Regulations: https://www.seas.upenn.edu/~ese532/fall2020/final_details.pdf 1 pts Question 1 True False I certify that I have complied with the University of Pennsylvania’s Code of Academic Integrity and the exam regulations https://www.seas.upenn.edu/~ese532/fall2020/final_details.pdf (https://www.seas.upenn.edu/~ese532/fall2020/midterm_details.pdf) in completing this exam. Consider the following application in answering the questions on this exam: #define NFRAMES 256 #define MAX_RESULTS 8*NFRAMES #define frame_type ap_int<240> #define lookup_result_type ap_int<128> #define STATELEN 12 #define KEYLEN (STATELEN+8) #define KEY_MASK 0x0FFFFF

Final Quiz Instructions

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Final Quiz Instructions

Final

! This is a preview of the published version of the quiz

Started: Dec 20 at 9:48am

Quiz InstructionsRegulations: https://www.seas.upenn.edu/~ese532/fall2020/final_details.pdf

1 ptsQuestion 1

True

False

I certify that I have complied with the University of Pennsylvania’s Code ofAcademic Integrity and the exam regulationshttps://www.seas.upenn.edu/~ese532/fall2020/final_details.pdf(https://www.seas.upenn.edu/~ese532/fall2020/midterm_details.pdf) in completing thisexam.

Consider the following application in answering the questions on this exam:

#define NFRAMES 256#define MAX_RESULTS 8*NFRAMES#define frame_type ap_int<240>#define lookup_result_type ap_int<128>#define STATELEN 12#define KEYLEN (STATELEN+8)#define KEY_MASK 0x0FFFFF

Page 2: Final Quiz Instructions

#define KEY_MASK 0x0FFFFF#define VAL_MASK 0x0FFF#define NUM_SLOTS 16384#define BUCKET_MASK 0x0FFFFFFFF#define slot_type uint32_t#define BUCKET_CAPACITY 4#include<stdint.h>extern lookup_result_type lookup[NUM_SLOTS];extern uint16_t init_lookup[256];void extract_compress(frame_type frames[NFRAMES], uint8_t bitlocs[64], uint16_t bitpos[KEYLEN], uint16_t results[MAX_RESULTS], int *num_results){ uint64_t tmp[NFRAMES]; for (int i=0;i<NFRAMES;i++) { // Loop A uint64_t result=0; int finalpos=1; frame_type val=frames[i]; for (int j=0;j<64;j++) { // Loop B uint8_t bitloc=bitlocs[j]; for (int k=128;k>0;k=k/2) { // Loop C if ((bitloc&0x01)==1) val=val/k; bitloc=bitloc/2; } if ((val&0x01)==1) result|=finalpos; finalpos=finalpos*2; } tmp[i]=result; } int result_count=0; int state=0; for (int i=0;i<NFRAMES;i++) { // Loop D uint64_t val64=tmp[i]; // val64 is input to pipeline pix for (int b=0;b<8;b++) // Loop E (also shown in pipeline pix)

Page 3: Final Quiz Instructions

for (int b=0;b<8;b++) // Loop E (also shown in pipeline pix) { <see pipeline in next box; complete code for question 4> } } *num_results=result_count; return;}

Here is a pipeline that implements loop E and the code (and loops) inside it.

Memories bitpos, lookup, init_lookup, and results are arrays defined in thecode above. Registers val64, b, state, and result_count are variables defined inthe code above.

Page 4: Final Quiz Instructions

5 ptsQuestion 2

Assuming:

up to 6-input xors or ors can be packed into a single 1ns operationa comparison followed by a mux can be performed in a single 1ns operation

mux alone can also be performed in a 0.2nseach memory operation can be performed in a single 2ns operationan add or increment can occur in a single 1ns operationbitpos registers are preloaded

What is the cycle-bound II for loop E in ns?

Page 5: Final Quiz Instructions

5 pts

"HTML Editor

Question 3

Explain your II answer to previous question.

� � � � � � � � � � � � � � �

� � # $ � � 12pt Paragraph

0 words�

10 ptsQuestion 4

Provide code for the body of Loop E based on the pipeline show.

Recreate loops as flagged in the diagram.

Page 6: Final Quiz Instructions

"HTML Editor

� � � � � � � � � � � � � � �

� � # $ � � 12pt Paragraph

0 words�

Consider the following baseline system:

We start with a baseline, single processor system as shown.

Page 7: Final Quiz Instructions

For simplicity throughout, we will treat non-memory indexing adds (subtractscount as adds), compares, logical operations (&&, ||,|,^&), min, max, divides,and multplies as the only compute operations. We'll assume theother operations take negligible time or can be run in parallel (ILP) with thelisted compute and memory operations. (Some consequences: You may ignoreloop and conditional overheads in processor runtime estimates; you mayignore computations in array indicies.)Baseline processor can execute one operation (as defined previous bullet) percycle and runs at 1 GHz.Reads from and writes to the 1 MB main memory issue in one cycle, butrequire 5 cycles of latency (including issue) to get the first 64b result;memory can supply one 64b read or write each cycle. Reads larger than 64breturn 64b per cycle following the first result.Up to 64b reads from and writes to the 1 KB scratchpad memory take 1 cycle.By default, all arrays live in the main memory and all array referencesare to main memory.Assume non-array variables live in registers.Assume all additions are associative. Max and min are associative.A lookup in a small memory (1KB or small) can complete in 1ns.A write to the pipeline accelerator above can be performed in one cycle.

5 ptsQuestion 5

Page 8: Final Quiz Instructions

Estimate the throughput in cycles per frame for loop A running on the baselineprocessor.

5 pts

"HTML Editor

Question 6

Explain your throughput answer above.

� � � � � � � � � � � � � � �

� � # $ � � 12pt Paragraph

Page 9: Final Quiz Instructions

0 words�

4 ptsQuestion 7

Loop A compute

Loop A memory

Loop E compute

Loop E memory

Where is the bottleneck in throughput processing frames?

4 pts

"HTML Editor

Question 8

What is the Amdahl's Law speedup if you were to accelerate the bottleneckidentified in the previous question? Support your answer with calculations.

� � � � � � � � � � � � � � �

� � # $ � � 12pt Paragraph

Page 10: Final Quiz Instructions

0 words�

4 ptsQuestion 9

Entire tmp[] (all NFRAMES words in tmp[], each of which is a 64b word)

single 64b word

no streaming possible

What is the smallest granularity that you can profitably stream data between LoopA and Loop E?

10 ptsQuestion 10

Page 11: Final Quiz Instructions

"HTML Editor

Use the scratchpad memory to accelerate memory operations in Loop A.

Indicate which data you place in the scratchpad.

Provide code or other clear description of how you modify the provided code forLoop A to exploit the scratchpad memory.

Use part of this box to provide justification to the numerical answer in the nextquestion.

� � � � � � � � � � � � � � �

� � # $ � � 12pt Paragraph

0 words�

4 ptsQuestion 11

Page 12: Final Quiz Instructions

For your revised code in the previous question, what is the throughput in cyclesper frame for the revised implementation of Loop A?

14 ptsQuestion 12

Classify each loop as sequential, reduce, or data parallel:

Loop A [ Select ]

Loop B [ Select ]

Loop C [ Select ]

Loop D [ Select ]

Loop E [ Select ]

Loop F [ Select ]

Loop G [ Select ]

5 ptsQuestion 13

What is the cycle-bound II (unlimited hardware) for loop A (assuming no bottleneckon input frames or output tmp)?

Page 13: Final Quiz Instructions

5 ptsQuestion 14

What is the latency bound (unlimited hardware) for loop A executing all NFRAMEframes (assuming no bottleneck on input frames or output tmp)?

5 pts

"HTML Editor

Question 15

Support your numeric answers to the previous two questions.

� � � � � � � � � � � � � � �

� � # $ � � 12pt Paragraph

Page 14: Final Quiz Instructions

0 words�

9 pts

"HTML Editor

Question 16

Describe a VLIW architecture (types of operators and numbers of each, custommemories (memory ports) as needed) for executing Loop A that has a ResourceBound throughput (cycles per frame) that is the same throughput as the pipelinefor Loop E such that Loop A is no longer the bottleneck. For simplicity, you mayassume a monolithic, multi-ported register file. Try to identify an architecture withminimum hardware achieving the target resource bound. Provide description andcalculations to support your answer.

� � � � � � � � � � � � � � �

� � # $ � � 12pt Paragraph

Page 15: Final Quiz Instructions

0 words�

5 pts

"HTML Editor

Question 17

Given the resources identified in the previous question, how close to the targetresource bound will a scheduled computation achieve? Explain your reasoning.

� � � � � � � � � � � � � � �

� � # $ � � 12pt Paragraph

Page 16: Final Quiz Instructions

Quiz saved at 9:49am

0 words�

Submit Quiz