View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Design Automation ofCo-Processors for Application
Specific Instruction Set Processors
Design Automation ofCo-Processors for Application
Specific Instruction Set Processors
Seng Lin Shee
OutlineOutline
1. Introduction 1. Introduction
2. Justification & Aims 2. Justification & Aims
3. Work done & Accomplishments3. Work done & Accomplishments
4. Current Research 4. Current Research
5. Customized Architecture5. Customized Architecture
6. Future work6. Future work
ASIPs in GeneralASIPs in General• ASICs vs GPPs situation• Power & Performance vs
Design / Manufacturing Cost• ASIPs are the hybrid of the
two• Main characteristic: highly
configurable• Consist of a base processor
and optional components• Today’s ASIPs are extensible• Xtensa, Jazz, PEAS-III,
ARCtangent, Nios, SP5-flex
AimAim
• Automatically create coprocessors for critical loops
• Create coprocessors which acquire small area, power and fast
• Maximize parallelism• Design the methodology to create
coprocessor• Create estimation methods / ILP
formulatiom
Related WorkRelated Work• [Ernst1993] Hardware software cosynthesis for microcontrollers
– Standard processor is connected by the main memory bus to a co-processing ASIC/FPGA
– Disadvantage: only produce a small amount of improvements; no parallelism involved; also degradation in performance
• [Stitt2003] Dynamic Hardware/Software Partitioning: A First Approach – Hardware approach to profile program dynamically– Synthesize onto FPGA; dynamic partitioning to extract appropriate loop– Disadvantage: only small regions of code; single cycle loop body;
sequential address of memory block; number of iterations must be predetermined
• CriticalBlue– Provides complete methodology with toolset for converting functions to
individual coprocessors on the Cascade platform– Disadvantage: no parallelism between coprocessor and base processor;
coprocessor is a separate component on the bus
ContributionsContributions
• Coprocessors are generally separate components from the main processor, connecting via the main memory bus
• My contributions:– Coprocessors can operate loops in multiclock cycles– Maximum parallelism– No limit on loop size– Minimize resource usage; reducing area usage– Methodology to generate such a coprocessor– Reduction in communication overhead– Accurate prediction to determine the improvement of the code
segment given a certain constraint and architectural configuration
Project ToolsProject Tools• Rapid Embedded Hardware/Software System Generation
[Peddersen J., Shee S. L., Janapsatya A., Parameswaran S.] presented at the 18th IEEE/ACM International Conference on VLSI Design, January 2005
– Uses ASIPmeister to generate core then adapts the RTL to complete the processor
– Include and exclude any instructions– Automatic generation of Application Specific Instruction Set– Implements the Portable Instruction Set Architecture (PISA)– Part of the SimpleScalar framework– Support for extended instructions– Contribution:
• A full SimpleScalar architecture (integer) processor core (synthesizable into SOC or FPGA for prototyping)
• A novel approach to generate a processor with various subsets of instructions
More ToolsMore Tools• Modified SimpleScalar Toolset to support SYSCALL of
SS CPU– Take advantage of cache & memory features in SimpleScalar– Matches clock cycle count of hardware version– Provides memory dump support
• Loop detection software– To detect most frequently occurring outer most loops.– Refers back to the line numbers in the C source code.– “Dynamic Characteristics of Loops”, [Kobayashi M. 1984]
• Memory dump file analyser• Hot Function Detector
– Provides the statistics of how much time is spent in each function
High Level Synthesis ApproachHigh Level Synthesis Approach• Previous tools used: SUIF, MACHsuif (particularly for
unrolling loops)• Use SPARK for coprocessor creation (inner control)
– a C-to-VHDL high-level synthesis framework that employs a set of innovative compiler, parallelizing compiler, and synthesis transformations
– takes behavioural ANSI-C code as input, schedules it using speculative code motions and loop transformations, runs an interconnect-minimizing resource binding pass and generates a finite state machine for the scheduled design graph. A backend code generation pass outputs synthesizable register-transfer level (RTL) VHDL
• SPARK : A High-Level Synthesis Framework For Applying Parallelizing Compiler Transformations [Gupta2003]
How improvements are obtainedHow improvements are obtained
Load
Computation
Store
Unrolled
My method
Single pipeline
1 iteration
HLS Coprocessor FeaturesHLS Coprocessor Features
• Register file sharing• A wrapper to control the execution of inner
coprocessor• SCPR & BCPR Instructions• Disadvantages:
– Can only read from destination register after write back stage; latency number pipeline stages
– Very hard to make loops if input always need to be fetched every time
– Have to make wrapper all the time just to accommodate SPARK generated component
– Number of input / outputs = number of arguments
More detailsMore details• Detect loop hotspot in cjpeg program• Created coprocessor using HLS Approach• Simulated using ModelSim
• Synthesized using tcbn90gwc technology libraries through SYNOPSYS design compiler
• Given a 10ns clock constraint:• 416.7MHz; 6,199 m2; 2,562 NAND gates
Loop Execution
0 500,000 1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
4,500,000
Original Program
Modif ied Program
Clock Cycles
Program Execution
0 5,000,000 10,000,000 15,000,000 20,000,000
Original Program
Modif ied Program
Clock Cycles
Why HLS approach was usedWhy HLS approach was used• Used to unroll loops
• To find out how much parallelism can be obtained• Parallelism is limited by how many register ports that can be read at any
one time• Area usage & power of register file increases linearly with increasing number
of ports
• However, loop unrolling will only be beneficial if the fetches / stores are done in parallel
• We need multiple resources, but we only have 1 base processor! Bottleneck!• No need to fetch data at the last moment
GPR configuration
2 reads
1 write
4 reads
2 writes
5 reads
3 writes
8 reads
4 writes
NAND gates 19,185 27,813 34,101 42,432
for (i = 0; i < 100; i++) g ();
for (i = 0; i < 100; i += 2){ g (); g ();}
Customized ArchitectureCustomized Architecture• Highly integrated coprocessor architecture• Something like a coprocessor but integrated within the base processor• Make full used of unused registers (r8 – r15, r24-r25)• All calculations in the loop (when possible) are done in coprocessor• Base processor just fetches the required data from memory and store
the result back to memory• Coprocessor taps into signal to know when data is ready and when to
start execution• Assumptions:
– No multitasking– No preemption, no interrupts– Coprocessor does not stall CPU; will already know how long it would take
at creation time; use NOPs
• Problems:– Latency pipeline stages– Not good for loops with short / simple computations
AdvantagesAdvantages
• Save register usage• Fetch data immediately when it is ready at WB
stage• Easy coprocessor task generation; basic block
grouping• Full control of instruction synthesis• Maximize parallelism; address calculations are
also performed• Memory I/O task given to base processor• No branch calculations
Customized Coprocessor IntegrationCustomized Coprocessor Integration
Co
Reg
Co
Do
ne
GPR
CustomCoprocessor
ID WBBase
Processor
Custom Coprocessor Creation MethodologyCustom Coprocessor Creation Methodology
Preconfigured ProcessorLibrary (ASIPmeister
libraries etc)
CustomCoprocessorArchitectural
Model
Reduceinstruction
set inASIPmeister
Heuristic Algorithmfor selecting
segments of code tobe handled by COPR
Critical functions andloops
Generatecoprocessorsegment in
source code(assembly
code)
Applicationwritten in C / C++
Set ofCoprocessors
Program CompletedCPU
Initialapplication
profiling
CustomCoprocessorGeneration
Methodology
List of Processes& Loop Control
ICOP CreationMethodology
CustomCoprocessorIntegrationMethodogy
Verification MethodologyVerification Methodology
• C/C++ program is run through hardware simulation (ModelSim) and software simulation (SimpleScalar)
• Memory dump file and execution time produced by both simulations should be identical.
• Same method is applied for verification of ICOP architecture
• Sim-hexbin (program developed) is used to obtain output file from dump file for comparison purposes
Loop IdentificationLoop Identification
Critical Loops
0
5
10
15
20
0 10 20 30 40 50 60Pixels (K pixels)
Per
cen
tag
e
jcphuff.c:643
jcdctmgr.c:232
jchuff.c:766
jchuff.c:684
jchuff.c:673
jfdctint.c:220
jfdctint.c:155
How did we fair?How did we fair?• Detect (same as previous) loop hotspot in cjpeg program• Created coprocessor using Custom Coprocessor Methodology• Simulated using ModelSim
• Synthesized using tcbn90gwc technology libraries through SYNOPSYS design compiler
• Given a 10ns clock constraint:• 166.9MHz (1 GHz possible); 16,203 m2; 6,698 NAND gates
• Has potential to acquire less area
Loop Execution
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Original Program
HLS Coprocessor
Custom Coprocessor
Clock Cycles (M cycles)
Program Execution
17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0
Original Program
HLS Coprocessor
Custom Coprocessor
Clock Cycles (M cycles)
HLS vs Custom CoprocessorHLS vs Custom Coprocessor
Loop Energy Consumption
0
20
40
60
80
100
120
140
160
180
Bas
eP
roce
ssor
HLS
Cop
roce
ssor
Cus
tom
Cop
roce
ssor
En
erg
y (µ
J)
Processor Size
0
10000
20000
30000
40000
50000
60000
70000
80000
Bas
eP
roce
ssor
HLS
Cop
roce
ssor
Cus
tom
Cop
roce
ssor
NA
ND
gat
es
Coprocessor Size
0
1000
2000
3000
4000
5000
6000
7000
HLS
Cop
roce
ssor
Cus
tom
Cop
roce
ssor
NA
ND
gat
es
Memory Latency EffectMemory Latency Effect
Loop Improvement
159.58 156.66 156.92 156.33 156.50 155.62 155.72154.91159.58
103.72
69.2552.17
42.11 35.48 30.78 27.20
159.58
103.92
69.0752.20
42.03 35.44 30.76 27.17
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
0/0 18/2 36/4 54/6 72/8 90/10 108/12 126/14
Memory Access Latency
% Im
pro
vem
en
t
ICACHE DCACHE ICACHE & DCACHE
Memory Latency EffectMemory Latency Effect
Overall Improvement
13.97
8.70
6.445.22
4.44 3.91 3.52 3.22
13.97
10.38
7.39
5.574.37
3.51 2.95 2.50
13.97
7.61
5.073.89
3.21 2.76 2.46 2.23
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
0/0 18/2 36/4 54/6 72/8 90/10 108/12 126/14
Memory Access Latency
% Im
pro
vem
en
t
ICACHE DCACHE ICACHE & DCACHE
Input Data BehaviourInput Data Behaviour
Critical Loops in CJPEG
0
5
10
15
20
25
0 200000 400000 600000 800000 1000000 1200000
Pixels
Perc
en
tag
e
jcphuff.c:643
jcdctmgr.c:232
jchuff.c:766
jchuff.c:684
jchuff.c:673
jfdctint.c:220
jfdctint.c:155
Future WorkFuture Work
• Formalize methodology• More concrete model of coprocessor• Model to predict performance
improvement• Able to decide when is ICOP architecture
feasible• Analyze performance improvements on
work on a variety of benchmark applications