Design Automation of Co-Processors for Application Specific Instruction Set Processors Seng Lin Shee

Design Automation ofCo-Processors for Application

Specific Instruction Set Processors

Design Automation ofCo-Processors for Application

Specific Instruction Set Processors

Seng Lin Shee

OutlineOutline

1. Introduction 1. Introduction

2. Justification & Aims 2. Justification & Aims

3. Work done & Accomplishments3. Work done & Accomplishments

4. Current Research 4. Current Research

5. Customized Architecture5. Customized Architecture

6. Future work6. Future work

ASIPs in GeneralASIPs in General• ASICs vs GPPs situation• Power & Performance vs

Design / Manufacturing Cost• ASIPs are the hybrid of the

two• Main characteristic: highly

configurable• Consist of a base processor

and optional components• Today’s ASIPs are extensible• Xtensa, Jazz, PEAS-III,

ARCtangent, Nios, SP5-flex

AimAim

• Automatically create coprocessors for critical loops

• Create coprocessors which acquire small area, power and fast

• Maximize parallelism• Design the methodology to create

coprocessor• Create estimation methods / ILP

formulatiom

Related WorkRelated Work• [Ernst1993] Hardware software cosynthesis for microcontrollers

– Standard processor is connected by the main memory bus to a co-processing ASIC/FPGA

– Disadvantage: only produce a small amount of improvements; no parallelism involved; also degradation in performance

• [Stitt2003] Dynamic Hardware/Software Partitioning: A First Approach – Hardware approach to profile program dynamically– Synthesize onto FPGA; dynamic partitioning to extract appropriate loop– Disadvantage: only small regions of code; single cycle loop body;

sequential address of memory block; number of iterations must be predetermined

• CriticalBlue– Provides complete methodology with toolset for converting functions to

individual coprocessors on the Cascade platform– Disadvantage: no parallelism between coprocessor and base processor;

coprocessor is a separate component on the bus

ContributionsContributions

• Coprocessors are generally separate components from the main processor, connecting via the main memory bus

• My contributions:– Coprocessors can operate loops in multiclock cycles– Maximum parallelism– No limit on loop size– Minimize resource usage; reducing area usage– Methodology to generate such a coprocessor– Reduction in communication overhead– Accurate prediction to determine the improvement of the code

segment given a certain constraint and architectural configuration

Project ToolsProject Tools• Rapid Embedded Hardware/Software System Generation

[Peddersen J., Shee S. L., Janapsatya A., Parameswaran S.] presented at the 18th IEEE/ACM International Conference on VLSI Design, January 2005

– Uses ASIPmeister to generate core then adapts the RTL to complete the processor

– Include and exclude any instructions– Automatic generation of Application Specific Instruction Set– Implements the Portable Instruction Set Architecture (PISA)– Part of the SimpleScalar framework– Support for extended instructions– Contribution:

• A full SimpleScalar architecture (integer) processor core (synthesizable into SOC or FPGA for prototyping)

• A novel approach to generate a processor with various subsets of instructions

More ToolsMore Tools• Modified SimpleScalar Toolset to support SYSCALL of

SS CPU– Take advantage of cache & memory features in SimpleScalar– Matches clock cycle count of hardware version– Provides memory dump support

• Loop detection software– To detect most frequently occurring outer most loops.– Refers back to the line numbers in the C source code.– “Dynamic Characteristics of Loops”, [Kobayashi M. 1984]

• Memory dump file analyser• Hot Function Detector

– Provides the statistics of how much time is spent in each function

High Level Synthesis ApproachHigh Level Synthesis Approach• Previous tools used: SUIF, MACHsuif (particularly for

unrolling loops)• Use SPARK for coprocessor creation (inner control)

– a C-to-VHDL high-level synthesis framework that employs a set of innovative compiler, parallelizing compiler, and synthesis transformations

– takes behavioural ANSI-C code as input, schedules it using speculative code motions and loop transformations, runs an interconnect-minimizing resource binding pass and generates a finite state machine for the scheduled design graph. A backend code generation pass outputs synthesizable register-transfer level (RTL) VHDL

• SPARK : A High-Level Synthesis Framework For Applying Parallelizing Compiler Transformations [Gupta2003]

How improvements are obtainedHow improvements are obtained

Load

Computation

Store

Unrolled

My method

Single pipeline

1 iteration

IntegrationIntegration

Co

Re

g

Co

Do

ne

GPR

HLSInternal

CoprocessorWrapper

ID WBBase

Processor

HLS Coprocessor FeaturesHLS Coprocessor Features

• Register file sharing• A wrapper to control the execution of inner

coprocessor• SCPR & BCPR Instructions• Disadvantages:

– Can only read from destination register after write back stage; latency number pipeline stages

– Very hard to make loops if input always need to be fetched every time

– Have to make wrapper all the time just to accommodate SPARK generated component

– Number of input / outputs = number of arguments

More detailsMore details• Detect loop hotspot in cjpeg program• Created coprocessor using HLS Approach• Simulated using ModelSim

• Synthesized using tcbn90gwc technology libraries through SYNOPSYS design compiler

• Given a 10ns clock constraint:• 416.7MHz; 6,199 m2; 2,562 NAND gates

Loop Execution

0 500,000 1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

4,500,000

Original Program

Modif ied Program

Clock Cycles

Program Execution

0 5,000,000 10,000,000 15,000,000 20,000,000

Original Program

Modif ied Program

Clock Cycles

Why HLS approach was usedWhy HLS approach was used• Used to unroll loops

• To find out how much parallelism can be obtained• Parallelism is limited by how many register ports that can be read at any

one time• Area usage & power of register file increases linearly with increasing number

of ports

• However, loop unrolling will only be beneficial if the fetches / stores are done in parallel

• We need multiple resources, but we only have 1 base processor! Bottleneck!• No need to fetch data at the last moment

GPR configuration

2 reads

1 write

4 reads

2 writes

5 reads

3 writes

8 reads

4 writes

NAND gates 19,185 27,813 34,101 42,432

for (i = 0; i < 100; i++) g ();

for (i = 0; i < 100; i += 2){ g (); g ();}

Customized ArchitectureCustomized Architecture• Highly integrated coprocessor architecture• Something like a coprocessor but integrated within the base processor• Make full used of unused registers (r8 – r15, r24-r25)• All calculations in the loop (when possible) are done in coprocessor• Base processor just fetches the required data from memory and store

the result back to memory• Coprocessor taps into signal to know when data is ready and when to

start execution• Assumptions:

– No multitasking– No preemption, no interrupts– Coprocessor does not stall CPU; will already know how long it would take

at creation time; use NOPs

• Problems:– Latency pipeline stages– Not good for loops with short / simple computations

AdvantagesAdvantages

• Save register usage• Fetch data immediately when it is ready at WB

stage• Easy coprocessor task generation; basic block

grouping• Full control of instruction synthesis• Maximize parallelism; address calculations are

also performed• Memory I/O task given to base processor• No branch calculations

Customized Coprocessor IntegrationCustomized Coprocessor Integration

Co

Reg

Co

Do

ne

GPR

CustomCoprocessor

ID WBBase

Processor

Custom Coprocessor Creation MethodologyCustom Coprocessor Creation Methodology

Preconfigured ProcessorLibrary (ASIPmeister

libraries etc)

CustomCoprocessorArchitectural

Model

Reduceinstruction

set inASIPmeister

Heuristic Algorithmfor selecting

segments of code tobe handled by COPR

Critical functions andloops

Generatecoprocessorsegment in

source code(assembly

code)

Applicationwritten in C / C++

Set ofCoprocessors

Program CompletedCPU

Initialapplication

profiling

CustomCoprocessorGeneration

Methodology

List of Processes& Loop Control

ICOP CreationMethodology

CustomCoprocessorIntegrationMethodogy

Verification MethodologyVerification Methodology

• C/C++ program is run through hardware simulation (ModelSim) and software simulation (SimpleScalar)

• Memory dump file and execution time produced by both simulations should be identical.

• Same method is applied for verification of ICOP architecture

• Sim-hexbin (program developed) is used to obtain output file from dump file for comparison purposes

Loop IdentificationLoop Identification

Critical Loops

0

5

10

15

20

0 10 20 30 40 50 60Pixels (K pixels)

Per

cen

tag

e

jcphuff.c:643

jcdctmgr.c:232

jchuff.c:766

jchuff.c:684

jchuff.c:673

jfdctint.c:220

jfdctint.c:155

How did we fair?How did we fair?• Detect (same as previous) loop hotspot in cjpeg program• Created coprocessor using Custom Coprocessor Methodology• Simulated using ModelSim

• Synthesized using tcbn90gwc technology libraries through SYNOPSYS design compiler

• Given a 10ns clock constraint:• 166.9MHz (1 GHz possible); 16,203 m2; 6,698 NAND gates

• Has potential to acquire less area

Loop Execution

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

Original Program

HLS Coprocessor

Custom Coprocessor

Clock Cycles (M cycles)

Program Execution

17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0

Original Program

HLS Coprocessor

Custom Coprocessor

Clock Cycles (M cycles)

HLS vs Custom CoprocessorHLS vs Custom Coprocessor

Loop Energy Consumption

0

20

40

60

80

100

120

140

160

180

Bas

eP

roce

ssor

HLS

Cop

roce

ssor

Cus

tom

Cop

roce

ssor

En

erg

y (µ

J)

Processor Size

0

10000

20000

30000

40000

50000

60000

70000

80000

Bas

eP

roce

ssor

HLS

Cop

roce

ssor

Cus

tom

Cop

roce

ssor

NA

ND

gat

es

Coprocessor Size

0

1000

2000

3000

4000

5000

6000

7000

HLS

Cop

roce

ssor

Cus

tom

Cop

roce

ssor

NA

ND

gat

es

Memory Latency EffectMemory Latency Effect

Loop Improvement

159.58 156.66 156.92 156.33 156.50 155.62 155.72154.91159.58

103.72

69.2552.17

42.11 35.48 30.78 27.20

159.58

103.92

69.0752.20

42.03 35.44 30.76 27.17

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

0/0 18/2 36/4 54/6 72/8 90/10 108/12 126/14

Memory Access Latency

% Im

pro

vem

en

t

ICACHE DCACHE ICACHE & DCACHE

Memory Latency EffectMemory Latency Effect

Overall Improvement

13.97

8.70

6.445.22

4.44 3.91 3.52 3.22

13.97

10.38

7.39

5.574.37

3.51 2.95 2.50

13.97

7.61

5.073.89

3.21 2.76 2.46 2.23

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

0/0 18/2 36/4 54/6 72/8 90/10 108/12 126/14

Memory Access Latency

% Im

pro

vem

en

t

ICACHE DCACHE ICACHE & DCACHE

Input Data BehaviourInput Data Behaviour

Critical Loops in CJPEG

0

5

10

15

20

25

0 200000 400000 600000 800000 1000000 1200000

Pixels

Perc

en

tag

e

jcphuff.c:643

jcdctmgr.c:232

jchuff.c:766

jchuff.c:684

jchuff.c:673

jfdctint.c:220

jfdctint.c:155

Future WorkFuture Work

• Formalize methodology• More concrete model of coprocessor• Model to predict performance

improvement• Able to decide when is ICOP architecture

feasible• Analyze performance improvements on

work on a variety of benchmark applications

Thank youThank you

Documents

Design Automation of Co-Processors for Application Specific Instruction Set Processors Seng Lin Shee