View
213
Download
0
Embed Size (px)
Citation preview
SPARKSPARK
Accelerating ASIC designs through parallelizing high-level synthesis
Sumit Gupta
Rajesh Gupta
©2003 Spark Team, Confidential 2
OutlineOutline
The targetThe problemThe technologyThe competition
The market opportunity
The people The status The plan
3
A Chip Is A Wonderful Thing!A typical chip, circa: 2006 50 square millimeters 50 million transistors 1-10 GHz, 100-1000 MOP/sq mm, 10-100 MIPS/mW 300 mm, 10,000 units/wafer, 20K wafers/month $5 per part
Does not matter what you build Processor, MEMS, Networking, Wireless, Memory
But it takes $20M to build one today, going to $50+M
So there is a strong incentive to port your application, system, box to the “chip”
4
But Design Decisions Matter!
©2003 Spark Team, Confidential 5
Technical TargetTechnical Target
Anyone and everyone with a technology IP to grind (build on-chip) – E.g., WLAN, Cellphone Chips:
•about 50 GOPS in BB processing
– and about 72 other application ‘markets’ enhanced by ASIC/FPGA parts
More technically– Behavioral descriptions with complex
and nested conditionals and loops.
©2003 Spark Team, Confidential 6
The ProblemThe Problem
Doing chip design in a system house is increasingly a costly proposition– Case Study: Conexant in 802.11a chip
•9 month from PRD to parts•7 months from PRD to synthesizable RTL•The pain is in getting the algorithmic right
for the chip implementation
Would love a “compiler” – but “push-buttons” just do not work.
©2003 Spark Team, Confidential 7
Enter High-Level SynthesisEnter High-Level Synthesis
TaskAnalysis
HW/SWPartitioning
ASIC
ProcessorCore
Memory
FPGA
I/O
HardwareBehavioralDescription
SoftwareBehavioralDescription
SoftwareCompiler
HighLevel
Synthesis
©2003 Spark Team, Confidential 8
Poor QOR, even Poor Controllability Poor QOR, even Poor Controllability
M e m o r y
ALUCo
ntr
ol
Data path
d = e - f g = h + i
If NodeT F
c
x = a + bc = a < b
j = d x gl = e + x
x = a + b;c = a < b;if (c) then d = e – f;else g = h + i;j = d x g;l = e + x;
©2003 Spark Team, Confidential 9
The Technology: Enter SPARKThe Technology: Enter SPARK
C Input VHDLOutput
Original CDFG
Optimized CDFG
Scheduling& Binding
Source-Level Compiler
Transformations
Scheduling Compiler & Dynamic
Transformations
By the time you got to CDFG, it is already too late
Parallelize (judiciously) and submerge it with HLS.
©2003 Spark Team, Confidential 10
Why SPARK, Why Now?Why SPARK, Why Now?
The chip designer is finally– letting go of the cycle boundary in design– being replaced by non-chip types
Education and awareness through – Synopsys Behavioral Compiler – But not ready to be the dominator…
SPARK changes the landscape– Parallelizing compilation as the ‘power
tool’
©2003 Spark Team, Confidential 11
SPARK Core StrengthsSPARK Core Strengths
Focus on – Transformations that increase
amount of parallelism available in the source description
– Tightly integrate with parallelizing compiler transformations
Provide a HLS Toolbox for the micro-architect– Fire the circuit designer.
©2003 Spark Team, Confidential 12
The POC and The ExperimentsThe POC and The Experiments
Intel ILD design– Produced a design that fundamentally
restructures the input description (the way a designer would, and no tool could)
Bunch of other media benchmarks– 40-70% improvement in delay for the
same area– Based on Synopsys backend
See appendix.
©2003 Spark Team, Confidential 13
The Market OpportunityThe Market Opportunity
The big picture– Semi is $140B, Fabless Semi is $15B– EDA currently is about $4B
Current EDA market– $1B Synthesis and verification
•$400M synthesis, $400M verification, $200M E.
– $3B in PDA, IP and Design Services. $400M Synthesis
– 90% is RTL and below. Market movement and ‘structural’ changes.
©2003 Spark Team, Confidential 14
Future ESL and Synthesis MarketFuture ESL and Synthesis Market
Keys to growth – ASIC focus (including structured ASICS)– ‘Power tool’ key to commanding high
ASPsChallenge
– The raid of the FPGAs• In which case, PHLS will be OEM’d
– ASICs mired in Nano swamp•Attention shifts to PDA, stationary semi
market
©2003 Spark Team, Confidential 15
The CompetitionThe Competition
The early educator: Synopsys BC– Classical HLS that just does not work,
fundamentally flawed The improviser: Cadence Get2Chip A2C
– Done a good job at RTL The others
– Celoxica, Forte, Synfora, BlueSpec– “Boutiques” primarily targeted for
“somebody else”
SynopsSynopsysys
Behav. Behav. CompilCompilerer
Traditional HLS: Synthesis Traditional HLS: Synthesis from subset of SystemC and from subset of SystemC and Behav VHDLBehav VHDL
No parallelizing No parallelizing and beyond basic and beyond basic block (BBB) block (BBB) transformationstransformations
CadencCadence/e/
Get2ChGet2Chipip
A2CA2C Traditional HLS; closely tied to Traditional HLS; closely tied to logic synthesislogic synthesis
No parallelizing No parallelizing and BBB trafosand BBB trafos
CeloxicCeloxicaa
DK DK Design Design SuiteSuite
Uses explicitly parallelized Uses explicitly parallelized input in Handel-C; traditional input in Handel-C; traditional HLSHLS
No pure No pure behavioral input behavioral input such as C or such as C or SystemCSystemC
Forte Forte DSDS
CyntheCynthesizersizer
Traditional HLS from SystemC Traditional HLS from SystemC with design space explorationwith design space exploration
No parallel and No parallel and BBB trafosBBB trafos
SynforaSynfora NANA Maps applications to a VLIW Maps applications to a VLIW processor and a pipelined processor and a pipelined array of processors – uses array of processors – uses parallelizing transformations in parallelizing transformations in VLIW compilerVLIW compiler
Does not do HLS Does not do HLS at all – it’s more at all – it’s more of a mapping tool of a mapping tool from C to a from C to a processor arrayprocessor array
BlueSpBlueSpecec
NANA Based on term rewriting Based on term rewriting systems; starts from a systems; starts from a description closer to RTL than description closer to RTL than to behavto behav
Not HLS – input Not HLS – input is behav code is behav code already already scheduled into scheduled into statesstates
The Competition
©2003 Spark Team, Confidential 17
What Do We Want To Do?What Do We Want To Do?
Make it accessible to SystemC, SystemVerilog– Front end architecture to port it across
Implement missing compiler passes– Really standard stuff but missing piece now
Work out a design flow– Build a path to existing RTL flow incl.
validation Industry strength characterization Secure IP rights
©2003 Spark Team, Confidential 18
Synergistic ActivitiesSynergistic Activities
SPARK release on the web– Mailing list– Build the users group– Expand to SystemC User
CommunityKluwer book in preparation
– Announcement at DATE, Feb 2004– Availability at DAC, June 2004
©2003 Spark Team, Confidential 19
Exit StrategyExit Strategy
Not yet worked out, but… Build a stand-alone EDA company
– As a standalone it would not work unless complemented by verification
Build to be bought– As an HLS company
License technology– Companies that have shown interest in
licensing it•Poseidon Systems, Cadence
©2003 Spark Team, Confidential 20
SPARK HistorySPARK History
A joint project– Rajesh Gupta, Nikil Dutt, Alex Nicolau
Kicked off in Fall 1999– First Ph.D., Sumit Gupta, 2003
Supported by– Semiconductor Research Corporation,
SRC– Intel grant as a match to UC Micro– National Science Foundation.
21Copyright Sumit Gupta 2003
Case Study: Case Study: IntelIntel Instruction Length Instruction Length DecoderDecoder
Stream ofInstructions
Instruction Length Decoder
FirstInsn
SecondInsn
ThirdInstruction
Instruction BufferInstruction Buffer
22Copyright Sumit Gupta 2003
ILD Synthesis: Resulting ILD Synthesis: Resulting ArchitectureArchitecture
Speculate Operations,Fully Unroll Loop,
Eliminate Loop Index Variable
Multi-cycle Sequential
Architecture
Multi-cycle Sequential
Architecture
Single cycle Parallel
Architecture
Single cycle Parallel
Architecture
Our toolbox approach enables us to develop a Our toolbox approach enables us to develop a script to synthesize applications from different script to synthesize applications from different domainsdomains
Final design looks close to the actual Final design looks close to the actual implementation done by Intelimplementation done by Intel
23Copyright Sumit Gupta 2003
Target ApplicationsTarget ApplicationsDesignDesign # of # of
IfsIfs# of # of
LoopsLoops# Non-# Non-Empty Empty Basic Basic BlocksBlocks
# of # of OperatiOperati
onsons
MPEG-1 MPEG-1 pred1pred1
44 22 1717 123123
MPEG-1 MPEG-1 pred2pred2
1111 66 4545 287287
MPEG-2 MPEG-2 dp_framdp_fram
ee
1818 44 6161 260260
GIMP GIMP
tilertiler1111 22 3535 150150
24Copyright Sumit Gupta 2003
MPEG-1 Pred1 Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
+ Speculative Code Motions
+ Pre-Synthesis Transforms
+ Dynamic CSE
MPEG-1 Pred2 Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
Scheduling & Logic Synthesis Scheduling & Logic Synthesis ResultsResults
Non-speculative CMs: Within BBs & Across Hier Blocks
42%
10%
36%
36%
8%
39%
Overall: 63-66 % improvement in DelayOverall: 63-66 % improvement in Delay
Almost constant Area Almost constant Area
25Copyright Sumit Gupta 2003
Non-speculative CMs: Within BBs & Across Hier Blocks
+ Speculative Code Motions
+ Pre-Synthesis Transforms
+ Dynamic CSE
Scheduling & Logic Synthesis Scheduling & Logic Synthesis ResultsResultsMPEG-2 DpFrame Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
GIMP Tiler Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
14%
20%1%
33%
41%
52%
Overall: 48-76 % improvement in DelayOverall: 48-76 % improvement in Delay
Almost constant Area Almost constant Area