Upload
rehan
View
25
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Industrial Experiences Pioneering Asynchronous Commercial Design. Peter A. Beerel Fulcrum Microsystems Calabasas Hills, CA, USA. Specification. Design & Verification. Design & Verification. Simulation & Verification. Synthesis & Floor Planning. Physical Design. - PowerPoint PPT Presentation
Citation preview
11
Industrial ExperiencesIndustrial Experiences
Pioneering Asynchronous Pioneering Asynchronous Commercial DesignCommercial Design
Peter A. BeerelPeter A. Beerel
Fulcrum MicrosystemsFulcrum Microsystems
Calabasas Hills, CA, USACalabasas Hills, CA, USA
22
AgendaAgendaIntroduction to FulcrumIntroduction to Fulcrum
Description of Integrated PipeliningDescription of Integrated Pipelining Fulcrum’s clockless circuit architectureFulcrum’s clockless circuit architecture
Description of Fulcrum’s Design FlowDescription of Fulcrum’s Design Flow
Overview of NexusOverview of Nexus Fulcrum’s Terabit crossbarFulcrum’s Terabit crossbar
Overview of PivotPointOverview of PivotPoint Fulcrum’s first commercial productFulcrum’s first commercial product
CircuitA
CircuitB
Design & Verification
Design & Verification
Synthesis & Floor Planning
Physical Design
Specification
Database Release to Manufacturing
Sim
ula
tio
n &
Ver
ific
atio
n
33
Company SnapshotCompany Snapshot
“Clockless”Semiconductor Company
Located in Calabasas, CA(30 people)
Technology provenin large-scale designs
Backed by top-tier investors(raised $14M in June)
Formed out of Caltech(1/00)
44
AgendaAgendaIntroduction to FulcrumIntroduction to Fulcrum
Description of Integrated PipeliningDescription of Integrated Pipelining Fulcrum’s clockless circuit architectureFulcrum’s clockless circuit architecture
Description of Fulcrum’s Design FlowDescription of Fulcrum’s Design Flow
Overview of NexusOverview of Nexus Fulcrum’s Terabit crossbarFulcrum’s Terabit crossbar
Overview of PivotPointOverview of PivotPoint Fulcrum’s first commercial productFulcrum’s first commercial product
CircuitA
CircuitB
Design & Verification
Design & Verification
Synthesis & Floor Planning
Physical Design
Specification
Database Release to Manufacturing
Sim
ula
tio
n &
Ver
ific
atio
n
55
Fulcrum’s Integrated PipeliningFulcrum’s Integrated Pipelining
Acknowledge
Robust, power efficient, and high performance
Fast delay-insensitive style using domino logic without latches(Developed at Caltech by Fulcrum’s founders)
Acknowledge
Dual-RailDominoLogic
Dual-RailDominoLogic
Dual-RailDominoLogic
66
Integrated PipeliningIntegrated Pipelining
Harnessing the power of Domino LogicHarnessing the power of Domino Logic Addresses delay variability with Completion SensingAddresses delay variability with Completion Sensing Addresses power inefficiency with Async HandshakesAddresses power inefficiency with Async Handshakes Leverages more efficient “N” transistorsLeverages more efficient “N” transistors
OutputCompletionDetection
Dual-RailDominoLogic
Control
Dual-RailDominoLogic
Control
Dual-RailDominoLogic
Control
InputCompletion
Detection
Leaf Cell A Leaf Cell B Leaf Cell C
77
Hierarchical DesignHierarchical Design Multi-level hierarchy of communicating blocksMulti-level hierarchy of communicating blocks
ASIC
At each level blocks communicate along channels
88
Hierarchical DesignHierarchical Design Multi-level hierarchy of communicating blocksMulti-level hierarchy of communicating blocks
Main FSM
Register Bank
Memory
Adder/Mult.
Subtract/Divider
At each level blocks communicate along channels
99
Hierarchical DesignHierarchical Design Multi-level hierarchy of communicating blocksMulti-level hierarchy of communicating blocks
BN-1 BN-2 BN-3
FAN-1 FAN-2 FAN-3 FA0Reg C
Reg B
Adder
Multiplier
Reg A
At each level blocks communicate along channels
channels
leaf cells
1010
Leaf CellsLeaf Cells
DefinitionDefinition Smallest block that performs logic and communicates via channelsSmallest block that performs logic and communicates via channels Based on small number of pipeline templates guiding designBased on small number of pipeline templates guiding design Forms basic building block for physical designForms basic building block for physical design
FeaturesFeatures Facilitates high throughput and low latencyFacilitates high throughput and low latency Provides easy timing validation and analog verificationProvides easy timing validation and analog verification ~1,000 digital leaf cell types compose our leaf cell library~1,000 digital leaf cell types compose our leaf cell library ~200 additional subtypes for different environments (e.g., loads)~200 additional subtypes for different environments (e.g., loads)
FRCD
D
LCD
C
1111
• Each pipeline style (QDI, timed…) has a different blueprint
• Library uses a blueprint to implement the lowest level blocks
RCD
F
LCD
C
Blueprint for a QDI N-input M-output pipeline stage
RCD
F
LCD
C
LCD
2-input 1-output pipeline stage
RCD
F
LCD
C
RCD
1-input 2-output pipeline stage
Template-Based Cell DesignTemplate-Based Cell Design
1212
Summary of CharacteristicsSummary of CharacteristicsDelay-Insensitive timing modelDelay-Insensitive timing model Gates and wires can have arbitrary delaysGates and wires can have arbitrary delays
4 phase 1of4 handshake4 phase 1of4 handshake Uses 4 wires to send 2 bitsUses 4 wires to send 2 bits Plus an acknowledge wire for flow controlPlus an acknowledge wire for flow control Returned to neutral between each data transferReturned to neutral between each data transfer Self shieldingSelf shielding
Precharge domino logic plus async handshakePrecharge domino logic plus async handshakeLow latency; high frequency; robustLow latency; high frequency; robustAuto power conservation; zero standby powerAuto power conservation; zero standby power
1313
AgendaAgendaIntroduction to FulcrumIntroduction to Fulcrum
Description of Integrated PipeliningDescription of Integrated Pipelining Fulcrum’s clockless circuit architectureFulcrum’s clockless circuit architecture
Description of Fulcrum’s Design FlowDescription of Fulcrum’s Design Flow
Overview of NexusOverview of Nexus Fulcrum’s Terabit crossbarFulcrum’s Terabit crossbar
Overview of PivotPointOverview of PivotPoint Fulcrum’s first commercial productFulcrum’s first commercial product
CircuitA
CircuitB
Design & Verification
Design & Verification
Synthesis & Floor Planning
Physical Design
Specification
Database Release to Manufacturing
Sim
ula
tio
n &
Ver
ific
atio
n
1414
Fulcrum Design FlowFulcrum Design Flow
Hierarchical design flowHierarchical design flow Executable specificationsExecutable specifications Formal decompositionFormal decomposition Creates design hierarchyCreates design hierarchy
Semi-custom Semi-custom synthesis & layoutsynthesis & layout
Hierarchical floor planningHierarchical floor planning Automated transistor sizingAutomated transistor sizing Semi-automated physical Semi-automated physical
designdesign
Supports synchronous & Supports synchronous & asynchronous designsasynchronous designs
Hard macro from place & routeHard macro from place & route
ArchitectureDesign & Verification
Micro-architectureDesign & Verification
Synthesis &Floor Planning
Physical Design
Design Specification
Database Releaseto Manufacturing
Mit
ere
d S
imu
lati
on
& V
erif
icat
ion
1515
Managing Design HierarchyManaging Design Hierarchy
Proprietary Objected Oriented Hardware LanguageProprietary Objected Oriented Hardware Language Integrated hierarchical design/verification languageIntegrated hierarchical design/verification language
Defines cell specification & implementationDefines cell specification & implementation SpecificationSpecification
Java or communicating-sequential-processes (CSP)Java or communicating-sequential-processes (CSP) Implementation: multiple formsImplementation: multiple forms
Sub-cellsSub-cellsSub-cells defined in terms of specification or implementationSub-cells defined in terms of specification or implementation
Defines integrated test environment for each cellDefines integrated test environment for each cell Enables verification at all pairs of levelsEnables verification at all pairs of levels
Efficiency featuresEfficiency features Supports refinement of cells and channelsSupports refinement of cells and channels
1616
Physical DesignPhysical DesignLayout hierarchy based on design hierarchyLayout hierarchy based on design hierarchy Hierarchical floor-planning semi-automated Hierarchical floor-planning semi-automated Large scale hand placement before sizingLarge scale hand placement before sizing Long distance channels planned carefullyLong distance channels planned carefully
Timing closure by constructionTiming closure by construction Placement drives sizingPlacement drives sizing Can insert extra pipelining on long wires late in designCan insert extra pipelining on long wires late in design
Tradeoffs between performance and design timeTradeoffs between performance and design time Hand layout where necessaryHand layout where necessary Automated layout where possibleAutomated layout where possible
GoalsGoals Full-custom density and speed within ASIC design timeFull-custom density and speed within ASIC design time
1717
Design Verification: System-LevelDesign Verification: System-Level
MissionMission Verify that executable spec = written spec + gate-level modelVerify that executable spec = written spec + gate-level model
Use industry-standard tools & methodsUse industry-standard tools & methods Cadence NCSIM and efficient Java-Verilog interfaceCadence NCSIM and efficient Java-Verilog interface Directed random testing Directed random testing Line & functional coverageLine & functional coverage
TestCases
Traffic Generator& Checker
ConfigurationManager
Test Bench Device Under Test
Monitor
BusFunctional
Model
ExecutableSpec
Gate-levelVerilogModel
1818
Design Verification: Unit-LevelDesign Verification: Unit-Level
Mitered co-simulation for unit-level verificationMitered co-simulation for unit-level verification Check correctness of digital model by comparing it to golden CSP/Java Check correctness of digital model by comparing it to golden CSP/Java
modelmodel
FeaturesFeatures Framework automated and regressedFramework automated and regressed Checks correctnessChecks correctness Checks delay insensitivity and/or throughput and latencyChecks delay insensitivity and/or throughput and latency
High level(Java/CSP)
Low level(CSP/PRS/CDL)
TestEngine
Log==Copy
1919
Analog Verification: Charge SharingAnalog Verification: Charge Sharing
SPICE-based charge sharing analysisSPICE-based charge sharing analysisTest case generation and analysis automatedTest case generation and analysis automatedCharge-sharing problems solved in numerous ways Charge-sharing problems solved in numerous ways
SymmetrizationSymmetrization Less transistor sharingLess transistor sharing Delay perturbationsDelay perturbations
Synthesis
Charge SharingTest Generator
SPICE
2020
Synthesis: Gate Generation / SizingSynthesis: Gate Generation / Sizing
Automated generation of Automated generation of transistor netliststransistor netlists
Dynamic logic generationDynamic logic generation Transistor sharingTransistor sharing SymmetrizationSymmetrization Gate-library matchingGate-library matching
Transistor sizingTransistor sizing Path-based sizing to meet Path-based sizing to meet
amortized unit-delay modelamortized unit-delay model
Micro-architecture feedbackMicro-architecture feedback Identifies where fanout limits Identifies where fanout limits
performanceperformance
CSPGate
LibraryFloor planning
Information
Logic Synthesis
Transistor Sizing
CDL Netlist
2121
Fulcrum QDI v. Synchronous FlowsFulcrum QDI v. Synchronous Flows
Save clock tree design, analysis, optimization, and verificationSave clock tree design, analysis, optimization, and verification
No timing closure problemsNo timing closure problems Unexpected long-wire bottlenecks easily solved with additional pipeline Unexpected long-wire bottlenecks easily solved with additional pipeline
buffers late in design cyclebuffers late in design cycle
QDI/DI timing model reduces timing analysis challengesQDI/DI timing model reduces timing analysis challenges
Fulcrum QDI hierarchical design facilitates:Fulcrum QDI hierarchical design facilitates: Composability, re-use, and early bug detectionComposability, re-use, and early bug detection
Hierarchical-floorplanning improves predictability of wiresHierarchical-floorplanning improves predictability of wires
Template-based leaf cell designs simplifies logic designTemplate-based leaf cell designs simplifies logic design
Design reuse reduces criticality of high-level synthesisDesign reuse reduces criticality of high-level synthesis
Decomposition methodology amenable to formal verificationDecomposition methodology amenable to formal verification
2222
AgendaAgendaIntroduction to FulcrumIntroduction to Fulcrum
Description of Integrated PipeliningDescription of Integrated Pipelining Fulcrum’s clockless circuit architectureFulcrum’s clockless circuit architecture
Description of Fulcrum’s Design FlowDescription of Fulcrum’s Design Flow
Overview of NexusOverview of Nexus Fulcrum’s Terabit crossbarFulcrum’s Terabit crossbar
Overview of PivotPointOverview of PivotPoint Fulcrum’s first commercial productFulcrum’s first commercial product
CircuitA
CircuitB
Design & Verification
Design & Verification
Synthesis & Floor Planning
Physical Design
Specification
Database Release to Manufacturing
Sim
ula
tio
n &
Ver
ific
atio
n
2323
Globally Asynchronous,Globally Asynchronous,Locally SynchronousLocally Synchronous
SoC designs: many cores with different clock domainsSoC designs: many cores with different clock domains
Async circuits can interconnect multiple sync cores in an Async circuits can interconnect multiple sync cores in an SoC design, eliminating global clock distribution and SoC design, eliminating global clock distribution and simplifying clock domain crossingsimplifying clock domain crossing
Fulcrum’s “Nexus” is a high speed on-chip interconnect:Fulcrum’s “Nexus” is a high speed on-chip interconnect: 16 port, 36 bit asynchronous crossbar16 port, 36 bit asynchronous crossbar Asynchronous cross-chip channelsAsynchronous cross-chip channels Async-sync clock domain convertersAsync-sync clock domain converters Runs at 1.35GHz in 130nm processRuns at 1.35GHz in 130nm process
2424
Nexus System-on-Chip Nexus System-on-Chip InterconnectInterconnect
Non-blocking crossbarNon-blocking crossbar16 full-duplex ports16 full-duplex portsFlow control extends Flow control extends through the crossbarthrough the crossbarFull speed arbitrationFull speed arbitrationArbitrary-length “bursts”Arbitrary-length “bursts”Bridges clock domainsBridges clock domainsScales in bit width and Scales in bit width and portsportsProcess portableProcess portable
Generic Nexus Example
- Synchronous IP block
- Asynchronous IP block
- Pipelined repeater
- Clock domain converter
2525
Nexus Burst FormatNexus Burst Format
To
D1
0
Incoming From Source Outgoing To Target
D2
0
D3
0
DN
1
• • •
From
D1
0
D2
0
D3
0
DN
1
• • •Data 36 bit
Tail 1 bit
Control 4 bit
Arbitrary-length source-routed bursts provide flexibility
Source Module
Target Module
2626
Sync-to-Async ConversionSync-to-Async ConversionSynchronous Request / Grant FIFO protocolSynchronous Request / Grant FIFO protocol Data transferred if request and grant both high on rising edge of clockData transferred if request and grant both high on rising edge of clock Compensates for any skew on asynchronous sideCompensates for any skew on asynchronous side Low latency: 1/2 to 3/2 clock cycles at A2SLow latency: 1/2 to 3/2 clock cycles at A2S
S2A
A
SynchronousDatapath
Request
Grant
clock
AsynchronousDatapath
A2S
A
SynchronousDatapath
Request
Grant
clock
AsynchronousDatapath
Seamlessly Bridges Different Clock Domains
2727
Arbitration and OrderingArbitration and OrderingUnrelated sender/receiver links are independentUnrelated sender/receiver links are independentBursts sent from multiple input ports to the same output Bursts sent from multiple input ports to the same output port are serviced fairly by built-in arbitration circuitry port are serviced fairly by built-in arbitration circuitryBursts from A to B remain orderedBursts from A to B remain orderedProducer-consumer and global-store-ordering satisfiedProducer-consumer and global-store-ordering satisfied
A sends X to B, A notifies C, C can read X from BA sends X to B, A notifies C, C can read X from B A writes X to B, A writes Y to C, if D reads Y from C, it can read A writes X to B, A writes Y to C, if D reads Y from C, it can read
X from BX from B
Split transactions implement loadsSplit transactions implement loads Load request and load completion burstsLoad request and load completion bursts Load completions returned out-of-orderLoad completions returned out-of-order
Can tunnel common bus and cache coherance protocols
2828
Example: Load/Store SystemsExample: Load/Store SystemsOption 1: Pure Master/Target PortsOption 1: Pure Master/Target Ports Masters send Requests to Targets, which may return Masters send Requests to Targets, which may return
CompletionsCompletions Each port must either be a Master or a Target so that Each port must either be a Master or a Target so that
Completions are never blocked by RequestsCompletions are never blocked by Requests Devices which need to be both Masters and Targets are Devices which need to be both Masters and Targets are
given two separate full-duplex portsgiven two separate full-duplex ports Could use two separate Nexus crossbarsCould use two separate Nexus crossbars
Option 2: PeersOption 2: Peers Modules which are both Masters and Targets implement Modules which are both Masters and Targets implement
an internal buffer to hold Requests so that Completions an internal buffer to hold Requests so that Completions can bypass themcan bypass them
All Masters or Peers restrict number of outstanding All Masters or Peers restrict number of outstanding Requests to avoid overflowing Request buffersRequests to avoid overflowing Request buffers
2929
Example: Switch FabricExample: Switch Fabric
Each module maintains input/output queues for Each module maintains input/output queues for traffic to/from each other moduletraffic to/from each other module
Data is sent from an input queue to an output Data is sent from an input queue to an output queue over Nexus as a series of short burstsqueue over Nexus as a series of short bursts
Flow control credits for each output queue are Flow control credits for each output queue are sent backwardsent backward
Eliminates head-of-line blockingEliminates head-of-line blocking
Segmentation, buffering, and overspeed optimize Segmentation, buffering, and overspeed optimize performance during congestionperformance during congestion
Used in PivotPoint, Fulcrum’s first chip product.Used in PivotPoint, Fulcrum’s first chip product.
3030
ALU
S1
S2
S3
S4
S5
S6
S7
Serial IO
Nexus Silicon ValidationNexus Silicon Validation
Plot of Nexus crossbar
Block diagram of Nexus Validation Chip ProcProc VV GHzGHz nsns pJ/bitpJ/bit
Low-KLow-K 1.21.2 1.351.35 2.02.0 10.410.4
Low-KLow-K 1.01.0 1.111.11 2.42.4 7.07.0
FSGFSG 1.21.2 1.101.10 2.52.5 11.211.2
FSGFSG 1.01.0 0.870.87 3.13.1 7.67.6
TSMC 130nm LV Results
Crossbar area: 1.75mm^2Total interconnect area: 4.15mm^2
Peak cross-section bandwidth: 778Gb/s
3131
Nexus SummaryNexus SummaryNexus is an asynchronous crossbar Nexus is an asynchronous crossbar interconnect designed to connect up to 16 interconnect designed to connect up to 16 synchronous modules in a SoCsynchronous modules in a SoCNexus can be used to implement load/store Nexus can be used to implement load/store systems as well as switch fabricssystems as well as switch fabricsSystems using Nexus can be tested with Systems using Nexus can be tested with standard equipmentstandard equipmentNexus runs up to 1.35GHz in TSMC 130nmNexus runs up to 1.35GHz in TSMC 130nmAsynchronous interconnect is now viable for Asynchronous interconnect is now viable for very high performance SoC designsvery high performance SoC designs
3232
AgendaAgendaIntroduction to FulcrumIntroduction to Fulcrum
Description of Integrated PipeliningDescription of Integrated Pipelining Fulcrum’s clockless circuit architectureFulcrum’s clockless circuit architecture
Description of Fulcrum’s Design FlowDescription of Fulcrum’s Design Flow
Overview of NexusOverview of Nexus Fulcrum’s Terabit crossbarFulcrum’s Terabit crossbar
Overview of PivotPointOverview of PivotPoint Fulcrum’s first commercial productFulcrum’s first commercial product
CircuitA
CircuitB
Design & Verification
Design & Verification
Synthesis & Floor Planning
Physical Design
Specification
Database Release to Manufacturing
Sim
ula
tio
n &
Ver
ific
atio
n
3333
PivotPoint Blade InterconnectPivotPoint Blade Interconnect
Large-scale SoC designLarge-scale SoC design >32.5M transistors (83% async)>32.5M transistors (83% async) 14 separate clock domains14 separate clock domains
Includes key Fulcrum IPIncludes key Fulcrum IP Nexus Terabit CrossbarNexus Terabit Crossbar Quad-port 600MHz async SRAMQuad-port 600MHz async SRAM
Operates at over 1GHzOperates at over 1GHzDelivers 192Gbps of non-blocking Delivers 192Gbps of non-blocking switching capacityswitching capacityTestable via standard toolsTestable via standard tools
JTAG; scan chainJTAG; scan chain
Activity-based power scalingActivity-based power scaling9-month project9-month project
World’s first high-performance clockless chip
X8
SPI-4
I/O(Phy/MAC)
BackplaneInterface
CPUNPUASICFPGA
CPUNPUASICFPGA
CPUNPUASICFPGA
CPUNPUASICFPGA
Generic System “Blade”
3434
PivotPoint Leverages NexusPivotPoint Leverages NexusFlexible architectureFlexible architecture
6 duplex SPI-4.2 interfaces6 duplex SPI-4.2 interfaces All paths are independentAll paths are independent
Optimized for performanceOptimized for performance Up to 14.4Gbps per interfaceUp to 14.4Gbps per interface Up to 32Gbps per Nexus portUp to 32Gbps per Nexus port Full-rate buffer memoriesFull-rate buffer memories Lossless flow controlLossless flow control
Easily configurableEasily configurable 16-bit CPU interface16-bit CPU interface JTAG supportJTAG support
Modest size and powerModest size and power ~2 Watt per active interface~2 Watt per active interface 1036 ball package1036 ball package
3ns latency
A true SoC GALS design
Control Bus(Serial Tree)
SPI-416KBBuffer
SPI-4 16KBBuffer
RouteTable
SPI-416KBBuffer
SPI-416KBBuffer
RouteTable
SPI-416KBBuffer
SPI-4 16KBBuffer
RouteTable
SPI-416KBBuffer
SPI-416KBBuffer
RouteTable
SPI-416KBBuffer
SPI-4 16KBBuffer
RouteTable
SPI-416KBBuffer
SPI-416KBBuffer
RouteTable
CPUInterface
JTAGInterface
BoundaryScan
3535
Testing – Testing – A Multi-Dimensional ApproachA Multi-Dimensional ApproachDFTDFT Synchronous scan chains for Synchronous logicSynchronous scan chains for Synchronous logic Asynchronous scan-chain-like structures for Asynchronous scan-chain-like structures for
asynchronous logic and sync-async interfacesasynchronous logic and sync-async interfaces Standardized JTAG interface for testingStandardized JTAG interface for testing
Fault-GradingFault-Grading Verilog fault-model for domino logicVerilog fault-model for domino logic Industry-standard fault grading toolsIndustry-standard fault grading tools
BISTBIST Use Nexus for observability in Nexus-Based SOCsUse Nexus for observability in Nexus-Based SOCs RAM self test and repairRAM self test and repair
3636
Differentiating Through TechnologyDifferentiating Through TechnologyLeveraging our clockless technology foundation
Differentiated Product OfferingDifferentiated Product Offering
High performance (latency, capacity)
Power efficient (linear scaling)
Robust in operation
High performance (latency, capacity)
Power efficient (linear scaling)
Robust in operation
Clockless Technology FoundationClockless Technology Foundation
Silicon proven and customer validated
Mature CAD flow (integrated with commercial tools)
Robust cell library (thousands of unique cells)
Silicon proven and customer validated
Mature CAD flow (integrated with commercial tools)
Robust cell library (thousands of unique cells)
Unique IP BlocksUnique IP Blocks
Unmatched performance
Extremely robust (power and temperature)
Easy to integrate (benign behavior)
Unmatched performance
Extremely robust (power and temperature)
Easy to integrate (benign behavior)
3737
Thank You!Thank You!
“A group of engineers wants to turn the microprocessor world on its head by doing the unthinkable: tossing out the clock and letting the signals move about unencumbered. For those designers, inspired by research conducted at Caltech, clocks are for wimps.”
Anthony Cataldo , EE Times
Peter A. Beerel, PhDVP Strategic [email protected]
818.871.8100www.fulcrummicro.com
26775 Malibu Hills RoadSuite 200Calabasas Hills, CA 91301