31
Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Embed Size (px)

Citation preview

Page 1: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Logic Emulation and Prototyping: It’s the Interconnect

(Rent rules)

Mike ButtsNVIDIA

RAMP at Stanford, August 2010

Page 2: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 2

In the beginning• I’ve always been a computer architect.• Before the ASIC (early 1980’s) we built computers with off-the-shelf chips.

– Am2901 bit slices, PALs, 7400 logic. Just hook up some parts and run it now. • Full-speed wire-wrapped prototypes. When it ran it shipped.• Design Verification: It doesn’t crash.• Debug visibility: scope, maybe LA.• Design revision: wire-wrap gun.• Project time: months, not years.• Example: Kurzweil 1978

– Nova clone for Kurzweil Reading Machine– 2901s, 74F TTL, 16Kb DRAMs, 4 MHz clock– When the prototype ran the reading machine

app for three days without crashing, I released the design to manufacturing.

Page 3: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 3

Then came the ASIC Tapeout• Must get the design perfect before tapeout• Emergence of EDA, design capture, logic simulation: “Daisy/Mentor/Valid”• Simulation is very slow,

must write testbenches,can’t run the real app.

• This makes the designprocess very conservative.Crimps architect’s style.

• To me EDA has alwaysbeen a bit of a video game.

Page 4: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 4

FPGAs Emerge!• Real hardware! We can prototype again! • But simulators are automatic, and

FPGA tools are strange and hard. What if we had an automatic box of FPGAs that plugs into an ASIC socket. Emulate!

• Many FPGAs are needed. How to interconnect? Extend the row-columnFPGA architecture:

Sample, US 5,109,353,

1992

XC2064 FPGA64 CLBs, 1986

Page 5: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 5

First Logic Emulator Product• Quickturn RPM: 1989• Nearest-neighbor interconnect• Hard to get expected logic capacity,

hard to manage delays.• But it worked!

Sample, US 5,109,353, 1992

Page 6: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 6

First big success: Intel P5

• Quickturn worked closely with Intel to emulate the original Pentium microarchitecture: P5. – Ten RPM systems were cabled together, and the design was manually

broken up into RPM-sized segments which were emulated.

• “The emulator had one more benefit: blunting the spread of RISC. At a technology forum for PC companies and software developers last November (1991), (Intel VP Albert Yu) dialed it up and ran a Lotus 1-2-3 spreadsheet from a terminal. The crowd was astonished that a model was already working. Six months later, Compaq Computer Corp. scrubbed its plans for a RISC-based PC.” - Business Week 6/1/1992 “Inside Intel”

Page 7: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 7

But row/column doesn’t scale• Logic circuit topology is not flat, 2D nearest-neighbor. Wires go anywhere.• FPGA pins get used up by nets that are just passing through. Long delays.• Quickturn RPM had serious capacity, placement and routing issues.• It turns out the wires and pins of an FPGA

are its most precious resource.– 80-90% of FPGA transistors are interconnect.– “We charge for the wires, the gates are free”

-- Altera VP Eng. Clive McCarthy, 1994

• Logic density follows Moore’s Law, but packaging and pin counts do not. – Not even the square root (perimeter).

• Logic emulators inevitably outstripped FPGA pin counts. Why???

Page 8: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 8

Rent’s Rule• The problem of how many

pins to provide for each partition of a system came up in the IBM 1401 project, 1960.

• Ed Rent found this empirical rule for the relationship between pins per logic block and the number of gates in the block:

p = Kgr

where p = pins, g = gates,r is the “Rent exponent”, and K is the “Rent constant”.

Page 9: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 9

Rent’s Rule• IBM 1401 used a Standard Modular System (SMS) of logic modules,

backplanes and chassis, with standard pin counts. How to size? Rent’s Rule.• Rent never published, but in 1971

Landman and Russo did.B. S. Landman, R. L. Russo, On a Pin Versus Block Relationship For Partitions of Logic Graphs, IEEE Trans. Comp., col. C-20, 1971.

• Profound influence on system architecture and CAD/EDA tools.

• Different Rent coefficients apply todifferent environments.

• Empirical. Theory? Inconclusive. – Exponent > 0.5: global connectivity. – Constant > 1: net fanout.

• Rent’s Rule guided FPGA emulation system architecture.

We used p = 2.5g0.57IEEE Solid-State Circuits magazine, winter 2010

Page 10: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 10

Emulators: Big Green Button• A logic emulator is automatic and universal.

It takes any arbitrary netlist and implements it in standard hardware, with little or no user intervention.

• Uniform hardware, uniform-size FPGAs.• Design netlist is cut arbitrarily into many equal partitions

to keep the chips full. – Balanced k-way partitioning (NP-hard)

• This means Rent’s Rule applies.

• An FPGA prototype is manual and specific. Hardware is usually chosen for one project, the design is manually partitioned according to its modular structure, FPGAs are sized accordingly.

• System modules naturally have smaller pinouts than arbitrary cuts. Rent’s Rule does not apply.(Well, yes it does but weakly.)

G. Schelle, et. al., Intel Nehalem Processor Core Made FPGA Synthesizable, ACM FPGA 2010

M. Butts, “Emulators”, Wiley Encyclopedia of Electrical and Electronics Engineering, 1999.

Page 11: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 11

Rent’s Rule says FPGA Pins are Precious• XC3090: 640 LUTs, 5K gates.

Rent’s Rule says 325 pins, FPGA has 144 pins, only 44%

• Lesson: FPGA pins are vital to FPGA emulator capacity.=> Separate interconnect

• Crossbar is ideal– Interconnects any pins,

any way, with any fanout– Uniform delay: one level

• Far too expensive: O(n2)• Far more fanout than needed,

average net fanout is 2 to 3.• Doesn’t take advantage of

FPGA pin routability. Butts, US 5,036,473, 1991

Page 12: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 12

Partial Crossbar Interconnect• Drop out most of the

crosspoints, leaving a partial crossbar.– Group FPGA pins

into subsets,– Fully populate crosspoints

within each subset,– Leave the rest out.

• For each net, find a subset which can route it.– High fanout nets first.

• Map nets to FPGA pins accordingly.• Still uniform single-level delay.• Symmetrical, no placement needed.• Scalable: O(n) Butts, US 5,036,473, 1991

Page 13: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 13

Partial Crossbar Systems• Redraw: Group each subset’s

crosspoints into a crossbar chip for that subset • Each crossbar has pins to

every FPGA, and vice versa.• Make crossbar chip or use cheap FPGA

• Multilevel for systems: second-level crossbars on the backplane.• Max delay is three hops.

• Cost is slightly higher than O(n). Scalable.

• Partial crossbar interconnect made large-scale logic emulation practical.

Butts, US 5,036,473, 1991

Page 14: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 14

History of FPGA Emulators, 1989-2000

Nearest-neighbor architecture• Quickturn RPM (1989): First commercial emulator• Virtual Machine Works (1994): Virtual Wires pin multiplexingPartial Crossbar architecture• Mentor Realizer (1989): First hardware, emulated Apple II mobo• Mentor Realizer (1991): Proof-of-concept system prototype

– 8 logic boards (14 XC3090 FPGAs, 32 XC2018 xbars), 64 XC2018 2nd-level xbars• Mentor sold this logic emulator technology to Quickturn (1992).• Quickturn Enterprise (1993): First commercial partial crossbar emulator

– 11 logic boards (46 XC3090s, 46 custom xbars), 144 2nd-level xbars, 330K gates• HP Teramac (1995): Configurable computing research machine: 1M gates • Quickturn System Realizer (1995): XC4000 series, 2M gates• Quickturn Mercury Plus (2000): Large custom emulation FPGA, 20M gates

Page 15: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 15

FPGA Emulation Clocking Issues• ASIC and custom chips have gated clocks, latches, many clock domains.

FPGAs can introduce their own violations.• FPGA interconnect delay is very hard to manage.

– FPGAs use dedicated low-skew clock networks.• Gated clocks: must run clock through

logic blocks. Hold-time violations: clock gets sooner than the data.

• Latches: timing of both edges matters, plus there’s latch transparency.

• How to reliably map these to FPGA? Re-synthesis.– Map gated clocks to FPGA FF clock enables (which is the gate, which is the clock?)– Map latches into flops, using 2x clocking.

• Emulators developed sophisticated design mapping techniques.

Page 16: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 16

Emulator User Psychology• Emulators were often hard to use, especially in the early days.

– First-time users + clocking issues = errors.– Ultra-high pincount backplanes, cabling = errors.

• This trained users to blame the emulator.• After weeks of effort, they finally get their

design up and running on the emulator. A bug is found. What is their response?a) “Wonderful! It found a bug in our design.

We’re getting value from all this expense.”b) “It’s not our design, it’s your emulator.”

• User starts running diagnostics and swapping boards.• Swap enough boards and guess what happens.....• Solutions: Locked board extractors, Better emulators. Emulators have thousands

of pins per board

Page 17: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 17

1995: Quickturn System Realizer• Up to 990 FPGAs (Xilinx XC4013), custom crossbar chips• Logic board: 45 FPGAs, 100 K gates

– 2500 pins to backplane, 900 pins in-circuit or LA

• Max system 22 boards2M gates, 14 MB RAM

• Built-in LAPG• 14K I/Os for

multiple systems• Compiler 100KG/hr• Two-level partial

crossbar connects 990FPGAs in 3 hops max.

Page 18: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 18

2000: Mercury Plus FPGA

• Custom FPGA for emulation• Five-level partial crossbar

across entire 20M gate system:– Logic cluster: full crossbar– Two partial crossbar levels on-chip– Two more levels in the system

• 10x faster compile• Predictable capacity and delays• 6-LUTs, FFs, RAMs

– hold time trimmers

• Full visibility, on-chip logic analyzer • QT’s last FPGA emulator

Page 19: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 19

FPGA Pin Shortage Gets Worse Over Time

LCs (4-LUT) Gates Rent pins Real pins* ShortfallXC2064 128 1024 130 58 2.24XC3090 640 5120 325 144 2.26XC4062 5472 43776 1105 352 3.14

XC40200 16758 134064 2092 448 4.67XCV800 21168 169344 2390 512 4.67

XC2V6000 67584 540672 4631 1104 4.20XC4VLX160 200448 1603584 8607 960 8.97

XC6VLX550T 549888 4399104 15299 1200 12.75XC7V2000T 1954560 15636480 31521 1200 26.27

• Using FPGAs directly in logic emulators falls to Rent’s Rule– FPGA-based emulators were always starved for pins.– Xilinx FPGAs from the beginning. Altera, other FPGAs are similar.

* ordinary pins only, SERDES latency is too long for logic emulation

Page 20: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 20

FPGA Emulator Pin Multiplexing

Babb et. al, “Logic Emulation with Virtual Wires”, vol. 16, pp. 609 - 626, 1997.

Xilinx data book

• Multiple nets per pin,slower design clock

• Quickturn: – Asynchronous

free-runninghigh-speed using DDR IOBs

– Transparent to the emulated design

• VMW: Virtual Wires– Synchronous to design– Modify design netlist:

Eval/mux/latch,many levels

– Multiple clock domains?

Page 21: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 21

Continuous to Discrete Time

• As FPGAs got further and further from Rent’s Rule, FPGA emulators went to deeper and deeper pin multiplexing.

• Continuous time:– Pure FPGA emulator runs in the continuous time of the design. Signals

propagate as in the real hardware, just with different delays.

• Continuous / discrete time mix:– Pin-multiplexed FPGA emulator runs in an ad-hoc mix of

continuous and discrete time. Yet pins still mostly lie idle.

• Discrete time:– Go all the way into discrete time == levelized simulation

• Now it’s a massively parallel computer

Page 22: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 22

Processor-based Emulation• Levelize netlist, evaluate all gates

every cycle, level-by-level.• No branches: deep pipelining, fast,

massively parallel, very scalable.• Compile-time net scheduling:

Emulated design escapes Rent’s Rule• IBM Yorktown Simulation Engine

Monty Denneau, DAC 1982.– “... high speed special purpose parallel

processor designed and built at the IBM Thomas J. Watson Research Center to simulate logical operation ... up to 2,000,000 gates at a rate exceeding 3 billion gate computations per second”

• IBM Engineering Verification Engine Beece et. al, DAC 1988.

Page 23: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 23

Quickturn CoBALTWm. Beausoleil et. al., IBM

• 1997 commercialization of IBM engines• 8M gates, 1 MHz emulation speed• IBM HW, QT front end compiler• Maps multi clock domains, latches, gated

clocks onto single faster clock, makinguse of FPGA compiler experience

• Compiles 1M gates / hour• Full custom 100 MHz

250um chip with64 logic processors

• 65 chips / board

Page 24: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 24

Processor-based Emulation in 2000’s

• IBM technology and team acquired by QT, then QT acquired by Cadence• FPGA emulators dropped• 2002: Palladium

– 128M gates, 0.75 MHz– Full visibility – Compile 30M gates / hour– Multi-user

• 2004: Palladium II– 256M gates, 1.5 MHz

• 2007: Palladium III– 256M gates, 2 MHz

• 2010: Palladium XP– 2000M gates, 4 MHz

Palladium XP

Page 25: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 25

Emulation at NVIDIA

One of the largest emulation labs in the world

Page 26: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 26

Early Emulation Success• In 1995, CEO Jensen Huang “spent $1 million, a third of the company’s

cash, on a technology known as emulation, which allows engineers to play with virtual copies of their graphics chips before they put them into silicon. That allowed Nvidia to speed a new graphics chip to market every six to nine months, a pace the company has sustained ever since.” - Forbes, 1/7/08

• RIVA 128, or "NV3", was one of the first consumer graphics processing units to integrate 2D and 3D acceleration. When announced in 1997, the market found the specifications hard to believe: performance superior to market-leader 3dfx. RIVA 128 shipped in volume, and the combination of its low cost and high performance made it a popular choice for OEMs.

Wikipedia

Page 27: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 27

Emulation in 2005The specific verification goals that were required

for the GeForce 6800 project include:• Bring up a new generation of GPUs on an

accelerated verification platform in a one-week time frame. Derivative chips must be brought up in a few days.

• Automate the Compile-Run-Debug process so that ASIC design engineers could use an accelerated verification platform.

• Verify GPU and frame-buffer/system-memory interaction.

• Validate AGP/PCI-bus interface functions.• Ensure functionality at various levels of

abstraction (RTL and gates).• Expand accelerated verification solution to

ATPG and BIST applications. - Chip Design Magazine, January 2005

Page 28: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 28

Emulation Today• 2010: Cadence Palladium XP• Up to 2 billion gates, up to 4 MHz, up to 512 users

– Compile up to 35M gates / hour on 1 PC

• Full visibility to all signals• Integrates with logic and power simulation,

SystemC/C++ models, prototype hardware

• System integration steps used at NVIDIA:– Design and verify the silicon itself.

• Power analysis is vital.– Run silicon in the virtual system (such as

a PC), verify that the GPU works in a system. – Run lots of software applications on the

virtualized platform.- “NVidia Engineer Cites HW/SW Integration Challenges”, 5/5/10, cadence.com

Page 29: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 29

FPGA Prototyping today• FPGA prototyping is widely used as a verification tool by chip

development projects (not to mention RAMP of course).• Practical for one to four to maybe ten FPGAs.

– 2-4M gates each, typically 10 to 50 MHz

• Prototypes are rarely disclosed, two research efforts were:

Atom CPU in one Virtex-5 LX330, 50 MHz

(ACM FPGA ‘09)

Nehalem CPU in five FPGAs, 520 kHz due to pin multiplexing, 18 to 24-ways (ACM FPGA ‘10)

Page 30: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 30

Future

• State-of-the-art projects continue to rely heavily on processor-based emulation and FPGA prototyping for tapeouts.

• State-of-the-art tapeouts today cost $50-100M++. – Only possible for established $B vendors. – Very hard to get new chip startups funded.

• Therefore, ASIC project starts are dropping.• FPGAs and GPUs are the only processing silicon

that scales with Moore’s Law (so far). – Their vendors are the “foundries” for new HW efforts.

• Off-the-shelf chips: we’re coming full circle.

Page 31: Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

Mike Butts - RAMP - August, 2010 31

The Ultimate InterconnectHuman brain: 1011 neurons, 1014 to 1015 total synapses, 20-40 W,

somewhat reconfigurable.

“The Brain Unveiled”, Technology Review, Nov-Dec, 2008