Why do so many chips fail?

Why do so many chips fail?Why do so many chips fail?

Ira Chayut, Verification Architect(opinions are my own and do not necessarily represent the

opinion of my employer)

Failure rate of first silicon is rising

“… research by Collett International revealed that 52% of complex application specific integrated circuits (ASICs) required a respin and the reason was largely due to functional errors.” (http://www.techonline.com/community/ed_resource/feature_article/36655)

Who is to blame? (There must be someone to blame!)

Management – they didn’t provide enough resources

HW Engineering – they created the functional errors

Verification – they didn’t catch the functional errors

Architecture – they didn’t focus on testability

Marketing – they kept changing the specs

http://www.techonline.com/community/ed_resource/feature_article/36655

http://www.techonline.com/community/ed_resource/feature_article/36655

People don’t kill chips, complexity kills chips

http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s99/papers/2_src.pdf (1999) — Projected numbers are a bit lower than current reality – a dual core AMD Opteron has 233 million transistors and the Intel Itanium 2 has 592 million transistors

http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s99/papers/2_src.pdf

http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s99/papers/2_src.pdf

Complexity increases exponentially

Transistors per chip

0

200

400

600

800

1000

1200

1400

1600

1995 2000 2005 2010 2015

Year

Mill

ions

of t

rans

isto

rs

• Chip component count increases exponentially over time (Moore’s law)• Interactions increase super-exponentially• IP reuse and parallel design teams facilitate more functions with fewer HW engineers per function and more functions per chip• Verification effort gets combinatorially more difficult as functions are added

Why verification is not able to keep up

Verification effort gets combinatorially more difficult as functions are added

BUT

Verification staffing/time cannot be made combinatorially larger to compensate

AND

Chip lifetimes are too short to allow for complete testing

THUS

Chips will continue to have ever-increasing functional errors as chips get more complex

Limiting the number of architectural and functional errors

Thorough unit-level verification testing

Small simulations run faster

Avoids combinatorial explosion of interactions

Well defined interfaces between blocks with assertions and formal verification techniques to reduce inter-block problems

Emulation or FPGA prototyping to accelerate testing

How to live with functional errors

Successful companies have learned how to ship chips with functional and architectural – time to market pressures and chip complexity force the delivery of chips that are not perfect (even if that were possible). How can this be done better?

For a long while, DRAMs have been made with extra components to allow a less-than-perfect chip to provide full device function and to ship

How to do the same with architectural features? How can full device function exist in the presence of architectural or implementation omissions or errors?

Architecture support

Embrace Perl’s motto: “There's More Than One Way to Do It” — allow for multiple ways of accomplishing all critical specified functions

Analogous to Design for Test (DFT) and Design for Verification (DFV), we should start thinking about Architect for Verification (AFV)

[Thanks to Dave Whipp for the AFV phrase and acronym]

In some problem domains, such as networking, upper-layer protocols can recover from some silicon errors; though there is a performance penalty when this is used

Architect support, continued

A programmable abstraction layer between the real hardware and user’s API can hide functional warts — hardware catches specific operations and either directs them to one of multiple hardware implementations, or signals a software trap

Pyramid minicomputers hid the assembly language from users, compiler could work around problems

Transmeta maps standard machine language to hidden processor architecture, translation software can work around problems

Soft hardware can allow chip redesign after silicon is frozen (and shipped!)

Summary

Ever increasing chip complexity prevents total testing before tape-out (or even before shipping)

AFV techniques can make chip verification not subject to combinatorial explosion

We have to accept that there will be architectural and functional failures in every advanced chip that is built

Architecture support needed to allow failures to be worked around or fixed after post-silicon

Documents

Why do so many chips fail?