Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
ON THE ORIGIN OF BUGSOr…
Understanding Hardware Bugs And How To Avoid Them
BRYAN DICKMAN DVCLUB: NOVEMBER 26TH 2019
1
Where do bugs come from? What are the common ways that bugs are introduced into designs?What can design engineers and verification engineers jointly do to avoid them?
BRYAN DICKMAN: VALYTIC CONSULTING LIMITED
¡ A senior technology manager and people leader¡ Recognised industry expert with 35 years of experience in the
semiconductor industry¡ 22 years of leading engineering teams at Arm
¡ IP Design-Verification delivery over an era when many new methodologies were introduced into engineering workflows
¡ Development of Design and Verification best practices¡ Engineering data strategies that exploit modern data science practices to
drive rich engineering insights and process/workflow improvements¡ Senior Director within the Technology Services Group ¡ Experienced developer of people
¡ T&VS Associate¡ Acuerdo Limited Associate (Joe Convey)
2
https://www.linkedin.com/in/bryan-dickman-74b1a914/?originalSubdomain=uk
INTRODUCTION AND DISCLAIMERS
¡ The following is a personal perspective
¡ None of the data shown is real data – it is hypothetical
¡ It’s all established thinking
¡ but some of it is more established in software development today
¡ I have a whitepaper on the subject to follow
3
WHY DO WE STILL HAVE BUGS?
…and jobs as DV engineers?
4
BUG FREE UTOPIA?
¡ ASSERTION: All complex designs contain bugs¡ No design can ever be 100% bug-free, no matter how hard
you try!
¡ Verification is a time and resource-limited quest to find as many bugs as possible before shipping
¡ Verification completeness is not generally achievable¡ Test planning can never be 100% complete
¡ Coverage models can never be 100% complete
¡ Infinite verification cycles is not possible
¡ …and so bugs will be missed
¡ Verification should employ strategies that increase the chances to find all bugs
¡ Designers should employ strategies that minimize bug risks
5
TWO DISCIPLINES…With A Lot In Common
6
Logic DesignBehavioral modellingRTL codingSynthesisDFTTiming AnalysisPower AnalysisImplementation
'ABC’Testbench Design
CoverageTest Planning
UVMSoftware test development
Hardware AccelerationDeep Formal
SpecificationArchitecture
Micro-Architecture designSystem Architecture
BenchmarkingSimulation
Waveform AnalysisSoftware understandingVerification techniques
AssertionsFormal (for assertions)
Scripting/building workflowsData Analysis/Data Science
DESIGN VERIFICATION
WHEN IS A BUG A BUG?…Only When You See It?
¡ Observable¡ An error is eventually seen (preferably detected by
verification)
¡ Lockups and denial of service – easy to detect – reset might workaround
¡ Data Corruption - if silent/undetected, consequences might be severe
¡ How ‘rare’ is it?
¡ What takes days and weeks to detect in the verification env, might manifest as a debilitating failure rate in silicon
¡ Non-Observable¡ Might be ‘spotted’ by chance, or found with formal?
¡ Might be an error in coding that is masked or unreachable
¡ Fix it or leave it? Weigh up the risks!
¡ Vulnerabilities¡ Reachable by malicious code, risk for security!
¡ Non-operational functions¡ Debug or event counters – software developers impacted
¡ Safety-critical and Reliability
¡ DRAM and logic sensitivity to SEUs – e.g. ECC functions for SECDED – errors rates in 1010 to 1017 errors/bit-h
¡ Performance and Power¡ Non-functional – but entitled performance is lost though
coding error, or too much power is consumed by device
¡ Clocking and Reset
¡ Asynchronous events can lead to meta-stability
7
WHEN ARE BUGS FOUND?…The Sooner The Cheaper!
¡ OPINION: most bugs should be flushed out in the early stages (say with 30% of the work) these are the easy finds.
¡ Verification work (consumption of resources and human effort to find, debug and fix) is disproportionately high for the remaining 10% of bugs
¡ But these are often the critical ones
¡ Focus workflow and methodology improvements on this later stage for the biggest ROI
8
COMMON ROOT CAUSES
9
Verification
Copy-Paste
Typos
Missing Code
Perfomance Tuning
Bug Fixing
Refactoring
Interfaces
Specifications Creating
Changing
¡ Bugs occur while creating code
¡ Spec errors/ambiguities
¡ Interface misunderstandings/spec
¡ Typos and Copy-Paste errors
¡ Incorrect verification env assumptions
¡ Missing code
¡ Or while changing code (code churn)
¡ Adding features
¡ Fixing bugs
¡ Performance tuning
¡ Or as a consequence of COMPLEXITY
A WORD ON COMPLEXITY
10
¡ It’s complicated because…
¡ The architecture specification is complex to meet function and performance targets
¡ The architecture-implementation (or micro-architecture) use a catalog of complex ‘clever-tricks’ to meet performance targets (the Art of Design)
¡ It’s complicated due to…
¡ Behavior is no longer fully understood.
¡ Design partitioning is sub-optimal
¡ Code style is more implied-gates and less behavioral
¡ Comments are missing or worse – incorrect
¡ Code health has deteriorated – accumulation of technical (code) debt
¡ Is it measurable? ¡ Engineering experience and gut feel!
¡ LOCs
¡ MaCabe Cyclomatic Complexity
¡ Code indentation complexity
¡ Other metrics from Verilog compilers and linting tools
¡ E.g. #registers, #wires, gated clocks, redundant code, logic depth
¡ A search will reveal a limited number of tools to measure RTL complexity – most use McCabe
¡ If I can measure it, I can visualize it and then decide how to act upon it.¡ E.g. Refactor code, or intensify verification
ESTABLISHED BUG AVOIDANCE
¡ Coding rules and static linting
¡ Design reviewing/ code scrubs
¡ Designer assertions
¡ Formal correct-by-construction
¡ If all else fails….
¡ Implement Feature toggle bits
11
SOFTWARE DEVELOPMENT IDEAS
¡ In the last decade we have seen the emergence of DevOps for the software development world (rooted in Lean and then Agile)
¡ Most mainstream software platforms are developed and operated using the DevOps model.¡ Enables 10s,100s, 1000s software deployments per day, while achieving stability, reliability, high availability and
security.
¡ What DevOps principles might apply to hardware development?
12
Automate Testing:
Continuous Integration Trunk based
development
Code Refactoring
(build in >20%) Integrate Performance
Testing
Test-Driven Development
Pair Programming
(automate with Gerrit)
Blameless Post-Mortems
(Retrospectives)
Swarm on Defects
Telemetry:Continuous analysis of
metrics
HOW CAN DATA AND ANALYTICS HELP?…assuming I am managing my data!
13
COVERAGE ANALYTICS
¡ Not wishing to state the obvious…
¡ Well established practices of tracking all available coverage metrics to achieve 100% or as close as possible with analysis
¡ Remembering…
Covered != Verified
14
BUG ANALYTICS
¡ Collecting and tracking bug data from your bugs database is a great way visualize where things are at
¡ We eventually expect a plateau
¡ But the plateau might just be a very shallow curve
¡ And there may be many false summits en-route
¡ Time to review and change something?
15
CORRELATING DATA
¡ Better insights gained from correlating bug data with other data such as the verification effort (machine/human hours).¡ “I’m running good cycles, but bugs are no
longer being found”
¡ And/or commit data…¡ “not only are no bugs being found, the
design and verification codebases are stable”
¡ From that you judge when to stop!¡ Or migrate to the next platform e.g.
Emulation, FPGA?
¡ And what if a late bug is then found?¡ How does that impact my sign off
verification target?
16
BUG PREDICTION…How Great Would That Be?
¡ It might be possible, and it may be a worthy endeavor to try
¡ It has been done before (see DVClub 2011 – Greg Smith*)
¡ Searching finds several papers for doing this for software
¡ Some use Machine Learning techniques
¡ So long as you have good datasets and have collected relevant design metrics for bugs
¡ Experiment with different training approaches e.g. Decision Trees, Naïve Bayes, Artificial Neural Networks (ANNs) to find the best prediction models
¡ Be aware of social factors and differences between teams when looking at historical datasets
17* https://www.testandverification.com/DVClub/24_Jan_2011/Greg_Smith.pdf
CODEBASE ANALYTICS…Exploiting Version Control Data
¡ Another book recommendation
¡ Another idea from the software world
¡ Ideas on how to extract insights from GIT
¡ Hotspots indicate complex code with a high commit rate –what’s going on there? Complexity tracking over development time. Refactoring indicators.
¡ Correlate this with Defects
¡ Unexpected couplings between modules that frequently get committed together
¡ Social aspects of code development – how many different editors, code that is now ‘abandoned’
¡ Architecture and Project Management insights
18
Screenshot of Hotspot visualization taken from codescenehttps://codescene.io/projects/171/jobs/15343/results/code/hotspots/system-map
(Permission kindly granted by Adam Tornhill)
THANK-YOU
¡ Hardware and software developers both develop code
¡ It’s a different mindset – but there are lessons from software that can be reused for hardware such as CI, Pair-programming/Gerrit, Refactoring, Complexity analysis, Bug Prediction, GIT analytics. Take some time to look at DevOps!
¡ Critical hardware bugs can be very costly (lith-masks, packaging, end-products), but so can software bugs in modern business-critical, high-availability and high-security platforms
¡ Successful teams will use data analytics to gain insights and apply improvements to reduce cost, shorten schedule and improve quality (less bugs!).
19
20