Coverage Metrics and Error Models Serdar Tasiran, Kurt Keutzer Department of Electrical Engineering & Computer Sciences University of California, Berkeley

Coverage Metricsand

Error Models

Serdar Tasiran, Kurt Keutzer

Department of Electrical Engineering & Computer Sciences

University of California, Berkeley

Tasiran and Keutzer, UC Berkeley2

Problem Statement Designers tape-out IC designs but do not have a clear idea as to

the number and severity of errors that remain in the design the comprehensiveness of the functional verification that has

been performed.

In theory, formal verification conclusively verifies the IC design But, in practice often impossible to perform formal verification

To improve the current situation practically, a broader variety of verification techniques are being used

Formal verification• Model checking, theorem proving

Simulation (using informal coverage metrics)

This catches more bugs, but at the end designers still don’t know how comprehensive their verification is whether a subset of the techniques applied could provide the

same comprehensiveness.


The Need for Coverage Metrics A mechanism is needed to

quantify the degree of verification achieved by a combination of techniques

provide information about the unchecked aspects of the design Coverage Metrics Good coverage metric(s) will enable us to

Evaluate existing validation approaches Model checking Simulation given a set of test vectors ...

Compare and correlate these approaches Assess whether a certain degree of verification

comprehensiveness has been achieved Guide the generation of new test vectors to exercise

unchecked parts of designs


Future System Overview

Simulationdriver

(vectors)

Simulationmonitor(yes/no)

Simulationengine

Simulationmodel(HDL)

Diagnosis ofUnverifiedPortions

CoverageAnalysis

VectorGeneration

OUR FOCUSNOW


Research Approach

Survey existing coverage approaches from Protocol conformance testing Software testing Hardware verification (CAD)


Research Approach

1. Form of spec and implementation2. Testing/verification goal3. Error models4. Coverage metrics

4.1 Does it correlate well with errors? Intuitively Experimentally

• Effectiveness: Likelihood of catching bugs given X % coverage.

4.2 Applicability How easy is it to achieve coverage? Likelihood of catching bugs per unit computation time

4.3 Relevance for hardware verification purposes

5. Comparison of metrics: theoretically and empirically X% C1 coverage vs. Y% C2 coverage. Computational cost, size of test set required, ...

For each domain

Conclusions and Discussion


Executive Summary -1

Strictly formal/mathematical analysis of coverage metrics is unlikely to define one best or most comprehensive metric

No such conclusion from existing work Existing results have the form

“For testing scenario A, according to probabilistic measure P,metric C1 is more likely to detect errors than metric C2”

At very least, many coverage metrics are likely to be incommensurable

Three directions for future research Look for new mathematical relationships (beyond subsumes)

that gives formal ordering of techniques Statistical/empirical data gathering

Situation analogous to fault models in manufacture test Evaluate and compare metrics by arriving at good

statistical data to support claims. Engage with a customer - e.g. Intel

Search for intuitively better coverage metrics.


Executive Summary -2

Software testing motivates a number of interesting new coverage metrics

Dataflow-based coverage metrics All definitions-uses associations Definition-use interactions for different variables Dependence coverage

Mutation coverage

Mutation testing could lead to new (interactive) diagnostic tool


Protocol Conformance/FSM Testing (1,2)

SPEC: Single Deterministic (Extended) FSM Written in ESTELLE, LOTOS, SDL

IMPL: Single Deterministic (Extended) FSM running inside bigger system

EFSM: FSM describes control. Extra state variables:

Data structures Context variables.

Mostly small control part:

<100 states. Bigger data part.50-100 input and output variables

VERIFICATION GOAL: IMPL is equivalent to or contained in SPECNot only I/O behavior: States and transitions must match

State of IMPL not visible Black box testing of IMPL vs. SPEC

i = 1 / o = 0

x := wait

s3

s4

s2

s1

sinit

i = 0, x = wait / o = 1x := done


Protocols, FSMs: Error Models (3)

Assumption: Faulty machines still satisfy correct interface behavior.

Structural faults: Some 20 “mutation operators” in Mothra

Output fault: The output on a transition is incorrect Transfer fault: Transition leads to wrong state Additional/missing transition on a given present state and input Extra/missing state Blocked or dropped input Sequencing faults:

Missing alternative path Improper nesting of loops WHILE instead of REPEAT Wrong logical expression to control a loop or

conditional statement Arithmetic/manipulative errors:

Ignoring overflow Wrong operator (>, ,*, +) Wrong initial value Reference to wrong variable


Protocols, FSMs: Control Coverage (4)

Cover each state and transition of SPEC FSM Transition tours Distinguishing sequences for states Unique I/O sequences for states ...

Verify that outputs of IMPL match SPEC

Distinguishingsequence for s1

Distinguishingsequence for s2

s1

s2

sinit

Different technique: Fault functions

Define sets of suspicious transitions

Represent as a non-deterministic FSM

Cover all states and transitions of NFSM.


Protocols: Datapath Coverage (4)

Exercise all branches Exercise all definitions-uses paths

Definition-Use Coverage Determine all definition- use pairs

May be an exponential # in the # of branches

For each definition-use path Get to its initial state Determine if the d-u path is traversable

Constraint solving (SAT like approach) May need to traverse loops several times to make path traversable

Control + Datapath: Repeat same control test with different data values All control states/branches + all definitions-uses

Cover all d-u paths first Cover remaining transitions

i = 1 / o = 0

x := wait

s3

s4

s2

s1

sinit

i = 0, x = wait / o = 1x := done

Definition

Use


Protocols: Domain Coverage (4)

Sub-domain for a data variable Range of values that make program

go through the same control path.

Exercise all of the sub-domains of a variable Exercise each sub-domain, extremes and a few

intermediate values BUT: Protocols are reactive, not a single set of parameters Must define this criterion over time:

[Zhu, Vuong ‘97] Behavior of protocol:

Set of execution sequences: (a1,r1), (a2,r2), (a3,r3), …

ai: Event i

ri: Recursion depth of event i• Probably doesn’t apply to hardware

D1 D2

D3D4

D5


Protocols: A Metric for Execution Sequences (4)

[Zhu, Vuong ‘97] Infinite behavior space, want to cover w/ finite tests Define distance between sequences

d(a1,b1) + 1/2 d(a2,b2) + 1/4 d(a3,b3) + … 0 d(ai,bi) 1

Obtain a metric space Cover space with finitely many spheres of radius . Rationale: Differences later on in the sequence don’t matter as much

Can pick test sets of decreasing . Problem: If metric says two sequences are close, are they really

close in terms of bug-catching probability? Generalize to data parameters:

(a1(v1,…,vn), r1), (a2(v1,…,vm), r2), … Distance metric depends on v1,…,vm as well

No results on applicability.


Protocol Cov. Metrics: Mutation Adequacy (4)

Effectiveness (Fault coverage)

Impractical to look at all (faulty) machines Use mutation operators to generate representative population Mutation operators for protocols:

Alter the output/next state on one (two, three, …, n) transitions Add/remove a state Local perturbations to state transition graph or protocol description

Each mutant is checked for conformance/equivalence Too many mutants

Limit type and number of mutations Limit on the number of states in the implementation

populationin machinesfaulty of #

method by testingcaught machinesfaulty of #


Protocol Metrics: Correlation w/ Errors (4.1)

No systematic study to determine if fault models are realistic. Compound mutations not be an accurate model for natural faults

[Frankl, Weiss, Hu ‘96]

Effectiveness vs. coverage curve flat, increases sharply after 95% Error exposing ability good only at high coverage levels Weed out unexecutable d-u paths, require 100% d-u coverage

Coverage vs. effectiveness depends on distribution of faulty programs to subdomains of

Data variables Mutations

Effectiveness defined relative to test generation strategy Will reflect probability distribution of test sets better

Results derived from protocols with few (actual) errors: Errors revealed by few tests Otherwise any large enough test set works


Protocol Cov. Metrics: Applicability (4.2)

Difficult to achieve high coverage for elaborate metrics May require large test set

Mutation testing is costly. Coverage per unit resource is better for other metrics.

Must determine if mutant is equivalent. If cost is not a factor, mutation testing does well in catching bugs. 100% branch coverage not effective enough for protocols

For general software 70% is considered excellent.

Encouraging: all-uses exposes Missing-path: Designer forgets to specify what to do in a certain case Interesting: Structural testing supposed to be poor in this case Some d-u associations would have taken missing path

Test suites much smaller if machine has reliable reset.


Protocol Cov. Metrics: Relevance (4.3)

Protocol SPEC described in terms of “abstract interfaces”. Correct interface assumption: High level actions implemented

correctly by the tester/protocol interface. Implication for hardware testing:

Can specify FSM in terms of high level actions (transactions). Each action implemented over several clock cycles Determine how much of the high-level behavior is exercised.

Useful to study effectiveness vs. coverage for various metrics Using actual and artificial faults

HW analogs of datapath metrics Definition: Loading/initialization of a data register (via a certain path) Predicate use: Register fans out to control part of design Computation use: Register fans out to other data register Compute, e.g., definition-use path coverage

Like netlist coverage Can be achieved by tagging data registers.


Protocols: Comparison of Metrics (5)

C1 subsumes C2 iff programs P, test sets T, T gives 100% C2 cov. T gives 100% C1 cov.

C1 subsumes C2 Test sets for C1 are more likely to uncover bugs than C2.

Elaborate metrics more effective at X% coverage because Large test set size needed for X% coverage

Branch coverage vs. random testing: Not much improvement for comparable test set size.

All definitions-uses coverage: Requires a lot of computation for high coverage. Example:

40 d-u tests is as effective as 250 random tests. D-U tests require computation.

If cost not a factor, mutation testing better at catching bugs. Random testing not as good as path testing

But a good enough cost-effective alternative


Software Testing (1,2)

SPEC: Model-based specs (written in Z, VDM, …)

State space (set of typed vars, invariant on vars) Required operations (dynamics of system)

Predicates: Pre-post condition pairs Can be executable: FSMs, state-charts Evaluate all predicates, output evaluates to TRUE or FALSE.

Property-oriented specs: Axiomatic or algebraic specs Algebraic spec: Set of equations program should satisfy A term represents

A sequence of calls to program A value

Check if the two are equal IMPL: Program written in an imperative programming language

May have procedure and function calls Verification Goal: Do the outputs of IMPL satisfy or match SPEC ?


SW Error Models (3)

Mutation operators mimic likely errors.

COMPUTATION ERRORS Statement Errors

Wrong control keyword (WHILE instead of REPEAT, etc.) Predicate Errors

Expression/evaluation errors in Boolean predicates Decision variable of “case” statement

Missing control path Assignment Errors

Reference to wrong variable, Wrong expression

DOMAIN ERRORSD1 D2

D3D4

D5

Error in specifying domain boundaries


SW Covg. Metrics (Adequacy Criteria) (4)

Classification 1: Program-based SPEC-based Combined Interface-based: Inputs and outputs adhere to required format

Random (statistical) testing• Probability distribution over input space

Classification 2: Structural metrics Fault-based metrics Error-based metrics


SW: Program-based Structural Testing (4)

CONTROL-FLOW-BASED METRICS Basics:

Statement coverage Branch coverage Decision (multiple condition) coverage Path coverage.

1, 2 and 3 usually too weak. Miss errors. Also undecidable: Code may be unreachable

4 ideal but impractical: Too many paths Choose representative subset of paths

Simple/elementary paths Length-n paths Level-i paths

Check all elementary paths At next level, check unexercised elementary subpaths and cycles

Only check a linearly independent set of paths There are = e - n + # of SCCs of them (Cyclomatic number)

Level 2

Level 1



DATA-FLOW-BASED METRICS GOAL:

Don’t fold in data variables into state space Find meaningful and efficient way to exercise them

Definition occurrence x := 3 Use occurrence (Global use)

Computational use z := x + 1 Predicate use if (x >2) then ...

All-definitions: Each definition of variable exercised by some path. All-uses: Each (feasible) use of a variable is covered by a path

Computational uses Predicate uses All c-uses, some p-uses All p-uses, some c-uses ...

There may be many paths through which definition reaches use



DATA-FLOW-BASED METRICS (cont’d) All definition-use pairs:

definitions of x (feasible) paths q via which

def’n reaches a use of x, path p such that q is a sub-path of p.

Interactions between different variables must be exercised [Ntafos] k def-ref interactions: [d1(x1),u1(x1), d2(x2),u2(x2), d3(x3), …, dk(xk),uk(xk)]

di: def of variable xi

ui: use of variable xi

ui and di+1 are at the same node di reaches ui

xi’s and nodes nj need not be distinct Interaction path for k-def-ref p =

Required k-tuples criterion For all j-def-ref interactions L with 1 j k, there is an interaction path for

all feasible L

x:=3

x > 2

q1

q2q3

n1 n2 n3 nk...


SW: Program-based Structural Testing (4,4.1,4.2)

DATA-FLOW-BASED METRICS : (Ordered) context coverage:

ni has a def. of xi that reaches n ni …n is def free for xi

Dependence coverage Determine syntactically/semantically if execution of one

statement affects the other. If so, exercise path in between.

APPLICATION ISSUES Should structured data be considered as a single entity?

May identify a def-use path for array whereas no array elementsatisfies def-use

Treating arrays element-by-element difficult if dynamically indexed Interprocedural data-flow dependencies:

Module instantiations, formal vs. actual variable uses.

n1 n2 n3 n...p=

Node n has uses of variables x1, x2, …, xn


SW: Spec-based Structural Testing (4)

SPEC can be used to Determine if outputs of IMPL are correct Provide info to select test cases, Measure test set adequacy

Example: Must exercise all feasible combinations of sub-expression values A feasible combination of atomic predicates =

Sub-domain of input space

PARTITION TESTING (Automatically if possible) identify categories for each input parameter

or environment variable Characteristics enumerated in pre-condition of spec Characteristics intrinsic to variable (parameter type, etc)


SW: Spec-based Structural Testing (4)

PARTITION TESTING Choice: partition of the domain of one group of variables

Example: A V B partitioned into ~(A V B), A and B

Example: n 0 … 63. Predicates in program: n > 0, 1 n 15 Partition: 0, [1 … 15], [16 … 63]

All combinations criterion Exercise every possible combination of choices

Each-choice-used criterion Exercise each choice as part of some combination

Base choice criterion Base choice: Combination of parameters representing normal

operation of software Each choice used in combinations with all base choices for the

rest of the variables.


SW: Fault-Based Adequacy Criteria (4)

Structural testing

Fault-based testing: Adequacy of test set = Ability to detect faults Methods:

Error seeding Mutation testing

• Program mutation testing• SPEC-mutation testing

Error-based testing


SW Fault-Based Metrics:Error Seeding (4,4.1)

Originally proposed to estimate # of faults in SW. Introduce artificial faults at random, unknown to tester Assumption: These faults representative of inherent faults

Test SW, count artificial and inherent faults separately

r = Measure of test adequacy

f = # of inherent faults found by testing Estimated # of inherent faults in program = (1/r) * f

Advantage: Can be used to evaluate any testing method Drawbacks:

Measure dependent on how faults are introduced Difficult to implement error-seeding:

Often done manually Artificial errors much easier to find

faults artificial of # Total

found faults artificial of #


SW Fault-Based Metrics: Mutation Cov. (4)

Procedure for program P Create set of alternative programs: MUTANTS Construct test set T For each mutant M, run tests from T

Either P and M differ on a test Mutant dies Or T is exhausted Mutant lives

Live mutants provide valuable info. Mutant lives because Test data inadequate

Large proportion of mutants live, Test data does not convince us that P is correct

Live mutants point to “un-exercised” aspects of program Mutant is equivalent to program

Only a small fraction should be this way

Mutation adequacy =

mutants equivalent-Non

mutants Dead


SW: Generating Mutants (4)

Mutation operator Replace one syntactic structure with another Designed based on previous design experience

Statement Analysis: Make sure every line and branch is necessary Replace statement with CONTINUE, TRAP Replace logicals and relationals with TRUE or FALSE Replace DO with FOR, etc.

Predicate Analysis Exercise predicate boundaries

Alter limit sub-expressions by small amounts Insert absolute value operators into predicate sub-expressions Alter relational operators

Domain Analysis Change constants and sub-expressions by small amounts Insert absolute value operators (if syntactically correct)

Coincidental Correctness Analysis Change data references and operators to alternatives

(wherever syntactically correct)


SW: Mutant Testing (4)

Mutant testing assumptions Competent programmer hypothesis

Programs are near-perfect, errors are small deviations from intended program

Coupling effect hypothesis Simple and complex errors are coupled If test kills simple non-equivalent mutant, it will also kill complex

non-equivalent mutant Trying to validate coupling effect

Test set that kills mutants well kills mutants of mutants well also Second-order mutants = complex faults ?

Not clear. Experiments say no. Local Correctness:

For program P, define neighborhood of programs N(p) P is locally correct w.r.t. N iff

For all q in N(p)• Either q is equivalent to p• Or q fails on at least one test point in the test set

LC Correctness if N(p) includes at least one correct program


Mutation Analysis: Applicability (4.2)

PROS Easy automation

Applying mutation operators Running mutants

Interactive test environment. If mutant doesn’t fail while original fails, easy to examine

mutation and determine if there is an error. Other testing methods special cases of mutant testing

Example: Statement and branch coverage

CONS Expensive (both in time and space)

n-line program = O(n2) mutants Human cost of examining live mutants

Empirically ~10% of all mutants Must decide if they’re equivalent to original program (HARD) If not, must create new test case to kill mutant


Mutation Testing: Improvements (4.2)

Weak mutation testing Mutate and test components instead of whole program

• Same # of mutants• Don’t need to run whole program

Firm mutation testing Select portion of program and subset of parameters to be mutated

Sensitive to selection, higher human cost

Constrained mutation testing Omit the few operators that cause the most dead mutants Quality of test sets remains almost the same Mutation testing cost is reduced Empirical data: Still quadratic cost, although reduced significantly


Mutation Testing: Improvements (4.2)

Ordered mutation testing Define order < between mutants

a < b implies that if a test t kills b, then it kills a also Check b first, check a only if b survives Order mutation operators.

Example: replace = by , <,>

Similar order on test data Experiments needed to determine practical effectiveness


SW Fault-based: SPEC-Mutation Tstg. (4)

Aimed at catching bugs due to Misinterpreting the SPEC Errors in the SPEC

Plant faults in the SPEC Check what fraction of faulty SPECs are caught by test New operators applied to pre-/post condition pairs

Some program mutations don’t work well with SPECs Replacing clauses with TRUE or FALSE is useless or uninteresting

Two testing methods Non-executable SPEC

Check if program satisfies mutated SPEC Executable SPEC

Check if mutant SPEC gives same result as original SPEC


SW: Error-Based Adequacy Criteria (4)

Structural testing Fault-based testing

Error-based testing IDEA: Check programs on error prone points. Partition input-output space so that behavior within a

subdomain is equivalent One test case is representative of all data in the sub-

domain May want to pick a few more to increase confidence in

implementation


SW Error-Based: Domain Analysis (4)

SPEC-based input space partitioning SPEC requires same function on data

Even when SPEC is formal, no general mechanical method to partition input space.

Idea: For given set of pre-/post-condition pairs, put them into CNF:

P1(I) Q1(I,O) P2(I) Q2(I,O) Input data that satisfies Pi makes up domain i

Program (IMPL) -based input space partitioning Two data belong to same subdomain if they cause the same

computation Same computation: Same execution path

Combined program and SPEC-based domain analysis methods Perform partitioning based on the two, separately Find partition that refines intersection For each partition, choose sufficient test-set

D1 D2

D3D4

D5


Domain Analysis: Test Case Selection (4)

Recall: Program errors Domain errors: Program selects boundaries of domains incorrectly Computation errors: Implementation of computation is wrong

Boundary analysis for domain errors N x 1 domain adequacy

For N sub-domains D1, D2, …, DN

boundaries B1, B2, …, BN

at least N test cases on each Bi and one test case “off” Bi

• If Bi is part of Di, then “off” test case should be outside Di

• Otherwise it should be inside Di

Detects parallel shift of linear boundaries N x N domain adequacy

N test cases on each Bi

N linearly independent test cases just off Bi

Detects parallel shifts and rotations of linear boundaries Using vertices improves efficiency of boundary testing

V x V domain adequacy: Test each vertex and another point just off.


Domain Analysis: Applicability (4.2)

For certain classes of programs Linear functions Multinomials, etc.

can choose small subset of inputs that guarantee correctness

Major drawback: Too complicated to apply to complex input spaces

Example: Process control software Difficult to come up with metrics for non-numeric inputs Difficult to partition input domain for reactive programs: Recall

protocols

Computation and boundary analysis methods should be used in complementary way


Overall SW Test Applicability (4.2)

Most metrics effective only at high coverage levels Complexity of data-flow testing

Experiments to correlate adequate test-set size with # of decisions inprogram

Observation: Large proportion of infeasible paths Requires computation to weed out Just looking at # of test cases is misleading

• Must consider how much computation goes into it

Mutant testing is quadratic in terms of Variables, variable references Number of software units

Structural coverage metrics: Linear-sized adequate test sets Most SW metrics can be adapted to apply to HW.

Mutation coverage may be Costly Unreliable

Difficult to check equivalence of mutant and original.


SW Test Adequacy Criteria: Comparison (5)

Little experimental data to compare effectiveness. Nothing conclusive.

Duran & Ntafos: Use simulation to compare Random testing Partition test cases: Choose a given ni cases from partition pi

Empirical result: 100 random cases = 50 partition cases Extra effort of partition testing not justified

Data from SW support data from protocols: For comparable test size

Sophisticated metric not more effective Requires extra computation

Sophisticated metrics better at finding difficult bugs

Formal Analysis Of Relationships Between Criteria Subsumes relation compares severity of testing methods,

not effectiveness for given % coverage. Only statistical measures for “better bug detection ability”

statementcoverage

all-paths

all level-ipaths

branchcoverage

cyclomaticadequacy

Ordered context

all d-upaths

requiredk-tuples

strongmutation

(un-ordered)context

all-p-usessome-c-uses

all-p-uses

all-uses

all-def’ns

requiredpairs

all-c-usessome-p-uses

firmmutation

weakmutation

“Subsumes” relation between criteria (5)

all-c-uses


Other Relationships between Criteria

For sub-domain based criteria

Multi-sets of subsets of input space

C1: { D1, D2, D3, …, Dm }

C2: { E1, E2, E3, …, En }

C1 narrows C2. For each Ei there is Dj Ei

C1 covers C2. Each Ei = D U D U D U … U D C1 partitions C2. Each Ei = D U D U D U … U D and D’s are disjoint C1 properly covers C2.

C1 covers C2 The covers of Ei’s make up proper subset of D1 U … U Dm

C1 properly partitions C2: C1 partitions C2 The covers of Ei’s make up proper subset of D1 U … U Dm

For random and statistical experiments, compare statistical measures of fault detecting ability of various criteria

Positive correlation Sometimes clean theoretical proof of implication


UniversallyProperlyPartition

UniversallyPartition

UniversallyProperlyCover

UniversallyCover

UniversallyNarrow

Subsume

Implications between Relations (5)

Universally: Holds for all programs and specs


ordered-contextcoverage

(un-ordered)-contextcoverage

decisioncoverage

requiredK-tuples

all uses

all p-uses(limited) mutationcoverage

multiple-conditioncoverage

decision conditioncoverage

atomic conditioncoverage

Universally Properly Cover Relation (5)


CAD for HW (1,2)

SPEC: English description Properties in some (temporal) logic Invariants

IMPL: RTL netlist Verification Goal: IMPL “satisfies” SPEC. Problem: SPEC almost never “complete”

Berkeley designers’ opinion:

High coverage according to some metric is more convincing


CAD: Error Models (3)

Wrong connection in gate-level netlist Perturbation to state-transition graph (a la protocols)

Timing errors Control pulse arrives too late or early State is entered or exited too soon

Computational errors Error in control predicates Missed cases

Control goes down wrong path Assignment errors


CAD: Coverage Metrics (4, 4.1)

CONTROL EVENT COVERAGE [Ho & Horowitz ‘96] FSM coverage for control variables controlling datapath

Ge: Control event graph. Project control FSM onto variables in control-datapath

interface No need to consider other control variables

Assumptions: Design already partitioned into datapath and control Datapath does not hold any control state Only the sequencing of datapath commands matter, not their timing

Automatically extract from Verilog (w/ user annotations) Comments in Verilog highlight important control state variables Transitive set-of-support: Capture logic that controls these variables Derive list of “independent” variables, coverage tool will use this

eGin events control reachable Total

takenevents control of #coverage Transition


CAD: Coverage Metrics (4,4.1)

CONTROL EVENT COVERAGE [Ho & Horowitz ‘96] Take global state graph Project out independent vars Take state & transition dumps from simulators Check which states/transitions have been covered PROBLEM: State explosion Heuristics

Graph pruning using don’t cares If a variable is written every cycle,

“zero” it when not read. Efficient to determine statically

Approximating the state space Project out variables that are close to primary inputs

• More likely to be close to non-deterministic


Control Event Coverage: Applicability (4.2)

CONTROL EVENT COVERAGE [Ho & Horowitz ‘96]

Empirical data:

In general, same test set gives less control event coverage than

full state/event coverage

Highlights important tests that are missed

Full coverage analysis gives huge # of untested scenarios: Hard to use this data For conventional coverage, may be useful to project these onto

fewer variables

Difficult to exercise an uncovered scenario in full coverage

Using fewer variables and over-approximation,

easier to incrementally construct a test scenario


CAD Cov. Metrics: Tag Coverage (4)

“Observability-based” Coverage Metric [Fallah, Devadas, Keutzer ‘96] Tags:

Mechanism to extend standard coverage metrics using observability requirements [‘96]

Capture assignment/computation errors [‘98] DISCLAIMER: Bugs do not always manifest themselves as an

incorrect value of some HDL variable Errors of omission Wrong global assumptions Program goes down wrong control path

IDEA: Tag a variable +, -: deviation from intended value Optimistic assumption: Deviation big enough to propagate in each case. Example: x > y +

Run a set of simulation vectors, tagging one variable assignment at a time, using tag calculus.

Determine which ones propagate to the output, calculate % propagated.


CAD Tag Coverage (4.1, 4.2)

“Observability-based” Coverage Metric [Fallah, Devadas, Keutzer ‘96] There is full observability of internal nodes during simulation

BUT this info may be incomprehensible even by designer Observability-based coverage gives more meaningful % numbers

Random test may exercise line but not propagate to output

According to obs-based metric, user given test cases yield much better coverage than random

Not necessarily the case if observability isn’t considered Overhead for computing controllability not too much

1.5-4 simulation time

Captures most errors that can be caught by structural metrics BUT: Produces more errors than need to be analyzed

Find error model with fewer candidate errors


CAD: Coverage Metrics (4)

Reachability analysis with coverage goal:

Bias simulation/search to achieve more coverage. Coverage directed state space search [Aziz, et. al.]

Guard (decision) coverage is the metric Give low priority to states yielding few new guards.

Guided search of the state space

Guidepost Coverage: Set intermediate goals for reaching designated state set Bias search to maximize achievement of intermediate goals Useful, captures intuition about how to reach a state inside

big state-space• Requires a lot of designer effort.

Saturated simulation [Aziz, Kukula, Shiple] Pick subset of transitions or next states to

exercise all controller-pair states or transitions


CAD: Coverage Metrics (4,4.1)

Netlist coverage: 0-in. Circuit structure. Registers: Loaded, loaded unique values, read, initialized

Like definitions, definitions-uses, etc. criteria Counters: Overflow, underflow

Exercise domain boundaries Register-to-register paths: Are all (feasible) ones exercised?

Line, branch, single controller FSM, pair-arc Architectural coverage

Coverage of high-level, behavioral machine. “Transaction level”

Berkeley Wireless Center designers complain: “No metric relating to timing errors”

Pulse timing Enters or exits state too soon


Commercial Coverage Tools

SureFire (SureCov), Design Acceleration Inc. (Coverscan), SummitDesign, TransEDA (HDLCover, VeriCov), interHDL (CoverIt), Veritools,

Blocks, arcs (branches), expressions, FSM states and transitions, sequences, pair FSM coverage

Covermeter Statement, block, branch, conditions coverage Register and net toggle coverage FSM coverage Data transfer coverage (register transfer and buses) ?? Invariant coverage/ assertion checking ??

SureFire (SureSolve) Functional verification suite and automatic testbench generation. Exercises 90 to 100% of all reachable HDL constructs !!


Conclusions I

Difficult to draw sharp conclusions from existing formal/mathematical relationships among metrics

Only statistical comparisons for particular testing scenarios. Formal relations between metrics (such as “subsumes”) only

indirectly correlate with “bug-detection ability” Metrics need to be compared intuitively and experimentally using

Actual test sets Actual designs and errors

Existing experimental results comparing metrics interesting but limited

Should not be taken as conclusive for HW Factors to be considered when comparing metrics

Effectiveness at 100% coverage Test set size vs. X% coverage

Cost of constructing this test set Coverage per unit resource The type of bugs it is well-tailored to catch

A “design errors data base” would be useful, perhaps indispensable, for studying the above


Conclusions II

System models and error models from all three domains can be used for HW.

PROTOCOLS Error models for control part of protocols are too low level

May be useful for mutation coverage, but do not capture actual errors.

Control coverage methods unlikely to be practical for HW: They require detailed coverage of state transition graph Based on black-box testing

Datapath and domain coverage metrics are very similar to SW and can be useful

SOFTWARE Most comprehensively studied, but experimental data not conclusive Data-flow and dependence based metrics likely to be useful for HW

They have RTL netlist analogs Example: Definition-use paths

• Tag propagation approach can be applied Best to complement IMPL-based metrics with SPEC-based metrics.


Conclusions III

SOFTWARE (cont’d) Domain (partition) testing metrics likely to be useful for HW

A subdomain • An assignment of values to wires that go from datapath into

control. Domain coverage metrics can be applied to these assignments

• Each choice used• Base choice used• All combinations

Mutation analysis/adequacy likely to uncover interesting bugs Live mutants show what part of design is not covered Good intuitive measure for “having simulated enough”. But computationally expensive: Quadratic in size of description

SPEC-mutation testing is More computationally viable Uncovers incompleteness in specs

Must derive a good set of mutation operators for HW IMPL and SPEC.


Conclusions IV

CAD Control event coverage: Meaningful subset of control variables to

consider There may be other useful subsets

Tag coverage useful tool for computing other sorts of coverage All definitions-uses Different variable interactions

For tag coverage, can choose which circuit nodes are comprehensible to designer

Declare these “observable” Compute which errors propagate to observable nodes


Conclusions V

No work seems to exist in HW domain for Datapath coverage metrics Metrics to cover timing errors Mutation testing

No experimental work on Effectiveness of metrics Comparison of metrics This will be essential if no meaningful formal relationships can be

derived (e.g. manufacture test - stuck-open faults vs. stuck-at-faults vs. bridging faults)

Documents

Coverage Metrics and Error Models Serdar Tasiran, Kurt Keutzer Department of Electrical Engineering & Computer Sciences University of California, Berkeley