Commandment VI If a function be advertised to return an error code in the event of difficulties, thou shalt check for that error code, yea, even though

Commandment VI

If a function be advertised to return an error code in the event of difficulties, thou shalt check for that error code, yea, even though the checks triple the size of thy code and produce aches in thy typing fingers, for if thou thinkest “it cannot happen to me”, the gods shall surely punish thee for thy arrogance.

Administrivia

Office hours moved on Wed. 9:00-10:00; 1:00-2:00

Grades on P1, Milestone 1 back today

Grade scale for Mile1

Code functionality (60%) Compiles Meets/implements Map interface

(syntax) Semantically correct Bulletproof Style/design Meets performance specs


Performance documentation (20%) Logical flow Substantiates points/conclusions Tests all required performance stats Writing/grammar/usage/

comprehensibility


User/API documentation (15%) Documents all methods Method docs agree with spec Gives full contractual spec (non-

method) Bugs/spec violations documented Writing/grammar/usage/

comprehensibility


Etc. (5%) Proper handin protocol Generated javadoc correctly CVS logs

The Joy of Test

Or: How to know that you got what you came for, once you think you have it....

The “Software Life Cycle”

1) Requirements gathering2) Design and specification3) Implementation4) Verification

1) Unit testing2) Integration testing

5) Validation6) Delivery7) Maintenance8) Retirement

Ok, but what do all of those mean?

Requirements: What the customer wants Specification: What the thing will do,

measurably Verification: Does it do what the spec says

it should do? Validation: Does it actually do what the

customer wants?

Requirements gathering What does the customer want? What does it

do? Not how should it work! That’s your job! What does the cusomer say that he/she

wants? “It should be blue” “It’s just one exception case, it can’t be

very hard, can it?” Problem is: you think about these things very

differently than the customer. (Usually.) They will not break it down into modules for

you! Probably will have trouble giving you good

test cases. Customer signoff

Specification and Design?

What will it do measurably/quantifiably? What is the high-level breakdown of the

program? Modules Data structures/storage Algorithms Data flow UI

Testable requirements! Customer signoff on specs (maybe)

Implementation

Code! Code like the wind! More than that, though...

Document Test Feedback from customer

Requirements/spec approvals/modifications Demo prototypes

Test Always refer back to specs Test some more!

Does it do what we said it would?

Verification: does it meet the spec? Does it match the measurable requirements?

Data format I/O UI Resource usage etc...

Should have test cases for everything in your spec (preferably before implementation!)

Types of testing I: “Glass box”

Excercise all known code paths Force code down different “if” or “case”

conditions Force exceptions to be thrown Check edge cases (null input, etc.) Valid and invalid inputs Neurotic cases

You know the guts of the code -- build test cases that try all of those paths

Easier (less hard, anyway) on small modules/ subroutines

Types of testing II: “Black box”

Don’t have (don’t look at) code -- it’s a black box

Know inputs/outputs (contract) Build test cases to known I/O relationships

E.g., testing java.io.HashMap Picking good test cases hard Shoot for known edge cases Try to violate assumptions Often the case for security assessments Often, all that’s possible for large

programs/ multi-module things

Types of testing III: “Unit tests”

Orthogonal to glass/black box Testing the behavior of a small “unit” --

e.g., a class Does this thing, all by itself, do what its

spec/ contract says it should, regardless of other parts of the system?

Usually simple to automate (e.g., JUnit) Good candidate for glass box

Types of testing IV: “Integration tests”

Testing the behavior of everything, once it’s all plugged together

Does it all work correctly in concert? Classes A and B might work brilliantly

individually, but break when put together

Often a problem in spec/contract/interface design

Good candidate for black box testing Sometimes automatable

Testing types so far

Glass Black

Unit

Small unitsClasses

Data structsShort code segs

Behavior of 3rd party modules, independent

test, etc.

IntegrationData flow code paths, system-level logic, etc.

Whole systemTotal

performanceMulti-module functionality

One more testing idea

“Regression testing” Does it still work? Have changes elsewhere in code broken

existing functionality? E.g., new data structs violate input

assumptions of unchanged modules Very, very important!

Most (~90%) of development is maintenance/ updates on existing code bases

Understanding consequences of “small” changes is very difficult

Automate if at all possible!

In practice, for Project 1

Two significant questions to answer: Does it meet spec? (Does it correctly

implement the Bayesian spam filter algs we defined)

Does it actually classify SPAM well? Which tokenizer works better?

First question: same sorts of tests we’ve been discussion

Second question: trickier

Testing a learning system

Problem is that answers change, depending on training data Very little data => poor statistics =>

poor classification accuracy Uncharacteristic data => stats don’t

match real training data A learning system gets better the more

training data it sees

Example of learning over time

Hello

Nigeria

Viagra

Opportunity

1/6

2/6

2/6

1/6

Example of learning over time

Hello

Nigeria

Viagra

Opportunity

23/314128

242/314128

198/314128

53/314128

Business

XXX

Friend

Hot

12/314128

93/314128

127/314128

47/314128

Measuring improvements over time

0

0

1

# tokens in training data

SPA

M a

ccu

racy

Pr[Spam]

So what’s accuracy?

(# correct classifications)(total # classifications)

You actually have 3 different accuracies: SPAM: (# correct SPAM)/(# total SPAM) NORMAL: (# correct NORMAL)/(# total

NORM) Overall: (# correct any)/(# total emails)

Should measure all of them Learning curve Asymptotic accuracy

Testing procedure pseudocode

Select batch of “test” emails (mix of SPAM and NORMAL)

For N=1, 2, ... Train system on N emails

(SPAM+NORMAL) For each email, t, in “test” set

Run BSFTest on email t Record whether classification is correct or not

Average (#correct)/(total # test emails) Plot point on accuracy plot

Average ~10-20 accuracy vals at large N for asymptotic accuracy

Some (hopefully) helpful hints

Choose (# SPAM)/(# NORMAL) about the same in train and test data sets (why?)

Make sure that train data and test data are different emails (why?)

Make sure there’re roughly the same kinds of emails in train and in test sets mailFromMom folder has very different

statistics than mailFromCS351 folder If you have both in test data, make sure

both are in train data too Pick emails from similar date ranges for

train and test (why?)

What’s the difference?

Which tokenizer is better? How do you know?

Do testing procedure on both -- compare results Asymptotic accuracy -- whose is higher? Is

it better on overall, just on SPAM, just on NORMAL?

Learning curve -- steeper is better! Are certain sorts of emails better/more

classifiable than others? Does handling the headers help?