View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Commandment VI
If a function be advertised to return an error code in the event of difficulties, thou shalt check for that error code, yea, even though the checks triple the size of thy code and produce aches in thy typing fingers, for if thou thinkest “it cannot happen to me”, the gods shall surely punish thee for thy arrogance.
Administrivia
Office hours moved on Wed. 9:00-10:00; 1:00-2:00
Grades on P1, Milestone 1 back today
Grade scale for Mile1
Code functionality (60%) Compiles Meets/implements Map interface
(syntax) Semantically correct Bulletproof Style/design Meets performance specs
Grade scale for Mile1
Performance documentation (20%) Logical flow Substantiates points/conclusions Tests all required performance stats Writing/grammar/usage/
comprehensibility
Grade scale for Mile1
User/API documentation (15%) Documents all methods Method docs agree with spec Gives full contractual spec (non-
method) Bugs/spec violations documented Writing/grammar/usage/
comprehensibility
Grade scale for Mile1
Etc. (5%) Proper handin protocol Generated javadoc correctly CVS logs
The Joy of Test
Or: How to know that you got what you came for, once you think you have it....
The “Software Life Cycle”
1) Requirements gathering2) Design and specification3) Implementation4) Verification
1) Unit testing2) Integration testing
5) Validation6) Delivery7) Maintenance8) Retirement
Ok, but what do all of those mean?
Requirements: What the customer wants Specification: What the thing will do,
measurably Verification: Does it do what the spec says
it should do? Validation: Does it actually do what the
customer wants?
Requirements gathering What does the customer want? What does it
do? Not how should it work! That’s your job! What does the cusomer say that he/she
wants? “It should be blue” “It’s just one exception case, it can’t be
very hard, can it?” Problem is: you think about these things very
differently than the customer. (Usually.) They will not break it down into modules for
you! Probably will have trouble giving you good
test cases. Customer signoff
Specification and Design?
What will it do measurably/quantifiably? What is the high-level breakdown of the
program? Modules Data structures/storage Algorithms Data flow UI
Testable requirements! Customer signoff on specs (maybe)
Implementation
Code! Code like the wind! More than that, though...
Document Test Feedback from customer
Requirements/spec approvals/modifications Demo prototypes
Test Always refer back to specs Test some more!
Does it do what we said it would?
Verification: does it meet the spec? Does it match the measurable requirements?
Data format I/O UI Resource usage etc...
Should have test cases for everything in your spec (preferably before implementation!)
Types of testing I: “Glass box”
Excercise all known code paths Force code down different “if” or “case”
conditions Force exceptions to be thrown Check edge cases (null input, etc.) Valid and invalid inputs Neurotic cases
You know the guts of the code -- build test cases that try all of those paths
Easier (less hard, anyway) on small modules/ subroutines
Types of testing II: “Black box”
Don’t have (don’t look at) code -- it’s a black box
Know inputs/outputs (contract) Build test cases to known I/O relationships
E.g., testing java.io.HashMap Picking good test cases hard Shoot for known edge cases Try to violate assumptions Often the case for security assessments Often, all that’s possible for large
programs/ multi-module things
Types of testing III: “Unit tests”
Orthogonal to glass/black box Testing the behavior of a small “unit” --
e.g., a class Does this thing, all by itself, do what its
spec/ contract says it should, regardless of other parts of the system?
Usually simple to automate (e.g., JUnit) Good candidate for glass box
Types of testing IV: “Integration tests”
Testing the behavior of everything, once it’s all plugged together
Does it all work correctly in concert? Classes A and B might work brilliantly
individually, but break when put together
Often a problem in spec/contract/interface design
Good candidate for black box testing Sometimes automatable
Testing types so far
Glass Black
Unit
Small unitsClasses
Data structsShort code segs
Behavior of 3rd party modules, independent
test, etc.
IntegrationData flow code paths, system-level logic, etc.
Whole systemTotal
performanceMulti-module functionality
One more testing idea
“Regression testing” Does it still work? Have changes elsewhere in code broken
existing functionality? E.g., new data structs violate input
assumptions of unchanged modules Very, very important!
Most (~90%) of development is maintenance/ updates on existing code bases
Understanding consequences of “small” changes is very difficult
Automate if at all possible!
In practice, for Project 1
Two significant questions to answer: Does it meet spec? (Does it correctly
implement the Bayesian spam filter algs we defined)
Does it actually classify SPAM well? Which tokenizer works better?
First question: same sorts of tests we’ve been discussion
Second question: trickier
Testing a learning system
Problem is that answers change, depending on training data Very little data => poor statistics =>
poor classification accuracy Uncharacteristic data => stats don’t
match real training data A learning system gets better the more
training data it sees
Example of learning over time
Hello
Nigeria
Viagra
Opportunity
1/6
2/6
2/6
1/6
Example of learning over time
Hello
Nigeria
Viagra
Opportunity
23/314128
242/314128
198/314128
53/314128
Business
XXX
Friend
Hot
12/314128
93/314128
127/314128
47/314128
Measuring improvements over time
0
0
1
# tokens in training data
SPA
M a
ccu
racy
Pr[Spam]
So what’s accuracy?
(# correct classifications)(total # classifications)
You actually have 3 different accuracies: SPAM: (# correct SPAM)/(# total SPAM) NORMAL: (# correct NORMAL)/(# total
NORM) Overall: (# correct any)/(# total emails)
Should measure all of them Learning curve Asymptotic accuracy
Testing procedure pseudocode
Select batch of “test” emails (mix of SPAM and NORMAL)
For N=1, 2, ... Train system on N emails
(SPAM+NORMAL) For each email, t, in “test” set
Run BSFTest on email t Record whether classification is correct or not
Average (#correct)/(total # test emails) Plot point on accuracy plot
Average ~10-20 accuracy vals at large N for asymptotic accuracy
Some (hopefully) helpful hints
Choose (# SPAM)/(# NORMAL) about the same in train and test data sets (why?)
Make sure that train data and test data are different emails (why?)
Make sure there’re roughly the same kinds of emails in train and in test sets mailFromMom folder has very different
statistics than mailFromCS351 folder If you have both in test data, make sure
both are in train data too Pick emails from similar date ranges for
train and test (why?)
What’s the difference?
Which tokenizer is better? How do you know?
Do testing procedure on both -- compare results Asymptotic accuracy -- whose is higher? Is
it better on overall, just on SPAM, just on NORMAL?
Learning curve -- steeper is better! Are certain sorts of emails better/more
classifiable than others? Does handling the headers help?