Upload
richardwarburton
View
816
Download
1
Tags:
Embed Size (px)
DESCRIPTION
So you’re a big data and distributed systems “expert”, you’ve collected 500 billion data points, thrown it into sci-lib-of-the-week, you’re using Hadoop, backing onto those cool AWS GPU instances, let it grind away for days and its spit out the answer to life the universe and everything. But is it really better than a coin toss? How do you validate whether your data analysis algorithm works? Are you learning a solution to your problems or just the data you already have? What problems can you encounter when analysing your data? How do you solve them, and what can you do easily under the time pressures of a business environment?
Citation preview
ARE YOU BETTER THAN AARE YOU BETTER THAN ACOIN TOSS?COIN TOSS?
BY JOHN OLIVER AND RICHARD WARBURTONBY JOHN OLIVER AND RICHARD WARBURTON
WHO ARE WE?WHO ARE WE?
Why you should care
The Fundamentals
Practical Problems
Applying the Theory
'EXPERTS" AREN'T VERY GOOD'EXPERTS" AREN'T VERY GOOD
BIG DATA SOLVES ALLBIG DATA SOLVES ALLKNOWN PROBLEMSKNOWN PROBLEMS
BIG DATA BIG DATA SOLVES ALLSOLVES ALLKNOWN PROBLEMSKNOWN PROBLEMS
... HELPS... HELPS
VALIDATION = TESTSVALIDATION = TESTSFOR DATAFOR DATA
PART 1: FUNDAMENTALSPART 1: FUNDAMENTALS
NULL HYPOTHESISNULL HYPOTHESISUntil proven otherwise there is no relationship between
phenomena
WHEN YOU HEAR "WOLF!" THERE IS A WOLF NEARBYWHEN YOU HEAR "WOLF!" THERE IS A WOLF NEARBY
Cry "Wolf!" Stay QuietWolf Nearby Ok False NegativeIts really a chicken! False Positive Ok
WHY IS THIS IMPORTANT?WHY IS THIS IMPORTANT?
It is better that ten guilty persons escape thanthat one innocent suffer
- William Blackstone
STATIC ANALYSISSTATIC ANALYSIS
COST BENEFIT ANALYSISCOST BENEFIT ANALYSISCosts a lot to jail an innocent manCosts very little to show someone an inappropriate houseCredibility, Liberty, Morality are also costs
CHOOSE THE RIGHT MEASUREMENTCHOOSE THE RIGHT MEASUREMENTThere's more than one concept of accuracy
RECALLRECALLnumber of true positives / number of actually true values
PRECISIONPRECISIONnumber of true positives / predicted true value
F MEASUREF MEASURE
CASE STUDY: MEMORY LEAKSCASE STUDY: MEMORY LEAKSAbout ~10% of our dataset had memory leaks
Predict "never leaks memory" ~= 0.9 accuracy, but F1 = 0
Our algorithm ~= 0.9 accuracy and F1 ~= 0.9
PROBLEM: RELIABILITY OF MEASUREMENTPROBLEM: RELIABILITY OF MEASUREMENT
RULE OF THUMBRULE OF THUMBIf it looks like random noise, it probably is random noise.
SOLUTION: CHECK YOUR DATASOLUTION: CHECK YOUR DATA
Low Standard Deviation
Coefficient of Variation = Standard Deviation / Mean
CAVEAT: NON-NORMAL DISTRIBUTONSCAVEAT: NON-NORMAL DISTRIBUTONS
SOLUTION: GO MADSOLUTION: GO MAD
MEDIAN ABSOLUTE DEVIATIONMEDIAN ABSOLUTE DEVIATION
PROBLEM: EXPERIMENTAL FLUKESPROBLEM: EXPERIMENTAL FLUKES
IS YOUR A/B TEST A HEISEN TEST?IS YOUR A/B TEST A HEISEN TEST?
SOLUTION: P-VALUESOLUTION: P-VALUE
SCIENCE WORKS - B****ES!SCIENCE WORKS - B****ES!
PRACTICAL PROBLEMSPRACTICAL PROBLEMSPART 2PART 2
PROBLEM: FALSE PROPHETSPROBLEM: FALSE PROPHETS
I'M AN EXPERT, LISTEN TO ME!I'M AN EXPERT, LISTEN TO ME!
SOLUTION: ESTABLISH GOALS AND HYPOTHESIS THEN TESTSOLUTION: ESTABLISH GOALS AND HYPOTHESIS THEN TESTSOLUTIONSSOLUTIONS
PROBLEM: CODE QUALITYPROBLEM: CODE QUALITYThe math works :-) the code does not :-(
@headinthebox
GROWTH IN A TIME OF DEBTGROWTH IN A TIME OF DEBT
SOLUTION: SOFTWARE ENGINEERING PRACTICESSOLUTION: SOFTWARE ENGINEERING PRACTICES
Everyone Lies
- House
SOLUTION: UNDERSTAND BIASES AND DESIGNSOLUTION: UNDERSTAND BIASES AND DESIGNAROUND THEMAROUND THEM
Gay couples should have an equal right to getmarried, not just to have civil partnerships
Populus: 65% vs 27%
Marriage should continue to be defined as a life-long exclusive commitment between a man and
a woman
Comres + Catholic Voices: 22% vs 70%
ACQUIESCENCE BIASACQUIESCENCE BIASAnswer yes if there’s a positive connotation
REMOVAL OF PARTICULAR ADVERTISING AND SPONSORSHIP BANSREMOVAL OF PARTICULAR ADVERTISING AND SPONSORSHIP BANS
FOR: 1045 AGAINST: 731 ABSTAIN: 121 Motion Carried
MAINTAINING AN ETHICAL UNION BY REAFFIRMING ADVERTISING AND SPONSORSHIP BANSMAINTAINING AN ETHICAL UNION BY REAFFIRMING ADVERTISING AND SPONSORSHIP BANS
FOR: 858AGAINST: 755ABSTAIN: 166Motion Carried
SOLUTION: PHRASE QUESTIONS NEUTRALLYSOLUTION: PHRASE QUESTIONS NEUTRALLYAnd only have one question
SOCIAL DESIRABILITYSOCIAL DESIRABILITYPoor people overestimate their income, rich people under
estimate it.
SOLUTIONSSOLUTIONSAnonymisationConfidentialityRandomized ResponseBogus Pipeline
BIAS TOWARDS THE FIRST ANSWER OF A QUESTIONBIAS TOWARDS THE FIRST ANSWER OF A QUESTIONMake sure to randomise the order of answers
WHAT WILL THE NEXT CRISIS IN WASHINGTON BE?WHAT WILL THE NEXT CRISIS IN WASHINGTON BE?
Fight over the debt ceilingDifficulty averting automatic cuts to the PentagonFailure to pass basic budget billsAll of the above
http://www.foxnews.com/politics/elections/2012/you-decide/what-will-next-crisis-washington-be
PROBLEM: CORRELATION DOESN’T IMPLY CAUSALITYPROBLEM: CORRELATION DOESN’T IMPLY CAUSALITY
DATABASE AND NETWORK ACTIVITY CORRELATINGDATABASE AND NETWORK ACTIVITY CORRELATINGPerformance Diagnosis: was actually a GC Problem.
SOLUTION: DOMAIN KNOWLEDGESOLUTION: DOMAIN KNOWLEDGE
SOLUTIONSSOLUTIONSUse domain knowledge - ask PilotsStratified sample setsMeasure outcomes - are planes surviving more?
BE RIGOROUSBE RIGOROUS
PART 3: APPLYING THEPART 3: APPLYING THETHEORYTHEORY
CORRELATIONCORRELATIONA MEASURE OF THE STRENGTH OF DEPENDENCE BETWEEN TWO VARIABLESA MEASURE OF THE STRENGTH OF DEPENDENCE BETWEEN TWO VARIABLES
PEARSON CORRELATIONPEARSON CORRELATIONErr...Just look it up
(Assumes linear relationship)
Range Strength<0.4 Weak/No Correlation<0.7 Some Correlation>0.7 Strong Correlation
CASE STUDY: PERFORMANCE PROBLEM WITH HIGH SYSTEMCASE STUDY: PERFORMANCE PROBLEM WITH HIGH SYSTEMTIMETIME
Hypothesis: caused by Disk I/O
Correlation Strength: 0.78453
MACHINE LEARNINGMACHINE LEARNINGApplication of statistics to learn a relationship
HOW MANY CLUSTERS?HOW MANY CLUSTERS?
WHERE'S THE ELBOW?WHERE'S THE ELBOW?
FITTINGFITTING
FITTINGFITTING
SOLUTION:SOLUTION:CROSS VALIDATIONCROSS VALIDATION
CHOOSE CROSS VALIDATION DATA WISELYCHOOSE CROSS VALIDATION DATA WISELY
SELF VALIDATINGSELF VALIDATINGEnsemble methods - Train lots of weak classifiers and merge
RANDOM FOREST AND BAGGINGRANDOM FOREST AND BAGGINGDivide the data into bootstrap sets
Use the rest for calculating error
LEARNING CURVESLEARNING CURVES
HOW MUCH IS TOO MUCH?HOW MUCH IS TOO MUCH?
MONITOR PRODUCTION DATA...IT CHANGESMONITOR PRODUCTION DATA...IT CHANGESDoes it look like the same data that you learnt with?
A/B TEST NEW SYSTEMSA/B TEST NEW SYSTEMSSatisfaction/Profit/Traffic...
COMMON THREADSCOMMON THREADSTraining set errors are misleadingCross Validation, Production Monitored Values are the onesthat really matterVisualise and compare these errors
CONCLUSIONCONCLUSIONAnalytics are increasingly importantWide variety of statistical and practical tips to get them rightHave fun and Best of luck!
@johno_oliver @RichardWarburto
QUESTIONS?QUESTIONS?http://insightfullogic.com