Upload
hoangkhuong
View
216
Download
0
Embed Size (px)
Citation preview
Disclaimer
n Iamnota“bigdataanalyst”,merelyastatisticianworkingasa
consultantforcompanieswith– sometimes– largedatasets
- Moreoften,verysmalldatasets
n My“bigdata”problemsaremoreacollectionofanawfullotofsmall
dataproblems
n Asaconsultant,thecustomerbigdatahardware
- iscentraltodefineaworkingsolution
- isoftenfixedandsometimesnotadapted
- wasprobablythebestwhentheyinstalledit(10yearsago)
n Asaconsultant,thecustomersoftware…
- isoftenfixed(protectedenvironment… e.g.unabilitytoinstallanR
package…)
Bayesianstatistics
n WhyBayesianstatistics?
- Focusonprediction insteadofmodelparameters
(notsayingthatparametersaren’tvaluable!)
- Integrateparameteruncertaintyandmeasurementerrorintothepredictivedistribution
(worksforalltypesofmodelsà unifiedframework)
- Don’tstoptobinaryanswers(go/nogoorpass/fail)
- Allowcombinationofknowledgethroughtheprior distribution,inanaturalschemeofupdatingpriorknowledgeusingBayestheorem
- Probability isacoherentmeasureofplausibilityofaneventoccurring,giventhemodelhypothesisanddata,insteadofthefrequencyoftheevent
n Allpointsaboveareindependentofthesizeofthedataset
Bayesianstatistics
Simulationswherethe“newobservations”aredrawnfromdistribution“centered”onestimatedlocationanddispersionparameters(treatedwronglyas“truevalues”).
PosteriorPredictionsFirst,bydrawingameanandavariancefromtheposteriorsand,second,drawinganobservationfromresultingdistribution
Case1:Timeseriesanalysis
n Identifyoutlierontimeseriesmadeofcountdata
- Compliance:authoritiessaidtothebanks“ifyouhavetheabilitytodetectweirdpatternsinyourcustomerdataandraisealert,pleasedoit”
- Onepatternthatcanbefoundeasilyiswhenanumberofaggregated
transactionsbetweentwoentitiesisnot“Normal”
- Thenacustomercanidentifyarootcauseoranoncomplianceissue
• Stronglinkwithstatisticalprocesscontrol(SPC)
5 10 15 20
1500
025
000
3500
0
Monthly aggregates
months
# tra
nsac
tions
Timeseriesanalysis
n Identifyoutlierontimeseriesmadeofcountdata
- Poissonornegativebinomialregression
- Deriveapredictionintervalforthenextnumberofaggregatedtransactions
withlargecoverage(e.g.99%)
- Ifapointisoutside,itmeansitdoesnotbelongtothesamepopulation(it
wouldoccuronlyin1-99%ofthecaseifthetimeserieswasbehaving
normally)
5 10 15 20
1500
025
000
3500
0
Monthly aggregates
months
# tra
nsac
tions
5 10 15 20
1500
025
000
3500
0
Monthly aggregates
months
# tra
nsac
tions
Timeseriesanalysis
n SomeexampleofRcode(Normalapproximation,noARstructure)
library(MASS)
mod<- glm.nb(formula=y~month,data=datas)#Log-linkisimplicit
pred <- predict(mod,data.frame(month=1:(nrow(datas)+1)),type="link",se.fit =TRUE)
X=model.matrix(mod)
xprimex_1=solve(t(X)%*%X)
Xpred =rbind(X,t(c(1,24)))
S=sqrt(sum(mod$residuals^2)/(ndata-2))
Root=sqrt(1+diag(Xpred%*%xprimex_1%*%t(Xpred)))
meanpred =exp(pred$fit)
PIpredLL =exp(pred$fit - qt(0.975,df=ndata-2)*S*Root)
PIpredUU =exp(pred$fit +qt(0.975,df=ndata-2)*S*Root)
(Here,95%bilateral
quantiles)
ExamplewithStancode(samemodel)
model<- "
data{
int<lower=1>N; //rowsofdata
vector[N]x; //predictor
int<lower=0>y[N];//response
}
parameters {
real<lower=0>phi;//neg.binomialdispersionparameter
realb0; //intercept
realb1; //slope
}
model {
//priors:
phi~cauchy(0,20);
b0~normal(0,20);
b1~normal(0,20);
//datamodel:
y~neg_binomial_2_log(b0+b1*x,phi);
}
"
stanmod <- stan(model_code =model,
data=list(N=nrow(datas),x=datas$month,y=datas$y),
iter=10000,chains=4)
chains<- extract(stanmod,pars=c("b0","b1","phi"))
x=1:24
N=length(x)
simul =matrix(0,ncol=length(x),nrow=length(chains[[1]]))
for(i inseq_along(x)){
#drawfromthepredictiveateverymonths
predictive[,i]=rnegbin(length(chains$b0),
mu=exp(chains$bo +chains$b1*(x[i])),
theta=chains$phi)
}
PIPpred =apply(predictive[i,2,function(l){
quantile(l,probs =c((1-0.95)/2,(1+0.95)/2)))
}
02
46
810
12
datas$month
datas$TR
N_SEN
T
201301 201308 201403 201410 201505 201512
Timeseriesanalysis
n Whya“non-Bayesian”version?
- Onmillionsofsubsetsofdata,runninganMCMCsamplercanbehopeless
(dependingonthecustomerarchitecture)
- Generally,itisinterestingtoverifyifananalyticalsolutioncanbeidentified
- Unfortunately,thisisnotthecasefornegativebinomialmodel
- Asshown,A“Normal”approximationcanbedeveloped,andcoverages
verifiedinvarioussimulations
• ButitisnotedthatarealBayesianpredictionismorepowerful,especially
withsmallcounts
Green:Bayesianinterval(inStan)
Red:Normalapprox.(Rcode)
Timeseriesanalysis
n Sofar,theproposedsolutionanswersyes/no
n Customersmayhavethousandsofalerts,allneedstobeverified
n Howtoimprovethesolution,byrankingthesealerts
- Provideaposteriorprobabilitythatthedatapointisout-of-trend(OOT)
n ExamplefollowingStanimplementation
>ecdf(predictive[,24])(35000) #24isthelasttimepoint(notincludedinthemodel)
[1]0.98915#probabilitytobe OOT
n Instead of reportingyes/no,it is better toreportP(OOT)
Timeseriesanalysis
n Difficultiesspecifictobigdata
- Customerhardwareandsoftwarepoorlyadapted
- Structureofthedata
Dataarestructured… butstill,plentyofproblems,leadingtoadditionaldata
consolidations
- Findingrobustsimplemodelsandhandlingerrorcase
E.g.insomecases,onemodelwillnotconvergeandRwouldcrashiftheerror
isnothandled(try….catchmechanism)
n Taketimetobuildappropriatemodels,onwhichyoumake
predictiveinference!!
- Ifthemodelsarenotgood,inferenceisquestionable… (sensitivity↘↘)
- Easyinsmalldata… Closetoimpossibleinbigdata
- Howtodeveloprobustmodelswithouthavingseen(all) thedata?
Case2:Accelerometerdata
n Pre-clinicaldataaregatheredonanimaltoobtainanearlyideaof
drugefficacy
- 32or64mice
- Followedonlineduring3to6weeks
- Videos(likeparkingsecurityvideos)
- 3-dimensionalaccelerometerdata(adeviceisattachedontheirback)
- Samplingfrequency~100hz
n Miceareepileptic-induced,andthedifferenttreatmentgroups
(control,placebo,dose1,dose2,etc.)shouldseedifferent
(significant?)numbersofcrisis
n Problemisthatitisnotpossibletoanalyzeeveryvideo
- Instead,usethesignalfromaccelerometerstoautomaticallydetectseizures
Accelerometerdata
n Thegoalistodetectepilepticseizuresfromaccelerometersignals
n Butsometimes,micearescratching,running,dancing…
Accelerometerdata
n Afirstalgorithmhasbeendevelopedtobeverysensitive(HMM
modelonsimplefeatures)
- Findsignalpatterns(“nomovement– movement– nomovement”)
- Canberun“online”duringexperiment
- ~100%detection
n Butthefalsepositiveratewas~97%
- Highlyskilledscientistsneedtoconfirmseizuresvisually(takesabout20
sec./suspicion)
- Totalnumberoffoundseizureisabout15k!!
- About100h+ofvideoanalysis(don’tdothat),forasmallexperiment
- …thisisnotaccountingforcoffeebreak… suchrateisnotacceptable
n Let’scall‘suspicions’thedetectedseizuresofar
Accelerometerdata
n Toimprove:
- Addafilterontopofthesuspicion,basedonmorespecificfeatures
- Asnobodyknowswhatisagoodfeature
1) Extractmanyfeatures,asorthogonalaspossible(frequencies,amplitude,
rollingSD)
2) Trainasupportvectormachine(theskilledscientistsalreadydidthe
100h+hoursanalysisonsomeexperiments,sowehavetrainingdata…)
3) TunetheSVM(mainly,weightedSVMduetoclasshighlyunbalanced)
4) Verify/optimizeusingstratifiedcross-validation
predicted
actual 0 1 total
0 178998 24102 203100
1 103 5304 5407
total 179101 29406 208507
missedseizure = 2%reductioninworkingtime = 86%
Accelerometerdata
n Implementation
- NoSVMavailablein“bigdata”Rpackagessuchassparklyr orrsparkling
- Sad,butnotsuchanissueasweworkonlyonthepredefinedsuspicions
Onlyabout15kchunksaccelerometerdata…
- Suspicionaccelerometerdatacanbeaccessedonebyone
• Featurescanbeextracted
• Thissummaryiseasilyhandledbyonemachine
• Stillembarrassinglyparallel…
n ”Bayesian”SVMnotveryavailable
- Answerwillremainssimple“yes/no”decision
- Norankingofthesuspicionsgivenhowlikelytheywouldbearealseizure
- PleaseimplementitinBoomSpikeSlabJ
Accelerometerdata
n Thestatisticalquestion
- Thereis,andwillalwaysbe,atrade-offbetweensensitivityandfalse
positiverate
n Whatistheimpactofmissingafewseizures
- Clinicalrelevance
n Simulationstudy
Conclusions
n Bayesianstatisticsallowprovidingbetteranswerthroughtheuseof
thecompletepredictivedistribution
- Useitwhenfeasible
- Approximationisnotacrime
n Makingmillionsofmodelsonsmalldataishard
- Taketimetoadjustyourmodel
- Inferenceotherwiseisquestionable
n Besuretoanswertheveryquestionofthecustomer/scientist
- OptimizingspecificityandsensitivityorRMSECVisnotsufficient
- Theresultsaregenerallyusedbyothers(wearejustasmallpieceinabig
process)
Lookattheimpactoferrorsinthefinaloutcome,insteadoftakingextratime
tocontinuetryingtooptimizeyourclassifier