25
+ Case Studies Bimal Roy Indian Statistical Institute Analytics for Crowd Sourced Data

Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+

Case Studies

Bimal Roy Indian Statistical Institute

Analytics for Crowd

Sourced Data

Page 2: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Crowd Sourced Data

Data generated from voluntary participation in a study; not evolved from a suitably designed scientific study.

n  Data DOES NOT follow a pre-decided model

n  Data NOT obtained from a Designed Experiment

Example : Mail-in questionnaire

“How much do you spend on film tickets every month?”

Page 3: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Case Study I Estimation of tiger population using pugmark data

Page 4: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Case Study I

Estimation of tiger population using pugmark data

n  Data : Tiger pugmarks across the Sunderbans

Distance between centroids Areas of triangles, toes, pad Ratios of axes, aspect ratios

Page 5: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Presumptions

Estimation of tiger population using pugmark data

Pugmark features for different tigers should be different in some respect.

n  In what respect, in which features?

n  How much is this difference?

Two pugmarks collected from distant locations at about the same time should belong to different tigers.

n  How far is considered too far?

n  How to estimate the timestamps?

Page 6: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Issues to consider

Estimation of tiger population using pugmark data

Between variation clearly exceeds Within variation Pugmarks at the same Time cannot be at large Distance

Pugmark features may differ depending on the Soil type

Page 7: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Work Plan

Estimation of tiger population using pugmark data

Collection of Training Data n  Get replicated measurements of

features from a single trail to estimate the within tiger variations.

n  Repeat for different trails, far apart, to estimate between tiger variations.

Clustering Strategy n  Use study on movement patterns

n  Identify useful variables / features

n  Develop a clustering strategy

n  Validate using simulation and test data

Page 8: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Training Data

Estimation of tiger population using pugmark data

Issues n  Impossible to choose a large number of

trail sites ensuring they are far enough

n  Trails are generally short; not enough good replicates for within variation

n  Final harvest n  25 hind pugmarks from 8 distinct trails

n  39 more hind pugmarks from 13 trails

n  12 pugmarks from census 2004 data

Page 9: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Assumed Model

Estimation of tiger population using pugmark data

Initial model for pugmark k

yijkl = µj + αi + γil + εijkl j = 1,2,3 i = 1,…,nj l = 1, 2 (L/R) k = 1,…, nijl E(εijkl) = 0 D(εijkl) = Σj

Final model (after feature elimination)

yijkl = αi + εijkl i = 1,…, I k = 1,…,ni E(εik) = 0 D(εik) = Σ

nj = number of trails from soil type j nijl = number of type l pugmarks from trail i in soil type j

Page 10: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Screening of Features

Estimation of tiger population using pugmark data

Preliminary screening n  7 features discarded – significant difference between L/R marks n  4 features discarded – significant variance across soil types n  2 features discarded – insignificant ‘tiger effect’

n  Screening for clustering n  Select best linear combinations of features based on between and

within dispersion matrices n  Choose the number of combinations – cross-validation error rate

8 principal components selected from top 19 variables

Page 11: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Clustering Strategy

Estimation of tiger population using pugmark data

Hierarchical/K-means clustering can be done, but this gives no clue about number of clusters.

Fresh idea for estimation of K n  Likelihood is maximized by singleton clusters

n  Need to put a penalty for too many clusters

n  Departure from known pattern of between tigers dispersion (observed in training data) can be used for penalty

n  Include in the likelihood a distribution for the between variation

Page 12: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Estimation of K

Estimation of tiger population using pugmark data

Assume Σ = I and B is known

Form normal likelihood

Integrate over cluster means αi,

Maximize over α Maximize over different cluster partitions P1,…, PK Maximize over K

Page 13: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Shape of Likelihood

Estimation of tiger population using pugmark data

Page 14: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Tigers missed in Sample

Estimation of tiger population using pugmark data

Assume SRSWR sampling from a large population of pugmarks Assume multinomial selection of individual tigers

If the true number of tigers is K, then the probability that a pugmark sample of size N will include k tigers is

Calculate expected count, and scale up accordingly

Page 15: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Final Results

Estimation of tiger population using pugmark data

Official data in 2004 : Estimated Count 249 (Forest Ministry, WB)

Result for Census 2004 : Estimated Count 75 95% interval (46, 106)

Page 16: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Case Study II Software reliability from crowd-sourced testing

Page 17: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Case Study II

Can the process of Software Testing be Crowd Sourced?

Yes; If every user of a software has the ability to voluntarily report a software defect when discovered by him/her.

What data would we have? n  Reported software defects due to an uncontrolled usage

n  Usage rate of the software is a function of time, and unknown

n  Distribution of defect type depends on unknown use cases

Software reliability from crowd-sourced testing

Page 18: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Challenges

How can we compare the reliability of two versions of a software product using data from their Bug Databases?

What kind of software reliability metrics are meaningful?

Minor defects

Major defects

Software reliability from crowd-sourced testing

Page 19: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+New reliability metrics

Mean Number of Defects to Failure (MNDF) n  Expected number of observations of all other types of defects

before a defect of a particular type is observed

R(N) Probability of observing no defect of a particular type in the next N defects to be discovered

n  Neither of the two depend on the unknown usage of the software

n  Both may be used to compare the reliability of two software products with respect to a specific type of defect (e.g., security)

Software reliability from crowd-sourced testing

Page 20: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Assumptions

Data : (T1, N(T1)), … , (Tn, N(Tn)), where N(t) = [N0(t), …, Nm(t)] denotes the multivariate counts of different types of defects.

Software reliability from crowd-sourced testing

Page 21: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Possible Models

Inspired by Jelinski and Moranda Every additional discovery of a defect causes a linear decrease in the propensity of discovering the next.

Inspired by Logarithmic Poisson Model Considers dependence between the discovery of different types of defects

Independent Logarithmic Poisson Model Discovery processes for the different types of defects are independent of each other

Software reliability from crowd-sourced testing

Page 22: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+

The Likelihood of the data sequence is proportional to

Under proportionality of the independent Logarithmic Poisson Model,

Hence, the maximization of the partial-likelihood can be performed by estimating a multinomial regression model between Zj and N(Tj-).

Likelihood Equations

Software reliability from crowd-sourced testing

Page 23: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+

Mean Number of Defects to Failure

Probability of not observing a defect of a specified type in the next N observations

Neither of the metrics depend upon the underlying usage rate.

Reliability Metrics

Software reliability from crowd-sourced testing

Page 24: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+Python Bug Database

Software reliability from crowd-sourced testing

Page 25: Analytics - Indian Statistical Instituteigwoa/Analytics_for_Crowd...Work Plan Estimation of tiger population using pugmark data Collection of Training Data ! Get replicated measurements

+References

n  “A Semi-parametric Reliability Model for Analysis of a Bug-Database with Multiple Defect Types” V T Subrahmaniam, A Dewanji and B K Roy, Technometrics, 2014

Thank You!