21
SFselect-E Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum Andrew Stewart JHU Spring 2014

Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

SFselect-EEnsemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Andrew StewartJHU Spring 2014

Page 2: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Introduction● Searching for signatures of selection● SFselect (Ronen, 2013)● Multi-K (Whiteman, 2010)● Introducing: SFselect-E

Page 3: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Contents1) The selection classification problem2) Overview of SVM classification with SFselect

3) Ensemble preprocessing with Multi-*4) Generating model variance5) Introducing SFselect-E6) Experimental Results7) Conclusion

Page 4: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Natural selection● Population genetics● Evolution: Descent with modification● Selection

o Directional Positive Negative

o Neutral

Page 5: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Classifying natural selection● Record of demographic history● Increased LD, reduced variation● Site frequency spectrum

o ie, Tajima’s D

Page 6: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Background: SFSelect (Ronen, 2013)● Scaled Site Frequency Spectrum● Linear kernel Support Vector Machines● Trained on extensive population simulations

o SFselect, SFselect-s, SFselect-XP

Page 7: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Background: Multi-K Clustering● Bootstrap aggregation

o Random samplingo Aggregation methodo Highly accurate, but computationally expensive

● Multi-Ko Iterative K-means clusteringo Classify new points based off centroid proximityo Optimize Kend with cross validation

● Multi-KX, Multi-SVD

Page 8: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Generating ensemble diversity● Generating ensemble diversity

o Generalizerso Specializers

● Applied to SFS classification:o Improve overall classification accuracy?o Produce classifiers robust to wide variations in

genetic diversity

Page 9: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

SFselect-E● SFselect General SVM● SFselect-E: Bagging approach● SFselect-E: Multi-K approach

Page 10: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Population simulations● 1000 individuals● s = [0.005, 0.01, 0.02, 0.04, 0.08]● t = [0, 50, 150, 200, …, 3500, 4000]● n = 500● labels = [-1, 1] (neutral, selected)

Page 11: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Training the standard model● Compute allele frequencies● Scale, normalize, bin into vectors● Trained linear kernel SVM on entire dataset

Page 12: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Computational limits● Very time intensive

o Population simulationso Vectorization of SFS o Training SVMs on SFS

● Simulations grouped/indexed by replicateo Proved a major limitation on ensemble sampling

Page 13: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

SFselect-E: Bagging approach● Random sampling

o k = 100, n = 200● Aggregation

o Majority voting● Validation

o Cross validation

Page 14: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

SFselect: Multi-K approachIterative K-means clustering of DKstart = 2 : Kend = 8Train on each KCross validation to determine optimal Kend

Page 15: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Experimental analysis: K-fold C.V.How to cross validate an ensemble???

For each K, hold out Ki, train on D-Ki

Test classifier on Ki

Report mean accuracy (# correct classifications)

Page 16: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Experimental analysis: C.V. ResultsModel

AccuracyStandard SFselect SVM: 74.28Bagged SFselect-E SVM: 73.86Multi-K SFselect-E SVM: NA

Page 17: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Experimental analysis: Time series● For t = [0, 4000], test Dt

o Neutral vs Selectedo Dependent T-Test on time sample accuracies

p-value of 2.0136 X 10-24

Page 18: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Conclusions● SFselect-E consistent with SFselect

o No separation of specialized classifierso Smaller subsets?

● Limitations of structure of training data as implemented in SFselect

● Model variance best obtained by separating by s, t.

Page 19: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Conclusions● Computing time for training a major obstacle● Multi-SVD preprocessing could reduce

training time● Refactoring required first

Page 20: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Future workRefactor to treat populations independentlyBagging: random sampling across s, tMulti-K: hierarchical clustering of training dataMulti-KX, Multi-SVDSFselect-s as component models

Page 21: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Future workCross population: SFselect-XP, XP-SFSCross species: SFS + conserved regionsXS-SFS

Tune ensemble diversity to population genetic diversity