Upload
andrew-stewart
View
71
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
SFselect-EEnsemble classification techniques for detecting signatures of natural selection from site frequency spectrum
Andrew StewartJHU Spring 2014
Introduction● Searching for signatures of selection● SFselect (Ronen, 2013)● Multi-K (Whiteman, 2010)● Introducing: SFselect-E
Contents1) The selection classification problem2) Overview of SVM classification with SFselect
3) Ensemble preprocessing with Multi-*4) Generating model variance5) Introducing SFselect-E6) Experimental Results7) Conclusion
Natural selection● Population genetics● Evolution: Descent with modification● Selection
o Directional Positive Negative
o Neutral
Classifying natural selection● Record of demographic history● Increased LD, reduced variation● Site frequency spectrum
o ie, Tajima’s D
Background: SFSelect (Ronen, 2013)● Scaled Site Frequency Spectrum● Linear kernel Support Vector Machines● Trained on extensive population simulations
o SFselect, SFselect-s, SFselect-XP
Background: Multi-K Clustering● Bootstrap aggregation
o Random samplingo Aggregation methodo Highly accurate, but computationally expensive
● Multi-Ko Iterative K-means clusteringo Classify new points based off centroid proximityo Optimize Kend with cross validation
● Multi-KX, Multi-SVD
Generating ensemble diversity● Generating ensemble diversity
o Generalizerso Specializers
● Applied to SFS classification:o Improve overall classification accuracy?o Produce classifiers robust to wide variations in
genetic diversity
SFselect-E● SFselect General SVM● SFselect-E: Bagging approach● SFselect-E: Multi-K approach
Population simulations● 1000 individuals● s = [0.005, 0.01, 0.02, 0.04, 0.08]● t = [0, 50, 150, 200, …, 3500, 4000]● n = 500● labels = [-1, 1] (neutral, selected)
Training the standard model● Compute allele frequencies● Scale, normalize, bin into vectors● Trained linear kernel SVM on entire dataset
Computational limits● Very time intensive
o Population simulationso Vectorization of SFS o Training SVMs on SFS
● Simulations grouped/indexed by replicateo Proved a major limitation on ensemble sampling
SFselect-E: Bagging approach● Random sampling
o k = 100, n = 200● Aggregation
o Majority voting● Validation
o Cross validation
SFselect: Multi-K approachIterative K-means clustering of DKstart = 2 : Kend = 8Train on each KCross validation to determine optimal Kend
Experimental analysis: K-fold C.V.How to cross validate an ensemble???
For each K, hold out Ki, train on D-Ki
Test classifier on Ki
Report mean accuracy (# correct classifications)
Experimental analysis: C.V. ResultsModel
AccuracyStandard SFselect SVM: 74.28Bagged SFselect-E SVM: 73.86Multi-K SFselect-E SVM: NA
Experimental analysis: Time series● For t = [0, 4000], test Dt
o Neutral vs Selectedo Dependent T-Test on time sample accuracies
p-value of 2.0136 X 10-24
Conclusions● SFselect-E consistent with SFselect
o No separation of specialized classifierso Smaller subsets?
● Limitations of structure of training data as implemented in SFselect
● Model variance best obtained by separating by s, t.
Conclusions● Computing time for training a major obstacle● Multi-SVD preprocessing could reduce
training time● Refactoring required first
Future workRefactor to treat populations independentlyBagging: random sampling across s, tMulti-K: hierarchical clustering of training dataMulti-KX, Multi-SVDSFselect-s as component models
Future workCross population: SFselect-XP, XP-SFSCross species: SFS + conserved regionsXS-SFS
Tune ensemble diversity to population genetic diversity