Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Embed Size (px)



Text of Ensemble classification techniques for detecting signatures of natural selection from site frequency...

  • SFselect-E Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum Andrew Stewart JHU Spring 2014
  • Introduction Searching for signatures of selection SFselect (Ronen, 2013) Multi-K (Whiteman, 2010) Introducing: SFselect-E
  • Contents 1) The selection classification problem 2) Overview of SVM classification with SFselect 3) Ensemble preprocessing with Multi-* 4) Generating model variance 5) Introducing SFselect-E 6) Experimental Results 7) Conclusion
  • Natural selection Population genetics Evolution: Descent with modification Selection o Directional Positive Negative o Neutral
  • Classifying natural selection Record of demographic history Increased LD, reduced variation Site frequency spectrum o ie, Tajimas D
  • Background: SFSelect (Ronen, 2013) Scaled Site Frequency Spectrum Linear kernel Support Vector Machines Trained on extensive population simulations o SFselect, SFselect-s, SFselect-XP
  • Background: Multi-K Clustering Bootstrap aggregation o Random sampling o Aggregation method o Highly accurate, but computationally expensive Multi-K o Iterative K-means clustering o Classify new points based off centroid proximity o Optimize Kend with cross validation Multi-KX, Multi-SVD
  • Generating ensemble diversity Generating ensemble diversity o Generalizers o Specializers Applied to SFS classification: o Improve overall classification accuracy? o Produce classifiers robust to wide variations in genetic diversity
  • SFselect-E SFselect General SVM SFselect-E: Bagging approach SFselect-E: Multi-K approach
  • Population simulations 1000 individuals s = [0.005, 0.01, 0.02, 0.04, 0.08] t = [0, 50, 150, 200, , 3500, 4000] n = 500 labels = [-1, 1] (neutral, selected)
  • Training the standard model Compute allele frequencies Scale, normalize, bin into vectors Trained linear kernel SVM on entire dataset
  • Computational limits Very time intensive o Population simulations o Vectorization of SFS o Training SVMs on SFS Simulations grouped/indexed by replicate o Proved a major limitation on ensemble sampling
  • SFselect-E: Bagging approach Random sampling o k = 100, n = 200 Aggregation o Majority voting Validation o Cross validation
  • SFselect: Multi-K approach Iterative K-means clustering of D Kstart = 2 : Kend = 8 Train on each K Cross validation to determine optimal Kend
  • Experimental analysis: K-fold C.V. How to cross validate an ensemble??? For each K, hold out Ki, train on D-Ki Test classifier on Ki Report mean accuracy (# correct classifications)
  • Experimental analysis: C.V. Results Model Accuracy Standard SFselect SVM: 74.28 Bagged SFselect-E SVM: 73.86 Multi-K SFselect-E SVM: NA
  • Experimental analysis: Time series For t = [0, 4000], test Dt o Neutral vs Selected o Dependent T-Test on time sample accuracies p-value of 2.0136 X 10-24
  • Conclusions SFselect-E consistent with SFselect o No separation of specialized classifiers o Smaller subsets? Limitations of structure of training data as implemented in SFselect Model variance best obtained by separating by s, t.
  • Conclusions Computing time for training a major obstacle Multi-SVD preprocessing could reduce training time Refactoring required first
  • Future work Refactor to treat populations independently Bagging: random sampling across s, t Multi-K: hierarchical clustering of training data Multi-KX, Multi-SVD SFselect-s as component models
  • Future work Cross population: SFselect-XP, XP-SFS Cross species: SFS + conserved regions XS-SFS Tune ensemble diversity to population genetic diversity