Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Embed Size (px)

DESCRIPTION

 

Text of Ensemble classification techniques for detecting signatures of natural selection from site frequency...

  • SFselect-E Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum Andrew Stewart JHU Spring 2014
  • Introduction Searching for signatures of selection SFselect (Ronen, 2013) Multi-K (Whiteman, 2010) Introducing: SFselect-E
  • Contents 1) The selection classification problem 2) Overview of SVM classification with SFselect 3) Ensemble preprocessing with Multi-* 4) Generating model variance 5) Introducing SFselect-E 6) Experimental Results 7) Conclusion
  • Natural selection Population genetics Evolution: Descent with modification Selection o Directional Positive Negative o Neutral
  • Classifying natural selection Record of demographic history Increased LD, reduced variation Site frequency spectrum o ie, Tajimas D
  • Background: SFSelect (Ronen, 2013) Scaled Site Frequency Spectrum Linear kernel Support Vector Machines Trained on extensive population simulations o SFselect, SFselect-s, SFselect-XP
  • Background: Multi-K Clustering Bootstrap aggregation o Random sampling o Aggregation method o Highly accurate, but computationally expensive Multi-K o Iterative K-means clustering o Classify new points based off centroid proximity o Optimize Kend with cross validation Multi-KX, Multi-SVD
  • Generating ensemble diversity Generating ensemble diversity o Generalizers o Specializers Applied to SFS classification: o Improve overall classification accuracy? o Produce classifiers robust to wide variations in genetic diversity
  • SFselect-E SFselect General SVM SFselect-E: Bagging approach SFselect-E: Multi-K approach
  • Population simulations 1000 individuals s = [0.005, 0.01, 0.02, 0.04, 0.08] t = [0, 50, 150, 200, , 3500, 4000] n = 500 labels = [-1, 1] (neutral, selected)
  • Training the standard model Compute allele frequencies Scale, normalize, bin into vectors Trained linear kernel SVM on entire dataset
  • Computational limits Very time intensive o Population simulations o Vectorization of SFS o Training SVMs on SFS Simulations grouped/indexed by replicate o Proved a major limitation on ensemble sampling
  • SFselect-E: Bagging approach Random sampling o k = 100, n = 200 Aggregation o Majority voting Validation o Cross validation
  • SFselect: Multi-K approach Iterative K-means clustering of D Kstart = 2 : Kend = 8 Train on each K Cross validation to determine optimal Kend
  • Experimental analysis: K-fold C.V. How to cross validate an ensemble??? For each K, hold out Ki, train on D-Ki Test classifier on Ki Report mean accuracy (# correct classifications)
  • Experimental analysis: C.V. Results Model Accuracy Standard SFselect SVM: 74.28 Bagged SFselect-E SVM: 73.86 Multi-K SFselect-E SVM: NA
  • Experimental analysis: Time series For t = [0, 4000], test Dt o Neutral vs Selected o Dependent T-Test on time sample accuracies p-value of 2.0136 X 10-24
  • Conclusions SFselect-E consistent with SFselect o No separation of specialized classifiers o Smaller subsets? Limitations of structure of training data as implemented in SFselect Model variance best obtained by separating by s, t.
  • Conclusions Computing time for training a major obstacle Multi-SVD preprocessing could reduce training time Refactoring required first
  • Future work Refactor to treat populations independently Bagging: random sampling across s, t Multi-K: hierarchical clustering of training data Multi-KX, Multi-SVD SFselect-s as component models
  • Future work Cross population: SFselect-XP, XP-SFS Cross species: SFS + conserved regions XS-SFS Tune ensemble diversity to population genetic diversity