1
To estimate the generalization performance a fitness landscape is needed where it is easy to accurately predict the fitness of any genotype. Using a real empirical landscape is intractable. Quasi-empirical fitness landscapes provide a solution based on RNA secondary structure that can be easily computed. Fitness of a genotype Simple fitness function Distance between MFE and the target structure. Plastic fitness function Average distance between all structures in the suboptimal ensemble and the target structure weighted by the Boltzmann probability. These landscapes are not representative of the real fitness of RNA sequences, but provide a useful approximation and have previously been shown to have similar properties to real fitness landscapes [3]. % deviance explained 0.0 0.2 0.4 0.6 0.8 1.0 Simple+MEEP Simple+ME Plastic+MEEP Plastic+ME Random Dense subset Evolved (Simple) Evolved (Plastic) Exploring the validity of representing high-dimensional fitness landscapes with a quadratic approximation Louis du Plessis * and Sebastian Bonhoeffer Institute for Integrative Biology, ETH Zürich, Universitätstrasse 16, 8092 Zürich, Switzerland, * [email protected], +41 44 632 33 37 Simulate RNA sequences Calculate fitness (simple/plastic) Train GKRR on subset (ME/MEEP) Evaluate approximation on test set Generalized kernel ridge regression (GKRR) 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Maximum hamming distance from target % deviance explained Simple+MEEP Plastic+MEEP Results RNA fitness landscapes Summary References Acknowledgments A U C C U U U U G U A G U U C A U G A G U G U G A U G A U U G U G UG U U C A U A C G C U U G U GUGAG A U G U G C C A C CC U U G A A C C U U G U U A C G A C A U U G G C A C A U U A C C C G U C U G A U C Measure the fitness of some genotypes (points on the landscape) AUCCUUUUGUAGUUCA UGAGUGUGAUGAUUG UGUGUUCAUACGCUU GUGAAGAUGUGCCAC CCUUGAACCUUGUUAC GACAUUGGCACAUUAC CCGUCUGAUC A predictive model for in vitro replicative fitness of HIV, that estimates main fitness effects and pairwise epistatic interactions of mutations with a generalized kernel ridge regression (GKRR), was presented in [1]. The predictive power of the resulting quadratic approximation was found to be high when testing on an independent dataset. A quantitative analysis of the resulting fitness landscape further showed that it was less rugged, less neutral and more correlated than expected [2]. However, the model was only tested on highly biased datasets (HIV sequences under selective pressure) and due to the high dimensionality of the sequence space it is intractable to assess its performance on all possible sequence variants. Fitness landscape (unobserved) Use GKRR to approximate the complete fitness landscape for each possible genotype Evaluate the accuracy of the approximation (Generalization performance) ME model: Main effects only MEEP model: Main effects and epistatic interactions Questions How important is it that the sequences used for training the model are representative of the sequences the model is tested on? How important is it that the fitness measured and trained on is the true underlying fitness function? Methodology Simulated datasets Genotype Phenotype Random (Unbiased) Sample random sequences Dense subset (Genotype bias) All genotypes close to the target genotype Evolved (Phenotype bias) Select sequences according to their fitness (simple/plastic) All phenotypes close to the target phenotype Predictive power of the MEEP model as the size of the landscape is increased. Datasets are sequences randomly sampled up to a specified hamming distance from the target. Data points are means from 6-fold CV, linear least squares fit also shown. 6-fold CV of predictive power of different models and fitness functions on simulated datasets (70,000 sequences, 217 bp, target structure is the MFE structure of human U3 snoRNA). We would like to thank Gabriel Leventhal, Trevor Hinkley and Nicco Yu for insightful discussions on the topic. [1] T Hinkley, J Martins et al. Nat Genet, 43, 2011. [2] RD Kouyos, GE Leventhal, T Hinkley et al. PLoS Genet, 8(3) 2012. [3] P Schuster, W Fontana, PF Stadler and IL Hofacker. Proc R Soc B, 255(1344) 1994. GKRR loses predictive power exponentially as the size of the landscape is increased. GKRR cannot to predict the fitness of randomly sampled sequences. On evolved datasets the fit is still good even if the fitness function is not the one used to evolve the dataset. The plastic fitness function is more powerful on smaller landscapes, but the simple fitness function works better on more divergent landscapes.

Exploring the validity of representing high-dimensional ...d2i6acgkizgokx.cloudfront.net/docs/abstracts/D23SY... · Simple+MEEP Plastic+MEEP RNA fitness landscapes Results Summary

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring the validity of representing high-dimensional ...d2i6acgkizgokx.cloudfront.net/docs/abstracts/D23SY... · Simple+MEEP Plastic+MEEP RNA fitness landscapes Results Summary

To estimate the generalization performance a fitness landscape is needed where it is easy to accurately predict the fitness of any genotype. Using a real empirical landscape is intractable. Quasi-empirical fitness landscapes provide a solution based on RNA secondary structure that can be easily computed. Fitness of a genotype Simple fitness function Distance between MFE and the target structure. Plastic fitness function Average distance between all structures in the suboptimal ensemble and the target structure weighted by the Boltzmann probability. These landscapes are not representative of the real fitness of RNA sequences, but provide a useful approximation and have previously been shown to have similar properties to real fitness landscapes [3].

% d

evia

nce

expl

aine

d

0.0

0.2

0.4

0.6

0.8

1.0 Simple+MEEP Simple+ME Plastic+MEEP Plastic+ME

Random Densesubset

Evolved(Simple)

Evolved(Plastic)

Exploring the validity of representing high-dimensional fitness landscapes with a quadratic approximation

Louis du Plessis* and Sebastian Bonhoeffer

Institute for Integrative Biology, ETH Zürich,

Universitätstrasse 16, 8092 Zürich, Switzerland, *[email protected], +41 44 632 33 37

¤  Simulate RNA sequences

¤  Calculate fitness (simple/plastic)

¤  Train GKRR on subset (ME/MEEP)

¤  Evaluate approximation on test set

Generalized kernel ridge regression (GKRR)

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Maximum hamming distance from target

% d

evia

nce

expl

aine

d

Simple+MEEP Plastic+MEEP

Results RNA fitness landscapes

Summary

References Acknowledgments

0 2.2

AUCCUUUUGUAGU U CAUGAGUGUGAUGAUUG

UGU GU

U C A U A C G C U U GU GUGAG A U G U G C C A

CC CU U G A A CC UUGUUAC

GACA

UUGGCACAUU A CCCGUCUGA

UC

Measure the fitness of some genotypes

(points on the landscape)

AUCCUUUUGUAGUUCAUGAGUGUGAUGAUUGUGUGUUCAUACGCUUGUGAAGAUGUGCCACCCUUGAACCUUGUUACGACAUUGGCACAUUACCCGUCUGAUC

A predictive model for in vitro replicative fitness of HIV, that estimates main fitness effects and pairwise epistatic interactions of mutations with a generalized kernel ridge regression (GKRR), was presented in [1]. The predictive power of the resulting quadratic approximation was found to be high when test ing on an independent dataset. A quantitative analysis of the resulting fitness landscape further showed that it was less rugged, less neutral and more correlated than expected [2]. However, the model was only tested on highly biased datasets (HIV sequences under selective pressure) and due to the high dimensionality of the sequence space it is intractable to assess its performance on all possible sequence variants.

Fitness landscape (unobserved)

Use GKRR to approximate the complete fitness

landscape for each possible genotype

Evaluate the accuracy of the approximation (Generalization performance)

ME model: Main effects only MEEP model: Main effects and epistatic interactions

Questions

¤  How important is it that the sequences used for training the model are representative of the sequences the model is tested on?

¤  How important is it that the fitness measured and trained on is the true underlying fitness function?

Methodology Simulated datasets

Genotype Phenotype

Random (Unbiased) Sample random sequences Dense subset (Genotype bias) All genotypes close to the target genotype Evolved (Phenotype bias) Select sequences according to their fitness (simple/plastic) All phenotypes close to the target phenotype

Predictive power of the MEEP model as the size of the landscape is increased. Datasets are sequences randomly sampled up to a specified hamming distance from the target. Data points are means from 6-fold CV, linear least squares fit also shown.

6-fold CV of predictive power of different models and fitness functions on simulated datasets (70,000 sequences, 217 bp, target structure is the MFE structure of human U3 snoRNA).

We would like to thank Gabriel Leventhal, Trevor Hinkley and Nicco Yu for insightful discussions on the topic.

[1] T Hinkley, J Martins et al. Nat Genet, 43, 2011. [2] RD Kouyos, GE Leventhal, T Hinkley et al. PLoS Genet, 8(3) 2012. [3] P Schuster, W Fontana, PF Stadler and IL Hofacker. Proc R Soc B, 255(1344) 1994.

¤  GKRR loses predictive power exponentially as the size of the landscape is increased.

¤  GKRR cannot to predict the fitness of randomly sampled sequences.

¤  On evolved datasets the fit is still good even if the fitness function is not the one used to evolve the dataset.

¤  The plastic fitness function is more powerful on smaller landscapes, but the simple fitness function works better on more divergent landscapes.