1
WIDESPREAD PURIFYING SELECTION O N RNA STRUCTURE I N MAMMALS Martin A. Smith Tanja Gesel Peter F. Stadler John S. Mattick Garvan Institute, Sydney, Australia [[email protected]] Scan me ! 2 Select subset of sequences randomly 6 Submit to RNA structure prediction algorithms 5 Select random sub-alignment simulating a sliding window 1 3 Emulate genomic alignment by realigning with MAFFT 4 Use native RFAM alignments as reference Select random RNA family I. Generating positive controls for algorithm benchmarking Most identified genetic variants associated to complex diseases occur in non-coding regions of the genome with no evidence of purifying evolutionary selection. Using an optimised sliding-window approach, we report that a large proportion of 35 sequenced mammalian genomes harbors evolutionarily conserved RNA structure motif s with unprecedented accuracy. We propose that the higher-order structural components of RNA serve as a flexible and modular evolutionary platform for the diversification of genetic regulatory mechanisms, assisted by low penetrance of affected alleles and b y compensatory base-pairing. Over 75% of the human genome is processed into RNA, with only 2% encoding proteins. III. Performance on chr10 II. Performance on benchmarking data 0% 20% 40% 60% 80% 100% Specificity Sensitivity Partial structure alignments [RFAM] Partial sequence alignments [RFAM+MAFFT] 13.6% 9.2% [5-22]% False discovery rate Genomic background RNAz 2.0 SISSIz 2.0 SISSIz 2.0 [+R] Density 0 0.1 0.2 0.3 5 10 15 20 25 30 35 Species in alignment 2% 10% 4% 5% 1% 17% 13% 48% 2D structures Gerp++ SyPhi-merged PhastCons 3.5% 6.4% 0.9% 6.8% 3.5% 0.8% 3.3% 13.6% 4.4% 1.3% 0.7% 4.1% 0.7% 1.3% 0.6% 3’UTR 5’UTR CDS Non-coding (0.3%) Exonic 8% Intergenic 41% Intronic 55% 3% 3% 2% Overlapping predictions (nt) Count (log) 10 1000 100000 0 2000 4000 80 90 0 0.02 0.04 0.06 0.08 30 40 50 60 70 Mean pairwise identity (%) Density 0 0.02 0.04 0.06 20 40 60 80 G+C content (%) Density SISSIz 2.0 Compares a native consensus structure prediction against a background distribution of randomized alignments SISSIz 2.0 [+R] Similar to regular SISSIz but employs a RIBOSUM sub- stitution matrix to score compensatory mutations. RNAz 2.0 Employs a regression model trained on known RNA structures to classify sampled alignments as structred or non-structured Overlap between predictions 0 4 8 12 Average runtime for 200 nt (s) IV. Optimised genome-wide screen Access predicted | structures in UCSC Genome Browser A. Genomic distribution of evolutionarily conserved RNA structures B. Overlap with annotated sequence constrained elements Exonic Intronic CDS 3’UTR 5’UTR Intergenic Fold Enrichment vs Uniform Distribution Non-coding Repeats 0.5 1 2.5 2 1.5 www.martinalexandersmith.com/ECS >4,000,000 high-confidence predictions Garvan Institute, Sydney, Australia Interdisciplinary Centre for Bioinformatics, Leipzig, Germany Centre for Integrative Bioinformatics, Vienna, Austria Less than 10% of the genome is currently defined as evolutionarily constrained.

Widespread Purifying Selection on RNA Structure in Mammals - Martin Smith

Embed Size (px)

Citation preview

Page 1: Widespread Purifying Selection on RNA Structure in Mammals - Martin Smith

WIDESPREAD PURIFYING SELECTIONON RNA STRUCTURE IN MAMMALSMartin A. Smith Tanja Gesel Peter F. StadlerJohn S. Mattick

Garvan Institute, Sydney, Australia [[email protected]]

Scan me !

2

Select subset ofsequences randomly

6

Submit to RNA structure

prediction algorithms

5

Select random sub-alignmentsimulating a sliding window

1 3Emulate genomic alignment

by realigning with MAFFT

4

Use native RFAMalignments as reference

Select randomRNA family

I. Generating positive controls for algorithm benchmarking

Most identified genetic variants associated to complex diseases occur in non-coding regions of the genome with no evidence of purifying evolutionary selection.

Using an optimised sliding-window approach, we report that a large proportion of 35 sequenced mammalian genomes harbors evolutionarily conserved RNA structure motifs with unprecedented accuracy.

We propose that the higher-order structural components of RNA serve as a flexible and modular evolutionary platform for the diversification of genetic regulatory mechanisms, assisted by low penetrance of affected alleles and by compensatory base-pairing.

Over 75% of the human genome is processed into RNA, with only 2% encoding proteins.

III. Performance on chr10

II. Performance on benchmarking data

0%

20%

40%

60%

80%

100%

SpecificitySensitivity

Partial structure alignments [RFAM]

Partial sequence alignments [RFAM+MAFFT]

13

.6%

9.2

%

[5-2

2]%

Fal

se d

isco

very

rat

eGenomic background

RNAz 2.0

SISSIz 2.0

SISSIz 2.0 [+R]

Den

sity

0

0.1

0.2

0.3

5 10 15 20 25 30 35Species in alignment

2%10%4%

5%

1%

17%

13%

48%

2D structures

Gerp++

SyPhi-merged

PhastCons

3.5%

6.4%

0.9% 6.8%

3.5%

0.8%

3.3%

13.6%

4.4%1.3%

0.7% 4.1%

0.7%

1.3%

0.6%

3’UTR

5’UTR

CDS

Non-coding (0.3%)

Exonic 8%

Intergenic41%

Intronic55%

3%

3%

2%

Overlapping predictions (nt)

Cou

nt (l

og)

10

1000

100000

0 2000 4000

80 900

0.02

0.04

0.06

0.08

30 40 50 60 70Mean pairwise identity (%)

Den

sity

0

0.02

0.04

0.06

20 40 60 80G+C content (%)

Den

sity

SISSIz 2.0Compares a native consensus structure prediction against a

background distribution of randomized alignments

SISSIz 2.0 [+R]Similar to regular SISSIz but

employs a RIBOSUM sub-stitution matrix to score

compensatory mutations.

RNAz 2.0Employs a regression model

trained on known RNA structuresto classify sampled alignments as structred or non-structured

Overlap between predictions

0

4

8

12

Average runtime for 200 nt (s)

IV. Optimised genome-wide screen

Access predicted |structures in UCSC Genome Browser

A. Genomic distribution of evolutionarily conserved RNA structures

B. Overlap with annotated sequence constrained elements

Exon

ic

Intr

onic

CD

S

3’U

TR5

’UTR

Inte

rgen

ic

Fold

Enr

ichm

ent

vs U

nifo

rm D

istr

ibut

ion

Non

-cod

ing

Rep

eats

0.5

1

2.5

2

1.5

www.martinalexandersmith.com/ECS

>4,000,000high-confidence

predictions

Garvan Institute, Sydney, Australia Interdisciplinary Centre for Bioinformatics, Leipzig, GermanyCentre for Integrative Bioinformatics, Vienna, Austria

Less than 10% of the genome is currently defined as evolutionarily constrained.