Widespread Purifying Selection on RNA Structure in Mammals - Martin Smith

WIDESPREAD PURIFYING SELECTIONON RNA STRUCTURE IN MAMMALSMartin A. Smith Tanja Gesel Peter F. StadlerJohn S. Mattick

Garvan Institute, Sydney, Australia [[email protected]]

Scan me !

2

Select subset ofsequences randomly

6

Submit to RNA structure

prediction algorithms

5

Select random sub-alignmentsimulating a sliding window

1 3Emulate genomic alignment

by realigning with MAFFT

4

Use native RFAMalignments as reference

Select randomRNA family

I. Generating positive controls for algorithm benchmarking

Most identified genetic variants associated to complex diseases occur in non-coding regions of the genome with no evidence of purifying evolutionary selection.

Using an optimised sliding-window approach, we report that a large proportion of 35 sequenced mammalian genomes harbors evolutionarily conserved RNA structure motifs with unprecedented accuracy.

We propose that the higher-order structural components of RNA serve as a flexible and modular evolutionary platform for the diversification of genetic regulatory mechanisms, assisted by low penetrance of affected alleles and by compensatory base-pairing.

Over 75% of the human genome is processed into RNA, with only 2% encoding proteins.

III. Performance on chr10

II. Performance on benchmarking data

0%

20%

40%

60%

80%

100%

SpecificitySensitivity

Partial structure alignments [RFAM]

Partial sequence alignments [RFAM+MAFFT]

13

.6%

9.2

%

[5-2

2]%

Fal

se d

isco

very

rat

eGenomic background

RNAz 2.0

SISSIz 2.0

SISSIz 2.0 [+R]

Den

sity

0

0.1

0.2

0.3

5 10 15 20 25 30 35Species in alignment

2%10%4%

5%

1%

17%

13%

48%

2D structures

Gerp++

SyPhi-merged

PhastCons

3.5%

6.4%

0.9% 6.8%

3.5%

0.8%

3.3%

13.6%

4.4%1.3%

0.7% 4.1%

0.7%

1.3%

0.6%

3’UTR

5’UTR

CDS

Non-coding (0.3%)

Exonic 8%

Intergenic41%

Intronic55%

3%

3%

2%

Overlapping predictions (nt)

Cou

nt (l

og)

10

1000

100000

0 2000 4000

80 900

0.02

0.04

0.06

0.08

30 40 50 60 70Mean pairwise identity (%)

Den

sity

0

0.02

0.04

0.06

20 40 60 80G+C content (%)

Den

sity

SISSIz 2.0Compares a native consensus structure prediction against a

background distribution of randomized alignments

SISSIz 2.0 [+R]Similar to regular SISSIz but

employs a RIBOSUM sub-stitution matrix to score

compensatory mutations.

RNAz 2.0Employs a regression model

trained on known RNA structuresto classify sampled alignments as structred or non-structured

Overlap between predictions

0

4

8

12

Average runtime for 200 nt (s)

IV. Optimised genome-wide screen

Access predicted |structures in UCSC Genome Browser

A. Genomic distribution of evolutionarily conserved RNA structures

B. Overlap with annotated sequence constrained elements

Exon

ic

Intr

onic

CD

S

3’U

TR5

’UTR

Inte

rgen

ic

Fold

Enr

ichm

ent

vs U

nifo

rm D

istr

ibut

ion

Non

-cod

ing

Rep

eats

0.5

1

2.5

2

1.5

www.martinalexandersmith.com/ECS

>4,000,000high-confidence

predictions

Garvan Institute, Sydney, Australia Interdisciplinary Centre for Bioinformatics, Leipzig, GermanyCentre for Integrative Bioinformatics, Vienna, Austria

Less than 10% of the genome is currently defined as evolutionarily constrained.

Health & Medicine

Widespread Purifying Selection on RNA Structure in Mammals - Martin Smith