40
Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

Embed Size (px)

Citation preview

Page 1: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures)

The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology

Page 2: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Sequence-based prediction methods2. RNA footprinting and high-throughput

methods

Last update: 17-Oct-2015

Page 3: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

SEQUENCE-BASED PREDICTION METHODS

Part 1

Page 4: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 4

RNA structures• Some RNAs have strong structural features

highly related to their functions

Last update: 17-Oct-2015

tRNA

snoRNAImage sources: http://www.bio.miami.edu/dana/pix/tRNA.jpg; http://lowelab.ucsc.edu/images/CDBox.jpg; http://www.daviddarling.info/images/ribosome_and_RNA.jpg

rRNA

Page 5: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 5

Secondary vs. tertiary structure• Four levels of molecular structures:– Primary: The sequence– Secondary: Local interactions– Tertiary: Global interactions– Quaternary: Inter-molecule interactions

• Both secondary and tertiary RNA structures are meaningful– However, more work has been devoted to

identifying/predicting RNA secondary structures– Also focus of this lecture

Last update: 20-Oct-2015

Page 6: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 6

Methods for predicting RNA structures

• Wikipedia contains a comprehensive list: https://en.wikipedia.org/wiki/List_of_RNA_structure_prediction_software

• Main classes:– Models specific to a particular type of RNA– Based on a single sequence• Minimum free energy (MFE)• Partition function

– Based on comparison of multiple sequences

Last update: 17-Oct-2015

Page 7: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 7

Type-specific models• Example: tRNAscan-SE for finding tRNAs and

predicting tRNA structures• Three main phases:

1. Running tRNAscan and the Pavesi algorithm to find candidate tRNAs

2. Using a covariance model to identify the more confident candidates

3. Trimming the candidates and predicting the detailed secondary structures

Last update: 17-Oct-2015

Page 8: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 8

Workflow of tRNAscan-SE

Last update: 17-Oct-2015

Image credit: Lowe and Eddy, Nucleic Acids Research 25(5):955-964, (1997)

Step 1

Step 2

Page 9: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 9

1a. tRNAscan• Features used:– Invariant and semi-

invariant bases– Potential base-pairing

structures consistent with the cloverleaf secondary structure• The aminoacyl arm, the D

arm, the anticodon arm and the T--C arm

– Length and position of potential intron sequences

Last update: 17-Oct-2015

Image credit: Fichant and Burks, Journal of Molecular Biology 220(3):659-671, (1991)

Page 10: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 10

1a. tRNAscan: sequential tests

Last update: 17-Oct-2015

Image credit: Fichant and Burks, Journal of Molecular Biology 220(3):659-671, (1991)

Page 11: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 11

1b. The Pavesi algorithm• Frequency tables based on 231 nuclear tRNA

genes:

Last update: 17-Oct-2015

Image credit: Pavesi et al., Nucleic Acids Research 22(7):1247-1256, (1994)

Page 12: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 12

1b. The Pavesi algorithm: workflow

Last update: 17-Oct-2015

Image credit: Pavesi et al., Nucleic Acids Research 22(7):1247-1256, (1994)

Page 13: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 13

2. Chomsky hierarchy of languages

Last update: 17-Oct-2015

Image source: http://www.cs.utexas.edu/users/novak/cs343283.html

Page 14: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 14

2. Context-free grammar• Use of context-free grammars in RNA secondary

structure representation: To capture paring relationships. Example:

Last update: 17-Oct-2015

Example credit: Sakakibara et al., CPM 289-306, (1994)

Productions P = {S0S1,S1CS2G,S1AS2U,S2AS3U,S3S4S9,S4US5A,S5CS6G,S6AS7,S7US7,

S7GS8,S8G,S8U,S9AS10U,S10CS10G,S10GS11C,S11AS12U,S12US13,S13C}

One possible derivation:S0

S1

CS2G CAS3UG CAS4S9UG CAUS5AS9UG CAUCS6GAS9UG CAUCAS7GAS9UG CAUCAGS8GAS9UG CAUCAGGGAS9UG CAUCAGGGAAS10UUG CAUCAGGGAAGS11CUUG CAUCAGGGAAGAS12UCUUG CAUCAGGGAAGAUS13UCUUG CAUCAGGGAAGAUCUCUUG

Page 15: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 15

2. Parse tree and RNA structure

Last update: 17-Oct-2015

Figure credit: Sakakibara et al., The 5th Annual Symposium on Combinatorial Pattern Matching 289-306, (1994)

Page 16: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 16

2. SCFG and CM• Stochastic context-free grammar (SCFG):

context-free grammar with probabilistic derivation

• Covariance model (CM): model for representing RNA sequence and structure profiles based on SCFG

Last update: 20-Oct-2015

Page 17: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 17

2. SCFC and CM

Last update: 17-Oct-2015

Input multiple sequence alignment and consensus structure:

Construction of guide tree from consensus structure:

Image credit: INFERNAL user’s guide

Output CM:

Node Description

MATP Pair

MATL Single strand, left

MATR Single strand, right

BIF Bifurcation

ROOT root

BEGL Begin, left

BEGR Begin, right

END End

White state: consensusGray state: indels

Page 18: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 18

Based on a single sequence• Minimum free energy (MFE): Finding the RNA

structure (pairing of bases) that minimizes the free energy– More pairing– More stable pairing• Strong GC pairing• Stable structures such as stacking pairs

Last update: 17-Oct-2015

Page 19: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 19

Turner’s free energy model• Considering the total

free energy of an RNA structure is the sum of the free energy of the sub-structures

Last update: 17-Oct-2015

Image credit: http://www.clcbio.com/scienceimages/rna_prediction/RNA_structure_prediction_web.png

Page 20: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 20

Turner’s energy parameters• For hairpin loops:

Last update: 17-Oct-2015

Table credit: Mathews et al., Journal of Molecular Biology 288(5):911-940, (1999)

Page 21: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 21

Energy minimization• Dynamic programming (without pseudoknots)

Last update: 17-Oct-2015

Vሺ𝑗ሻ= minቊVሺ𝑗− 1ሻ 𝑗 is unpairedmin1≤𝑖<𝑗ሼVPሺ𝑖,𝑗ሻ+ Vሺ𝑖 − 1ሻሽ 𝑗 pairs with 𝑖 Vpሺ𝑖,𝑗ሻ= min

ە۔

+eSሺ𝑖,𝑗ሻۓ VPሺ𝑖 + 1,𝑗− 1ሻ ሺ𝑖,𝑗ሻ and ሺ𝑖 + 1,𝑗− 1ሻ form stacking pairseH(𝑖,𝑗) ሺ𝑖,𝑗ሻ closes a hairpin loopVBIሺ𝑖,𝑗ሻ ሺ𝑖,𝑗ሻ closes an internal loopVMሺ𝑖,𝑗ሻ ሺ𝑖,𝑗ሻ closes a multi loop

VBIሺ𝑖,𝑗ሻ= min𝑖1,𝑗1:𝑖<𝑖1<𝑗1<𝑗ሼeBIሺ𝑖,𝑗,𝑖1,𝑗1ሻ+ VPሺ𝑖1,𝑗1ሻሽ VMሺ𝑖,𝑗ሻ= min𝑖1,𝑗1,…,𝑖k,𝑗𝑘:𝑖<𝑖1<𝑗1<...<𝑖𝑘<𝑗𝑘<𝑗൝eMሺ𝑖,𝑗,𝑖1,𝑗1,…,𝑖𝑘,𝑗𝑘ሻ+ VPሺ𝑖ℎ,𝑗ℎሻ𝑘

ℎ=1 ൡ

Page 22: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 22

Partition function• Issues of MFE:

– Solution may not be optimal– There can be different structures with similar free energy

• The partition function sums over the relative likelihood of all possible secondary structures:

– S: Possible secondary structures– G(S): Gibb’s free energy change– R: Gas constant– T: Absolute temperature

• Probability of a particular structure s, •

Last update: 17-Oct-2015

Q= e−∆Gሺ𝑆ሻ/RT𝑆

Prሺ𝑠ሻ= e−∆Gሺ𝑠ሻ/RTQ Prሺ𝑖 and 𝑗 form a pairሻ= σ e−∆Gሺ𝑠ሻ/RT𝑠:ሺ𝑖,𝑗ሻ𝜖𝑠Q Mathews, RNA 10(8):1178-1190, (2004)

Page 23: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 23

Based on multiple sequences• The conservation of a base and the co-

conservation of a base pair in multiple sequences can help resolve ambiguous cases– In fact, a CM can be trained from a multiple

sequence alignment• Main types:– Joint optimization– Consensus/alignment of individual structures

Last update: 17-Oct-2015

Page 24: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

RNA FOOTPRINTING AND HIGH-THROUGHPUT METHODS

Part 2

Page 25: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 25

RNA footprinting• A traditional way to study RNA secondary structures– Preferentially cleave or mark nucleotides with a particular

structural property

Last update: 17-Oct-2015

Image credit: Novikova et al., International Journal of Molecular Sciences 14(12):23672-23684, (2013)

Page 26: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 26

Some probes that can be usedProbe Size (Dalton) Structural preference

DMS (dimethyl sulfate) 126 Mark unpaired bases

IM7 222 Mark unpaired bases

RNase V1 15,900 Cleave paired bases

RNase ONE 27,000 Cleave unpaired bases

Nuclease S1 32,000 Cleave unpaired bases

Nuclease P1 36,000 Cleave unpaired bases

Last update: 17-Oct-2015

Page 27: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 27

High-throughput RNA footprinting• Enzyme-based: After enzymatic treatment,

sequence the resulting fragments to identify the cleavage sites– And thus bases with the structural property

• Chemical-probe-based: Chemical adduct can terminate reverse-transcription. The termination point can be identified by sequencing

Last update: 17-Oct-2015

Page 28: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 28

Parallel Analysis of RNA Structure (PARS)

Last update: 17-Oct-2015

Image credit: Kertesz et al., Nature 467(7311):103-107, (2010)

Page 29: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 29

DMS-seq

Last update: 17-Oct-2015

Image credit: Rouskin et al., Nature 505(7485):701-705, (2014)

Page 30: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 30

Potential confounding factors• General:– Expression level of transcripts• Need control/comparison

– Sequence bias– Issues in read alignment• Blind tail – Fragments that are too short cannot be

aligned correctly

– Experimental efficiency

Last update: 17-Oct-2015

Page 31: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 31

Potential confounding factors• Method-specific:– DMS modifies mainly only adenines and cytosines– Increasing read count towards 3’ end in DMS-seq– Natural polymerase drop-off in chemical-probe-

based methods– Preference due to secondary vs. tertiary structure

(e.g., steric hindrance in enzyme-based methods)

Last update: 17-Oct-2015

Page 32: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 32

Data normalization• Normalization strategies:– Transcript level: Comparison using• Standard RNA-seq data• Control experiment (with some steps not carried out)• Data from two different enzymes (PARS)

– Increasing read count: Smoothing by local window– Polymerase drop-off: Modeling it explicitly

Last update: 17-Oct-2015

Page 33: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 33

Example: Poisson linear model• Modeling local sequence bias

– i: measured read count of nucleotide i– : actual expression level of transcript– bik: the k-th nucleotide within the length-K local

sub-sequence around nucleotide i– kh: bias coefficient

Last update: 17-Oct-2015

log𝜔𝑖 = 𝛼+ 𝛽𝑘ℎ𝐼ሺ𝑏𝑖𝑘 = ℎሻ+ 𝜀ℎ∈ሼ𝐴,𝐶,𝐺ሽ𝐾

𝑘=1

Li et al., Genome Biology 11(5):R50, (2010)

Page 34: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 34

In vivo vs. in vitro data

Last update: 17-Oct-2015

Image credit: Rouskin et al., Nature 505(7485):701-705, (2014)

Page 35: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 35

Using structure-probing data• The high-throughput RNA footprinting

(structure-probing) data only tell whether a base is paired or not, but not with which other base

• The data can be used to help RNA secondary structure prediction

Last update: 17-Oct-2015

Page 36: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 36

Using structure-probing data• Several ways to use the data:– Free energy penalty– Pseudo free energy terms– Discrepancy minimization– Identifying closest structure centroid

Last update: 17-Oct-2015

Page 37: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 37

Example: StructureFold• Overall workflow:

Last update: 17-Oct-2015

Image credit: Tang et al., Bioinformatics 31(16):2668-2675, (2015)

Page 38: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 38

Example: RNApbfold• Minimizing discrepancies between predicted

( ) and measured ( ) probabilities of bases being unpaired:

Last update: 17-Oct-2015

𝐹ሺ𝜖Ԧሻ= 𝜖𝜇2𝜏𝜇2𝜇 + 1𝜎𝑖2ሺ𝑝𝑖ሺ𝜖Ԧሻ− 𝑞𝑖ሻ2𝑛

𝑖=1

𝑝𝑖ሺ𝜖Ԧሻ 𝑞𝑖

Variance terms

indicating uncertainty

Washietl et al., Nucleic Acids Research 40(10):4261-4272, (2012)

Perturbation of the energy

parameter values

Page 39: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 39

Example: RNApbfold• Sample results:

Last update: 17-Oct-2015

Image credit: Washietl et al., Nucleic Acids Research 40(10):4261-4272, (2012)

Page 40: Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 40

Summary• Sequence-based RNA secondary structure

prediction– For specific types of RNA– Single sequence• Minimum free energy (MFE)• Partition function

– Multiple sequences• High-throughput RNA structure probing– Modification of objective function– Selection of appropriate structures

Last update: 17-Oct-2015