62
Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

Embed Size (px)

Citation preview

Page 1: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

Lecture 6. Topics in RNA Bioinformatics

The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology

Page 2: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Identification of RNAs2. Identification of RNA structures, interactions

and functions

Last update: 6-Oct-2015

Page 3: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

IDENTIFICATION OF RNASPart 1

Page 4: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 4

Understanding machine language• This is how a

PDF file looks like when we open it in binary mode (shown as hexadecimal numbers).

• How do we interpret it?

Last update: 6-Oct-2015

Page 5: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 5

Understanding machine language

Last update: 6-Oct-2015

Version number

Language

Want to know more? Look for a standard called ISO32000.

Number of pages

Page 6: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 6

Understanding machine language• We looked for elements that were easy to interpret

– There were many parts the meanings of which were not as obvious

– Would be more complicated if it was an executable program instead, as it would contain both control and data elements

• In general, we tried to separate the long piece of content into elements/element types, and annotate each of them– Meanings of some elements can be determined with the help

of other elements (e.g., number of pages)– Next (more difficult) step is to understand the relative

locations of the different elements and how they interact with others

Last update: 6-Oct-2015

Page 7: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 7

Understanding genomic language• Now, how do we interpret the human

genome?

Last update: 6-Oct-2015

......TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGACACATGCTAGCGCGTCGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTACCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCACCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGC......

Page 8: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 8

Understanding genomic language• Again, we first look for functional elements• We focus on genes in this lecture• Classification:– By end-product: protein-coding vs. non-coding– By type: mRNAs, tRNAs, miRNAs, lncRNAs, ...– Sub-elements at the transcriptional level: whole

transcripts, exons, introns, ...– Sub-elements at the translational level: 5’UTR,

coding sequence, 3’UTR, ...

Last update: 6-Oct-2015

Page 9: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 9

Structure of a protein-coding gene

Last update: 6-Oct-2015

Image source: http://www.carolguze.com/text/442-1-humangenome.shtml

Page 10: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 10

Human gene annotation sets• RefSeq (NCBI, National Center for Biotechnology Information, USA

National Institute of Health)– Standard for most biologists

• Ensembl (EMBL-EBI, European Molecular Biology Laboratory-European Bioinformatics Institute)– Automatic annotation

• Havana (Wellcome Trust Sanger Institute)• Gencode (ENCODE, Encyclopedia of DNA Elements)

– Based on latest experimental data– Level 1: Experimentally validated– Level 2: Manually checked, but do not have experimental support– Level 3: Automatic annotation

• UCSC, University of California at Santa Cruz

Each has different versions

Last update: 6-Oct-2015

Page 11: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 11

Comparison of gene annotation sets

Last update: 6-Oct-2015

Image source: Harrow et al., Genome Research 22(9):1760-1774, (2012)

Page 12: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 12

Comparison of gene annotation sets

Last update: 6-Oct-2015

UCSC

Gencode v17

Gencode v14

Gencode v7

RefSeq

Ensembl

Example: p53

Page 13: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 13

Annotation file formats• GFF format (from http://genome.ucsc.edu/FAQ/FAQformat.html): tab-delimited. Fields:

1. seqname - The name of the sequence. Must be a chromosome or scaffold. 2. source - The program that generated this feature. 3. feature - The name of this type of feature. Some examples of standard feature types

are "CDS", "start_codon", "stop_codon", and "exon". 4. start - The starting position of the feature in the sequence. The first base is numbered

1. 5. end - The ending position of the feature (inclusive). 6. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for

this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".".

7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care). 8. frame - If the feature is a coding exon, frame should be a number between 0-2 that

represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.

9. group - All lines with the same group are linked together into a single item. • GTF format: Similar to GFF, except that the group field is replaced by a list of

attributes in <name>, <value> pairs

Last update: 6-Oct-2015

Page 14: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 14

Example• Gencode v12 GTF file:

Last update: 6-Oct-2015

chr1 ENSEMBL exon 17021 17055 . - . gene_id "ENSG00000227232.3"; transcript_id "ENST00000430492.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "WASH7P-202"; level 3; havana_gene "OTTHUMG00000000958.1";chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.1"; transcript_id "ENSG00000243485.1"; gene_type "antisense"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "antisense"; transcript_status "NOVEL"; transcript_name "MIR1302-11"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";...chr1 HAVANA gene 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENSG00000237613.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A"; level 2; havana_gene "OTTHUMG00000000960.1";chr1 HAVANA transcript 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA exon 35721 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA CDS 35721 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA start_codon 35734 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";

chr1 ENSEMBL exon 17021 17055 . - . gene_id "ENSG00000227232.3"; transcript_id "ENST00000430492.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "WASH7P-202"; level 3; havana_gene "OTTHUMG00000000958.1";chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.1"; transcript_id "ENSG00000243485.1"; gene_type "antisense"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "antisense"; transcript_status "NOVEL"; transcript_name "MIR1302-11"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";...chr1 HAVANA gene 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENSG00000237613.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A"; level 2; havana_gene "OTTHUMG00000000960.1";chr1 HAVANA transcript 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA exon 35721 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA CDS 35721 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA start_codon 35734 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";

Key:Annotation setFeatureGene nameTranscript typeAnnotation level

Page 15: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 15

Gene annotation: The process• How to discover genes?– Experimental:

• EST (Expressed Sequence Tag) libraries• Tiling microarrays• RNA sequencing• ...(Require observed expression)

– Computational:• Similarity search• Simple features• Machine learning• Hidden Markov Models• ...

Last update: 6-Oct-2015

Page 16: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 16

Computational gene finding – similarity search

• Find sequences that are similar to annotated genes– DNA (blastn)– Protein (blastx/tblastx): 6-

frame translation

Last update: 6-Oct-2015

Readingframe

Image credit: Wikipedia

+3 L V R T+2 T C S Y+1 N L F V 5’-AACTTGTTCGTACA-3’ 3’-TTGAACAAGCATGT-5’-1 K N T C-2 S T R V-3 V Q E Y

sr

G C G T G A C T T T C T

A

C

G

T

T

G

C

T

Page 17: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 17

Computational gene finding – simple features

• Based on sequence information only – “Ab initio gene finding”– Open reading frame (ORF)

• Existence of start and stop codons in-frame and within a reasonable distance– More complicated when introns are present

– Splice junctions• Grammar rules or probabilistic models

– Promoter signals• TATA boxes• CpG islands• ...

– Codon bias– ...

Last update: 6-Oct-2015

Image source: http://www.blackwellpublishing.com/ridley/a-z/codon_bias.asp

Page 18: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 18

Combining features• How to combine the various features?• Essentially a machine learning problem– For each window (e.g., 100-400bp), compute the

various features– Gather some positive examples (known coding

genes)– Gather some negative examples (known non-genic

regions)– Train a statistical model that can tell whether the

window (or the middle nucleotide) is likely genic/coding

Last update: 6-Oct-2015

Page 19: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 19

Computational gene finding – machine learning

• GRAIL: Neural network-based method

Last update: 6-Oct-2015

Image credit: Uberbacher and Mural, PNAS 88(24):11261-11265, (1991)

Page 20: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 20

Fine-grained modeling• All the above methods have limitations:– Similarity search: Only for genes with annotated

homologs– Simple features: Each feature is weak, and thus

can lead to false positives and false negatives– Machine learning (in that form): Does not fully

utilize information about neighboring positions, also not able to tell precise element boundaries

• Need methods that provide finer-grained modeling of gene structures

Last update: 6-Oct-2015

Page 21: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 21

Hidden Markov Models (HMMs)• Hidden Markov Models are statistical models

for modeling unobserved information based on observed data sequence– Observed data: DNA sequence– Unobserved information:• State of each nucleotide (exon, intron, etc.)• Transition probability between states• Emission probabilities: E.g., what is the probability of

emitting a certain nucleotide in the exon state?

Last update: 6-Oct-2015

Page 22: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 22

HMM example• Suppose you have two coins, one is biased and

one is unbiased, which coin is used each time if you observed the sequence <T, H, T>?

Last update: 6-Oct-2015

? ? ?

T H T

A possible model:

B

A0.5

0.5

0.9

0.1 0.8

0.2

0.5

0.50.25

0.75

H

T

H

T T H T

B A A

A possible run:

Page 23: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 23

HMM algorithms• There are algorithms for the following problems:– Given a model, compute data likelihood of observed

sequence, Pr(O|)• Forward algorithm• Backward algorithm

– Given a model and an observed sequence, determine the most likely state sequence,• Viterbi algorithm

– Given a set of states and a series of observed data sequences, estimate the transition and emission probabilities• Baum-Welch algorithm

Last update: 6-Oct-2015

argmax𝑄 Prሺ𝑄|𝑂,ሻ

Page 24: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 24

Computational gene finding – HMMs

• GeneScan:– Both transcription

(exon/intron) and translation (UTR/CDS)

– Positive and negative strands– Single-exon vs. multi-exon

genes– Three different frames

• One type of generalized HMMs (GHMMs): Emission of a sequence instead of a single nucleotide

Last update: 6-Oct-2015

Image credit: Burge and Karlin, Journal of Molecular Biology 268(1):78-94, (1997)

Page 25: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 25

Computational gene finding – HMMs

• VEIL: Multi-level models

Last update: 6-Oct-2015

Overall:

Exon and stop codon:

Image credit: Henderson et al., Journal of Computational Biology 4(2):127-141, (1997)

Page 26: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 26

Gene finding in the post-NGS era• With the invention of RNA-seq, the ability to experimentally

discover gene locations has been greatly improved:1. Sequence all RNAs2. Map them to reference genome or perform de novo assembly

• Issues:– Experimental noise– Mapping:

• Availability of good reference genome• Mapping of split reads and paired-end reads

– Assembly:• Lots of ambiguity

– Cell/tissue/condition-specific expression• Over- and under-representation of certain transcripts

– Biochemical activity vs. biological function

Last update: 6-Oct-2015

Page 27: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 27

Split mapping• TopHat2

Last update: 6-Oct-2015

Image credit: Kim et al., Genome Biology 14(4):R36, (2013)

Page 28: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 28

Transcript isoforms [Project]• Given a set of RNA-

seq short reads mapped to a gene, determine the transcript isoforms present and their relative abundance

• Cufflinks

Last update: 6-Oct-2015

Image credit: Trapnell et al., Nature Biotechnology 28(5):511-515, (2010)

Page 29: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 29

Non-coding RNAs (ncRNAs)• Non-coding RNAs are RNAs that function

without translating into proteins• Many types:

Last update: 6-Oct-2015

Type Abbreviation Function

Ribosomal RNA rRNA Translation

Transfer RNA tRNA Translation

Small nuclear RNA snRNA Splicing

Small nucleolar RNA snoRNA Nucleotide modifications

MicroRNA miRNA Gene regulation

Small interfering RNA siRNA Gene regulation

Long non-coding RNA (>200nt) lncRNA Various (mostly unknown)

… … …

Page 30: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 30

Identifying non-coding RNAs [project]

• Some features:– Strong evolutionary conservation– Strong secondary structure– Weak coding potential– (For small RNA) Strong RNA-seq signals selected

for small RNA– (For non-polyadenylated RNA) Weak RNA-seq

signals enriched for poly-A RNA– ...

Last update: 6-Oct-2015

Page 31: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 31

Machine learning for identifying ncRNAs

Last update: 6-Oct-2015

Image credit: Lu, Yip et al., Genome Research 21(2):276-285, (2011)

Page 32: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 32

Identifying long non-coding RNAs

Last update: 6-Oct-2015

Image credit: Nam and Bartel, Genome Research 22(12):2529-2540, (2012)

Page 33: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 33

Structural models for ncRNA• Some small RNAs have strong structural

features, which can be used to identify them from genomic sequences

Last update: 6-Oct-2015

tRNA snoRNA

Image sources: http://www.bio.miami.edu/dana/pix/tRNA.jpg, http://lowelab.ucsc.edu/images/CDBox.jpg

Page 34: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 34

Covariance models

Last update: 6-Oct-2015

Input multiple sequence alignment and consensus structure:

Construction of guide tree from consensus structure:

Image credit: INFERNAL user’s guide

Output CM:

Node Description

MATP Pair

MATL Single strand, left

MATR Single strand, right

BIF Bifurcation

ROOT root

BEGL Begin, left

BEGR Begin, right

END End

Page 35: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 35

Rfam• For RNA, analogous to Pfam (protein family)• Mirrors:– Sanger Institute, Wellcome Trust Foundation, UK

http://rfam.sanger.ac.uk/– Howard Hughes Medical Institute Janelia Farm

Research Campus, USAhttp://rfam.janelia.org/

Last update: 6-Oct-2015

Page 36: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 36

Rfam• Three classes of families:

– Non-coding RNA genes– Structured cis-regulatory elements– Self-splicing RNAs

• Each family provides the following:– Covariance models (CMs, slightly more complicated than profile

HMMs) (patterns)– Multiple sequence alignments (conservation)

• Seed alignment (one or more experimentally validated examples, possibly with other high-confidence predicted members)

• Full alignment (based on CMs built from the Infernal software)

– Consensus secondary structures (conservation)• Current status:

– Version 12.0 (July 2014), with 2450 families

Last update: 6-Oct-2015

Page 37: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 37

Rfam• If a sequence is queried against a family, a bit

score is given to indicate how likely it really belongs to the family as compared to the background: bit-score = log2(PCM / Pnull)

• Source of secondary structures in Rfam– Literature• Experimentally validated• Predictions

– Predictions using the WAR software

Last update: 6-Oct-2015

Page 38: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 38

Example: RF00005• Alignment

Last update: 6-Oct-2015

Image credit: Rfam

Page 39: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 39

Example: RF00005• Secondary structure

Last update: 6-Oct-2015

Image credit: Rfam

Page 40: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 40

Pseudogenes• Pseudogenes are former genes that have lost

their ability to code for (the original) protein• Classification:– By mechanism of creation:• Non-processed pseudogenes: Mutation (e.g., pre-

mature stop codon)• Processed pseudogenes: Reverse transcription (missing

introns)

– By copy of gene:• Duplicated copy• The only copy (unitary pseudogenes)

Last update: 6-Oct-2015

Page 41: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 41

Identifying pseudogenes• Look for sequences

similar to annotated coding genes or with strong coding potential

• Consider those that cannot produce the corresponding protein

Last update: 6-Oct-2015

Image credit: Zhang et al., Bioinformatics 22(12):1437-1439, (2006)

Page 42: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 42

Circular RNAs• Some RNAs

take a circular form– Due to back-

splicing of exons

– More stable– May act as

miRNA decoys

Last update: 6-Oct-2015

Image credit: Wilusz and Sharp, Science 340(6131):440-441, (2013)

Page 43: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 43

Identification of circular RNAs [project]

• Detection of back-splicing– Based on genomic

location of exon annotation

– Need to be distinguished from SVs

Last update: 6-Oct-2015

Image credit: Gao et al., Genome Biology 16(1):4, (2015)

Page 44: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

IDENTIFICATION OF RNA STRUCTURES, INTERACTIONS AND FUNCTIONS

Part 2

Page 45: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 45

Identification of RNA structures [lecture]

• Sequence-based– Sequence conservation/co-conservation– Minimizing free energy– Partition function: Sample from the probabilistic

distribution of structures• Sequencing-based– RNA footprinting– High-throughput versions

Last update: 6-Oct-2015

Page 46: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 46

Identification of RNA interactions• With DNAs– Sequence complementarity

• With RNAs– Sequence complementarity (more specific)

• With proteins– More difficult– High-throughput methods [project]

Last update: 6-Oct-2015

Page 47: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 47

Micro RNAs (miRNAs)• A miRNA can have its

own gene, or can be within the intron of another gene

• A number of processing steps, finally a single-strand RNA, part of the RNA-induced silencing complex (RISC)

Last update: 6-Oct-2015

Image credit: Narayanese at Wikipedia

Page 48: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 48

miRNA targeting• miRNA triggers mRNA cleavage or translational repression

Last update: 6-Oct-2015

Image credit: Kelvinsong at Wikipedia

Page 49: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 49

miRNA naming convention• Pre-miRNA: mir-<number> (e.g., mir-29)• Mature miRNA: miR-<number> (e.g., miR-29)• Specifying the species of origin: <species>-miR-<number> (e.g., hsa-miR-

29)• Nearly identical miRNAs: miR-<number><letter> (e.g., miR-29b)• Pre-miRNAs at different genomic locations but code for 100% identical

mature miRNAs: mir-<number>-<number> (e.g., mir-194-1)• If two mature miRNAs are from different arms of the same pre-miRNA:

– Standard: miR-<number>-<3p | 5p> (e.g., miR-337-3p)– If expression levels are known, the one with the lower expression

level: miR-<number>* (e.g., miR-123*)• Could combine multiple things (e.g., hsa-miR-125a-5p)

Last update: 6-Oct-2015

Page 50: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 50

Prediction of miRNA targets• MicroRNA recognizes and binds specific features on

target mRNAs (usually at 3’UTR for animals)– In plants, usually near exact match– In animals, usually good match at ~6 nucleotide “seed

region”• Possibly effects from other positions

– Secondary structure• No perfect prediction algorithms– Poor consistency between predictions by different

algorithms• Number of validated examples is small– How many?

Last update: 6-Oct-2015

Page 51: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 51

Identification of miRNA targets

Last update: 6-Oct-2015

Image credit: Thomas et al., Nature Structural & Molecular Biology 17(10):1169-1174, (2010)

Page 52: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 52

Prediction of miRNA targets• Comparison of some prediction methods:

Last update: 6-Oct-2015

Image credit: Thomas et al., Nature Structural & Molecular Biology 17(10):1169-1174, (2010)

Page 53: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 53

Experimental validation of miRNA targets

• Classification of evidence based on TarBase 7.0:– Not all are direct evidence

Last update: 6-Oct-2015

Method Throughput Intended useReporter Genes Low Validation of miRNA:UTR (or binding region) interactionNorthern Blotting Low Relative effect of miRNA on mRNA levelsqPCR Low Quantification of miRNA effect on mRNA levelsWestern Blot Low Relative assessment of miRNA effect on protein concentrationELISA Low Quantification of miRNA effect on protein concentration5 RLM-RACE Low Identification of cleaved mRNA targetsMicroarrays High High-throughput assessment of miRNA effect on mRNA expressionRNA-Seq High High-throughput assessment of miRNA effect on mRNA expressionQuantitative Proteomics (e.g., pSILAC) High High-throughput assessment of miRNA effects on protein concentrationAGO-IP High Identification of enriched transcripts (miRNAs and mRNAs) in AGO immunoprecipitatesHITS-CLIP High Sequencing of AGO binding regions on targeted transcriptsPAR-CLIP High Sequencing of AGO binding regions on targeted transcriptsCLASH / PAR-CLIP + Ligation High Sequencing of AGO binding regions on targeted transcripts. Production of chimeric

miRNA:mRNA reads for the identification of interacting pairs.Biotin miRNA tagging High/Low Pull-down of biotin-tagged miRNAs and estimation of bound transcript content using

qPCR (Low yield), microarrays (High-throughput) and RNA-Seq (High-throughput)IMPACT-Seq High Pull-down of biotin-tagged miRNAs, identification of interacting pairs and binding

regions.PARE / Degradome-Seq High High-throughput identification of cleaved mRNA targets3Life High High-throughput reporter gene assaymiTRAP High miRNA trapping by RNA baiting

Table credit: Vlachos et al., Nucleic Acids Research 43(D1):D153-D159, (2015)

Page 54: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 54

MiRNA networks

Last update: 6-Oct-2015

Image credit: Liu et al., Briefings in Bioinformatics 15(1):1-19, (2014)

Page 55: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 55

MiRNA networks

Last update: 6-Oct-2015

Image credit: Liu et al., Briefings in Bioinformatics 15(1):1-19, (2014)

Page 56: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 56

The ceRNA hypothesis• Competing endogenous RNA: different targets

compete for their common targeting miRNAs

Last update: 6-Oct-2015

Image credit: Salmena et al., Cell 146(3):353-358, (2011)

Page 57: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 57

miRBase• List of miRNA (families)– Latest release: Release 21 (June 2014), 28645 entries

Last update: 6-Oct-2015

Page 58: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 58

Modeling RNA-RNA interactions• Computational

pipeline:

Last update: 6-Oct-2015

Image credit: Schmitz et al., Nucleic Acids Research 42(12):7539-7552, (2014)

Page 59: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 59

Protein-RNA interactions• Experimental methods:– Probed by immuno-precipitation (with or without

cross-linking), or oligo(dT) pull-down– Many types of experiment: RIP-seq, HITS-CLIP,

PAR-CLIP, iCLIP, gPAR-CLIP, ...– Each type of data has its own properties and

biases• Need proper data processing and normalization

Last update: 6-Oct-2015

Page 60: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 60

RNA functions• By sequence homology– Borrowing the known annotated function of

another RNA with a similar sequence• By structure– Borrowing the known annotated function of

another RNA with a similar structure• By target– Borrowing the known annotated function of the

target gene• De novo annotation

Last update: 6-Oct-2015

Page 61: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 61

Function of lncRNAs• More variability,

less known• Four main

archetypes:

Last update: 6-Oct-2015

Image credit: Wang and Chang, Molecular Cell 43(6):904-914, (2011)

Page 62: Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 62

Summary• Identification of RNAs in a genome– mRNAs– Structured RNAs– miRNAs– Pseudogenes– Long non-coding RNAs– Circular RNAs

• Identification of RNA structures, interactions and functions

Last update: 6-Oct-2015