rnaseq short intro - Göteborgs universitetbio.lundberg.gu.se/courses/vt13/rnaseq_intro.pdf ·...

Preview:

Citation preview

RNA-Seq practical!

Basic processing: UNIX tools and IGV!

Erik Larsson

RNA-Seq practical!

•  Tophat!– Alignment!

•  IGV!– Visualization!

•  Cufflinks!– Gene discovery!– Find differentially expressed genes!

!

<3% coding sequence

~40% coding genes

GGGGTGAGATCTGGCTGGGTAGGGCTGTTTGACAGGGACACAGTTCACGGCCTGGGACTTGCCAACAAAGTCACCCTGTAGTTCAGGTGACACACAAGTGGATGGGGAGGGTGAGACCCAGGATCTCTTCTCCCCCAGGTCCTTATGAGGGGCTGGAGGAGACAGAACTGGGGTGCTGGACCCTCAGCATAAAGAATGCTATAGGCTGGGCATGGTGACTCATGCCTGTAAATCCCAGCGTTTTGGGAGGCCAAGGCGGGCAGATTGCTTGAGCCCAGAAATTTGAGACCAGCCTGGGCAACATAGCGAGACCCCGGGCAACATAGCGAGACCCCATCTCTAAAAAAATAAAATAAAATTAGCCAGGTTGGTGGCACAAGTCTGCAATTCTAACTACTTGGATGGGCTGAGATGGGAGGATCACTTGAGCCTGGGAGGTCAAGGCTGCAGTGAGCTGTGATTGTGCCACTGCACTCCAGCCGAGGGGACAGAGTGAAACCTTGCCTTAAAAAGACTGCTATGGCCCGAGTCCCTCTGCTGTGCCGGGCACTGTGCTGGGCATGTAACAGGCATATTCTTCTGATCTTTACAACTCTCCCATGAGGCAGGCACTATCGTTAGCCCATTTTACAGATGTGGCCATAGAGGCCCAGAGAGGAGAAGGGGCTTACCTAAGGCTATAGACTGTTGGTATCTGGAGATAAACCCGGGATGGTGCTCACTAAACTACCTTGGGTGTCAGTCCTGCTTCAAGACTCCAGAGAGATAAAGAGAGATGACCTCAGAGACAAAGAGACTCAGACCCAGCCAGAGGCCCAATGGACAGTGGGAGGGGTGGGTGGAAGAAGGCTGGTCTCTGTCTGACCAAGCCCCCCCAGAATAACGCAGGCTGCCCCCCTAGGTGGAAACAATGACACAATCAGCTCCCAATACCAAGGGCCTGACATCACAAGGGGAGGGGAAGGCAGCTGAGGTTGTGGGGGGAGGTGCCCCGCCCCTTGGCAGGCCCCTACAGCCAATGGAACGGCCCTGGAAGAGACCCGGGTCGCCTCCGGAGCTTCAAAAACATGTGAGGAGGGAAGAGTGTGCAGACGGAACTTCAGCCGCTGCCTCTGTTCTCAGCGTCAGTGCCGCCACTGCCCCCGCCAGAGCCCACCGGCCAGCATGTCCTCTGCTCACTTCAACCGAGGCCCTGCCTACGGGCTGTCAGCCGAGGTTAAGAACAAGGTAGGGCTGGAGGGCCTCCCTGGCCTGGCCCACACGTCCTGCCAGGCCAGAGCCCTGAGCTTGGGGTCCCTTGAACCCCCTCCTGCCTATCCTATGTGACTTGGAAACTGAGAGGGGAAAAGGGAGTGATATGGGATAGGGGCTGCCTGTCTCCCCCTGAACATCCCGGAGCCCCCAGCTATGGTTGGGGCTGGAATGGGGGGGCACACAGCCACACATAAACAGAGGGGGTCAGTCCATTGCAAAGATACCCACCTGATCAGTCTTCTGTTAACCCTTCGTGTTCTTGGGGGGAACAACATAGGGGGAAGACTTGTTGATTTTTCCATATCCCCCGGCCTGACAAAGAAATTGGGGAGCGCTTGAGTGCTGGGGTACCTGGGAAGTGACGCCGTGAAAGTGTGGGAGATCCTGAAGACAGAGGGGGACGGTGAAAGGCAGGAAGCGGGCATCAGAAGTGCGGCAGGGGTCTCCTGACTGTGGAGCTAGGAAGATACCTGGACACCACCTTCATGCTATGGTTGGGTAAACTGAGGTTCGGAGAGGAGAGGCAAATAGCTGGGGTCCCAGGTAAAGCAGGTACAGCGCTCGGACCCTGGACTCACCCCCCATACACCAGGATGGGCTCAGCTTCTCCCAGCTGGAGAACTTTAAGTTTCCAGCCCACTGGAATCGCCCCAACAGTATTGCCGAGGGAGGAGTTCCTGCCCCATTTGACAGAGGGGAACACTGAGGCTCAGGGTGGCTTTTCCCAGGGTCCCATGGTGAGGAAGTGGGGGACTGGGTTGGAACCTGGGTCGAGGGATCTCGGGGCTGGAGGAGGGGGCTGGTGGGGGGCGGGTCCTCGGGCGAGAGACAGATCCCAGCGCCGCCCTCCTCCCCCCCAGCGCCGGCCCCAGAGCCGCGCAGAGCCGCGCAGAGACGCCGCGCCTTATAAGGCGGCCTCGGGGAGCCCGGGCCACGCTATATAAGGGCCGGTTTGCTTTATAAAGCCGGGCTGGTGGCGTGGGGGGCGGCAGGGCCAGGGCCAGGTGAGGGGGCCGCCCCTCCCACCTCCCCCCACTCACCCGGGAGAAGAAGAGGCAGCCCGGTCCCCTAGGGGCTGGGAGCCTGGCTGGGCTTGGGCGGAGGGTTCTGGAGAAATGGGAGTGGAGTGGGGGAGGGGGGGGACAGTGGAGAGAGGGAAAAGCAGGGAGGTGGGGGGAGAGGCAGACAGAGATACTGGGAGCCTGAGACACCCTAGGGACAGACGGGGGAGGGCGAGCCAGGAGCGAGATAAGACCTAGACAAGGATGGAGGGGCAGGGAGAGGAGACAGAGCCCCACCACCCCCACCCCAGGCAGGAAACCTGGAGACAGAGAAAGACCTAGAGAGGCAGATATACAAGACCCAGGAGCCCTACCCCTGGCCAGACAGGGACTAGCCACCTAGAGAGATGGGGACCCAAGACTGGGCCAAGAAAAGACAGCGCTGGGGAAGAGAGAGACAGAGGAGTCGGGGGGATAAGAGGGAGAGAGACATACAGACGTGCAAGGGGTGGGGGCTAAGACAGAGACAAGCCCCCACCACTAACCAGAGACAGAGCCCTGGAGCTGAAGACCTGGGGGACACGGAGAGACAGAGATGTATGACCAGCACTCCTCTGCAAGCCAGCACCCAGGGACACCTCCTTAGACATCCTTCTTCCCTTCCTGAGGTGCCCTCTCTTCCAACAGGGGGCACAGAGGGGGCAGGGCTAGAGGAAGAGAAGCCCCAAGTTTGGCCTGGGCGAAAAACCAGGGTGCCGGGTGCCACCCCTCTAGCTCAGAGGATCCAGCTCCCCACACCCCACCCCTCATCTACATTCCCTGGTGCCAAACCTCAGAATGCCCGGAATGGCCCCCTGGGCAGGTGCCACCTCAGCCCTGGCTCTCAGCCCGCCCCAGCCCCCATCCCCCAACTATGGATCTGGGGCAAAATTGCCTTAGTTGGGAAGGACGAGGGAGATCAGGCTCTAGGAAGTTCAGACAGGACCCAGGGAGCCCAGGCTGCCCCCAATGCATCCTCACCCCTTTCTCTGTGCCCCCTGCCCTCCCCTCGCCCCAGCTGGCCCAGAAGTATGACCACCAGCGGGAGCAGGAGCTGAGAGAGTGGATCGAGGGGGTGACAGGCCGTCGCATCGGCAACAACTTCATGGACGGCCTCAAAGATGGCATCATTCTTTGCGAGTGAGTGAGGCTCTCGAAGCCGAGACCCTGCAACATCCCCCAACTCCATGCAGCCCCTCAACCCCCAAAACAACCATGATCCTGGAACTGAGTTGAACACTTTCTATTGGATACCTTTGGGGTGGCCAGTAATCATTGTGCCCATTTAACAGGCACAGAAAACTGAGGCTCAGGTGAAATGCATTGCACCAAGTCCCACGTGGTTTCAAGGGAAATGACTCTAGAATCTTAACCACCATGCTATATAGGGTAGGCCCATCTGTGGCCGCCAGAGTCCCCAGAAAGAGCGGTCACAGCTAAAAGGCAGCAGCCAACAGCTGTTCATGGCTGGCTTGGTGATGTGAGGAGAGATGTGCAGCAATAATTAAAGGAGGCCCTGGTTTTCTTTCTGTTTTCTTTTTGTTTTTTTGAGATACAGTCTTGTTCTGTTGCCCAGGCTGCAGTGCAGAGACACAATCTCGGCTCACTGCAACCTCCGCCTCCAGGGTTTAAGTGATTCTCCTGCCTCAGCCTCCCCAATAGCTGGGATTACAGGCACGCACCACCATGCCTGGCTAATTTTTGTATTTTTTTAAAGTAGAGATGGGGTTTCACCATGTTGGCCAGGATGGTTACGAACTCCTGACCTCAATTGATCCACCTACCTCAGCCTCCCAAAGTGCTGGGATTACAGGCACGTGCCACCATGCCCGGTTAATTTTTGTTTTTTTTTTTTTTTTTTCAGTAGAGATGGAGTTTCACCATGTTGACTAGGCTGGTCTTGAACTCCTGACTTCAAGTGATCCACCTGCCTTGGCCTCCCAAAGTGCTGGGATTGCAGGCACATGCCACCACGCCTGGCTAATTTTTGTATTTTTTTTTTTTTTTTTTAGTAGAGACAGTGTTTCACCATGTTGACCGGGCTGGTCTCAAACTGTGTGTGACACACACACACATGTGACAGTTGTGAAAAACACACACGTGTGTGTGTGGACACACACACACACACACACAC

~60% transcribed

The human transcriptome (according to GENCODE v11)!

1,944 SnRNA

1,521 SnoRNA

1,756 MicroRNA1,190 Misc. RNA19,999

Protein-coding12,534Pseudogene

10,419 LncRNA

Shahrouki, Larsson, Frontiers in Genetics 2012

RNA-seq, RNA sequencing, transcriptome sequencing, total RNA-seq, mRNA-seq,

miRNA-seq…!

•  Many names, sometimes mean same!•  All about characterizing RNA with next-

generation sequencing (NGS) in one way or the other!

Microarrays vs. RNA-seq!

•  Simultaneously quantify most known genes!

•  Simultaneously quantify all known genes at high accuracy!

•  Identify new genes!•  Study splicing patterns!•  Discover mutations!•  Fusion transcripts!•  Find viruses!•  Allele-specific expression!•  …!

New toys

Applied Biosystems 3730 (2002) Illumina HiSeq 2000 (2010)

50.000-100.000 bp per run ~200.000.000.000 bp per run

NGS principle (Illumina/Solexa)!Take picture to figure out first base in each cluster !

Remove terminators and repeat everything many times!

Add labeled nucleotides, primers, polymerase!

Source: Illumina!Sequencing!!

Isolate polyA+!Fragmentation!

Add random primers!

cDNA synthesis!(first and second strand)!

Ligate adapters!

Standard RNA-seq workflow (polyA+)!

Directional/strand-specific RNA-seq:dUTP method!

Levin et al, Nature Methods 2010!

RNA!

dsDNA!

Adapters!

U U U U U!

U U U U U!

UNC treatment!

RNA-seq data analysis!

•  Alignment!•  Gene discovery!•  Expression quantification!•  Testing for differential expression!•  Variant discovery!

Pairwise alignment

•  Figure out where one sequence belongs within another sequence

•  Trivial if not for substitutions, insertions, deletions

Genome: TGCGTACGCTCGATAGCTCGCATCGCTAGCCTCGCATAGCTAGCGATCGT

TCGCATCGCTAGCCTCGCAGAGCTAGC RNA:

||||||||||||||||||| |||||||

Aligning RNA-seq reads!

•  Why? Figure out from where the were transcribed!!•  Required prior to most analyses!!Two main options:!•  Align to transcriptome!

–  Fast, simple!–  Avoids problems with “spliced”/junction-spanning

reads!•  Align to genome!

–  Requires specialized RNA-seq aligner (can handle junction-spanning reads)!

Gapped alignments

•  Aligners for RNA-seq will need to handle gapped alignments

•  Junction-spanning reads will otherwise be lost

Genome:

Spliced mRNA: AAA

NGS reads:

Splice-junction aware aligners!

•  TopHat!– Popular option, big online user community!– Finds new junctions but can be guided by

known annotation!– Cuts up reads into smaller pieces and calls

the Bowtie short-read aligner!•  SOAPsplice!•  SpliceMap!•  …!

TopHat output visualized using IGV(human ACTB locus)!

RNA-seq data analysis!

•  Alignment!•  Gene discovery!•  Expression quantification!•  Testing for differential expression!•  Variant discovery!

Transcriptome assembly/gene discovery!

•  Task:!– Use aligned reads to discover genes and

figure out transcript structures !•  Tools:!

– Cufflinks!•  Most popular choice!•  Lots of online support, actively developed!

– Scripture!– Trans-ABySS!

Cufflinks discovers new transcripts/genes from aligned reads!

Aligned reads!

Discovered transcript isoforms!

Abundance estimates!

RNA-seq data analysis!

•  Alignment!•  Gene discovery!•  Expression quantification!•  Testing for differential expression!•  Variant discovery!

Testing for differential expression!

•  Normal t-test not optimal!– RNA-seq is “digital” rather than continuous!

•  Negative binomial distribution is better!– EdgeR, DeSeq!

•  Runs in R environment!– Cuffdiff (Cufflinks package)!

•  +Easy: use alignments without prior quantification!•  +Can test for differential splicing!•  -Very conservative!

http://bio.lundberg.gu.se/courses/vt13/rnaseq.html

Read intro carefully!

Good luck!!

Recommended