38
Seyed Abolfazl Motahari RNA-seq Data Analysis Basics

Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Seyed Abolfazl Motahari

RNA-seqData Analysis

Basics

Page 2: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Next Generation Sequencing

Biological  Samples Data

Data  VolumeCost

Page 3: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

تشخیص بیماریها

محافظت بیولوژیکی

طراحی داروها

تولید مواد بیولوژیکی جدید

کنترل سیستمهای بیولوژیکی

تحلیل داده ها

Big Data Analysis in Biology

Page 4: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

DNA Fragment

Read

Sequencer

ACCGTAACCTACTTAGTA

The sequence generated by a sequencing machine from a DNA fragment.

Read

Page 5: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

DNA Fragment

ACCGTAACCTACTTAGTAReads

Sequencer

Data from a pair of reads sequenced from ends of the same DNA fragment. The genomic distance between the reads is approximately known and is used to constrain assembly solutions.

CTAGTAACCTTAACCGTA

Paired-end Read

Page 6: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Long DNA Fragment

B

B

B

B

12kb

B

B

B

B B

B

Biotinylated

Shearing Capture

Similar to the paired-end reads with longer inserts

Mate-pair Read

Page 7: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Maxam-Gilbert Sequencing Sanger Sequencing

Basic MethodsPolony Sequencing 454 Pyrosequencing Illumina Sequencing SOLiD Sequencing Ion Torrent Semiconductor Sequencing DNA Nanoball Sequncing Heliscope Signle Molecule Sequencing Signle Molecule Real-time Sequencing

Next Generation Methods

Nanopore DNA Sequencing Tunneling Currents DNA Sequencing Sequencing by Hybridization Sequencing with Mass Spectrometry Microfluidic Sanger Sequencing Microscopy-based Sequencing

Methods in Development

Sequencing Technologies

Page 8: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Technology Read Length Error Rate Paired-end

Sanger up to 2k 2% Yes

ABI/SOLiD 75 2% Yes

Illumina 100-150 2% Yes

Roche/454 400-600 4% No

IonTorrent 200 4% No

PacBio up to 15k 18% Yes

Potentially all sequencing technologies can be used to sequence mate-pair libraries obtained by the circularization of long DNA fragments

Comparison of Sequencers

Page 9: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Biological samples

Genomes (DNA) Transcriptome (RNA)

Size selection

Capturing

Whole Genome

Capturing Regions

NGS Applications

RNA-seqmiRNA-seq

CLIP-seq

Whole Exome Epigenomics

ChIP-seq methyl-seq DNase-seq

Page 10: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Raw Data

Pre-processing

Assembly

De Novo/Alignment

App

DNA Prep

Library Prep

Chip Prep

Sequencing

Typical Workflow

Page 11: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Tasks

Page 12: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

UNIX

Advantages of Unix - Unix is more flexible and can be installed on many different types of machines, including main-frame computers, supercomputers and micro-computers.- Unix is more stable and does not go down as often as Windows does, therefore requires less administration and maintenance.- Unix has greater built-in security and permissions features than Windows.- Unix possesses much greater processing power than Windows.- Unix is the leader in serving the Web. About 90% of the Internet relies on Unix operating systems running on Apache, the worlds most widely used Web server, which is free.- Software upgrades from Microsoft often require the user to purchase new or more hardware or prerequisite software. That is not the case with Unix.- The mostly free or inexpensive open-source operating systems, such as Linux and BSD, with their flexibility and control, prove to be very attractive to (aspiring) computer wizards. Many of the smartest programmers are developing state-of-the-art software free of charge for the fast growing "open-source movement.- Unix also inspires novel approaches to software design, such as solving problems by interconnecting simpler tools instead of creating large monolithic application programs.

Page 13: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

UNIX

Page 14: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Command Line

Graphical  User  Interface    (GUI)

Command  Line  Interface    (CLI)

Page 15: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

General Command

command [-options] [args ...]

The “prompt”

The current directory (“path”)

The host

MacBook-Pro:abolfazl$ bowtie -v 3 -S human reads.fastq > aligned.sam

Page 16: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Unix File System

/home/Abolfazl/Data

The Path

/

Applications bin home use

Abolfazl Sajjad

Data

Page 17: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

>gi|123440403|ref|NC_008800.1| Yersinia enterocolitica subsp. enterocolitica 8081 chromosome, complete genomeGATCTTTTTATTTAAAGATCTCTTTATTAGATCTCTTATTAGGATCATGACCTTCTGTGGATAAGTGATTATTCACATTTAAGATCATGTGATTAAGGAGGATCGTTTGCTGTGAATGATCGGTGATCCTATTGCGTATAAGCTGGGATCTAAATGGCATGTTATGCACAGGCACTTTAAGTTACTAAGGTTGTTATGTGGATATGTACTGCTTATACCCTGCTTTCAAGCTTACTTATCCACATTCGTTCGCGTGATCTTTAAGCAAATTAGAGTAAATTAATCCAGTTTTTAACCCAAATCTCTGCCGGATCCTCAGGAATTTCATGTTTGATGACGTCAATTTCTAAAATATCACCCACACGAATGGCTCCCTGGATGATCAGTTGCTGATCCAATTTTCTGACCGCACCACAGAAAGTGTCATATTCTGAACTGCCCAAACCAACAGCACCAAAGCGAACCTGTGAGAGATCCGGTCTCTGCTGCTCGATTTGTTCTAATAAGGGTTGAAGATTGTCTGGCAGATCACCTGCACCATGAGTGGAAGTGATTATTAGCCACATACCATCCAAGGTCAGCTCGTCTAATTCCGGGCCATGCAGAGTTTCTGTCGTGAAACCCGCCTCTTCTAATTTCTCAGCTAAATGTTCAGCAACGTATTCAGCACTGCCAAGCGTACTGCCACTGATCAAGGTAATGTCAGCCATAAAGACCCCAACCGAAGTAATGAACCGGTATTGTACGCTGTGAATCAGCTGGGATCTACCTGTGGATAATGTGGGTATAGTTATTTAGTGCTCAGGGCACGATGGTACGCATGATGGGGTTTTGCAGGGAAATAAGAGTCTCGGTTGACTGGATCTCATCAATAGTTTGGATCTTGTTGATAAGTACCTGTTGCAGTGCATCTATCGATTTACACATGACCTTAATAAAGATGCTGTAATGGCCAGTGGTGTAATAGGCCTCGACAACTTCTTCTAAACTTTCCAGTTTTTTTAATGCAGAAGGGTAATCTTTGGCACTTTTCAAAATGATGCCGATGAA

Header

Sequence

Fasta Format

Genome Format

Page 18: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Read Format (Fastq)

Header

FASTQ files extend FASTA files in that they provide both sequence and quality. A FASTQ file thus typically consists of four lines.

1. A line starting with @ containing the sequence identifier2. the actual sequence3. a line starting with + after which the sequence identifier is optional4. a line with quality values which are encoded in ASCII space

As such the 2nd and 4th line must have the same length One such entry is given below showing one sequence "ATGTCT"..

@HWI-ST999:102:D1N6AACXX:1:1101:1235:1936 1:N:0:ATGTCTCCTGGACCCCTCTGTGCCCAAGCTCCTCATGCATCCTCCTCAGCAACTTGTCCTGTAGCTGAGGCTCACTGACTACCAGCTGCAG+1:DAADDDF<B<AGF=FGIEHCCD9DG=1E9?D>CF@HHG??B<GEBGHCG;;CDB8==C@@>>GII@@5?A?@B>CEDCFCC:;?CCCAC

Quality Scores

Sequence

Page 19: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

RNA-seq

WHY?  1-­‐  To  assemble  the  transcriptome  2-­‐  To  find  the  expression  levels

Assembly  Paradigms:  (depending  on  other  sources)  1-­‐  De  Novo    2-­‐  Genome  Guided  3-­‐  Transcriptome  Guided

NATURE BIOTECHNOLOGY VOLUME 28 NUMBER 5 MAY 2010 421

Brian J. Haas and Michael C. Zody are at the Broad Institute, Cambridge, Massachusetts, USA. e-mail: [email protected] or [email protected]

Advancing RNA-Seq analysisBrian J Haas & Michael C Zody

New methods for analyzing RNA-Seq data enable de novo reconstruction of the transcriptome.

Sequencing of RNA has long been recognized as an efficient method for gene discovery1 and remains the gold standard for annotation of both coding and noncoding genes2. Compared with earlier methods, massively parallel sequencing of RNA (RNA-Seq)3 has vastly increased the throughput of RNA sequencing and allowed global measurement of transcript abundance. Two reports in this issue introduce approaches for RNA-Seq analysis that capture genome-wide transcription and splicing in unprecedented detail. Trapnell et al.4 describe a software package, Cufflinks, for simultane-ous discovery of transcripts and quantification of expression levels and apply it to study gene expression and splicing during the differentia-tion of mouse myoblast cells. Taking a similar approach, Guttman et al.5 use software called Scripture to reannotate the transcriptomes of three mouse cell lines, defining complete gene models for hundreds of new large intergenic noncoding RNAs (lincRNAs)6.

Although transcript sequencing has been possible for nearly 20 years, until recently it required the construction of clone libraries. Projects to determine full-length gene struc-tures for human, mouse and other impor-tant models have taken years to complete7. With new sequencing technologies, no clon-ing is needed, allowing direct sequencing of cDNA fragments. In a matter of days and at a small fraction of the cost of earlier projects, one can achieve reasonably complete cover-age of a transcriptome8. But this approach has been hindered by a substantial challenge: without cloning, one cannot know a priori which reads came from which transcripts. Recent studies analyzed gene expression and alternative splicing by mapping short RNA-Seq reads to previously known or predicted

transcripts9,10. Although highly informative, such studies are inherently limited to known genes and to alternative splicing across pre-viously identified splice junctions. To fully leverage RNA-Seq data for biological dis-covery, one should be able to reconstruct transcripts and accurately measure their relative abundance without reference to an annotated genome.

Previous efforts to reconstruct transcripts

from short RNA-Seq reads have followed two general strategies (Fig. 1). The first, a de novo assembly approach implemented in the ABySS software11, reduces the annotation problem to that of aligning full-length cDNAs, which is well handled by several algorithms. This method is also applicable to the discovery of transcripts that are missing or incomplete in the reference genome and to RNA-Seq data from organisms lacking a genome reference.

RNA-Seq reads

Align reads to genome

Assemble transcripts de novo

Assemble transcripts from spliced alignments

More abundant

Less abundant

Align transcriptsto genome

Genome

Figure 1 Strategies for reconstructing transcripts from RNA-Seq reads. The ‘align-then-assemble’ approach (left) taken by Trapnell et al.4 and Guttman et al.5 first aligns short RNA-Seq reads to the genome, accounting for possible splicing events, and then reconstructs transcripts from the spliced alignments. The ‘assemble-then-align’ approach (right) first assembles transcript sequences de novo—that is, directly from the RNA-Seq reads. These transcripts are then splice-aligned to the genome to delineate intron and exon structures and variations between alternatively spliced transcripts. As de novo assembly is likely to work only for the most abundant transcripts, the align-then-assemble method should be more sensitive, although this warrants further investigation. RNA-Seq reads are colored according to the transcript isoform from which they were derived. Protein-coding regions of reconstructed transcript isoforms are depicted in dark colors.

N E W S A N D V I E W S

Page 20: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

De Novo Assembly

Reads

Transcriptome  Assembly

Quantitative  Analysis

Page 21: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Assembly with Available Genome

RNA-­‐seq  Reads

Genome

DNA

Transcriptome  Assembly

Quantitative  Analysis

RNA

Page 22: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Assembly with Available Transcriptome

RNA-­‐seq  Reads

Transcriptome

RNA

Quantitative  Analysis

RNA

Page 23: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Assembly with Available Genome

Page 24: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Pipeline

RNA-­‐seq  reads  (2  x  100  bp)

Sequencing

RNA-­‐seq  reads  (2  x  100  bp)  (.fastq  files)

FASTX/FastQC

Quality  Control

Bowtie/Tophat

Read    Alignment

Reference  genome  

(.fasta  files)

Cufflinks

Transcript    assembly

Gene  annotation  (.gtf  files)

Cufflinks  (cuffmerge)

Gene  identification

Cuffdiff  (A:B  Comparison)

Differential  expression

IGV

Visualization

Page 25: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Quality Control

Filtering data

removing low quality sequences or bases (trimming), adaptors, contaminations or overrepresented sequences to assure a coherent final result.

Quality assessment is the first step of the bioinformatics pipeline of RNA-Seq.

• FastQC

developed in Java. Results are presented in HTML permanent reports.

• FASTX conversion from FASTQ to FASTA format information about statistics of quality removing sequencing adapters, filtering and cutting sequences based on quality

Packages:

Page 26: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

FastQC

Page 27: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

FASTX

Page 28: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Read Mapping Strategies

unspliced  aligner

bowtie bwa soap spliced  aligner

tophat MapSplice SpliceMap

Garber  et  al.  Nature  Methods  8,  469–477  (2011)

Short  Read  Aligners

Page 29: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Aligners

Module!2!–!RNA-seq!alignment!and!visualiza8on! bioinformatics.ca

Which!read!aligner!should!I!use?!

http://wwwdev.ebi.ac.uk/fg/hts_mappers/

RNA Bisulfite DNA microRNA

Page 30: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Short Read Aligners

Name Description paired-end

Use FASTQ quality

Gapped

Multi-threaded

Bowtie

Based on Burrows-Wheeler transform 1.3 GB memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour.

Yes Yes No Yes

BWABased on Burrows-Wheeler transform It's a bit slower than bowtie but allows indels in alignment.

Yes No Yes Yes

SHRiMPIndexes the reference genome as of version 2. Uses masks to generate possible keys.

Yes Yes Yes Yes

SOAP, SOAP2, SOAP3 and SOAP3-dp

SOAP:  Robust  with  a  small  (1-­‐3)  number  of  gaps  and  mismatches.  SOAP2:  using  bidirectional  BWT  much  faster  than  the  first  version.    SOAP3:  GPU-­‐accelerated  version  SOAP3-­‐dp,  also  GPU  accelerated,

Yes No

SOAP3-dp:Yes

Yes

Page 31: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Bowtie

Exons

bowtie  [options]*  <ebwt>  {-­‐1  <m1>  -­‐2  <m2>  |  -­‐-­‐12  <r>  |  <s>}  [<hit>]

Page 32: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Tophat

tophat  [options]*  <genome_index_base>  <reads1_1[,...,readsN_1]>  [reads1_2,...readsN_2]  

Exons

Page 33: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Tophat

tophat  [options]*  <genome_index_base>  <reads1_1[,...,readsN_1]>  [reads1_2,...readsN_2]  

Exons1 2 3

3

3

1

2isoforms

Page 34: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Tophat

•  TopHat  is  a  ‘splice-­‐aware’  RNA-­‐ seq  read  aligner  

•  Requires  a  reference  genome  

•  Breaks  reads  into  pieces,  uses   ‘bow,e’  aligner  to  first  align  these  pieces  

•  Then  extends  alignments  from   these  seeds  and  resolves  exon  edges  (splice  junc,ons)  

Module!2!–!RNA-seq!alignment!and!visualiza8on! bioinformatics.ca

Bow8e/TopHat!

•  TopHat&is&a&‘splice>aware’&RNA>seq&read&aligner&

•  Requires&a&reference&genome&

•  Breaks&reads&into&pieces,&uses&‘bow,e’&aligner&to&first&align&these&pieces&

•  Then&extends&alignments&from&these&seeds&and&resolves&exon&edges&(splice&junc,ons)&

Trapnell et al. 2009

Page 35: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Trapnell  et  al.  Nature  Biotechnology  28,  511–515  (2010)

Cufflinks

Page 36: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

Integrative Genome Viewer (IGV)

Genomic Coordinates

Data Panel

Annotation Heatmap

Cytoband

Genome Features

Track Names

Page 37: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

IGV

Page 38: Seyed Abolfazl Motahari RNA-seq - ce.sharif.educe.sharif.edu/courses/93-94/2/ce795-1/resources/root/RNA-seq.pdf · Recent studies analyzed gene expression and alternative splicing

IGV