19
1 RNA-Seq analysis The transcriptome is the total set of transcripts, mRNA and non-coding RNA, in one or a population of cells under specific conditions. The transcriptome analysis lay the foundation of gene structure and function research. Based on next-generation high-throughput sequencing technologies, RNA-seq found its applications in many research fields including fundamental science research, medical research and drug development. Services 1. RNA-seq without reference genome (De novo transcriptome) 1.1 Sequencing and basic data processing We will first test the quality of total RNA provided by the customer. If the sample is qualified, we will then conduct the following technical route: sample preparationsequencing. The basic data analysis includes image recognition, base calling, filtering adapter sequences and detecting contaminations of samples. 1.2 Bioinformatics analysis 1) Statistics and quality assessment of output data 2) Contig length distribution 3) Scaffold-gene length distribution 4) Functional annotation of the scaffold-gene 5) GO categories of the scaffold-gene 6) Differentially expressed scaffold-gene 7) Protein function prediction and classification 8) Enriched metabolic pathway of scaffold-gene 9) Enriched GO categories of differentially expressed scaffold-gene 2RNA-Seq with reference genomeIn reference transcriptome2.1 Sequencing and basic data processing We will first test the quality of total RNA provided by the customer. If the sample is qualified, we Figure 1. RNA-Seq De novo

BGi RNA-Seq Analysis

Embed Size (px)

Citation preview

1

RNA-Seq analysis The transcriptome is the total set of transcripts, mRNA and non-coding RNA, in one or a population of cells under specific conditions. The transcriptome analysis lay the foundation of gene structure and function research. Based on next-generation high-throughput sequencing technologies, RNA-seq found its applications in many research fields including fundamental science research, medical research and drug development.

Services

1. RNA-seq without reference genome (De novo transcriptome)

1.1 Sequencing and basic data processing

We will first test the quality of total RNA provided by the customer. If the sample is qualified, we

will then conduct the following technical route: sample preparation→sequencing.

The basic data analysis includes image recognition, base calling, filtering adapter sequences and detecting contaminations of samples.

1.2 Bioinformatics analysis

1) Statistics and quality assessment of output data

2) Contig length distribution 3) Scaffold-gene length distribution 4) Functional annotation of the scaffold-gene 5) GO categories of the scaffold-gene 6) Differentially expressed scaffold-gene 7) Protein function prediction and classification 8) Enriched metabolic pathway of

scaffold-gene 9) Enriched GO categories of differentially

expressed scaffold-gene

2.RNA-Seq with reference genome(In reference transcriptome)

2.1 Sequencing and basic data processing

We will first test the quality of total RNA provided by the customer. If the sample is qualified, we

Figure 1. RNA-Seq (De novo)

2

will then conduct the following technical route: sample preparation→sequencing.

The basic data analysis includes image recognition, base calling, filtering adapter sequences and detecting contaminations of samples.

2.2 Bioinformatics analysis

1) Summary of data output and alignment to reference sequences

2) Distribution of reads in reference genome

3) Randomness assessment of sequencing

4) Gene coverage and sequencing depth

5) Differentially expressed genes 6) Optimization of gene structure 7) Identification of alternative

spliced transcripts 8) Identification of novel genes 9) Identification of gene fusion

3.Non-coding RNA analysis

We will first test the quality of total RNA or size-fractionated RNA (eg. 200-700 nt) provided by the customer. If the sample is qualified, we will then conduct the following technical route: sample

preparation→sequencing.

Experimental pipeline

Figure2. RNA-Seq (In reference)

Figure 3. Flowchart of RNA-Seq

3

Application of RNA-Seq

1. Identification of genes (De novo transcriptome only) 2. Structure of transcripts: Identification of untranslated region (UTR), boundary of intron,

alternative splicing and start codon, etc. 3. Identification of non-coding unit: Non-coding RNA, precursor of microRNA, etc. 4. Determing gene expression in transcriptional level 5. Identification of new transcription unit

Technical features of RNA-Seq

capacity RNA-Seq Detected signals Digital signals Detected range Nearly all the transcripts

Detected accuracy From several copies to 100,000 copies Resolution Allele specific expression, alternative splicing

Case Study

Discover new alternative spliced transcripts

Marc Sultan et al. reported that RNA-Seq can detect 25% more genes than those by microarrays. A global survey of messenger RNA splicing events identified 94,241 splice junctions (4096 of which were previously unidentified) in a study of human embryonic kidney and B cell.

Fig4.RNA-Seq versus microarrays

A.Comparison of the number of expressed genes detected by RNA-Seq and microarrays

B.Distribution of the RNA-Seq NEs and the proportion of genes detected on microarrays.

Genes missed by microarrays are shown with gray (HEK) and black (B cells) bars. Genes

4

detected by microarrays are shown with light red (HEK) and dark red (B cells) bars.

Identify 5’and 3’UTRs in yeast

After comparing 5’ RACE results with RNA-Seq results, researchers found both methods identified 5’ boundaries within 50 bp of one another for 786 genes (77.9%). RNA-Seq could identify the 3’ boundary precisely.

Fig5. Identify 5’ and 3’UTRS in yeast using RNA-seq

A. The 5′ UTRs determined by RNA-Seq and by 5′ RACE for gene YKL004W

B. 3′ UTR determined by RNA-Seq for gene YDR419W. A colored box represents an ORF and

an arrow indicates the transcription direction.

Detect more low abundance transcripts

In rice RNA-seq project in the Beijing Genomics Insitute, we found RNA-seq can find more low

abundance genes than traditional methods. (Fig. 6-A)

A

B

  BA

5

Fig6.Transcriptome study can detect more low abundance

transcripts than cDNA sequencing A.The length distribution of newly identified transcripts.

B.A comparison of expression level between novel transcripts and cDNA genes.

Detect gene fushion

Researchers at the University of Michigan performed the transcriptome sequencing of patient cell lines and tumor samples using 454 together with the GA(Solexa) to discover new gene fusion in prostate cancer. This established high-throughput sequencing as a reliable method for discovering new gene fusion and other disease-related mutation.

Quantify RNA expression level Researchers in Yale University found a strong correlation (R = 0.9775) between the qPCR and RNA-Seq data of the 34 genes predicted to be expressed at a range of high, medium, and low expression level respectively.

Fig 8.Comparison of 34 ORFs indentified by RNA-Seq and

qRCR at the transcriptional level

Fig 7 RNA-seq detect gene fushion

-10 -5 0 5 10 15

15

-5

-10

0

5

10

RNA-seq(log2)

qPC

R(lo

g2)

con=0.9775

6

Reference

Sultan, M, Schulz, M. H.A global view of gene activity and alternative splicing by deep sequencing of the human

transcriptome. et al., Science 321 (5891), 956 (2008).

Maher, C. A,Kumar-Sinha, C,Cao, X. Transcriptome sequencing to detect gene fusions in cancer. et al., Nature

458 (7234), 97 (2009).

Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, The Transcriptional Landscape of the YeastGenome Defined

by RNA Sequencing. et al.,Science 320 (5881), 1344 (2008).

Wilhelm BT, Marguerat S, Watt S.Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide

resolution. et al., Nature 453 (7199), 1239 (2008).

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B.Mapping and quantifying mammalian transcriptomes

by RNA-Seq. et al., Nat Methods 5 (7), 621 (2008).

FAQ

1. What are the samples requirements? Please provide total RNA with concentration no less than 400 ng/μl and quantity no less than 20 μg. Minimum quantity requirement is 10 μg. The RNA quality requirement: OD260/280 is 1.8-2.2, 28S/18S >1.8, RIN ≥8. The customers should ship the RNA a week before sequencing. 2. Can the Beijing Genomics Institute (BGI) perform the transcriptome analysis of bacteria? Yes. We recommend the customer to submit purified mRNA or cDNA rather than total RNA. 3. Can the BGI perform the non-coding RNA sequencing? Yes. We recommend customer to submit RNA free of rRNA and tRNA. 4. How many unigenes can be retrieved from 1 Gb sequencing data? In general, more than 6000 unigenes more than 1Kb in length can be identified from 1 Gb sequencing data. However, the exact number of unigenes more than 1Kb will vary according to the nature of the sample. 5. What species have the BGI sequenced? We have sequenced many model organisms and main crops, for example Homo sapiens, Nematoda, Silkworm, Arabidopsis thaliana, rice and corn etc. Many novel structures and transcripts were identified. We also performed the transcriptome sequencing of many species without reference genome such as trees, flowers, vegetables, insects, fishes and fungi etc.

7

RNA-Seq analysis The transcriptome is the total set of transcripts, mRNA and non-coding RNA, in one or a population of cells under specific conditions. The transcriptome analysis lay the foundation of gene structure and function research. Based on next-generation high-throughput sequencing technologies, RNA-seq found its applications in many research fields including fundamental science research, medical research and drug development.

Services

1. RNA-seq without reference genome (De novo transcriptome)

1.1 Sequencing and basic data processing

We will first test the quality of total RNA provided by the customer. If the sample is qualified, we

will then conduct the following technical route: sample preparation→sequencing.

The basic data analysis includes image recognition, base calling, filtering adapter sequences and detecting contaminations of samples.

1.2 Bioinformatics analysis

10) Statistics and quality assessment of output data

11) Contig length distribution 12) Scaffold-gene length distribution 13) Functional annotation of the scaffold-gene 14) GO categories of the scaffold-gene 15) Differentially expressed scaffold-gene 16) Protein function prediction and classification 17) Enriched metabolic pathway of

scaffold-gene 18) Enriched GO categories of differentially

expressed scaffold-gene

2.RNA-Seq with reference genome(In reference transcriptome)

2.1 Sequencing and basic data processing

We will first test the quality of total RNA provided by the customer. If the sample is qualified, we

Figure 1. RNA-Seq (De novo)

8

will then conduct the following technical route: sample preparation→sequencing.

The basic data analysis includes image recognition, base calling, filtering adapter sequences and detecting contaminations of samples.

2.2 Bioinformatics analysis

10) Summary of data output and alignment to reference sequences

11) Distribution of reads in reference genome

12) Randomness assessment of sequencing

13) Gene coverage and sequencing depth

14) Differentially expressed genes 15) Optimization of gene structure 16) Identification of alternative

spliced transcripts 17) Identification of novel genes 18) Identification of gene fusion

3.Non-coding RNA analysis

We will first test the quality of total RNA or size-fractionated RNA (eg. 200-700 nt) provided by the customer. If the sample is qualified, we will then conduct the following technical route: sample

preparation→sequencing.

Experimental pipeline

Figure2. RNA-Seq (In reference)

Figure 3. Flowchart of RNA-Seq

9

Application of RNA-Seq

6. Identification of genes (De novo transcriptome only) 7. Structure of transcripts: Identification of untranslated region (UTR), boundary of intron,

alternative splicing and start codon, etc. 8. Identification of non-coding unit: Non-coding RNA, precursor of microRNA, etc. 9. Determing gene expression in transcriptional level 10. Identification of new transcription unit

Technical features of RNA-Seq

capacity RNA-Seq Detected signals Digital signals Detected range Nearly all the transcripts

Detected accuracy From several copies to 100,000 copies Resolution Allele specific expression, alternative splicing

Case Study

Discover new alternative spliced transcripts

Marc Sultan et al. reported that RNA-Seq can detect 25% more genes than those by microarrays. A global survey of messenger RNA splicing events identified 94,241 splice junctions (4096 of which were previously unidentified) in a study of human embryonic kidney and B cell.

Fig4.RNA-Seq versus microarrays

A.Comparison of the number of expressed genes detected by RNA-Seq and microarrays

B.Distribution of the RNA-Seq NEs and the proportion of genes detected on microarrays.

Genes missed by microarrays are shown with gray (HEK) and black (B cells) bars. Genes

detected by microarrays are shown with light red (HEK) and dark red (B cells) bars.

10

Identify 5’and 3’UTRs in yeast

After comparing 5’ RACE results with RNA-Seq results, researchers found both methods identified 5’ boundaries within 50 bp of one another for 786 genes (77.9%). RNA-Seq could identify the 3’ boundary precisely.

Fig5. Identify 5’ and 3’UTRS in yeast using RNA-seq

B. The 5′ UTRs determined by RNA-Seq and by 5′ RACE for gene YKL004W

B. 3′ UTR determined by RNA-Seq for gene YDR419W. A colored box represents an ORF and

an arrow indicates the transcription direction.

Detect more low abundance transcripts

In rice RNA-seq project in the Beijing Genomics Insitute, we found RNA-seq can find more low

abundance genes than traditional methods. (Fig. 6-A)

A

B

  BA

Fig6.Transcriptome study can detect more low abundance

transcripts than cDNA sequencing A.The length distribution of newly identified transcripts.

B.A comparison of expression level between novel transcripts and cDNA genes.

11

Detect gene fushion

Researchers at the University of Michigan performed the transcriptome sequencing of patient cell lines and tumor samples using 454 together with the GA(Solexa) to discover new gene fusion in prostate cancer. This established high-throughput sequencing as a reliable method for discovering new gene fusion and other disease-related mutation.

Quantify RNA expression level Researchers in Yale University found a strong correlation (R = 0.9775) between the qPCR and RNA-Seq data of the 34 genes predicted to be expressed at a range of high, medium, and low expression level respectively.

Fig 8.Comparison of 34 ORFs indentified by RNA-Seq and

qRCR at the transcriptional level

Fig 7 RNA-seq detect gene fushion

-10 -5 0 5 10 15

15

-5

-10

0

5

10

RNA-seq(log2)

qPC

R(lo

g2)

con=0.9775

12

Reference

Sultan, M, Schulz, M. H.A global view of gene activity and alternative splicing by deep sequencing of the human

transcriptome. et al., Science 321 (5891), 956 (2008).

Maher, C. A,Kumar-Sinha, C,Cao, X. Transcriptome sequencing to detect gene fusions in cancer. et al., Nature

458 (7234), 97 (2009).

Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, The Transcriptional Landscape of the YeastGenome Defined

by RNA Sequencing. et al.,Science 320 (5881), 1344 (2008).

Wilhelm BT, Marguerat S, Watt S.Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide

resolution. et al., Nature 453 (7199), 1239 (2008).

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B.Mapping and quantifying mammalian transcriptomes

by RNA-Seq. et al., Nat Methods 5 (7), 621 (2008).

FAQ

2. What are the samples requirements? Please provide total RNA with concentration no less than 400 ng/μl and quantity no less than 20 μg. Minimum quantity requirement is 10 μg. The RNA quality requirement: OD260/280 is 1.8-2.2, 28S/18S >1.8, RIN ≥8. The customers should ship the RNA a week before sequencing. 2. Can the Beijing Genomics Institute (BGI) perform the transcriptome analysis of bacteria? Yes. We recommend the customer to submit purified mRNA or cDNA rather than total RNA. 5. Can the BGI perform the non-coding RNA sequencing? Yes. We recommend customer to submit RNA free of rRNA and tRNA. 6. How many unigenes can be retrieved from 1 Gb sequencing data? In general, more than 6000 unigenes more than 1Kb in length can be identified from 1 Gb sequencing data. However, the exact number of unigenes more than 1Kb will vary according to the nature of the sample. 5. What species have the BGI sequenced? We have sequenced many model organisms and main crops, for example Homo sapiens, Nematoda, Silkworm, Arabidopsis thaliana, rice and corn etc. Many novel structures and transcripts were identified. We also performed the transcriptome sequencing of many species without reference genome such as trees, flowers, vegetables, insects, fishes and fungi etc.

13

Small RNA analysis

RNA is one of the most important parts of the bio-material which constructs the framework of life with DNA and protein together. Small RNA regulates life, such as the development and growth of cell, the transcription and translation of gene, as well as the gene silence. Small RNA sequencing is based on solexa technology, the deep sequencing yield numerous small fragments from 18 to 30nt, we compare them with the known and relative species, find out the difference between different samples and predict the novel miRNA, furthermore study its function.

Services

Sequencing and basic data processing

We will first test the quality of small RNA provided by the customer. If the sample is qualified, we will then conduct the following technical route: sample preparation->TA clone->sequencing reaction.

The basic data analysis includes image recognition, base calling, filtering adapter sequences and detecting contaminates of samples.

Bioinformatics analysis

Items of basic bioinformatics analysis

Length distribution of small RNA

Mapping small RNA sequences to genome sequences and exploring features of distribution along each chromosome

Differential small RNA between two samples

Comparing small RNA sequences with known miRNAs deposited at miRBase (miRBase13.0)

Identification of rRNA, tRNA, snRNA, snoRNA against Rfam (9.1) and Genebank

Identifying repeats associated with small RNAs

Identifying mRNA degradated fragments and siRNA candidates

Items of basic bioinformatics analysis

Annotating and classifying miRNA

Prediction of novel miRNA

Expression of miRNA

Differential expression analysis of miRNA gene and construction of miRNA expression profiles

Clustering analysis of differentially expressed miRNA

Target prediction of miRNA (only for plant)

Technical features

High-throughput: more than 2.5 millions reads can be obtained through the single-pass sequencing.

14

High resolution: differences between single base pair can be detected. High accuracy: digital signals to accurately detect the number of copies ranging from several to hundreds of thousands.

Experimental pipeline

Figure 5-1 Experimental pipeline of small RNA analysis

Case study

Extract about 20 μg of total RNA from an animal tissue, conduct the high-throughput sequencing and do bioinformatics analysis.

Length distribution of small RNA

The length of small RNA is centered on 22 nt (more than 90%). This illustrates small RNA sequencing is reliable. The length distribution of small RNA from the tissue is shown in Fig5-2.

Figure 5-2 The distribution of small RNA

Annotating and classifying small RNA sequences

After removing contaminants of adaptor and low quality sequences, 3,333,504 reads are

15

generated. Align the sequence to database of miRBase, mRNA/EST and rRNA, and identify known and candidate miRNAs (Fig.5-3).

Figure 5-3 Proportion of miRNAs and other categories of RNA

Among these data, the most of unique_reads is exon, but miRNA is the major part of total reads. These miRNA data make the the result more reliable for predicting the novel miRNA.

Different expression patterns analysis on miRNAs

Different miRNAs show different expression patterns in the same tissue (Fig. 5-4).That is relative to the difference of the tissue and the selective expression of gene.

Figure 5-4 Expression profile for part of miRNAs in the same tissue

Figure 5-5 Expression profile for part of miRNAs in the different tissue

Expr

essi

on le

vel(1

0K)

16

As shown in Fig. 5-5, the expression of miRNA is tissue-specific (A, B, C, D, E, F, G, H, I, J indicate different tissues respectively, has-let-7b and has-miR-22 indicate different miRNA genes).

Identification of miRNA nucleotide bias

As a special kind of RNA, there usually is U in 5’ end, but not G. The position of 2 and 4 base is short at U. Generally speaking, all positions are short at G but the fourth position. miRNAs have high conservation in sequence, high time orderency and tissue specificity. The count of all variants of a miRNA gene can be used as a digital measure its expression level. (Fig. 5-6)

Figure 5-6 The distribution of all bases

Identification of miRNA related to repeat sequence

Except acting as sequence specific guides to regulate mRNA stability or inhibit protein synthesis, lots of recent studies discovered some novel small RNA types which bound with different Agonaute proteins and involved in some important biological process, such as chromatin maintenance and transposon control. These small RNAs always derived from highly repeated elements and called repeat-associated small RNAs (always interchangeable with piRNA). According to type of Agonaute proteins they bind to, these small RNAs can be future divided into different classes. (Fig. 5-7)

Figure 5-7 the distribution of repeat

Per

cent

17

Prediction of new miRNA candidates

miRNA precursors have characteristic fold-back structure, which can be used to predict novel miRNAs. By folding the flanking genome sequence of small RNAs, followed by analysis of its structural features, we can identify novel miRNA candidates. (Fig. 5-8)

Figure 5-8 the identification of novel miRNA

Expression differences of miRNA between two samples

One type of gene at different condition has differential expression. Expression level of known miRNA between different samples and use Log2-ratio drawing, Scatter plot drawing to campare known miRNA expressed in different samples. (Figure 5-9,5-10)

Figure 5-9 Scatter plot of different Samples

1

100

10000

1000000

110

0

1000

0

1000

000

Expression level(Day2)

Exp

ress

ion

leve

l(Day

7)

Figure 5-10 Log2-ratio of different Samples

18

Clustering differentially expressed miRNA between two samples

Analysis clusterly miRNA gene which standardized to 1TPM by sequence similarlity. Cluster the similar sequence .Red indicated up trend, green indicated down trend ,and gray the gene which hasn’t expressed in any sample.

Figure 5-11 cluster analysis of miRNA

References

Y Zhang,X Zhou,X Ge, et al.(2009)Insect-Specific microRNA Involved in the Development of the Silkworm

Bombyx mori.PLos One.

Xi Chen,QB Li,J Wang,et al. (2009)Identifucation and characterization of novel amphioxus microRNAs by

Solexa sequencing. Genome Biology.

JM Guo,Y Miao,BX Xiao,et al. (2009)Differential expression of microRNA species in humann gastric cancer

versus non-tumorous tissues.J Gastroenterol Hepatol.

Xi Chen,Yi Ba,LJ Ma,et al. (2008)Characterization of microRNAs in serum: a novel class of biomarkers for

diagnosis of cancer and other diseases.Cell Research.

19

XH Wang,S Tang,SY Le,et al. (2008)Aberrant Expression of Oncogenic and Tumor-Supperessive MicorRNAs in

Cervical Cancer Is Required for Cancer cell Growth.PLos One.

Mi S, Cai T, Hu Y, Chen Y, Hodges E, et al. (2008) Sorting of Small RNAs into Arabidopsis Ar-gonaute

Complexes Is Directed by the 5’ Terminal Nucleotide. Cell.

Montgomery TA, Howell MD, Cuperus JT, Li D, Hansen JE, et al. (2008) Specificity of ARGONAUTE7-miR390

Interaction and Dual Functionality in TAS3 Trans-Acting siRNA Formation. Cell.

Morin RD, O’Connor MD, Griffith M, Kuchenbauer F, Delaney A, et al. (2008) Application of massively parallel

sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res.

Hafner M, Landgraf P, Ludwig J, Rice A, Ojo T, et al. (2008) Identification of microRNAs and other small

regulatory RNAs using cDNA library sequencing. Methods 44(1): 3-12.

Ibarra I, Erlich Y, Muthuswamy SK, Sachidanandam R, Hannon GJ (2007) A role for microRNAs in maintenance

of mouse mammary epithelial progenitor cells. Genes Dev 21(24): 3238-3243.

Frequently asked questions

3. What are the samples requirements? Please provide total RNA with concentration of no less than 750 ng/μl and minimum quantity of no less than 20 μg. We recommend that customers should avoid using spin columns to extract total RNA. We use the Agilent machine to detect the number of RIN, so you better send the total RNA. You also could detect your sample in OD or gel.

4. What is the TA clone for? TA is after the construct of the library as detecting the quality of the library. We choose more than 80 fragments to TA, using Sanger for sequencing. Compare the insert fragments with the database. TA can avoid the dad result after sequencing.

5. What should the customer offer beside samples ? You should offer the information of genome and exon/intron, repeat, if the sample don’t have genome, you must offer the nearest specie’s information.

6. How long can I get the data ? We promise to submit the report in 50 work days after affirming the pro money received.

7. How can I understand the data? What should I do? In the BGI miRNA result repots, there is a part-analysis method remark. You could understand the method by it. The README part can also help you find the answer.