RNA-Seq Lab Radhika S. Khetani RNA-Seq Lab v5 | Radhika S. Khetani1 Powerpoint by Casey Hanson

RNA-Seq Lab v5 | Radhika S. Khetani 1

RNA-Seq Lab

Radhika S. Khetani

Powerpoint by Casey Hanson


Exercise

1. Use the Tuxedo Suite to:

a. Align RNA-Seq reads using TopHat(splice-aware aligner).

b. Perform reference-based transcriptome assembly with CuffLinks.

c. Obtain a new transcriptome using CuffLinks & CuffMerge.

d. Use CuffDiff to obtain a list of differentially expressed genes.

e. Report a list of significantly expressed genes.

2. Use a genome browser and visualization tool to observe the

aligned data and the new transcriptome.


Premise

1. Procedure:

Run 1A: Allow TopHat to select splice junctions de novo and

proceed through the steps without giving the software known

genes/gene models.

Run 1B: Force TopHat to use only known splice junctions (i.e. known

genes/gene models) and proceed through the steps making sure we

are doing our analysis in the context of these gene models.

2. Evaluation:

a. 2 metrics: # of mapped reads and # of significantly

different identified genes

b. Compare new transcriptome to known genes.

Question: Is their a difference in our results if the Tuxedo Suit is run 2 different ways?


sample replicate

# fastq name # reads

controlReplicate

1thrombin_control.tx

t 10,953

experimental

Replicate 1 thrombin_expt.txt 12,027

name description

chr22.faFasta file with the sequence of

chromosome 22 from the human genome (hg19 – UCSC)

genes-chr22.gtfGTF file with gene annotation, known genes

(hg19 – UCSC)

RNA-Seq: 100 bp, single end data

Genome & gene information

Data Sources


Step 0A: Accessing the IGB BioclusterOpen Putty.exe

In the hostname textbox type:

biocluster.igb.Illinois.edu

Click Open

If popup appears, Click Yes

Enter login credentials assigned to you; example, user class45.

Now you are all set!


Step 0B: Lab SetupThe lab is located in the following directory:

~/mayo/khetani

This directory contains the finished version of the lab (i.e. the version of the lab after the tutorial). Consult it if you unsure about your runs. You don’t have write permissions to the lab directory. Create a working directory of this lab in your home directory for your output to be stored. Note ~ is a symbol in unix paths referring to your home directory. Copy the files

Make sure you login to a machine on the cluster using the qsub command. The exact syntax for this command is given below. This particular command will login you into a reserved computer (denoted by classroom) with 4 cpus with an interactive session. You only need to do this once.$ mkdir ~/khetani # Make working directory in your home directory,

$ cp ~/mayo/khetani/data/* ~/khetani # Copy data to your working

directory.

$ qsub –I –l ncpus=4 # Login to a computer on cluster.


Step 0C: Shared Desktop Directory

For viewing and manipulating files on the classroom computers, we provide a shared directory in the following folder on the desktop:

classes/mayo

In today’s lab, we will be using the following folder in the shared directory:

classes/mayo/khetani


We will login to the biocluster and examine various files.

Step 1: Access the Biocluster

$ qsub –I –l ncpus=4 # Open session on node with 4 cpus.

SKIP IF DONE

$ cd ~/khetani/ # Change to the directory of today’s lab session.

$ head chr22.fa # Examine first 10 lines of chr22 sequence file.

$ head –n 12 thrombin_control.txt

# Examine first 12 lines (first 3 reads) from control.

$ tail –n 12 thrombin_expt.txt

# Examine last 12 lines (last 3 reads) from experimental sample.

$ head genes-chr22.gtf # Examine first 10 lines of chr22 genes file.


Assume the data is of good quality and quality trimming has taken place.

We will now use Bowtie in the Tuxedo Suite to build a chromosome index

of chr22.

Step 2: Create an Index using Bowtie

$ module load tophat2/2.0.8 # Load Tophat2 and

dependencies.

$ bowtie2-build chr22.fa chr22 # Make Bowtie index for

chr22.

# This index is essential for the rest of the exercise.

# Make sure you use the correct bowtie version (type bowtie-

build).


In this exercise, we will be aligning RNA-Seq reads to a reference

genome in the absence of gene models. Splice junctions will be

found de novo.

.

Run 1A: de novo Alignment


We will now align RNA-Seq reads to chr22 using mostly default parameters.

Note: We are not providing gene information. TopHat will find

splice junctions de novo.

Step 3A: Align Reads Using TopHat

$ tophat –p 4 –o ctrl chr22 thrombin_control.txt

# Run TopHAT using chr22 as reference and sequences in the

control.

# -p indicates the number of cpus (4).

# -o indicates the output directory (ctrl).

$ tophat –p 4 –o expt chr22 thrombin_expt.txt

# Run TopHAT using chr22 as reference and sequences in the

experimental sample.

# -o indicates the output directory (expt).


We will now evaluate our alignment by observing how many reads DID NOT

align to the reference genome chr22.

Step 4A: Evaluate de novo Alignment

$ samtools view –c ctrl/unmapped.bam

# -c instructs the view tool to count the unmapped

reads.

# The result should be 101 unmapped reads.

$ samtools view –c expt/unmapped.bam



For the purpose of visualization, we will create an index for the mapped

reads.

Step 5A: Create Index for Mapped Reads

$ samtools index ctrl/accepted_hits.bam

# Create index of mapped reads in control.

$ samtools index expt/accepted_hits.bam

# Create index of mapped reads in experimental

sample.


In this exercise, we will be aligning RNA-Seq reads to a reference

genome in the presence of gene information. This obviates the

need for TopHat to find splice junctions de novo.

.

Run 1B: Informed Alignment


We will now align RNA-Seq reads to chr22 using mostly default parameters

for TopHat and information on the location of genes on chr22.

Step 3B: Align Reads Using Gene Info

$ tophat –p 4 –G genes-chr22.gtf –o ctrl-genes chr22

thrombin_control.txt

# Run TopHAT using chr22 as reference and sequences in the control.

# -p indicates the number of cpus (4).

# -o indicates the output directory (ctrl-genes).

# -G indicates the gene file to use to aid in alignment.

$ tophat –p 4 –G genes-chr22.gtf –o expt-genes chr22 thrombin_expt.txt

# Run TopHAT using chr22 as reference and sequences in the experimental

sample.

# -o indicates the output directory (expt-genes).


We will now evaluate our informed alignment by observing how many

reads DID NOT align to the reference genome chr22.

Step 4B: Evaluate Informed Alignment

$ samtools view –c ctrl-genes/unmapped.bam

# -c instructs the view tool to count the unmapped

reads.


$ samtools view –c expt-genes/unmapped.bam



For the purpose of visualization, we will create an index for the mapped

reads in the informed alignment.

Step 5B: Create Index for Mapped Reads

$ samtools index ctrl-genes/accepted_hits.bam

# Create index of mapped reads in control.

$ samtools index expt-genes/accepted_hits.bam

# Create index of mapped reads in experimental

sample.


sample # fastq name # readsUnmapped Reads

de novo (Run 1A)

Informed (Run 1B)

control thrombin_control.txt 10,953 101 27

experimental thrombin_expt.txt 12,027 63 39

Checkpoint 1: Comparison of Alignments

There are fewer unmapped reads with the informed alignment, or Run 1B

(i.e. when we use the known genes, and known splice sites)!

TopHat’s prediction of splice junctions de novo is not working very well for

this dataset. (This is likely due to the low number of reads in our dataset.)

Conclusions


Next, we will utilize our RNA-Seq alignments to assembly gene

transcripts, thereby permitting us to get relative gene abundances

between the two samples (control and experimental).

.

Finding Differentially Expressed Genes


For the de-novo alignment (Run 1A) , we will run the program CuffLinks in

order to obtain gene transcripts from our aligned RNA-Seq reads .

There is no need to conduct this step for the informed

alignment (Run 1B) because we have the locations of known genes

already.

Step 6: Assemble Transcripts using Cufflinks

$ module load cufflinks/2.1.1 # Load cufflinks v2 and dependencies

$ cufflinks –p 4 –o cuff-ctrl ctrl/accepted_hits.bam

# -p indicates the number of processors to use (4)

# -o indicates the output directory (cuff-ctrl)

$ cufflinks -p 4 –o cuff-expt expt/accepted_hits.bam

# -o indicates the output directory (cuff-expt)


For the de-novo alignment (Run 1A) , we will run the program CuffMerge

in order to merge our assembled transcripts.

There is no need to conduct this step for Run 1B.

Step 7: Merge Transcripts Using CuffMerge

$ echo -e

"cuff-ctrl/transcripts.gtf\ncuff-expt/transcripts.gtf" >

gtf.list.txt

# Create a text file named gtf.list.txt with the following

contents:

cuff-ctrl/transcripts.gtf

cuff-expt/transcripts.gtf

$ cuffmerge –o cuffmerge gtf.list.txt

# -o indicates the output directory (cuffmerge)


For both alignments, de-novo (Run 1A) and informed (Run 1B), we aim to

collected the abundances of the expressed genes. To do this, we will utilize

the CuffDiff program.

We need only a gene (.gtf) file and alignment (.bam) files to calculate

differentially expressed genes between the different sample groups

(control and experimental).

Step 8A: Gene Expression Using CuffDiff

$ cuffdiff –p 4 –o cuffdiff cuffmerge/merged.gtf

expt/accepted_hits.bam ctrl/accepted_hits.bam

# -p indicates the number of processors to use (4)

# -o indicates the output directory(cuffdiff)

$ cuffdiff –p 4 –o cuffdiff-genes genes-chr22.gtf

expt/accepted_hits.bam ctrl/accepted_hits.bam

# -o indicates the output directory (cuffdiff-genes)





Step 8B: Gene Expression Using CuffDiff

$ head cuffdiff-genes/gene_exp.diff # Examine first 10 lines of

file.

# We want all rows where the 14th column is yes. The awk command is

convenient for file parsing. We’ll see a Galaxy interface for awk in a

later lab.

$ awk ‘{if ($14==“yes”) print $0}’ cuffdiff/gene_exp.diff >

cuffdiff/gene_exp.SIG.diff

$ awk ‘{if ($14==“yes”) print $0}’ cuffdiff-genes/gene_exp.diff >

cuffdiff-genes/gene_exp.SIG.diff





We will count the lines in the .SIG.DIFF files to see how many genes are

differentially expressed in each of the two alignments.

Step 8C: Gene Expression Using CuffDiff

$ wc –l cuffdiff/gene_exp.SIG.diff cuffdiff-genes/gene_exp.SIG.DIFF

# Count the number of lines in each .SIG.DIFF files

# Output

3 cuffdiff/gene_exp.SIG.diff

0 cuffdiff-genes/gene_exp.SIG.diff

Only the de-novo alignment (Run 1A) reports any differentially

expressed genes between experiment and control !


The Integrative Genomics Viewer (IGV) is a tool that supports the visualization

of mapped reads to a reference genome, among other functionalities. We will use

it to observe where hits were called for the de-novo alignment (Run 1A) for the two

samples (control and experimental), the new transcriptome generated by

CuffMerge, and the differentially expressed genes.

.

Visualization Using IGV


In this step, we will start IGV and load the chr22.fa file, the known genes

file

(genes-chr22.gtf), the hits for both sample groups, and the merged

transcriptome. These files are located in classes/mayo/khetani on the

desktop.

Step 9: Start IGV

Graphical Instruction: Load Genome

1. Within IGV, click the FILE tab on the menu bar.

2. Click the the ‘Load Genome from File’ option.

3. In the browser window, select chr22.fa (genome).

Graphical Instruction: Load Other Files

1. Within IGV, click the FILE tab on the menu bar.

2. Click the ‘Load from File’ option.

3. Select the genes-chr22.gtf file (known genes file).

4. Perform Steps 1-3 for the files to the right.

Files to Load

genes-chr22.gtf

ctrl_accepted_hits.b

am

expt_accepted_hits.

bam

merged.gtf


Step 10A: Visualization With IGVYour browser window should look similar to the picture below:


Step 10B: Visualization With IGVClick here and type the following location of a differentially expressed gene:

chr22:19960675-19963235

Move to the left and right of the gene. What do you see?


Step 10C: Visualization with IGV

Looks like the new transcriptome (merged.gtf) compares

poorly to the known gene models. This is very likely due to

the very low number of reads in our dataset.

We can see that there are many more reads for one dataset

compared to the other. Hence, it makes sense that the gene

was called as being differentially expressed.

Note the intron spanning reads.


Conclusion

Today we did the following:

1. Used the Tuxedo Suite to:

a. Aligned RNA-Seq reads using TopHat(splice-aware aligner).

b. Performed reference-based transcriptome assembly with CuffLinks.

c. Obtained a new transcriptome using CuffLinks & CuffMerge.

d. Used CuffDiff to obtain a list of differentially expressed genes.

e. Reported a list of significantly expressed genes.

2. Used a genome browser and visualization tool to observe

the aligned data and the new transcriptome.

Documents

RNA-Seq Lab Radhika S. Khetani RNA-Seq Lab v5 | Radhika S. Khetani1 Powerpoint by Casey Hanson