Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
User Showcase Astrocyte BICF Workflow RNA-‐Seq Analysis
Xin Luo Ph.D. BICF
02/22/2017
What Can Gene Expression Tell Us? • How do gene expression paGerns determine the cell
idenJty? • What are the differenJally expressed genes in a specific
disease as compared to healthy control? • How does the gene expression response to different sJmuli
or drug treatment • How does gene expression changes across different
developmental stages • What are the downstream targets or pathways of your
factor of interests?
… and much, much more …
Technology RNA-‐Seq Microarray
Development Year 2008 1995
Method Sampling HybridizaJon
Prior Knowledge No Yes
reproducibility Yes Yes
Dynamic range >8000-‐fold Hundred fold novel transcripts detecJon (fusion, non-‐coding
transcripts) Yes No
novel splicing isoforms detecJon Yes No De novo analysis for species without reference
genome Yes No
Required RNA amount >100ng ~1μg
RNA-‐seq vs Microarray
General RNA-‐Seq Workflow
Everything's connected slide by Dündar et al. (2015)
Experimental Design AffecJng Your Analysis
• Whole transcriptome vs mRNA • Single-‐end vs paired-‐end • Paired-‐end produces more accurate alignments • Paired-‐end allows for transcript-‐level analysis • Single-‐end is cheaper
• Number of Reads • 10-‐50M is a good range • Aim at least 20M
• Read Length • Longer reads produce beGer alignments, min 50 bp paired
or 100bp single for gene quanJficaJon • ChIP-‐seq, smallRNA-‐seq, RIP-‐seq, CLIP-‐seq: 50nt single-‐end
Experimental Design AffecJng Your Analysis
• Number of Samples • Your power to detect an effect depends on – Effect size (difference between group means) – Within group variance – Sample size
• More samples the beGer, min 3 per group • Five samples sequenced to 20M reads each offer more power than 2 samples sequenced to 50M reads
• Stranded • Can disJnguish expression of overlapping genes
Strand-‐specific RNA-‐Seq
image from GATC Biotech
How to decide strand
Reverse stranded
Stranded
RNA-Seq Analysis Pipeline
http://www.utsouthwestern.edu/labs/bioinformatics/services/data-analysis/rnaseq-pipeline.html
RNA-‐Seq Analysis Essence
• Preprocessing and normalizaJon • DifferenJal gene expression analysis • QC • VisualizaJon • Pathway and gene sets enrichment analysis • Different splicing isoforms • Fusion and variants
• IntroducJon to BioHPC – First Wednesday Every Month – 10:30AM-‐Noon @ NL6.215
• Please aGend so you can get an account to try this out
Launch Workflows using Astrocyte
Allows groups to give easy-access to their analysis pipelines via the web
Astrocyte – BioHPC Workflow Platform"
Standardized Workflows Simple Web Forms Online documentaJon & results visualizaJon*
Workflows run on HPC cluster without developer or user needing cluster knowledge
Slide contribuJon: David Trudgian@BioHPC
astrocyte.biohpc.swmed.edu
Browse workflows
Create a New Project
Add Data to Your Project
Add Data to Your Project
For NGS experiment, this is recommended.
Use cat command to combine reads from different lanes of sequencing e.g. cat S1.*.R1.fastq.gz > S1.R1.fastq.gz Use cp command to copy files to the astrocyte_incoming folder for uploading e.g. cp *.fastq.gz /project/apps/astrocyte/astrocyte_incoming/xluo4
Make Your Design File SampleID SampleGroup SubjectID SampleName FullPathToFqR1 FullPathToFqR2
N4 Normal 4 N4 N4.R1.fastq.gz N4.R2.fastq.gz N5 Normal 5 N5 N5.R1.fastq.gz N5.R2.fastq.gz N7 Normal 7 N7 N7.R1.fastq.gz N7.R2.fastq.gz T2 Tumor 2 T2 T2.R1.fastq.gz T2.R2.fastq.gz T5 Tumor 5 T5 T5.R1.fastq.gz T5.R2.fastq.gz T7 Tumor 7 T7 T7.R1.fastq.gz T7.R2.fastq.gz
Make Your Design File
• Use tab as delimiter – Excel save as “Text (tab delimited)”
• Columns names MUST be exactly the same as documented
• If no SubjectID, use same number/character for all rows
• If single end sequencing , leave FullPathToFq2 empty
• For all contents, no “-‐” • For all contents, no spaces
Select Your Data Files and Set Up Workflow
SELECT YOUR FILES
hGp://souware.broadinsJtute.org/gsea/msigdb/index.jsp
Project is running
Timeline of The Whole Run
Download/Visualize Your Results
Vizapp needs about 30s to start if there is no queue. You need to refresh the page.
You can also choose individual files to download to your local computer
Data source: E-‐GEOD-‐60424 -‐ Next generaZon sequencing of human immune cell subsets across diseases
Comparisons • Comparisons are based on SampleGroup – All pair-‐wise comparisons – Could be idenJfied by file name • A_B.edgeR.txt • Log fold change will be A/B • If you want B/A, -‐1*logFC
Vizapp: Align Stats
Vizapp: Gene Type Stats
Vizapp: MDS and PCA Analysis
Vizapp: Sample Distances
Vizapp: Gene Compare
Vizapp: DEA
• Uses edgeR results • Filter gene list by different parameters • Sort by different columns • Data table downloading
Vizapp: DEA Heatmap
• Filter gene list by different parameters • Choose different comparisons • Support user define gene list (gene official symbol) • Support pathway
Vizapp: QuanJtaJve Set Analysis for Gene Expression (QuSAGE)
hGp://bioconductor.org/packages/release/bioc/html/qusage.html
Vizapp: QuanJtaJve Set Analysis for Gene Expression (QuSAGE)
Vizapp: Gene Alt Splicing
Different transcripts’ expression in sample groups
Vizapp: alternaJve splicing
Common Errors and SoluJons
• Make sure the delimiter is tab • Make sure the column name are the same as menJoned in documentaJon
• Make sure the file names match
Common Errors and SoluJons
• Not all files are uploaded
• It’s about the proxy seyng
• Use auto-‐detect proxy
Other Useful Tools • Gene Set Enrichment Analysis (GSEA) hGp://souware.broadinsJtute.org/gsea/index.jsp MSigDB hGp://souware.broadinsJtute.org/gsea/msigdb/index.jsp Gene PaGern hGp://souware.broadinsJtute.org/cancer/souware/genepaGern/ Use countTable.logCPM.txt to generate .gct file or edgeR.results.txt to generate .rnk file in excel as inputs • User-‐designed specific heatmaps by Morpheus hGps://souware.broadinsJtute.org/morpheus/ Use countTable.logCPM.txt to create .txt file as input for Morpheus
Other Useful Tools • Complex designs Use countTable.txt as input for complex designs in edgeR or Deseq2 R packages hGps://bioconductor.org/packages/release/bioc/html/edgeR.html hGps://bioconductor.org/packages/release/bioc/html/DESeq2.html • MoJf search/promoter analysis with Homer moJf search Use edgeR.results.txt (differenJally expressed gene list) as input hGp://homer.salk.edu/homer/moJf/
Acknowledgement
• Brandi Cantarel • David Trudgian • Beibei Chen • BioHPC team hGps://portal.biohpc.swmed.edu/content/ • BICF team hGp://www.utsouthwestern.edu/labs/bioinformaJcs/