Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis...


Citation preview

Next Generation Sequencing Analysis Web Toolkit

1Updated for 2016-03-16

Genomic, transcriptomic sequencing now

commonplace in projects. Now very cheap!

Large NGS projects in departments throughout the


Now a routine tool for labs with little previous

genomic / transcriptomic experience.

Most common experiment across the University:

Use RNA-Seq to identify gene expression

changes in response to a stimulus / caused by

a disease.

Next Generation Sequencing


Let’s focus on this today – but you can do

other things on our systems!

Typical RNA Seq Workflow


A BPrepare/ Obtain Samples for different conditions

Extract RNA and prepare library for sequencing

Run library on Illumina sequencer

Obtain short-read sequences

Typical RNA Seq Workflow – Data Analysis


Check quality and/or filter reads

Align to the genome or transcriptome

Quantify transcript abundance across conditions

Identify significant differences in expression between conditions

Big Data, Big Problems


* Data from the NHGRI Genome Sequencing Program (GSP)

With sequencing throughput rapidly outpacing Moore’s Law for compute power, we find ourselves facing a major CPU and storage problem

e.g.: Running Cuffquant on BioHPC computer node(32 cores, 128 GB)

Sample A: 3 MB, 94,428 Reads~ 5 mins

Sample B: 4 GB, 157,530,392 Reads~ 2 hours 32 mins

The BioHPC Solution: Easy Accessible Cloud Storage


50 GBBioHPC File Exchange

BioHPC Cluster

Compute nodesLamella Cloud Storage/Gateway

UTSW private cloud

1 way access2 way access

Large amounts of high-performance storage, easily accessible

Total Storage space: 3.7PB

The BioHPC Solution : Powerful Computational Resource


112 nodes – 128GB, 256GB, 384GB, GPU

CPU cores: 4700GPU cores: 19968Memory: 25TB

Powerful compute cluster – run multiple tasks each faster than on a workstation

Nucleus Computer Cluster

The BioHPC Solution: Various BioHPC Tools to Help You


Batch Scripts & Command Line Tools

Various NGS tools available as modules on the cluster, for expert



Flexible environment with many tools, workflow designer, for

advanced users.

NGS Web Toolkit

Simple workflows built from modules. Step-by-step with

customizable parameters.

Workflow Platform

Run standard workflow/pipeline from web, for beginners.

4 approaches for NGS analysis on BioHPC

The BioHPC Solution: Various BioHPC Tools to Help You


Batch Scripts & Command Line Tools

Various NGS tools available as modules on the cluster, for expert



Flexible environment with many tools, workflow designer, for

advanced users.

NGS Web Toolkit

Simple workflows built from modules. Step-by-step with

customizable parameters.

Workflow Platform

Run standard workflow/pipeline from web, for beginners.

4 approaches for NGS analysis on BioHPC

More Flexible

Easier to use

Transfer and manage your sequence data

Retrieving and storing sequencing files (scp, ftp, removable hard disk )

Understanding the file system (file permissions etc.)

Understand and use command line tools

Option 1 - : NGS Analysis with Traditional Linux Command Line Tools


bowtie2-build genome.fa hg19

fastqc -o OutputDirectory/ inputFile.fastq

tophat -o TophatOutput/ -p 8 /programs/indexes/hg19 Experiment1.fastq

* Summarized from:

Option 1 - : NGS Analysis with Traditional Linux Command Line Tools


Software/tools : module availGenome Database: /project/apps_database/iGenomes

Common NGS tools and Illumina iGenome databases are available on the clusterExperts can write their own pipelines using cluster sbatch jobs

Option 2 - BioHPC Galaxy Service


BioHPC Portal -> Cloud Services -> Galaxy (

Reproducible workflows, with many available tools, via the web. Widely used by many institutions.

Separate Training Session: Galaxy at BioHPC (08/17/2016)

Option 3 - BioHPC NGS Pipeline


BioHPC Portal -> Cloud Services -> NGS Web Toolkit (

Provides an easy-to-use web interface to these command line-driven tools and allows users to run multiple sequencing samples simultaneously.

Option 3 - BioHPC NGS Pipeline


BioHPC Portal -> Cloud Services -> NGS Web Toolkit (

Provides an easy-to-use web interface to these command line-driven tools and allows users to run multiple sequencing samples simultaneously.

Option 4 - BioHPC Workflow Platform


Under Construction

Separate Training Session: Introduction to the BioHPC Workflow Platform (05/18/2016)

Web-based workflow platform, which allows easy access to run standard workflows on the BioHPC compute cluster, via the web.

NGS Web Toolkit


• Backgroundalpha release (April, 2015)beta release (March, 2016)

• Prepare and Upload Data

• Demo: Follow a simple RNA-SEQ differential expression analysisTraining Notes

• Access results

• Future DevelopmentApplication Improvements (new features & security enhancement)User requests (new parameters, software, modules and etc.)

Background: Modules


Raw Reads

Check Quality

Trim Reads Map Reads Processing Mapped Reads

Assemble Reads

Cuffmerge Transcripts




Cuffdiff DESeq2-Diff


Check Quality Check Quality

Existing Module

Optional Module

New Module

* Users are encouraged to propose new modules and software.

DESeq2 can be used when each sample in a group has matched sample(s) in other group(s). For example, in a control vs. treatment experiment, a subject before and after treatment can be viewed as a pair.

Background: Genome Databases


If a reference genome is not collected in our system, you may choose to Build Your Own.

Available Genome Databases

Human reference genomes : GRCh37, hg19

Mouse reference genomes : GRCm38, mm9

A ‘toy’ example we can show you in real time (hopefully!)

75,000 reads from chr19, extracted from a larger study

2 Conditions – brain tissue vs adrenal tissue

What’s the difference in expression for the limited number

of transcripts we can see in this data?

What’s unique to the brain tissue?

Courtesy Galaxy Project, Illumina Body Map:

Example 1 – Brain vs Adrenal



NGS Web Toolkit Demo

See Handouts

Access results


• Download small files (~20MB) directly from web

• Create symbolic link of project folder

• Access data from BioHPC cluster/Lamella Gateway (Output Path)

* All results are read only except you choose to delete the whole module from web



gene_exp.diff – Summary of differentially expressed genes

CELF5 & TUBB4A transcript are present in Brain tissue, not in Adrenal tissue

Called as significant – but remember this is a toy example (no replicates etc.)



Yes – Antibody staining data for CELF5 agrees with our findings here.

CELF5 is an RNA-binding protein expressed in the brain, implicated in the regulation of pre-mRNA alternative splicing.



We would like to thank Dr. Zhiyu(Sylvia) Zhao and Dr. Liang Shi from the Children's

Research Institute for development and assistance of this RNA-Seq pipeline.

Future Development and Acknowledgements


Application Improvements

o New features (e.g. linking data for customized data)

o Security enhancement (HIPAA)

* Note: Contact us if you want to upload any identifiable/confidential data

Upon user requests

o Add new software and parameters

o Design new modules

o Develop other web-based application

* Send Email to:
