Upload
denis-bauer
View
333
Download
0
Embed Size (px)
DESCRIPTION
Unprecedented computational capabilities and high-throughput data collection methods promise a new era of personalised, evidence-based healthcare, utilising individual genomic profiles to tailor health management as demonstrated by recent successes in rare genetic disorders or stratified cancer treatments. However, processing genomic information at a scale relevant for the health-system remains challenging due to high demands on data reproducibility and data provenance. Furthermore, the necessary computational requirements requires a large investment associated with compute hardware and IT personnel, which is a barrier to entry for small laboratories and difficult to maintain at peak times for larger institutes. This hampers the creation of time-reliable production informatics environments for clinical genomics. Commercial cloud computing frameworks, like Amazon Web Services (AWS) provide an economical alternative to in-house compute clusters as they allow outsourcing of computation to third-party providers, while retaining the software and compute flexibility. To cater for this resource-hungry, fast pace yet sensitive environment of personalized medicine, we developed NGSANE, a Linux-based, HPC-enabled framework that minimises overhead for set up and processing of new projects yet maintains full flexibility of custom scripting and data provenance when processing raw sequencing data either on a local cluster or Amazon’s Elastic Compute Cloud (EC2).
Citation preview
Denis C. Bauer | Bioinformatics | @allPowerde08 July 2014
CSIRO COMPUTATIONAL INFORMATICS
Population-scale high-throughput sequencing data analysis
By M
elo
dy
Talk Overview
2 |
• Background: CSIRO/Omics Project
• Methods: NGS Data Processing on HPC/Cloud
• Research Outcome: Cancer and Microbes in Colorectal Cancer
Denis Bauer | @allPowerde
62% of our people hold
university degrees 2000 doctorates 500 masters
With our university partners, we develop
650 postgraduate research students
Top 1% of global research institutions in 14 of 22 research fields
Top 0.1% in 4 research fields
Darwin
Alice Springs
Geraldton 2 sites
Atherton
Townsville2 sites
Rockhampton
Toowoomba
GattonMyall Vale
Narrabri
Mopra
Parkes
Griffith
BelmontGeelong
HobartSandy Bay
Wodonga
Newcastle
Armidale 2 sites
Perth3 sites
Adelaide2 sites Sydney 5 sites
Canberra 7 sites
Murchison
Cairns
Irymple
Melbourne 5 sites
CSIRO: Who we are
Werribee 2 sites
Brisbane6 sites
Bribie Island
People
Divisions
Locations
Flagships
Budget
6500
13
58
11
$1B+
The Commonwealth Scientific and Industrial Research Organisation
Denis Bauer | @allPowerde3 |
Our business units
12 Research Divisions11 National Research Flagships
+ National Research Facilities and Collections
FOOD, HEALTH & LIFE SCIENCE
INDUSTRIES
ENVIRONMENT MANUFACTURING,MATERIALS &
MINERALS
ENERGY INFORMATION & COMMUNICATIONS
+ Transformational Capability Platforms
Denis Bauer | @allPowerde4 |
Our track record: top inventions
4. EXTENDED WEAR CONTACTS
2. POLYMER BANKNOTES
3. RELENZA FLU VACCINE
1. Fast WLANWireless Local Area Network
5. AEROGARD 6. TOTAL WELLBEING DIET
7. RAFT POLYMERISATION
8. BARLEYMAX 9. SELF TWISTING YARN
10. SOFTLY WASHING LIQUID
Denis Bauer | @allPowerde5 |
Part 1: The ‘omics project
The goal of the project is to investigate the
susceptibility to colorectal cancer in the
context of obesity and the gut
microbiome
Denis Bauer | @allPowerde6 |
Data from Pilot Study
Full Cohort: 500 (178 to date) individuals from colorectal resection at the John Hunter Hospital, Newcastle Private Hospital and Royal Newcastle Centre (surgeons Dr Brian Draganic, Dr Peter Pockney & Dr Steve Smith) organized by Dr Desma Grice and Prof Rodney Scott (University of Newcastle)
Denis Bauer | @allPowerde7 |
• Objective: capture genomic variances reliably in tumour normal and adipose.
• Sequence effort:• 12 tumour -> 6 lanes (2-plex)• 12 normal -> 3 lanes (4-plex)• 12 adipose -> 3 lanes (4-plex)
Considerations before sequencing: Undersampling
More depth needed due to potentially low cellularity in the tumour sample
additional depth
tumour samplenormal sample
Denis Bauer | @allPowerde8 |
• Objective: process samples avoiding confounding factors
Considerations before sequencing: Flowcell design
L1
L2
L2
L2
O1
O1
O1
O2
O2
O2
Sequenced over 3 lanes
L1
L1
Normal
Adipose
Tumour
4-plex
4-plex
4-plex
L2
O2
L1
O1
L2
O2
L1
O1
Sequence on one lane each
L2
O2
L1
O1
Subject every sample to the same lane and flowcell effects by multiplexing (labelling every sample with a identifying barcode)
Denis Bauer | @allPowerde9 |
• Population-scale sequencing with more samples than illumina-barcodes: imbalanced flowcell design will split samples and pair the halves with different partners (e.g. LeanSubj1.1 + Obese Subject 1.1; LeanSubj1.2 + Obese Subject 3.2 )
Considerations for Omics Proj.: Flowcell design
L1.1
L1.1
O1.1
O1.1
O1.1
L1.1
Normal
Adipose
Tumour
L2.1
L2.1
L2.1
O2.1
O2.1
O2.1
L3.1
L3.1
L3.1
O3.1
O3.1
O3.1
L4.1
L4.1
L4.1
O4.1
O4.1
O4.1
Lane1Lane2 Lane3 Lane4
L1.2
L1.2
O3.2
O3.2
O3.2
L2.2
L2.2
L2.2
O4.2
O4.2
O4.2
L3.2
L3.2
L3.2
O1.2
O1.2
O1.2
L4.2
L4.2
L4.2
O2.2
O2.2
O2.2
Lane5 Lane6 Lane7 Lane8
L1.2
4-plex
4-plex
2-plex
L=Lean O=Obese
L1.1=Lean individual 1 part 1 (of 2) ...
12 Lanes
Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781
Denis Bauer | @allPowerde10 |
Blue Monster says
Design your experiment with project-specific pitfalls in mind
Auer PL et al. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781
Denis Bauer | @allPowerde11 |
Part 2: NGS Data Processing
Minimize project set-up overhead while providing easily adaptable processing modules
for NGS analysis on high-performance-compute clusters/cloud
Denis Bauer | @allPowerde12 |
Resource consumption for Variant Calling
qsub –t 1-36 task.qsub
Script
Submission
Scheduler
36 samples (2.7T data) on average requires 128 hours CPU time (ste= 15) 77 GB RAM (ste=0.34)
CPU (hours)
Real time(hours)
Memory(GB)
#PBS –l nodes=2:ppn=8
High-Performance-Compute
Denis Bauer | @allPowerde13 |
doi:10.1038/nbt.2421
Tailored processing for different sequencing applicationsWet-lab Protocols Production Informatics
Variant Calling
MethylationSites
Gene Expression
Despite different approaches we want to use the same processing framework!
Denis Bauer | @allPowerde14 |
reusability
cutting edgedata security
HPC environment
reproducibility
robustness
adaptability
knowledge transfer (publication)
efficient
Wish list for a framework
Denis Bauer | @allPowerde15 |
Denis Bauer | @allPowerde16 |
Denis Bauer | @allPowerde17 |
Assessexperimentalsuccessquickly
Denis Bauer | @allPowerde18 |
DEMO - files
Project X fastq
Exp1
Run1_read1.fastq
Run2_read1.fastq
Exp2 Run3_read1.fastq
We can start from raw fastq files: here 3 files (Run1-3) in 2 different
conditions (Exp1-2)
Denis Bauer | @allPowerde19 |
DEMO – setting up config file
#********************
# Data
#********************
declare -a DIR; DIR=( Exp1 Exp2 )
#********************
# Tasks
#********************
RUNMAPPINGBOWTIE2="1" # mapping with bowtie2
#********************
# Paths
#********************
# reference genome
FASTA=/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa
20 | Denis Bauer, @allPowerde
We specify the folders NGSANE should run on and what to do (here:
bowtie2 mapping). We can also specify project specific settings (here:
use igenomes)
DEMO – dry run
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt
[NGSANE] Trigger mode: [empty] (dry run)[NOTE] Folders: Exp1 Exp2[Task] bowtie2 [NOTE] setup enviroment[TODO] Exp1/Run1_read1.fastq[TODO] Exp1/Run2_read1.fastq[TODO] Exp2/Run3_read1.fastq[NOTE] proceeding with job scheduling...[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1[NOTE] make Exp1/bowtie2/Run3.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run3_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
We run NGSANE in dry run to test what jobs it would submit
Denis Bauer | @allPowerde21 |
DEMO – submit
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt armed[NGSANE] Trigger mode: armedDouble check! Then type safetyoff and hit enter to launch the job: safetyoff... take cover!
[NOTE] Folders: Exp1 Exp2[Task] bowtie2 [NOTE] setup environment[TODO] Exp1/Run1_read1.fastq[TODO] Exp1/Run2_read1.fastq[TODO] Exp2/Run3_read1.fastq[NOTE] proceeding with job scheduling...[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1Jobnumber 2424899
[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1Jobnumber 2424900
[NOTE] make Exp2/bowtie2/Run3.asd.bam.dummy[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp2/Run3_read1.fastq -o /NGSANEDEMO/Exp2/bowtie2 --rgsi Exp2Jobnumber 2424901
We submit HPC jobs. Checkout the returned qsub identifiers.
Denis Bauer | @allPowerde22 |
DEMO – scheduler bau04c@burnet-login:/NGSANEDEMO> qstat -u bau04c
burnet-srv.idpx.hpsc.csiro.au: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----2424899.burnet-s bau04c normal NGs_bowtie2_RunM 9085 1 2 -- 00:05 R 00:002424900.burnet-s bau04c normal NGs_bowtie2_RunM 9178 1 2 -- 00:05 R 00:002424901.burnet-s bau04c normal NGs_bowtie2_RunM 9353 1 2 -- 00:05 R 00:00
Three HPC jobs run in parallele because there were three fastq files. But there is no limit to the number of files to process in parallele: easy scale-
up to populations.
Denis Bauer | @allPowerde23 |
DEMO – report
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt html
[NGSANE] Trigger mode: html>>>>> Generate HTML report >>>>> startdate Fri Jan 24 08:02:37 EST 2014>>>>> hostname burnet-login>>>>> makeSummary.sh -k /NGSANEDEMO/config.txt--R --R version 3.0.0 (2013-04-03) -- "Masked Marvel”--Python--Python 2.7.2QC - bowtie2>>>>> Generate HTML report - FINISHED>>>>> enddate Fri Jan 24 08:02:39 EST 2014
More report examples
Now create the HTML overview page, to check if jobs finised sucessfully and
what the results are (bowtie2: mapping statistics)
Denis Bauer | @allPowerde24 |
DEMO - files
Project X
Summary HTML
Exp1 Bowtie
Run1.bam
Run2.bam
Exp2 Bowtie Run3.bam
fastq
Exp1
Run1_read1.fastq
Run2_read1.fastq
Exp2 Run3_read1.fastq
The resulting file structure: every experiment has a folder with the tasks as subfolders and in them the results
(here: bam files)
Denis Bauer | @allPowerde25 |
NGSANE Currently supports
• Transfer data (smbclient)
• Quality Control (GATK, FastQC, RNA-SeQC, custom summaries,
user code)
• Trimming (Cutadapt,Trimgalore, Trimmomatic)
• Mapping (BWA,Bowtie1,Bowtie2,Tophat)
• Transcript Quantification
(cufflinks, htseq, bedtools)
• Variant calling (GATK, samtools)
• Variant annotation(annovar)
• 3D Genome structure (Hicup, fit-hi-c, Hiclib, Homer)
Denis Bauer | @allPowerde26 |
For details see https://github.com/BauerLab/ngsane/wiki/How-to-use-the-virtual-machine
Denis Bauer | @allPowerde27 |
Blue Monster says
Analyze your data to be reproducible and well documented with tools that
scale well to larger datasets
Buske FA et al. NGSANE: a lightweight production informatics framework for high-throughput data analysis. Bioinformatics. 2014 PMID: 24470576
Denis Bauer | @allPowerde28 |
Part 3: Combining Omics Data
Seeing the full picture requires taking all information into account
Denis Bauer | @allPowerde29 |
Result overview: traditional differential analysis
1. 722 genes differentially expressed (DE) between tumour and normal• QC: We have good concordance with genes known to be up/down regulated in CRC
2. 841 differentially methylated (DM) genomic regions -- mostly hypermethylated• QC: good concordance with previously reported gut methylation profile
Fernandez et al. Genome Res. 2012CSIRO inhouse
Known DE gene Known DM locations
Denis Bauer | @allPowerde30 |
Microbial Population: traditional population survey
Paul GreenfieldDenis Bauer | @allPowerde31 |
Data integration
(image credit: Francis Tabary)
Denis Bauer | @allPowerde32 |
DNA methylation: Blood signatures in Adipose and Gut samples
Tim Peters
Some gut/adipose samples have blood-
like signatures.
Denis Bauer | @allPowerde33 |
Exonseq: blood-signatures stem from a blood-plasma protein
Contamination by ADM2, a gene expressed in blood plasma
Individuals
Contamination (%)
Con
tam
inat
ion
(%)
expr
essi
on
Plasma protein ADM2 makes up most of the human material in the digesta (number
of reads mapping to human genome)
Denis Bauer | @allPowerde34 |
Medical History: Blood potentially resulting from medication
CARTIA14,50,57
WARFARIN40
ASPIRIN59,7
COPLAVIX12
No anti-clotting drug 2, 62, 4No medication 19,20
Wilcoxon rank sum test p-value = 0.02
Anti-thrombosis drugs significantly enriched in individuals with human material in digesta.
Denis Bauer | @allPowerde35 |
Microbial data: Blood “liking” opportunistic bacteria are enriched in contaminated samples
E. coli and Salmonella etc
Opportunistic pathogens.Respond to inflammation and bleeding
Bacterial marker for low level chronic gut bleeding ?
Denis Bauer | @allPowerde36 |
Blue Monster says
Integrating different ‘omics data is still a challenge.
Denis Bauer | @allPowerde37 |
Three things to remember
• Good experimental design is necessary (even) in sequencing experiments
• Reproducible, documented data analysis is key (e.g. NGSANE, a lightweight flexible tool for large-scale sequence data analysis on high-performance systems and Amazon’s elastic cloud)
• Promising research opportunities are in the integration of multiple high-throughput data sources
Denis Bauer | @allPowerde38 |
COMPUTATIONAL INFORMATICS
Thank youComputational InformaticsDenis C. Bauert +61 2 9123 4567e [email protected] www.csiro.au/bioinformatics
Buske et al.,
Bioinformatics, Jan
2014
More talks online: Twitter:http://www.slideshare.net/allPowerde @allPowerde
Fabian A. BuskeSusan ClarkHugh FrenchMartin SmithGarvan Institute of Medical Research, Sydney, Australia
Robert DunneTim PetersPaul GreenfieldPiotr SzulTomasz BednarzComputational Informatics, CSIRO, Australia
Garry HannanAnimal Food and Health Scinece, CSIRO, Australia
Rodney ScottUniversity of Newcastle, Australia
Funding:National Health and Medical Research Council;National Breast Cancer Foundation;CSIRO's Transformational Capability Platform;CSIRO’s IM&T;Science and Industry Endowment Fund
http://www.genome-engineering.com.au/