Population-scale high-throughput sequencing data analysis

Denis C. Bauer | Bioinformatics | @allPowerde08 July 2014

CSIRO COMPUTATIONAL INFORMATICS

Population-scale high-throughput sequencing data analysis

By M

elo

dy

Talk Overview

2 |

• Background: CSIRO/Omics Project

• Methods: NGS Data Processing on HPC/Cloud

• Research Outcome: Cancer and Microbes in Colorectal Cancer

Denis Bauer | @allPowerde

62% of our people hold

university degrees 2000 doctorates 500 masters

With our university partners, we develop

650 postgraduate research students

Top 1% of global research institutions in 14 of 22 research fields

Top 0.1% in 4 research fields

Darwin

Alice Springs

Geraldton 2 sites

Atherton

Townsville2 sites

Rockhampton

Toowoomba

GattonMyall Vale

Narrabri

Mopra

Parkes

Griffith

BelmontGeelong

HobartSandy Bay

Wodonga

Newcastle

Armidale 2 sites

Perth3 sites

Adelaide2 sites Sydney 5 sites

Canberra 7 sites

Murchison

Cairns

Irymple

Melbourne 5 sites

CSIRO: Who we are

Werribee 2 sites

Brisbane6 sites

Bribie Island

People

Divisions

Locations

Flagships

Budget

6500

13

58

11

$1B+

The Commonwealth Scientific and Industrial Research Organisation

Denis Bauer | @allPowerde3 |

Our business units

12 Research Divisions11 National Research Flagships

+ National Research Facilities and Collections

FOOD, HEALTH & LIFE SCIENCE

INDUSTRIES

ENVIRONMENT MANUFACTURING,MATERIALS &

MINERALS

ENERGY INFORMATION & COMMUNICATIONS

+ Transformational Capability Platforms


Our track record: top inventions

4. EXTENDED WEAR CONTACTS

2. POLYMER BANKNOTES

3. RELENZA FLU VACCINE

1. Fast WLANWireless Local Area Network

5. AEROGARD 6. TOTAL WELLBEING DIET

7. RAFT POLYMERISATION

8. BARLEYMAX 9. SELF TWISTING YARN

10. SOFTLY WASHING LIQUID


Part 1: The ‘omics project

The goal of the project is to investigate the

susceptibility to colorectal cancer in the

context of obesity and the gut

microbiome


Data from Pilot Study

Full Cohort: 500 (178 to date) individuals from colorectal resection at the John Hunter Hospital, Newcastle Private Hospital and Royal Newcastle Centre (surgeons Dr Brian Draganic, Dr Peter Pockney & Dr Steve Smith) organized by Dr Desma Grice and Prof Rodney Scott (University of Newcastle)


• Objective: capture genomic variances reliably in tumour normal and adipose.

• Sequence effort:• 12 tumour -> 6 lanes (2-plex)• 12 normal -> 3 lanes (4-plex)• 12 adipose -> 3 lanes (4-plex)

Considerations before sequencing: Undersampling

More depth needed due to potentially low cellularity in the tumour sample

additional depth

tumour samplenormal sample


• Objective: process samples avoiding confounding factors

Considerations before sequencing: Flowcell design

L1

L2

L2

L2

O1

O1

O1

O2

O2

O2

Sequenced over 3 lanes

L1

L1

Normal

Adipose

Tumour

4-plex

4-plex

4-plex

L2

O2

L1

O1

L2

O2

L1

O1

Sequence on one lane each

L2

O2

L1

O1

Subject every sample to the same lane and flowcell effects by multiplexing (labelling every sample with a identifying barcode)


• Population-scale sequencing with more samples than illumina-barcodes: imbalanced flowcell design will split samples and pair the halves with different partners (e.g. LeanSubj1.1 + Obese Subject 1.1; LeanSubj1.2 + Obese Subject 3.2 )

Considerations for Omics Proj.: Flowcell design

L1.1

L1.1

O1.1

O1.1

O1.1

L1.1

Normal

Adipose

Tumour

L2.1

L2.1

L2.1

O2.1

O2.1

O2.1

L3.1

L3.1

L3.1

O3.1

O3.1

O3.1

L4.1

L4.1

L4.1

O4.1

O4.1

O4.1

Lane1Lane2 Lane3 Lane4

L1.2

L1.2

O3.2

O3.2

O3.2

L2.2

L2.2

L2.2

O4.2

O4.2

O4.2

L3.2

L3.2

L3.2

O1.2

O1.2

O1.2

L4.2

L4.2

L4.2

O2.2

O2.2

O2.2

Lane5 Lane6 Lane7 Lane8

L1.2

4-plex

4-plex

2-plex

L=Lean O=Obese

L1.1=Lean individual 1 part 1 (of 2) ...

12 Lanes

Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781


Blue Monster says

Design your experiment with project-specific pitfalls in mind

Auer PL et al. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781


Part 2: NGS Data Processing

Minimize project set-up overhead while providing easily adaptable processing modules

for NGS analysis on high-performance-compute clusters/cloud


Resource consumption for Variant Calling

qsub –t 1-36 task.qsub

Script

Submission

Scheduler

36 samples (2.7T data) on average requires 128 hours CPU time (ste= 15) 77 GB RAM (ste=0.34)

CPU (hours)

Real time(hours)

Memory(GB)

#PBS –l nodes=2:ppn=8

High-Performance-Compute


doi:10.1038/nbt.2421

Tailored processing for different sequencing applicationsWet-lab Protocols Production Informatics

Variant Calling

MethylationSites

Gene Expression

Despite different approaches we want to use the same processing framework!


reusability

cutting edgedata security

HPC environment

reproducibility

robustness

adaptability

knowledge transfer (publication)

efficient

Wish list for a framework




Assessexperimentalsuccessquickly


DEMO - files

Project X fastq

Exp1

Run1_read1.fastq

Run2_read1.fastq

Exp2 Run3_read1.fastq

We can start from raw fastq files: here 3 files (Run1-3) in 2 different

conditions (Exp1-2)


DEMO – setting up config file

#********************

# Data

#********************

declare -a DIR; DIR=( Exp1 Exp2 )

#********************

# Tasks

#********************

RUNMAPPINGBOWTIE2="1" # mapping with bowtie2

#********************

# Paths

#********************

# reference genome

FASTA=/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa

20 | Denis Bauer, @allPowerde

We specify the folders NGSANE should run on and what to do (here:

bowtie2 mapping). We can also specify project specific settings (here:

use igenomes)

DEMO – dry run

bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt

[NGSANE] Trigger mode: [empty] (dry run)[NOTE] Folders: Exp1 Exp2[Task] bowtie2 [NOTE] setup enviroment[TODO] Exp1/Run1_read1.fastq[TODO] Exp1/Run2_read1.fastq[TODO] Exp2/Run3_read1.fastq[NOTE] proceeding with job scheduling...[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy

[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy

[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1[NOTE] make Exp1/bowtie2/Run3.asd.bam.dummy

[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run3_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1

We run NGSANE in dry run to test what jobs it would submit


DEMO – submit

bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt armed[NGSANE] Trigger mode: armedDouble check! Then type safetyoff and hit enter to launch the job: safetyoff... take cover!

[NOTE] Folders: Exp1 Exp2[Task] bowtie2 [NOTE] setup environment[TODO] Exp1/Run1_read1.fastq[TODO] Exp1/Run2_read1.fastq[TODO] Exp2/Run3_read1.fastq[NOTE] proceeding with job scheduling...[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1Jobnumber 2424899

[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1Jobnumber 2424900

[NOTE] make Exp2/bowtie2/Run3.asd.bam.dummy[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp2/Run3_read1.fastq -o /NGSANEDEMO/Exp2/bowtie2 --rgsi Exp2Jobnumber 2424901

We submit HPC jobs. Checkout the returned qsub identifiers.


DEMO – scheduler bau04c@burnet-login:/NGSANEDEMO> qstat -u bau04c

burnet-srv.idpx.hpsc.csiro.au: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----2424899.burnet-s bau04c normal NGs_bowtie2_RunM 9085 1 2 -- 00:05 R 00:002424900.burnet-s bau04c normal NGs_bowtie2_RunM 9178 1 2 -- 00:05 R 00:002424901.burnet-s bau04c normal NGs_bowtie2_RunM 9353 1 2 -- 00:05 R 00:00

Three HPC jobs run in parallele because there were three fastq files. But there is no limit to the number of files to process in parallele: easy scale-

up to populations.


DEMO – report

bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt html

[NGSANE] Trigger mode: html>>>>> Generate HTML report >>>>> startdate Fri Jan 24 08:02:37 EST 2014>>>>> hostname burnet-login>>>>> makeSummary.sh -k /NGSANEDEMO/config.txt--R --R version 3.0.0 (2013-04-03) -- "Masked Marvel”--Python--Python 2.7.2QC - bowtie2>>>>> Generate HTML report - FINISHED>>>>> enddate Fri Jan 24 08:02:39 EST 2014

More report examples

Now create the HTML overview page, to check if jobs finised sucessfully and

what the results are (bowtie2: mapping statistics)


http://www.hpsc.csiro.au/users/bau04c/datahome/Sandbox/smokebox_ngsane/smokebox/result/

DEMO - files

Project X

Summary HTML

Exp1 Bowtie

Run1.bam

Run2.bam

Exp2 Bowtie Run3.bam

fastq

Exp1

Run1_read1.fastq

Run2_read1.fastq

Exp2 Run3_read1.fastq

The resulting file structure: every experiment has a folder with the tasks as subfolders and in them the results

(here: bam files)


NGSANE Currently supports

• Transfer data (smbclient)

• Quality Control (GATK, FastQC, RNA-SeQC, custom summaries,

user code)

• Trimming (Cutadapt,Trimgalore, Trimmomatic)

• Mapping (BWA,Bowtie1,Bowtie2,Tophat)

• Transcript Quantification

(cufflinks, htseq, bedtools)

• Variant calling (GATK, samtools)

• Variant annotation(annovar)

• 3D Genome structure (Hicup, fit-hi-c, Hiclib, Homer)


For details see https://github.com/BauerLab/ngsane/wiki/How-to-use-the-virtual-machine


Blue Monster says

Analyze your data to be reproducible and well documented with tools that

scale well to larger datasets

Buske FA et al. NGSANE: a lightweight production informatics framework for high-throughput data analysis. Bioinformatics. 2014 PMID: 24470576


Part 3: Combining Omics Data

Seeing the full picture requires taking all information into account


Result overview: traditional differential analysis

1. 722 genes differentially expressed (DE) between tumour and normal• QC: We have good concordance with genes known to be up/down regulated in CRC

2. 841 differentially methylated (DM) genomic regions -- mostly hypermethylated• QC: good concordance with previously reported gut methylation profile

Fernandez et al. Genome Res. 2012CSIRO inhouse

Known DE gene Known DM locations


Microbial Population: traditional population survey

Paul GreenfieldDenis Bauer | @allPowerde31 |

Data integration

(image credit: Francis Tabary)


DNA methylation: Blood signatures in Adipose and Gut samples

Tim Peters

Some gut/adipose samples have blood-

like signatures.


Exonseq: blood-signatures stem from a blood-plasma protein

Contamination by ADM2, a gene expressed in blood plasma

Individuals

Contamination (%)

Con

tam

inat

ion

(%)

expr

essi

on

Plasma protein ADM2 makes up most of the human material in the digesta (number

of reads mapping to human genome)


Medical History: Blood potentially resulting from medication

CARTIA14,50,57

WARFARIN40

ASPIRIN59,7

COPLAVIX12

No anti-clotting drug 2, 62, 4No medication 19,20

Wilcoxon rank sum test p-value = 0.02

Anti-thrombosis drugs significantly enriched in individuals with human material in digesta.


Microbial data: Blood “liking” opportunistic bacteria are enriched in contaminated samples

E. coli and Salmonella etc

Opportunistic pathogens.Respond to inflammation and bleeding

Bacterial marker for low level chronic gut bleeding ?


Blue Monster says

Integrating different ‘omics data is still a challenge.


Three things to remember

• Good experimental design is necessary (even) in sequencing experiments

• Reproducible, documented data analysis is key (e.g. NGSANE, a lightweight flexible tool for large-scale sequence data analysis on high-performance systems and Amazon’s elastic cloud)

• Promising research opportunities are in the integration of multiple high-throughput data sources


COMPUTATIONAL INFORMATICS

Thank youComputational InformaticsDenis C. Bauert +61 2 9123 4567e [email protected] www.csiro.au/bioinformatics

Buske et al.,

Bioinformatics, Jan

2014

More talks online: Twitter:http://www.slideshare.net/allPowerde @allPowerde

Fabian A. BuskeSusan ClarkHugh FrenchMartin SmithGarvan Institute of Medical Research, Sydney, Australia

Robert DunneTim PetersPaul GreenfieldPiotr SzulTomasz BednarzComputational Informatics, CSIRO, Australia

Garry HannanAnimal Food and Health Scinece, CSIRO, Australia

Rodney ScottUniversity of Newcastle, Australia

Funding:National Health and Medical Research Council;National Breast Cancer Foundation;CSIRO's Transformational Capability Platform;CSIRO’s IM&T;Science and Industry Endowment Fund

http://www.genome-engineering.com.au/

Documents

Population-scale high-throughput sequencing data analysis