CPTR title slide title slide ReSeqTB data ... (MBTC) above acceptable threshold? ... -Custom loci...

Preview:

Citation preview

CPTR title slide

ReSeqTB data platform

pipeline threshold values

Jamie Posey, PhD

CDC

Pipeline Scheme

Pipeline flowchart

Pipeline flowchart

Pipeline flowchart

Key steps on pipeline

• Input data validation & QC

• Species specificity check

• Sequence reads mapping & refinement

• Variant calling

• Functional Annotation & Lineage Analysis

Input data validation & QC

Quality Scores

QUALITY SCORE ACCURACY (%)

Q10 90

Q20 99

Q30 99.9

Q40 99.99

Input data validation & QC

• Fastq format files

-From next-generation Sequencing platforms

-specifically Illumina sequencing

• FastQValidator Version 1.0.5

Are Sequence reads in fastq format or not?

Input data validation & QC

Input data validation & QC

• Prinseq-lite.pl Version 1.0.5

- Trim reads based on quality Threshold

QC Threshold: Q20 Average Read Sequence Quality

Species Specificity check

Species Specificity check

• Kraken version 0.10.5

-Is the percentage of reads mapping to Mycobacterium tuberculosis Complex(MBTC) above acceptable threshold?

QC Threshold : Percent of reads mapping to MBTC -> 90%

Species Specificity check

Sequencing reads mapping & refinement

Sequencing reads mapping & refinement

• Reference Genome: H37Rv (NC_000962.3)

• BWA MEM: Version 0.7.12

- Mapping Tool

• QC: Qualimap Version 2.1

- Output: Quality Report, inferring mapping

Sequencing reads mapping & refinement

Sequencing reads mapping & refinement

• Removing duplicate reads

PICARD tools Version 1.134

• Cleaning Indels & recalibration

GATK Version 3.4.0

• Calculation of coverage statistics

Variant Calling

Variant Calling

• Samtools & Bcftools Version 1.2

-QC Threshold : Q20 Minimum base call quality

-QC Threshold: Q20: Minimum mapping quality

-QC Threshold : Minimum read depth >/= 10X

-QC Threshold: SNP clusters; 3 SNPs in 10 nucleotide bases

Variant Calling

Pipeline flowchart

FFILTER VCF FileCustom Script

Functional Annotation & Lineage AnalysisSnpEff Ver. 4.1 & custom Script

Mapping to ReseqTB Database

Input: VCF file (Raw)

Filtered VCF file

Output: Annotation Report & Lineage Report

Functional Annotation and Lineage Analysis

Functional Annotation & Lineage Analysis

• Filtering output VCF file

-Custom loci bed list & vcftools Version 0.1.126

• Initial annotation

-SnpEff Version 4.1

• Reformatting annotation and Lineage analysis

-Custom Script

Annotation Report

Lineage Report

Summary of UVP analysis

Total Isolates Analyzed : 3717

Number passed all checks: 3570

Total failed QC: 147

- Failed Kraken specificity: 67

- Flagged for multiple rrs/rrl mutations : 76

- Mixed infection : 4

Distribution of MTBC major lineages in dataset

Phylogenetic representation of Isolates in dataset

BovisEast AsianEast African Indian

West African L5

Indo-Oceanic

West African L6Euro American

Antibiotic resistance profile across major lineages

Summary

• The Unified variant pipeline is very comprehensive, includes additional genomic data analysis steps (Species and lineage specificity, custom annotations)

• Applies current versions of bioinformatics tools to set quality thresholds at all stops on the pipeline to ensure confidence in variant calls.

• Annotation results validation with results from a number of other variant calling pipelines, including PhyReeSE (Silke et al 2015) shows agreement across most variant positions.

Thank You!

Recommended