60
Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018

Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Variant Calling Pipeline

Erika Villa

Bioinformatics Core Facility

10/17/2018

Page 2: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Genome

A genome is the entire set of genetic material for an organism.blueprint of life that contains information to grow, develop, survive and reproduce

The human genome

~3 billion base pairs of DNA across 23 pairs of chromosomes.

~20,000 protein coding genes

Page 3: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

No individuals are genetically identical

But we are more similar than we are different

More than 99 percent of the human genome is the same in all people.

Differences in less than 1 percent of our genome accounts for the vast diversity in humans across the globe.

Projects that give us insight about human differences

2015: 1000 genome project found typical individual varies in 4.1-5 million sites(~20 million bp) from reference.

2017: dbSnp 324 million variants for humans

Page 4: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Exome

The exome is a subset of the genome composed of only exons.Exons are the coding regions of a gene

The exons of all our genes make up approximately 1.5% of our genome Exonic mutations are thought to harbor ~85% of mutations largely effecting disease.There are some important DNA sequences that are not contained within the exome in noncoding DNA that have important biological functions, such as regulating the

coding regions of the genome.

Page 5: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Sequencing Approaches

Whole Genome(WGS), Whole Exome(WES), Target Gene Panel

Target Gene Panel • A gene panel is a gene subset of the exome

• It contains a subset of exons for a select group of genes

• Gene Panels are useful if you need to do deep sequencing > 1000X

• Many clinical tumor tests use gene panels.

What portion of the genome do you want to sequence?

Page 6: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Pros and Cons of WGS vs WES

Whole Genome

• ~$1300 for 30-40X coverage

• All variants possible

• Sequence can better predict large structural changes including CNV, large Indels, etc

• Whole Genome has more uniform coverage of the protein coding regions

Whole Exome

• ~$500 for 100X coverage

• Restricted to exonic regions

• In somatic/mosaic conditions you might need > 1000X coverage.

• Generate less data to store and analyze

COST

DETECTABLE VARIANTS

PROS

Page 7: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Human Reference GenomeTo make assertions about genetic variation we rely on a reference

Reference Genome: A representative example of a species' genetic makeup

Curated by Genome Reference Consortium (GRC)

• GRCh37/hg19: 2009 derived from thirteen anonymous volunteers from Buffalo, NY.

• GRCh38/hg38: Dec 2013-includes ALT contigs. More representative of population.

2001(150,000 gaps) 2009(250 gaps) 2013(12 gaps)

Build 38 was a significant ‘upgrade’, and due to its accuracy and reputation it is the ‘go to’ reference for many large scale projects

Page 8: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Catalogs of Human Variation

HapMap

• The International HapMap Project: SNP genotyping arrays to develop a haplotype map (HapMap) of the human genome.

1000G

• The 1000 Genomes project sequenced > 1000 genomes in pure and ad-mixture populations to identify human variation in the human genome

ExAC

• ExAC collected the SNP and Indel calls in ~ 26K genomes/exomes and their prevalence in different populations

gnomAD

• The Genome Aggregation Database (gnomAD) is a resource of aggregate genomes and aimed to harmonize both exome and genome sequencing data from over 120K exomes and 15K genomes.

Page 9: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Types of variation in Genome

• Single Nucleotide Polymorphisms(SNPs or SNVs)

• Short Insertions/Deletions (Indels)

• Large Structural Variations SNVs

INDELs

SVs

A C T G A

A T T G A

A A

A T T G ATT

A A G T T

Substitutions

Insertions

Deletions

Inversions

reference

Page 10: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Some SNPs of Interest

EXAMPLES• Non-synonymous mutations

- Results in Amino Acid change- Affects the Protein Sequence- Types of non-synonymous mutations

* Missense

* Nonsense: also described as stop_gained

Diseases can be driven by various types of genetic alterationsExamine Variants and understand features

Original Synonumous Missense Missense Nonsense

GAG GAA GAT GTG TAG

Glutamic Acid Glutamic Acid Aspartic Acid Valine Stop codon

Page 11: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Features used in biological sequence annotationEffects that we see in variants

In a gene? In an exon? Protein coding change? http://www.sequenceontology.org/

Page 12: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Structural Variants: The variation in structure of an organism's chromosome. Typically a structure variation affects a sequence length about 1Kb to 3Mb

1 kb = 10^3 bp1 Mb = 10^6 bp1 Gb = 10^9 bp

Page 13: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Alterations in Genome• A genetic disorder is a genetic problem caused by one or more abnormalities in the genome.

• A single-gene disorder is the result of a single mutated gene.

• Autosomal dominant disorders occur with only one mutated

copy of the gene.

• Recessive disorders require both copies are mutated.

• X-linked dominant disorders are caused by mutations in

genes on the X chromosome.

• Mitochondrial disease, also known as maternal inheritance,

applies to genes encoded by mitochondrial DNA.

Page 14: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Inherited Diseases

Page 15: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Complex Disease

• Complex diseases are caused by a combination of genetic, environmental, and lifestyle factors, most of which have not yet been identified.

• Some examples include Alzheimer's disease, scleroderma, asthma, Parkinson's disease, multiple sclerosis, osteoporosis, connective tissue diseases, kidney diseases, autoimmune diseases, etc

Page 16: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Somatic Mutations

Somatic Mutation Germline Mutation

Somatic mutation: An alteration in DNA that occurs after conception. Somatic mutations can occur in any of the cells of the body except the germ cells and therefore are not passed on to children. Can cause cancer or other diseases.

Page 17: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Somatic Disease• Acquired diseases are caused by acquired mutations in a gene or group of genes that occur during a person's life.

• These include many cancers, as well as some forms of neurofibromatosis.

Mosaicism • Mosaicism, involves the presence of two or more populations of cells with different genotypes in one individual, who has developed from a single fertilized egg.

• Intersex conditions can be caused by mosaicism where some cells

in the body have XX and others XY chromosomes

• Other endogenous factors can also lead to mosaicism including

mobile elements, DNA polymerase slippage, and unbalanced

chromosomal segregation.

• Exogenous factors include nicotine and UV radiation

Page 18: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Germline and Somatic Workflows for

Variant Discovery

BICF and BioHPC

Page 19: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Alignment WorkflowFirst Step for Germline and Somatic Workflows

Page 20: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Alignment: BWABurrow-Wheelers Aligner

“BWA is carefully designed to achieve a good balance between performance and accuracy”

SE and PE reads

Difficulties: ambiguity caused by repeats and sequencing errors.

Human Reference Sequences-GRCh37/hg19

- GRCh38

Other Organisms Reference Sequences

Available for e.g. Mouse(mm10/GRCm38)

Others not available

Page 21: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Alignment: DedupingWith or without UMI

Why are we so worried about sequence duplication?

• When DNA is sequenced, PCR is used to amplify sequence library to ensure that only DNA with “a known adapter” is sequenced.

• Since PCR has a small error rate, “early errors” can be amplified and could skew your results

• We remove duplicates to remove potential noise.

Page 22: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Alignment: Indel Realignment• Why does GATK need Indel Realignment?• Sometimes, alignment algorithms align reads inconsistently, adding the alignment gaps to different places.• Indel Realignment uses “known” gold standard indels to realign these gaps

Page 23: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Alignment Workflow: Recalibration• Why does GATK need Base Recalibration?

• Every base has a quality score that variant callers rely on these scores

• Quality scores are prone to different types of biases

• Base recalibration detects systematic errors made by the sequencer when it estimates the quality score of each base call

Page 24: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Germline Workflow

Germline Union VCF

Page 25: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Variant Callers

• Strelka2: https://github.com/Illumina/strelka

– Sangtae Kim, Konrad Scheffler, Aaron L Halpern, Mitchell A Bekritsky, Eunho Noh, Morten Källberg, Xiaoyu Chen, Doruk Beyter, Peter Krusche, Christopher T Saunders. Strelka2: Fast and accurate variant calling for clinical sequencing applications.

• Speedseq: https://github.com/hall-lab/speedseq

– Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB, Garrison EP, Marth GT, Quinlan AR, Hall IM. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Methods. 2015 Oct;12(10):966-8.https://github.com/hall-lab/speedseq

• Platypus: http://www.well.ox.ac.uk/platypus

– Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SR; WGS500 Consortium, Wilkie AO, McVean G, Lunter G. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet. 2014 Aug;46(8):912-8.http://www.well.ox.ac.uk/platypus

• Gatk: https://software.broadinstitute.org/gatk/

– Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV, Altshuler D, Gabriel S, DePristo MA. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;43:11.10.1-33.

Page 26: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K
Page 27: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Recommended Filtering for Germline Testing

• Depth >20• LOF or Missense (Coding Changes)• Alt Read Ct >3• Mutation Allele Frequency (MAF) >0.10• If novel:

- Called by 2+ callers

Page 28: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Important Terminology to understand

Page 29: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K
Page 30: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Different tumor cells can show distinct morphological and phenotypic profiles; eg. cell morphology and gene expression

Page 31: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Somatic Workflow

Page 32: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Somatic Variant Callers

• Shimmer: https://github.com/nhansen/Shimmer

– Hansen NF, Gartner JJ, Mei L, Samuels Y, Mullikin JC. Shimmer: detection of genetic alterations in tumors using next-generation sequence data. Bioinformatics. 2013 Jun 15;29(12):1498-503.

• Virmid: https://sourceforge.net/projects/virmid/

– Kim S, Jeong K, Bhutani K, Lee J, Patel A, Scott E, Nam H, Lee H, Gleeson JG, Bafna V. Virmid: accurate detection of somatic mutations with sample impurity inference. Genome Biol. 2013 Aug 29;14(8):R90

• VarScan: http://dkoboldt.github.io/varscan/

– Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012 Mar;22(3):568-76.

• Speedseq: https://github.com/hall-lab/speedseq

– Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB, Garrison EP, Marth GT, Quinlan AR, Hall IM. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Methods. 2015 Oct;12(10):966-8.https://github.com/hall-lab/speedseq

• MuTect:

• https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_ gatk_tools_walkers_cancer_m2_MuTect2.php

– http://archive.broadinstitute.org/cancer/cga/mutect

Page 33: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Recommended Filtering for Somatic Mutations

• Depth >20• LOF or Missense (Coding Changes)• MAF(Normal) * 5 < MAF(Tumor)• In COSMIC > 5 Subject

- Tumor: Alt Read Ct > 3- Tumor: MAF > 0.01

• Others- Tumor: Alt Read Ct >8- Tumor: MAF >0.05- Tumor: Called by 2+ Callers

Page 34: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K
Page 35: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Annotations

Page 36: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K
Page 37: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

• ClinVar- ClinVar is a freely accessible, public archive that aggregates information about genomic variation and it’s relationship to human health.

• GWAS Catalog-GWAS Catalog is a quality controlled, manually curated, literature derived collection of published GWAS assaying at least 100,000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P <1 × 10(-5).

• Decipher- The DECIPHER database contains data from 27,302 patients who have given consent to the broad data sharing; DECIPHER also supports more limited sharing via consortia. Used by clinical community to share and compare phenotype and genotypic data

Disease Studies

Page 38: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Cancer Datasets and Annotation

• Clinical Interpretation of Variants in CancerCIVIC

• Catalog of Somatic Mutation in CancerCOSMIC- GeneFusions- Gene Census- Curated Genes- Drug Resistance(so far 9 genes)- Genome Wide Screens

• The Cancer Genome Atlas - TCGA- Tons of data, RNASeq, CNV, WES, WGS etc

Page 39: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K
Page 40: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Astrocyte - BioHPC Workflow Platform

astrocyte.biohpc.swmed.eduor

portal.biohpc.swmed.edu: Cloud Services -> Astrocyte Workflow Platform

Standardized Workflow

Simple Web Forms

Online documentation & results visualization

Workflows run on HPC cluster without developer or user needing cluster knowledge

Page 41: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Bioinformatics Core Facility (BICF)BICF provides bioinformatics, statistical and data management support to researchers on campusBICF functions as a conduit between bioinformatics research programs and the clinical and basic science research community at UTSWPlease email [email protected] with questions or comments about the workflow

Page 42: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Create New Project

Page 43: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Add DataTo Your Project

Page 44: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Adding Data To Your Project

#For NGS experiements, this is recommended

Page 45: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Data to Import

• Design File: tab delimited *txt file with sample names, Family/Group names, fastq file names

• Fastq Files: One or two fastq files per sample

• Capture Bed file: tab delimited file with target capture region in bed format. (Must contain at least 3 columns specifying chromosome, chromosome start position and chromosome end position)

Page 46: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Make A Design FileFamilyIDThis ID will be used to call samples in batchSampleIDThis ID will be used to name all workflow produced files. E.g. S0001 will produce S0001.bamFqR1Name of the fastq file(not full path)FqR2Name of the fastq file R2 (not full path)

Page 47: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Rules for Making Design File

• Use tab as delimiter- Excel save as “Text (tab delimited)”

• If no SubjectID, use same number/character for all riws

• If no FqR2, leave them empty• For all contentes, no “-”• For all contents, no spaces• Column names MUST be exactly the

same as documented

Page 48: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Run Workflow in this ProjectMy Project Select Project

Page 49: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

mydesignfile.txt

mycapturefile.bed

GM12878.R1.fastq.gzGM12878.R2.fastq.gzmydesignfile.txt

mycapturefile.bed

SELECT YOUR FILES

Select your data file, set up workflow and submit

Page 50: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Project is Queued/Running/Complete

/RUNNING/QUEUED

GM12878.R1.fastq.gz

GM12878.R1.fastq.gz

Page 51: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Keep Trying: My first attempt belowMake sure you have all the appropriate files selected

BICF Help Desk: Email: [email protected] Hours: 10-11am Daily Location: E4.380

Page 52: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K
Page 53: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Timeline of Germline workflowOne Sample

Page 54: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K
Page 55: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Key Files for Germline Pipeline• VCF file — SNPs/Indels for each sample

• SampleID.germline.vcf.gz• Coverage Histogram for each sample

• SampleID.coverage_histogram.png• Cumulative Distribution Plot for all samples

• coverage_cdf.png• QC for all samples

• SampleID.sequence.stats.txt• Structural Variants (unfiltered)

• SampleID.sssv.sv.vcf.gz.annot.txt• Copy Number for each sample

• SampleID.cnvcalls.txt

Page 56: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Key Files Somatic Mutation Pipeline

• VCF file — SNPs/Indels for each sample

• FamilyID.somatic.vcf.gz

• Match Check File

• FamilyID_matched.txt

• QC for tumor normal pairs

• FamilyID.sequence.stats.txt

Page 57: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

BAM files can be viewed on

Referencesame as analysis reference

Page 58: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

http://newbam.iobio.io/

Page 59: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

VCF Files can be viewed by

http://vcf.iobio.io

Page 60: Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics Core Facility 10/17/2018. ... • ExAC collected the SNP and Indel calls in ~ 26K

Thank you

Questions?