17
The Queensland Brain Institute | Variant calling for disease association (1/2) Ordering the haystack 12/27/21 [www.absolutefab.com]

Variant (SNPs/Indels) calling in DNA sequences, Part 1

Embed Size (px)

DESCRIPTION

Abstract: This session will focus on the first steps involved in identifying SNPs from whole genome, exome capture or targeted resequencing data: The different read mapping approaches to a DNA reference sequence will be introduced and quality metrics discussed.

Citation preview

Page 1: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute |

Variant calling for disease association (1/2)Ordering the haystack

April 11, 2023

[www.absolutefab.com]

Page 2: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Quick recap: Production informatics

Sequencing Image Fastq

• Sequencing->Images->Conversion (Demultiplexing)

• Resulting file type: FASTQ• Several projects can be processed on one flowcell• One project can have several samples

Quality ControlProjects

Page 3: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Product Time

fastq 5 days

bam, vcf,… 3 weeks

paper >6 months

Per one-flowcell project

Production Informatics and Bioinformatics

Map to genome and generate raw genomic features (e.g. SNPs)

Analyze the data; Uncover the biological meaning

Produce raw sequence readsBasic ProductionInformatics

Advanced Production Inform.

BioinformaticsResearch

Page 4: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Where in the genome do the reads come from?

Reads Alignment

Page 5: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Short read mapping

• Brute-Force algorithm would take years to process one lane: Data structures matter !– Constant trade-off: speed vs. sensitivity– To date >50 read mapping tools

• Two categories– Hash tables: MAQ, ELAND, SOAP, BFAST, RazerS, Novoalign

– Suffix trees: BWA, SOAP2, BOWTIE

Thomas Keane 9th European Conference on Computational Biology 26th September, 2010

Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ. Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet. 2011 Apr 28. PubMed PMID: 21525877.

Page 6: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Hash table based aligners

• Modification– Speed-up: Spaced seeds 111010010100110111– Gapped seeds: Qgrams

• Hash of the reads: MAQ, ELAND, ZOOM and SHRiMP– Potentially much smaller memory requirements

• Hash the reference: SOAP, BFAST and MOSAIK– Constant memory cost, one time effort

Thomas Keane 9th European Conference on Computational Biology 26th September, 2010

Page 7: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Suffix tree and Burrows‐Wheeler Transformation

• Suffix trees are much faster– E.g.  BWA is ~20-times faster than hash-based MAQ

• BW transformation makes them applicable (memory)

Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. PMID: 19451168

queensland$ueensland$qeensland$quensland$quensland$queesland$queenland$queensand$queenslnd$queenslad$queenslan$queensland

$queensland and$queensl d$queenslan eensland$qu ensland$que land$queens nd$queensla nsland$quee queensland$sland$queen ueensland$q

Reference: queensland BWT(Ref): dlnuesae$nq$queensland and$queensl d$queenslan eensland$qu ensland$que land$queens nd$queensla nsland$quee queensland$sland$queen ueensland$q

Rotated Sorted

Page 8: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Find exact matches in transformed sequence

P BWT C 0 $queensland 1 6 and$queensl 1 10 d$queenslan 1 3 eensland$qu 1 4 ensland$que 1 7 land$queens 1 9 nd$queensla 1 5 nsland$quee 2 1 queensland$ 1 6 sland$queen 2 2 ueensland$q 1

Read: ensl

Reference:queensland12345678910

1. Search backwards2. Find letter i in last column3. Jump to the countth i letter in first column4. Set i to be the letter in the last column 5. repeat 3+4 to the end

John Pearson Winter School in Mathematical and Computational Biology 5-9 July 2010Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25; PMID: 19261174.

Page 9: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Which aligner to use ?

• Hash based approaches are more suitable for divergent alignments –  Rule of thumb:

• <2% divergence -> BWT E.g. human alignments

• >2% divergence -> hash based approach  E.g. wild mouse strains alignments

However, the space develops fast: don’t be sentimental

Thomas Keane 9th European Conference on Computational Biology 26th September, 2010

Page 10: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

File format: Sam/Bam

The SAM Format Specification (v1.4-r962) The SAM Format April 17, 2011

ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGCCAT

+ unlimited add. fields: TAG:TYPE:VALUE, e.g. NM edit distance

Page 11: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Flag

Hex 0x80 0x40 0x20 0x10 0x8 0x4 0x2 0x1Bit 128 64 32 16 8 4 2 1 = 163 1 1 1 1

Page 12: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

CIGAR String

Page 13: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Visualizing Bam files: IGV

Exome capture

http://www.broadinstitute.org/igv/

Whole genome sequencing

Page 14: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Bam file: Quality control

• Percentage mapped– Aim for 80%

• Coverage– Aim for coverage >10

• Duplicates– Aim for <1% (whole

genome)

//cluster-vm.qbi.uq.edu/<yourProject>

Page 15: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Three things to remember

1. Getting the mapping right is critical2. QC are the mapping stats and visualizing the

bam file 3. Knowing where the reads are does not

necessarily tell you about their function

Page 16: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Next week: Part 2

Abstract: This session will focus on the steps involved in identifying genomic variants after an initial mapping was achieved: improvement the mapping, SNP and indel calling and variant filtering/recalibration will be introduced and quality metrics discussed.

http://climbers.net/blog/Exhibiting-at-Cliffhanger-12-13th-July-Sheffield

Page 17: Variant (SNPs/Indels) calling in DNA sequences, Part 1

The Queensland Brain Institute | April 11, 2023

Walk-in-clinic