58
Genetics 211 - 2017 Lecture 3 High Throughput Sequencing Pt II Gavin Sherlock [email protected] January 24 th 2017

Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Genetics 211 - 2017 Lecture 3

High Throughput Sequencing Pt II Gavin Sherlock [email protected] January 24th 2017

Page 2: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Leveraging Multiple Technologies

•  Illumina is great, because you can get a ton of data –  BUT read length is short

•  PacBio/Oxford Nanopore is great, because read lengths are long –  BUT the data quality is poor

•  Two approaches have been used: 1.  Hybrid error correction

•  Use short reads to perform correction of long PacBio reads, and then assemble those

2.  Use long reads to improve existing (short read or Sanger based) assemblies

•  E.g. With 24× mapped coverage of PacBio long-reads applied to a D. pseudoobscura short read assembly, 99% of gaps were addressed, with 69% being closed and further 12% improved.

Page 3: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Long “Synthetic Reads” aka Moleculo

Fragment

Genomic DNA

Size Select (10kb)

Polish, ligate amplification adaptors

Page 4: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

~10 kb DNA

Dilute to 500 molecules per well

Amplify, fragment, add sequencing adaptors

Page 5: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Pool

Sequence

Page 6: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Separate, based on well barcode

Remove barcodes, assemble 10kb fragments

Assemble genome from 10kb fragments

Page 7: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Synthetic Read Characteristics

Page 8: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Using Hi-C data to aid assemblies

•  Hi-C is a proximity ligation method, aimed at reconstructing the 3 dimensional structure of a genome

•  Originally developed with the idea of looking at how the genome of an organism for which a good reference exists is organized

•  But, probability of intrachromosomal contacts is on much higher than that of interchromosomal contacts.

•  Although the probability of interaction decays rapidly with linear distance, even loci separated by >200 Mb on the same chromosome are more likely to interact than loci on different chromosomes

Page 9: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

crosslink and isolate digest and biotin fill in ligation and DNA isolation

biotin removal and size fractionationPulldown, adapter ligation

and deep sequencing

Hi-C Protocol

Page 10: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Hi-C Data

intrachromosomal interchromosomal probability

Page 11: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Applying Hi-C data to assemblies

LACHESIS (ligating adjacent chromatin enables scaffolding in situ)

Page 12: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Results

•  Applied to human contigs, using Hi-C data from an embryonic cell line

•  Most scaffolds (n = 13,528, comprising 98.2% of the length of the shotgun assembly) were clustered into one of the 23 groups

•  Nearly all of these groups corresponded to individual chromosomes, with the exceptions of the X chromosome, whose arms were in separate groups

•  Within each chromosome group, the vast majority of the length of the clustered scaffolds was successfully ordered and oriented by LACHESIS (94.4%)

Page 13: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Principle Can also be applied to Metagenomic Samples

Page 14: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Suggests multipronged approach to de novo Assembly

•  Provide high coverage data over various different physical distances –  shotgun fragment

•  Typical paired end, short read sequencing – can be very high coverage

–  Up to 12-kbp mate-pair •  Allows you to bridge many repeats

–  CPT-seq/10X Genomics •  Longer range contiguity (limited to ~10Mb genomes and above)

–  Synthetic reads •  Generate some complete reads than span repeats, but significant

PCR bias (limited to ~100Mb genomes and above) –  Long reads (PacBio, Oxford Nanopore)

•  Low coverage, and high error rates, (though in a hybrid strategy that may become unimportant, as they can be used for scaffolding)

–  Hi-C •  Incorporates longer range genome information

Page 15: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

•  interactions between nucleic acids and proteins

•  transcript identity•  transcript abundance

• RNA editing• SNPs

• Allele specific expression• Regulation

• Nucleosome positioning• 3D genome architecture

• Active promoters•  interactions between

nucleic acids and proteins•  chromatin modifications

• genome variability• metagenomics

• genome modifications• detection of mutations

• association studies• phylogeny• evolution

Applications of Next-Gen Sequencing

genome chromatin transcriptome

de novo sequencing

assembly

annotation

mapping

resequencing

detection of variants

mapping

ATAC-Seq

Identify open chromatin

mapping

ChIP-Seq

detection of binding sites

mapping

RNA-Seq

transcript detection and quantification

mapping

Hi-C

3D reconstruction

Page 16: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

•  interactions between nucleic acids and proteins

•  transcript identity•  transcript abundance

• RNA editing• SNPs

• Allele specific expression• Regulation

• Nucleosome positioning• 3D genome architecture

• Active promoters•  interactions between

nucleic acids and proteins•  chromatin modifications

• genome variability• metagenomics

• genome modifications• detection of mutations

• association studies• phylogeny• evolution

Applications of Next-Gen Sequencing

genome chromatin transcriptome

de novo sequencing

assembly

annotation

mapping

resequencing

detection of variants

mapping

ATAC-Seq

Identify open chromatin

mapping

ChIP-Seq

detection of binding sites

mapping

RNA-Seq

transcript detection and quantification

mapping

Hi-C

3D reconstruction

Page 17: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Mapping Short Reads

•  Many options; often a trade off between speed, resources and sensitivity.

•  Several open source projects to solve this problem, continually improving in speed and memory requirements.

•  New features being added all the time.•  When dealing with short read data, make sure you

have the very latest versions of the software you’re using, as some are updated frequently.

Page 18: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Alignment

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 19: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Approaches to Short Read Alignment

Two main approaches•  Hash-Based mapping:

–  Hashing of reads (E.g. Maq, Eland, SHRiMP)–  Hashing of genome (E.g. novoalign, SHRiMP2)

•  Indexing of tree-like structures:–  Bowtie (ungapped)–  Bowtie2 (gapped)–  Bwa (gapped)–  these all use Suffix Arrays/Burrows-Wheeler Transform (BWT),

coupled with FM index

Page 20: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Hashing

•  A hash function simply converts a string (“key”) to an integer (“value”).

•  The integer is then used as an index in an array, for fast look up.

keyshash

function buckets

Page 21: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

MAQ

•  First widely used open source short read aligner•  Very fast, but at the cost of accuracy (ungapped)•  Uses hashing to index sequence reads, then scans with

reference sequence•  Guaranteed to find alignments with up to 2 mismatches•  Can take advantage of paired end reads•  Uses quality scores to determine best alignments•  Generally no longer used, as has been superceded by

newer aligners

http://sourceforge.net/projects/maq/

Page 22: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Tries

•  The term trie comes from retrieval•  Sometimes pronounced tree, sometimes try.•  Is a tree representing a collection of strings, with one

common node per prefix•  Smallest tree such that:

–  Each edge is labeled with a character c ∈ Σ–  A node has at most one outgoing edge labeled c, for c ∈ Σ–  Each key is “spelled out” along some path starting at the root

Page 23: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Trie Example

Key Value instant 1 internal 2 internet 3

Example taken from slides by Ben Langmead

•  Smallest tree such that:–  Each edge is labeled with a

character c ∈ Σ–  A node has at most one outgoing

edge labeled c, for c ∈ Σ–  Each key is “spelled out” along

some path starting at the root

Page 24: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Tries – another example

ac 4 ag 8 at 14 cc 12 cc 2 ct 6 gt 18 gt 0 ta 10 tt 16

We can index T with a trie. The trie maps substrings to offsets where they occur

gtccacctagtaccatttgtExample taken from slides by Ben Langmead

Page 25: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie

•  Also uses quality scores to find best alignments.•  Uses “Burrows-Wheeler index” to keep its

memory footprint small.•  Can find alignments with up to 3 mismatches in

the first L bases of the read.•  Only ungapped alignments•  Also supports paired end reads.

http://bowtie-bio.sourceforge.net/•  Bowtie2 supports gapped alignments too.

Page 26: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 27: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 28: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 29: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 30: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 31: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 32: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 33: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 34: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 35: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 36: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 37: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 38: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 39: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

Page 40: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATGATACGGCGACCACCGAGATCTA

AATAATACGGCGACCACCGAGATCTA

Page 41: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Bowtie Algorithm

@HWI-EAS412_4:1:1:1376:380AATAATACGGCGACCACCGAGATCTA

Page 42: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

BWA

•  From the author of Maq, but now also uses Burrows-Wheeler transform to significantly speed it up, and use less memory.

•  Can also find small indels, in contrast to both Maq and Bowtie.

•  Is slightly slower than bowtie, but ability to find indels make it more useful if SNVs are important to you.

Page 43: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Current Practice

•  Most people use bwa for mapping their short read data if they want to discover variants

•  Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though we often use it for yeast)

•  If not interested in variants, people use bowtie for speed •  There is no standard benchmark dataset, though see:

–  Holtgrewe et al. (2011). A novel and well-defined benchmarking method for second generation read mapping. BMC Bioinformatics 12:210.

•  It doesn’t hurt to experiment

Page 44: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

SAM files

•  Sequence Alignment/Map format•  Is a concise file format that contains information about how sequence

reads maps to a reference genome•  Can be further compressed in BAM format, which is a binary format

of SAM.•  Can also be sorted and indexed to provide fast random access, using

SAMtools (more on this in a minute).•  Requires ~1 byte per input base to store sequences, qualities and meta

information.•  Supports paired-end reads (and color space).•  Is produced by bowtie, bwa•  SAM can be converted to pileup format.

Page 45: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

SAM format

No. Name Description 1 QNAME Query NAME of the read or read pair 2 FLAG Bitwise FLAG 3 RNAME Reference Sequence Name 4 POS 1-Based leftmost POSitionof clipped alignment 5 MAPQ MAPping Quality (Phred-scaled) 6 CIGAR Extended CIGAR string (operations: MIDNSHP) 7 MRNM Mate Reference NaMe (‘=’ if same as RNAME) 8 MPOS 1-Based leftmost Mate POSition 9 ISIZE Inferred Insert SIZE 10 SEQ Query SEQuence on the same strand as the reference 11 QUAL Query QUALity (ASCII-33=Phred base quality)

Page 46: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

SAM Format coor 12345678901234 5678901234567890123456789012345ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT

r001+ TTAGATAAAGGATA*CTGr004+ ATAGCT..............TCAGCr001- CAGCGCCAT

@SQ SN:ref LN:45r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *r001 83 ref 37 30 9M * 0 0 CAGCGCCAT *

Page 47: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Alignment Visualization

•  SAMtools has its own terminal-based text visualization tool, called tview

•  Very simple, but fast, and useful for looking at SNPs and indels

Page 48: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Alignment Visualization

•  For more complex viewing needs, use either GenomeView or IGV

•  Can display BAM files, plus annotation tracks•  Java based dynamic visualization software•  Shows snps, indels, spliced reads and a signal track

http://genomeview.sourceforge.net/http://www.broadinstitute.org/igv/

•  Also check out JBrowse for online viewing of BAM files

NB: Browser-driven analyses are necessarily anecdotal and at best semi-quantitative.

Page 49: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Genome Resequencing

•  SNP/indel discovery–  Can now multiplex 96 haploid yeast strains on a single lane of the

HiSeq 2500 (~30x coverage each)–  1 lane of HiSeq = ~10x human genome

•  Structural variant discovery•  Copy Number changes•  Novel sequence discovery

Page 50: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

SNP and indel calling

•  SNPs and indels are ‘called’ from SAM/BAM files•  Typically use GATK (Genome Analysis Toolkit), which

will identify nucleotides with support for variation–  Sounds trivial, but typically lots of false positives–  Harder in a diploid than a haploid–  Harder with lower coverage–  You *must* read the “Best Practice Variant Detection with the

GATK” page before using GATK; it changes frequently–  Make sure the version of GATK you are using (it is updated

frequently) corresponds to the latest “best practices”–  Always document what you did – this is best done in the form of a

Perl or Python script that glues GATK calls into a pipeline, and that documents its parameters.

Page 51: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

VCF files

•  VCF stands for variant call format •  Produced by the GATK, and describes all variant

positions derived from bam files •  Meta information lines, preceded by ##, indicate the

source of data used, and how the file was generated. •  Variants are not annotated – to do this, using a tool

such as SNPeff or ANNOVAR –  Helps you zero in on which variants might be of interest

Page 52: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Copy Number Variation

•  Sequence coverage of a region can allow detection of amplifications or deletions – i.e. duplicated regions will have higher coverage.

•  Power to detect such regions depends on their size, copy number and number of reads.

Page 53: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

CNV Detection Power

Chiang et al, 2009. Nature Methods 6, 99-103.

Page 54: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Coverage Shows Amplification HXT7HXT6

Page 55: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Structural Variation

•  Using paired-end read data, it’s possible to identify structural variants:

Adapted from Korbel at al, 2007.

New Sequence

Page 56: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Structural Variant Identification

•  There are several tools, e.g.: –  BreakDancer (used by WashU Sequencing Center) –  VariationHunter –  MoDIL –  inGAP-SV

•  has a nice visualization tool (http://ingap.sourceforge.net/) –  PEMer (developed by the Gerstein lab)

•  No real robust comparison between these exists on a standard dataset

•  Ask around, find out what others are using, get latest versions and experiment

Page 57: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Identifying Novel Sequences

•  Typically after aligning your data, you have some number of left over reads, which didn’t map to your reference sequence

•  These are either reads with a bunch of errors, or correspond to additional sequence in you sample of interest not present in the reference sequence

•  Can often assemble these left over reads into contigs of novel sequence, for example using Velvet, after which you can annotate them.

•  Often interesting things in your unmapped reads, so don’t just discard them without thinking about it

Page 58: Genetics 211 - 2017 Lecture 3 - Stanford University · 2017. 1. 24. · • Novoalign is actually a more accurate aligner, but slower, and often impractical for large genomes (though

Recommended Reading File Formats and Tools •  Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.; 1000

Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16):2078-9.

•  Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R.; 1000 Genomes Project Analysis Group. (2011). The variant call format and VCFtools. Bioinformatics 27(15):2156-8.

•  Cingolani, P., Platts, A., Wang, le L., Coon, M., Nguyen, T., Wang, L., Land, S.J., Lu, X., Ruden, D.M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6(2):80-92

Mapping and Variant Calling •  Li, H., Ruan, J. and Durbin, R.. (2008). Mapping short DNA sequencing reads and calling variants using mapping

quality scores. Genome Res. 18(11):1851-8. MAQ •  Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K. and Wang, J. (2009). SOAP2: an improved ultrafast tool for

short read alignment. Bioinformatics 25(15):1966-7. •  Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform.

Bioinformatics 25(14):1754-60. BWA •  Langmead, B., Trapnell, C., Pop, M. and Salzberg, S.L. (2009). Ultrafast and memory-efficient alignment of short

DNA sequences to the human genome. Genome Biol. 10(3):R25. Bowtie •  Langmead, B. and Salzberg, S. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods 9:357-359. •  Hwang S, Kim E, Lee I, Marcotte EM. (2015). Systematic comparison of variant calling pipelines using gold standard

personal exome variants. Sci Rep. 5:17875.