22
Alexis Dereeper CIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms

Alexis Dereeper

  • Upload
    leland

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Detection and analysis of SNP polymorphisms. Alexis Dereeper. CIBA courses – Brasil 2011. Objectives. Short reads Solexa. To know and manipulate available packages/tools for SNP and INDEL detection from NGS data (assembly of NGS data). Mapping SAM. - PowerPoint PPT Presentation

Citation preview

Alexis Dereeper CIBA courses – Brasil 2011

Detection and analysisof SNP polymorphisms

• To know and manipulate available packages/tools for SNP and INDEL detection from NGS data(assembly of NGS data)

• To think about difficulties encountered when analysing new generation sequencing data(differentiate sequencing errors, paralogs and allelic variation)

• Detect SNP and assign genotypes to every polymorphic positions

• Simply exploit polymorphisms data via a Web-based application (genetic diversity, LD)

• Obtain an exploitable dataset to send for the design of a high-throughput SNP chip(Illumina VeraCode technology)

Short reads Solexa

Mapping SAM

Exploitation of polymorphism data

Design of a Illumina SNP chip

Assignation of genotypes

Ind1 ATTGTGTCGTAACGTATGTCATGTCGTInd2 ATTGTGTCGGAACGTATGTCATGTCGTInd3 ATTGTGTCGKAACGTATGTCATGTCGT

Allelic variations

List of SNPs867

A/G1998

T/C2341

T/G

Objectives

Alexis Dereeper

Tablet• Graphical viewer for assembly of NGS data

• Accepts different formats:ACE, SAM, BAM

CIBA courses – Brasil 2011

Alexis Dereeper

Automatic detection of SNP from SAM assembly

SAM assembly

SAM-to-BAM

Generate Pileup

Pileup2snp

Pileup file

FastQ Groomer

Mapping BWA

SAM-to-BAM

IndelRealigner

CountCovariates

TableRecalibration

UnifiedGenotyper

VCF file

SamTools

GATK

PicardTools

VarScan

SNP tabular file

SNiPlay Utilities

SamToFastaAlignments

FASTA alignmentswith IUPAC

Fastq

AddReadGroupIntoSam

VCFToFastaAlignments

Example of pipeline faisable with the Galaxy system:3 alternatives

CIBA courses – Brasil 2011

Alexis Dereeper

Program for SNP detection from Pileup file : Pileup2snpAnother module exists for indel Pileup2indel but not implemented yet in Galaxy SouthGreen

Text file describing for each position: base for reference, depth of coverage, variations, quality

seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

Varscan

Pileup format

CIBA courses – Brasil 2011

Alexis Dereeper

genotype2

genotype3

Depth threshold

Depth threshold

Heterozygosity

genotype1

Depth Frequency Depth

Threshold values per genotype

1 0 1

4 0.3 2

4 0.3 2

WYA

A

T

Assemblage: Ace format

For each contig

CL1Contig1

CL1Contig1.align.fa

+ CL1Contig2.align.fa , CL2Contig1.align.fa …

FASTA alignments including IUPACList of heterozygous positions

+

Mapping: SAM format

Stats: estimation of average heterozygosity for each genotype+

For heterozygosity estimation

For position

SamToFastaAlignments and AceToFastaAlignments: SNiPlay utilities for management of NGS data

CIBA courses – Brasil 2011

Alexis Dereeper

GATK (Genome Analysis ToolKit)

• Package for analysis of NGS data.

• Developed for the analysis of Human medical resequencing projects(1000 Genomes, The Cancer Genome Atlas)

• Includes tools for depth analysis, quality score recalibration, SNP/InDel discovery

• Complementary of 2 other packages: SamTools, PicardTools

PREPROCESS:

* Index human genome (Picard), we used HG18 from UCSC. * Convert Illumina reads to Fastq format * Convert Illumina 1.6 read quality scores to standard Sanger scores

FOR EACH SAMPLE:

1. Align samples to genome (BWA), generates SAI files. 2. Convert SAI to SAM (BWA) 3. Convert SAM to BAM binary format (SAM Tools) 4. Sort BAM (SAM Tools) 5. Index BAM (SAM Tools) 6. Identify target regions for realignment (Genome Analysis Toolkit) 7. Realign BAM to get better Indel calling (Genome Analysis Toolkit) 8. Reindex the realigned BAM (SAM Tools) 9. Call Indels (Genome Analysis Toolkit) 10. Call SNPs (Genome Analysis Toolkit) 11. View aligned reads in BAM/BAI (Integrated Genome Viewer)

CIBA courses – Brasil 2011

Global SAM with read group

FastQ Groomer

Mapping BWA

SAM-to-BAM

IndelRealigner

CountCovariates

TableRecalibration

UnifiedGenotyper

VCF file

Fastq (RC1)

AddReadGroupIntoSam

SAM with read group

FastQ Groomer

Mapping BWA

Fastq (RC2)

AddReadGroupIntoSam

SAM with read group

FastQ Groomer

Mapping BWA

Fastq (RC3)

AddReadGroupIntoSam

SAM with read group

FastQ Groomer

Mapping BWA

Fastq (RC4)

AddReadGroupIntoSam

SAM with read group

….

mergeSam

Global SAM with read group

SAM-to-BAM

IndelRealigner

CountCovariates

TableRecalibration

UnifiedGenotyper

VCF file

FastQ Groomer

Mapping BWA

Fastq global

AddReadGroupIntoSam

Fastq (RC1) Fastq (RC2) Fastq (RC3) Fastq (RC4)

Alexis Dereeper

VCF format (Variant Call Format)

##fileformat=VCFv4.0##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,5120 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3

Advantages: describes the variations for each position + genotype assignation

CIBA courses – Brasil 2011

Alexis Dereeper

Other functionalities of GATK• DepthOfCoverage module:Enables to inform sequencing depth of coverage for each gene, each position and each individual

• ReadBackedPhasing module:Enables to define if possible allele association (phase or haplotype) in case of heterozygosity…

And not AGGGGA

CIBA courses – Brasil 2011

Alexis Dereeper

SNiPlay: Web-based application for polymorphism analysis

http://sniplay.cirad.fr

CIBA courses – Brasil 2011

Alexis Dereeper

SAM assembly

SAM-to-BAM

Generate Pileup

Pileup2snp

Pileup file

FastQ Groomer

Mapping BWA

SAM-to-BAM

IndelRealigner

CountCovariates

TableRecalibration

UnifiedGenotyper

VCF file

SamTools

GATK

PicardTools

VarScan

SNP tabular file

SNiPlay Utilities

SamToFastaAlignments

FASTA alignmentswith IUPAC

Fastq

AddReadGroupIntoSam

VCFToFastaAlignments

CIBA courses – Brasil 2011

Automatic detection of SNP from SAM assembly

Example of pipeline faisable with the Galaxy system:3 alternatives

Options of SNiPlay

Select the VCF format

Load the VCF fileLoad reference file

Select the Rice genome as reference

Alexis Dereeper CIBA courses – Brasil 2011

Alexis Dereeper

Cartesian coordinates

Genotyping file

Submission file for Illumina

Analysis with the BeadStudio software

Design of Illumina chip

CIBA courses – Brasil 2011

Alexis Dereeper

@DARwin 5.0 - ALLELIC - 233 20N° 50 50 122 122 218 218 245 245 261 261 290 290 3561 1 1 1 1 3 3 3 3 4 4 2 2 22 1 1 1 1 3 3 1 3 4 4 2 2 23 1 1 1 1 3 3 3 3 4 4 2 2 24 1 1 1 1 3 3 3 3 4 4 2 2 2

3310P 49 121 217 244 260 289SSSSSSSSSS#cARBA A G G T C C A T TA A G G T C C A T T#cSYRA A G A T C C A T CA A G G T C C A T T

• PED format

• DARwin format

• .inp format for Phase • Format for TASSEL (association studies)

cARB 1 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4cSYR 2 0 0 1 0 1 1 1 1 3 3 1 3 4 4 2 2 2 2 1 1 4 4 2 4cARA 3 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4

33 10:250 122 218 245 261 290 356 461 467 560cARB A:A A:A G:G G:G T:T C:C C:C A:A T:T T:TcSYR A:A A:A G:G A:G T:T C:C C:C A:A T:T C:TcARA A:A A:A G:G G:G T:T C:C C:C A:A T:T T:TcORL A:A A:A G:G G:G T:T C:C C:C A:A T:T T:TcLAR A:G A:G A:G A:G C:T C:C C:C A:A T:T C:T

Allelic files

CIBA courses – Brasil 2011

Alexis Dereeper

Annotation of SNPs

CIBA courses – Brasil 2011

Alexis Dereeper

Annotation of SNPs

CIBA courses – Brasil 2011

SeqLib library

Diversity analysis

Alexis Dereeper

Haplotype networks

High frequency haplotypes

Low frequency haplotype

Group distribution whithin this haplotype

Distance between 2 haplotypes (nb of mutations)

CIBA courses – Brasil 2011

Alexis Dereeper

Individu, groupInd1, TableInd2, TableInd3, TableInd4, EastInd5, EastInd6, EastInd7, EastInd8, West

External file (optional)

Allele sharing between groups

CIBA courses – Brasil 2011