Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Detection and analysis of SNP polymorphisms
Alexis DEREEPER
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
What is a SNP?
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
• A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion of a large population
• SNPs in coding regions may (or may not) alter the protein structure and function
Why studying SNP?
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
• Population genetics: population stratification, linkage disequilibrium…
• Define markers for genetic maps
• Analysis of genome structure, genome evolution
• Genome Wide Association studies. SNPs can be used for estimating predisposition to disease, for predicting specific genetic traits • Functional analysis: alteration of protein structure and function
Re-sequencing projects: a deluge of SNPs
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
Using NGS technologies: RESEQUENCING Mapping SNPs
SNPs from RNASeq: example of Arcad project
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
Available reference ? genome/transcriptome
454 sequencing
De novo reference assembly
Solexa sequencing
Mapping on reference
Polymorphism database in adapted format • redundancy • open reading frame • CDS/UTR
Diversity study • Comparative domestication • Life history trait impact
• Functionnal evolution
Yes
No
Ortholog/paralogs assignation
Solexa sequencing
CROP Breeding SNP database
• functional annotation • selection footprint
Strategy : comparative population genomics with transcriptomics data
Strategy for SNP discovery from NGS
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
Global BAM with read group
FastQ Groomer
Mapping BWA
IndelRealigner
UnifiedGenotyper
VCF file
Fastq (ind1)
BAM with read group
FastQ Groomer
Mapping BWA
Fastq (ind2)
BAM with read group
FastQ Groomer
Mapping BWA
Fastq (ind3)
BAM with read group
FastQ Groomer
Mapping BWA
Fastq (ind4)
BAM with read group
….
mergeSam
Add or Replace Groups Add or Replace Groups Add or Replace Groups Add or Replace Groups
DepthOfCoverage
Depth file
GATK
PicardTools
FASTQ Format
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
• Standard format of short reads from NGS data
FASTQ file → TEXT file
STRUCTURE:
@HWUSI-EAS454_0006:1:112:14105:5498#CTTGTA
CGCCAAGAAGTGTAGCAAAACGGCAGAGCTCGTGGATTAAACAAACAGAGGATTTCGGTGAGGATTGAGGGGGAGT
+
cfffcfeffdeefefffcffffffffcffeffffdffffafcfffffdffffdfefeddf^eececfffdfcbffb
@HWUSI-EAS454_0006:1:37:16314:3410#CTTGTA
AGTGTAGCAAAACGGCAGAGCTCGTGGATTAAACAAACAGAGGATTTCGGTGAGGATTGAGGGGGAGTGGTGGCCG
+
`bTbbccccceeeeeceeeecccYeedded`ceec]dddde^a`deeeec\`dddcbaadadYd`]]Jc_^bc^^\
FASTQ Format
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
f → Quality = 38 (102 – 64)
@HWUSI-EAS454_0006:1:112:14105:5498#CTTGTA
CGCCAAGAAGTGTAGCAAAACGGCAGAGCTCGTGGATTAAACAAACAGAGGATTTCGGTGAGGATTGAGGGGGAGT
+
cfffcfeffdeefefffcffffffffcffeffffdffffafcfffffdffffdfefeddf^eececfffdfcbffb
SAM Format
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
• Standard format for mapping of NGS data
• Sequence Alignment Mapping (SAM) Binary Alignment Mapping (BAM)
Visualization of SNPs in mapping alignment
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
Tablet
• Graphical viewer for assembly of NGS data
• Accepts different formats: ACE, SAM, BAM
SNP discovery from NGS data
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
GATK (Genome Analysis Tool Kit)
• Package for analysis of NGS data.
• Developed for the analysis of Human medical resequencing projects (1000 Genomes, The Cancer Genome Atlas)
• Includes tools for depth analysis, quality score recalibration, SNP/InDel discovery
• Complementary of 2 other packages: SamTools, PicardTools
PREPROCESS: * Index human genome (Picard), we used HG18 from UCSC. * Convert Illumina reads to Fastq format * Convert Illumina 1.6 read quality scores to standard Sanger scores FOR EACH SAMPLE: 1. Align samples to genome (BWA), generates SAI files. 2. Convert SAI to SAM (BWA) 3. Convert SAM to BAM binary format (SAM Tools) 4. Sort BAM (SAM Tools) 5. Index BAM (SAM Tools) 6. Identify target regions for realignment (Genome Analysis Toolkit) 7. Realign BAM to get better Indel calling (Genome Analysis Toolkit) 8. Reindex the realigned BAM (SAM Tools) 9. Call Indels (Genome Analysis Toolkit) 10. Call SNPs (Genome Analysis Toolkit)
11. View aligned reads in BAM/BAI (Integrated Genome Viewer)
VCF Format (Variant Call Format)
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3
Advantages: describes the variations for each position + genotype assignation
GATK (other functionalities)
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
• DepthOfCoverage module: Enables to inform sequencing depth of coverage for each gene, each position and each individual
• ReadBackedPhasing module: Enables to define if possible allele association (phase or haplotype) in case of heterozygosity…
And not AGG GGA
GATK (other functionalities)
Global BAM with read group
FastQ Groomer
Mapping BWA
IndelRealigner
UnifiedGenotyper
VCF file
Fastq (ind1)
BAM with read group
FastQ Groomer
Mapping BWA
Fastq (ind2)
BAM with read group
FastQ Groomer
Mapping BWA
Fastq (ind3)
BAM with read group
FastQ Groomer
Mapping BWA
Fastq (ind4)
BAM with read group
….
mergeSam
Add or Replace Groups Add or Replace Groups Add or Replace Groups Add or Replace Groups
DepthOfCoverage
Depth file ReadBackedPhasing
VariantFiltration
Phased VCF
Filtered VCF
The SNiPlay project
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
The SNiPlay project
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
SNiPlay: Web-based application for polymorphism analysis
http://sniplay.cirad.fr
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
Global BAM with read group
FastQ Groomer
Mapping BWA
IndelRealigner
UnifiedGenotyper
VCF file
Fastq (RC1)
BAM with read group
FastQ Groomer
Mapping BWA
Fastq (RC2)
BAM with read group
FastQ Groomer
Mapping BWA
Fastq (RC3)
BAM with read group
FastQ Groomer
Mapping BWA
Fastq (RC4)
BAM with read group
….
mergeSam
Add or Replace Groups Add or Replace Groups Add or Replace Groups Add or Replace Groups
DepthOfCoverage
Depth file
SNiPlay : parameters and options
21 November 2013
Select the VCF format
Load the VCF file, the reference and the depth file
Indicate groups of individuals to make SNP comparison
SNiPlay : parameters and options
21 November 2013
Filter SNPs respecting minimum Depth coverage
Select the banana genome
Check the steps to be performed
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
SNiPlay: SNP statistics
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
SNiPlay and Illumina genotyping chip
Cartesian coordinates
Genotyping file
Submission file for Illumina
Analysis with the BeadStudio software
Design of Illumina chip
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
SNiPlay : Allelic files
@DARwin 5.0 - ALLELIC - 2
33 20
N° 50 50 122 122 218 218 245 245 261 261 290 290 356
1 1 1 1 1 3 3 3 3 4 4 2 2 2
2 1 1 1 1 3 3 1 3 4 4 2 2 2
3 1 1 1 1 3 3 3 3 4 4 2 2 2
4 1 1 1 1 3 3 3 3 4 4 2 2 2
33
10
P 49 121 217 244 260 289
SSSSSSSSSS
#cARB
A A G G T C C A T T
A A G G T C C A T T
#cSYR
A A G A T C C A T C
A A G G T C C A T T
• PED format
• DARwin format
• .inp format for Phase • Format for TASSEL (association studies)
cARB 1 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4
cSYR 2 0 0 1 0 1 1 1 1 3 3 1 3 4 4 2 2 2 2 1 1 4 4 2 4
cARA 3 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4
33 10:2
50 122 218 245 261 290 356 461 467 560
cARB A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T
cSYR A:A A:A G:G A:G T:T C:C C:C A:A T:T C:T
cARA A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T
cORL A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T
cLAR A:G A:G A:G A:G C:T C:C C:C A:A T:T C:T
Provides various formats of allelic files:
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
SNiPlay : Annotation of SNPs
1) Locate SNP on a genome
• using Blast • or using GFF if reference correspond to gene/CDS
2) Annotate SNPs with SnpEff program
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
SNiPlay : Annotation of SNPs
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
SNiPlay : Diversity analysis
SeqLib library
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
SNiPlay : Haplotype network
High frequency haplotypes
Low frequency haplotype
Group distribution whithin this haplotype
Distance between 2 haplotypes (nb of mutations)
Haplophyle
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
SNiPlay : Comparison of SNPs between groups
Individu, group
Ind1, Table
Ind2, Table
Ind3, Table
Ind4, East
Ind5, East
Ind6, East
Ind7, East
Ind8, West
External file (optional)
SNiPlay: Population structure analysis
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
Admixture
• Test different values of K (estimates of probability that samples are structured in K populations)
• For the best value of K, the application shows Q estimates for each individual (probability that the individual belongs to each population)
SNiPlay : GWAS analysis
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
Tassel, MLMM…
• Genome Wide Association Studies (GWAS)
• Estimate the association between SNPs and a phenotypic trait
• Display Manhattan plots: GWAS statistical tests (-log10 pvalue) along the chromosomes
SNP analysis in an allopolyploid
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
SNiPloid
• Compare SNPs observed in allotetraploid to those observed between parental genomes
• Categorize SNP in different evolution scenarii
• Attempt to assign alleles to subgenomes • Estimates the subgenomic contribution to the transcriptome
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
SNPs in the Banana Genome Hub
Objectives of the exercises
21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013
• To know and manipulate available packages/tools for SNP and INDEL detection from NGS data (assembly of NGS data)
• To think about difficulties encountered when analysing new generation sequencing data (differentiate sequencing errors, paralogs and allelic variation)
• Detect SNP and assign genotypes to every polymorphic positions
• Simply exploit polymorphisms data via a Web-based application (genetic diversity, LD)
• Obtain an exploitable dataset to send for the design of a high-throughput SNP chip (Illumina VeraCode technology)
Short reads Solexa
Mapping SAM
Exploitation of polymorphism data
Design of a Illumina SNP chip
Assignation of genotypes
Ind1 ATTGTGTCGTAACGTATGTCATGTCGT Ind2 ATTGTGTCGGAACGTATGTCATGTCGT Ind3 ATTGTGTCGKAACGTATGTCATGTCGT
Allelic variations
List of SNPs
867 A/G 1998 T/C 2341 T/G