32
Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

Detection and analysis of SNP polymorphisms

Alexis DEREEPER

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

Page 2: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

What is a SNP?

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

• A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion of a large population

• SNPs in coding regions may (or may not) alter the protein structure and function

Page 3: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

Why studying SNP?

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

• Population genetics: population stratification, linkage disequilibrium…

• Define markers for genetic maps

• Analysis of genome structure, genome evolution

• Genome Wide Association studies. SNPs can be used for estimating predisposition to disease, for predicting specific genetic traits • Functional analysis: alteration of protein structure and function

Page 4: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

Re-sequencing projects: a deluge of SNPs

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

Using NGS technologies: RESEQUENCING Mapping SNPs

Page 5: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

SNPs from RNASeq: example of Arcad project

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

Available reference ? genome/transcriptome

454 sequencing

De novo reference assembly

Solexa sequencing

Mapping on reference

Polymorphism database in adapted format • redundancy • open reading frame • CDS/UTR

Diversity study • Comparative domestication • Life history trait impact

• Functionnal evolution

Yes

No

Ortholog/paralogs assignation

Solexa sequencing

CROP Breeding SNP database

• functional annotation • selection footprint

Strategy : comparative population genomics with transcriptomics data

Page 6: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

Strategy for SNP discovery from NGS

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

Global BAM with read group

FastQ Groomer

Mapping BWA

IndelRealigner

UnifiedGenotyper

VCF file

Fastq (ind1)

BAM with read group

FastQ Groomer

Mapping BWA

Fastq (ind2)

BAM with read group

FastQ Groomer

Mapping BWA

Fastq (ind3)

BAM with read group

FastQ Groomer

Mapping BWA

Fastq (ind4)

BAM with read group

….

mergeSam

Add or Replace Groups Add or Replace Groups Add or Replace Groups Add or Replace Groups

DepthOfCoverage

Depth file

GATK

PicardTools

Page 7: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

FASTQ Format

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

• Standard format of short reads from NGS data

FASTQ file → TEXT file

STRUCTURE:

@HWUSI-EAS454_0006:1:112:14105:5498#CTTGTA

CGCCAAGAAGTGTAGCAAAACGGCAGAGCTCGTGGATTAAACAAACAGAGGATTTCGGTGAGGATTGAGGGGGAGT

+

cfffcfeffdeefefffcffffffffcffeffffdffffafcfffffdffffdfefeddf^eececfffdfcbffb

@HWUSI-EAS454_0006:1:37:16314:3410#CTTGTA

AGTGTAGCAAAACGGCAGAGCTCGTGGATTAAACAAACAGAGGATTTCGGTGAGGATTGAGGGGGAGTGGTGGCCG

+

`bTbbccccceeeeeceeeecccYeedded`ceec]dddde^a`deeeec\`dddcbaadadYd`]]Jc_^bc^^\

Page 8: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

FASTQ Format

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

f → Quality = 38 (102 – 64)

@HWUSI-EAS454_0006:1:112:14105:5498#CTTGTA

CGCCAAGAAGTGTAGCAAAACGGCAGAGCTCGTGGATTAAACAAACAGAGGATTTCGGTGAGGATTGAGGGGGAGT

+

cfffcfeffdeefefffcffffffffcffeffffdffffafcfffffdffffdfefeddf^eececfffdfcbffb

Page 9: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

SAM Format

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

• Standard format for mapping of NGS data

• Sequence Alignment Mapping (SAM) Binary Alignment Mapping (BAM)

Page 10: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

Visualization of SNPs in mapping alignment

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

Tablet

• Graphical viewer for assembly of NGS data

• Accepts different formats: ACE, SAM, BAM

Page 11: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

SNP discovery from NGS data

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

GATK (Genome Analysis Tool Kit)

• Package for analysis of NGS data.

• Developed for the analysis of Human medical resequencing projects (1000 Genomes, The Cancer Genome Atlas)

• Includes tools for depth analysis, quality score recalibration, SNP/InDel discovery

• Complementary of 2 other packages: SamTools, PicardTools

PREPROCESS: * Index human genome (Picard), we used HG18 from UCSC. * Convert Illumina reads to Fastq format * Convert Illumina 1.6 read quality scores to standard Sanger scores FOR EACH SAMPLE: 1. Align samples to genome (BWA), generates SAI files. 2. Convert SAI to SAM (BWA) 3. Convert SAM to BAM binary format (SAM Tools) 4. Sort BAM (SAM Tools) 5. Index BAM (SAM Tools) 6. Identify target regions for realignment (Genome Analysis Toolkit) 7. Realign BAM to get better Indel calling (Genome Analysis Toolkit) 8. Reindex the realigned BAM (SAM Tools) 9. Call Indels (Genome Analysis Toolkit) 10. Call SNPs (Genome Analysis Toolkit)

11. View aligned reads in BAM/BAI (Integrated Genome Viewer)

Page 12: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

VCF Format (Variant Call Format)

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

##fileformat=VCFv4.0

##fileDate=20090805

##source=myImputationProgramV3.1

##reference=1000GenomesPilot-NCBI36

##phasing=partial

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">

##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">

##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">

##FILTER=<ID=q10,Description="Quality below 10">

##FILTER=<ID=s50,Description="Less than 50% of samples have data">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51

20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3

Advantages: describes the variations for each position + genotype assignation

Page 13: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

GATK (other functionalities)

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

• DepthOfCoverage module: Enables to inform sequencing depth of coverage for each gene, each position and each individual

• ReadBackedPhasing module: Enables to define if possible allele association (phase or haplotype) in case of heterozygosity…

And not AGG GGA

Page 14: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

GATK (other functionalities)

Global BAM with read group

FastQ Groomer

Mapping BWA

IndelRealigner

UnifiedGenotyper

VCF file

Fastq (ind1)

BAM with read group

FastQ Groomer

Mapping BWA

Fastq (ind2)

BAM with read group

FastQ Groomer

Mapping BWA

Fastq (ind3)

BAM with read group

FastQ Groomer

Mapping BWA

Fastq (ind4)

BAM with read group

….

mergeSam

Add or Replace Groups Add or Replace Groups Add or Replace Groups Add or Replace Groups

DepthOfCoverage

Depth file ReadBackedPhasing

VariantFiltration

Phased VCF

Filtered VCF

Page 15: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

The SNiPlay project

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

Page 16: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

The SNiPlay project

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

SNiPlay: Web-based application for polymorphism analysis

http://sniplay.cirad.fr

Page 17: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

Global BAM with read group

FastQ Groomer

Mapping BWA

IndelRealigner

UnifiedGenotyper

VCF file

Fastq (RC1)

BAM with read group

FastQ Groomer

Mapping BWA

Fastq (RC2)

BAM with read group

FastQ Groomer

Mapping BWA

Fastq (RC3)

BAM with read group

FastQ Groomer

Mapping BWA

Fastq (RC4)

BAM with read group

….

mergeSam

Add or Replace Groups Add or Replace Groups Add or Replace Groups Add or Replace Groups

DepthOfCoverage

Depth file

Page 18: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

SNiPlay : parameters and options

21 November 2013

Select the VCF format

Load the VCF file, the reference and the depth file

Indicate groups of individuals to make SNP comparison

Page 19: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

SNiPlay : parameters and options

21 November 2013

Filter SNPs respecting minimum Depth coverage

Select the banana genome

Check the steps to be performed

Page 20: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

SNiPlay: SNP statistics

Page 21: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

SNiPlay and Illumina genotyping chip

Cartesian coordinates

Genotyping file

Submission file for Illumina

Analysis with the BeadStudio software

Design of Illumina chip

Page 22: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

SNiPlay : Allelic files

@DARwin 5.0 - ALLELIC - 2

33 20

N° 50 50 122 122 218 218 245 245 261 261 290 290 356

1 1 1 1 1 3 3 3 3 4 4 2 2 2

2 1 1 1 1 3 3 1 3 4 4 2 2 2

3 1 1 1 1 3 3 3 3 4 4 2 2 2

4 1 1 1 1 3 3 3 3 4 4 2 2 2

33

10

P 49 121 217 244 260 289

SSSSSSSSSS

#cARB

A A G G T C C A T T

A A G G T C C A T T

#cSYR

A A G A T C C A T C

A A G G T C C A T T

• PED format

• DARwin format

• .inp format for Phase • Format for TASSEL (association studies)

cARB 1 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4

cSYR 2 0 0 1 0 1 1 1 1 3 3 1 3 4 4 2 2 2 2 1 1 4 4 2 4

cARA 3 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4

33 10:2

50 122 218 245 261 290 356 461 467 560

cARB A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T

cSYR A:A A:A G:G A:G T:T C:C C:C A:A T:T C:T

cARA A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T

cORL A:A A:A G:G G:G T:T C:C C:C A:A T:T T:T

cLAR A:G A:G A:G A:G C:T C:C C:C A:A T:T C:T

Provides various formats of allelic files:

Page 23: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

SNiPlay : Annotation of SNPs

1) Locate SNP on a genome

• using Blast • or using GFF if reference correspond to gene/CDS

2) Annotate SNPs with SnpEff program

Page 24: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

SNiPlay : Annotation of SNPs

Page 25: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

SNiPlay : Diversity analysis

SeqLib library

Page 26: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

SNiPlay : Haplotype network

High frequency haplotypes

Low frequency haplotype

Group distribution whithin this haplotype

Distance between 2 haplotypes (nb of mutations)

Haplophyle

Page 27: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

SNiPlay : Comparison of SNPs between groups

Individu, group

Ind1, Table

Ind2, Table

Ind3, Table

Ind4, East

Ind5, East

Ind6, East

Ind7, East

Ind8, West

External file (optional)

Page 28: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

SNiPlay: Population structure analysis

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

Admixture

• Test different values of K (estimates of probability that samples are structured in K populations)

• For the best value of K, the application shows Q estimates for each individual (probability that the individual belongs to each population)

Page 29: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

SNiPlay : GWAS analysis

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

Tassel, MLMM…

• Genome Wide Association Studies (GWAS)

• Estimate the association between SNPs and a phenotypic trait

• Display Manhattan plots: GWAS statistical tests (-log10 pvalue) along the chromosomes

Page 30: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

SNP analysis in an allopolyploid

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

SNiPloid

• Compare SNPs observed in allotetraploid to those observed between parental genomes

• Categorize SNP in different evolution scenarii

• Attempt to assign alleles to subgenomes • Estimates the subgenomic contribution to the transcriptome

Page 31: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

SNPs in the Banana Genome Hub

Page 32: Detection and analysis of SNP polymorphisms · Detection and analysis of SNP polymorphisms Alexis DEREEPER 21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November

Objectives of the exercises

21 November 2013 Training course in Bioinformatics, Montpellier, 18-22 November 2013

• To know and manipulate available packages/tools for SNP and INDEL detection from NGS data (assembly of NGS data)

• To think about difficulties encountered when analysing new generation sequencing data (differentiate sequencing errors, paralogs and allelic variation)

• Detect SNP and assign genotypes to every polymorphic positions

• Simply exploit polymorphisms data via a Web-based application (genetic diversity, LD)

• Obtain an exploitable dataset to send for the design of a high-throughput SNP chip (Illumina VeraCode technology)

Short reads Solexa

Mapping SAM

Exploitation of polymorphism data

Design of a Illumina SNP chip

Assignation of genotypes

Ind1 ATTGTGTCGTAACGTATGTCATGTCGT Ind2 ATTGTGTCGGAACGTATGTCATGTCGT Ind3 ATTGTGTCGKAACGTATGTCATGTCGT

Allelic variations

List of SNPs

867 A/G 1998 T/C 2341 T/G