39
Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for Vertebrate Genomics (CVG) CBSU/3CPG/CVG Joint Workshop Series

Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Reference genome based sequence variation detection

Computational Biology Service Unit (CBSU)Cornell Center for Comparative and Population Genomics (3CPG)

Center for Vertebrate Genomics (CVG)

CBSU/3CPG/CVG Joint Workshop Series 

Page 2: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Assembly Alignment

Two different data analysis strategies

Page 3: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

De novo Assembly

ACGGTACCTAAACCGGTACCTAAACCGGA

ACGAGCAACACGGTACCTA

TACCTAAACCGGACCCGGAAAGAC

ACGGTAGCTAAACCGGTAGCTAAACCGGA

ACGAGCAACACGGTAGCTA

TAGCTAAACCGGACCCGGAAAGAC

......ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC..... ......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC.....

Page 4: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

De novo Assembly

ACGGTACCTAAACCGGTACCTAAACCGGA

ACGAGCAACACGGTACCTA

TACCTAAACCGGACCCGGAAAGAC

ACGGTAGCTAAACCGGTAGCTAAACCGGA

ACGAGCAACACGGTAGCTA

TAGCTAAACCGGACCCGGAAAGAC

......ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC..... ......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC.....

......ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC.....

......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC.....

Page 5: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

ReferenceAlignment

ACGGTACCTAAACCGGTACCTAAACCGGA

ACGAGCAACACGGTACCTA

TACCTAAACCGGACCCGGAAAGAC

ACGGTAGCTAAACCGG

TAGCTAAACCGGA

ACGAGCAACACGGTAGCTA

TAGCTAAACCGGACCCGGAAAGAC

Page 6: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

ReferenceAlignment

ACGGTACCTAAACCGGTACCTAAACCGGA

ACGAGCAACACGGTACCTA

TACCTAAACCGGACCCGGAAAGAC

ACGGTAGCTAAACCGG

TAGCTAAACCGGA

ACGAGCAACACGGTAGCTA

TAGCTAAACCGGACCCGGAAAGAC

ACGGTACCTAAACCGGTACCTAAACCGGA

ACGAGCAACACGGTACCTA

TACCTAAACCGGACCCGGAAAGAC

ACGGTAGCTAAACCGGTAGCTAAACCGGA

ACGAGCAACACGGTAGCTA

TAGCTAAACCGGACCCGGAAAGAC

Reference GenomeC

Page 7: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Chr Position Ref Coverage Depth Genotypes Genechr1 24515167 C 5 11 3 T() C() T()chr1 45396856 G 13 7 9 C() G() C()chr1 68417006 G 43 18 6 A() G() A()chr1 90162621 A 15 99 255M(AC) A() A()chr1 90162696 G 17 134 255 G() R(GA) G()chr1 90162750 C 19 108 176 Y(CT) Y(CT) C()chr1 90162816 G 30 72 106 G() K(GT) K(GT)chr1 90162975 G 162 48 255 G() R(GA) G()chr1 90163027 C 100 6 255 C() Y(CT) Y(CT)chr1 90163136 A 152 17 176 A() R(AG) R(AG)chr1 90163167 C 132 25 218 C() M(CA) M(CA)chr1 90163191 T 91 19 227 T() Y(TC) Y(TC)chr1 90164490 A 173 16 103 A() M(AC) M(AC)chr1 90164557 A 100 66 137 A() R(AG) A()chr1 90164612 A 62 48 107 A() R(AG) R(AG)chr1 90164677 A 88 37 64 R(AG) A() R(AG)chr1 90165817 T 88 35 56 Y(TC) Y(TC) T()… … … … … … … … …… … … … … … … … …chr17 72952985 C 23 26 31 T() Y(TC) T()chr18 7355152 G 23 34 3 A() G() A()chr18 7355177 A 16 29 3 C() A() C()chr18 25274226 T 28 35 22 C() Y(CT) C()chr18 34475963 A 25 12 25 G(KT) R(GA) G()chr18 38133671 G 69 63 21 C(SG) G() G()chr18 65363507 G 14 29 3 T(KG) G() T()chr18 65363509 T 18 31 3 G(KT) T() G()chr18 71606111 C 9 32 5 A() C() A()chr19 46381078 A 8 12 6 G(RA) A() G()

With limited number of individuals, whole genome/exomesequencing do not always reveal the causative mutations

Page 8: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Chr Position Ref Coverage Depth Genotypes Genechr1 24515167 C 5 11 3 T() C() T()chr1 45396856 G 13 7 9 C() G() C()chr1 68417006 G 43 18 6 A() G() A()chr1 90162621 A 15 99 255M(AC) A() A()chr1 90162696 G 17 134 255 G() R(GA) G()chr1 90162750 C 19 108 176 Y(CT) Y(CT) C()chr1 90162816 G 30 72 106 G() K(GT) K(GT)chr1 90162975 G 162 48 255 G() R(GA) G()chr1 90163027 C 100 6 255 C() Y(CT) Y(CT)chr1 90163136 A 152 17 176 A() R(AG) R(AG)chr1 90163167 C 132 25 218 C() M(CA) M(CA)chr1 90163191 T 91 19 227 T() Y(TC) Y(TC)chr1 90164490 A 173 16 103 A() M(AC) M(AC)chr1 90164557 A 100 66 137 A() R(AG) A()chr1 90164612 A 62 48 107 A() R(AG) R(AG)chr1 90164677 A 88 37 64 R(AG) A() R(AG)chr1 90165817 T 88 35 56 Y(TC) Y(TC) T()… … … … … … … … …… … … … … … … … …chr17 72952985 C 23 26 31 T() Y(TC) T()chr18 7355152 G 23 34 3 A() G() A()chr18 7355177 A 16 29 3 C() A() C()chr18 25274226 T 28 35 22 C() Y(CT) C()chr18 34475963 A 25 12 25 G(KT) R(GA) G()chr18 38133671 G 69 63 21 C(SG) G() G()chr18 65363507 G 14 29 3 T(KG) G() T()chr18 65363509 T 18 31 3 G(KT) T() G()chr18 71606111 C 9 32 5 A() C() A()chr19 46381078 A 8 12 6 G(RA) A() G()

With limited number of individuals, whole genome/exomesequencing do not always reveal the causative mutations

Sequence a mapping population

Page 9: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

FASTQ files

SAM/BAM files

VCF file

Reference genome based sequence variation detection

Step 1: Alignment

Step 2: Call SNP/INDELs

Page 10: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Reference genome based sequence variation detection

Step 3: Filter SNP/INDELs

Step 4: Annotate SNP/INDELs

Page 11: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Reference genome based sequence variation detection

Step 1: Alignment

Step 2: Call SNP/INDELs

BWALi H. and Durbin R. (2009)  Bioinformatics, 25:1754‐60

SAMtools GATK + PicardLi H. et al. Bioinformatics, 25, 2078‐9 Broad Institute

or

Page 12: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Reference genome based sequence variation detection

Step 3: Filtering

Step 4: Annotation

• GATK• Write your own code

• Annovarhttp://www.openbioinformatics.org/annovar/

Page 13: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Standard file formats

• FASTQ• SAM/BAM• VCF

Page 14: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

@20F75AAXX:5:1:335:1565

ACCTTGTTGAGAAACAGGAGGTGTTGTTCTTCAAAG

+20F75AAXX:5:1:335:1565

]]]]][]][][[][]Z[[[][[[[][[[[][[[[[R

@20F75AAXX:5:1:466:1056

GGAAGCAACAGCTAATACATGAATGGATATCGATCG

+20F75AAXX:5:1:466:1056

[]]]]][]]]Y]]]][Y[[[[[[[[[[Y[Y[YW[[[

@20F75AAXX:5:1:256:1724

GCCCAACAAAGACCGGTCACCAAAGACAGATGATTC

+20F75AAXX:5:1:256:1724

]][]][]][[[[]L[[[[][[[Z[[[[[S[[ZW[[[

FASTQ file:

Page 15: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

HWI‐EAS83_20F7TAAXX:1:1:379:338 16 4 157555988 25 36M * 0 0

AGAAAACTGCAAAGCACGAGTCTAGCAGATACCCTT

h?DhhhLDPOhhhhhhhhhhhhhhhhhhhhhhhhhh XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0

MD:Z:2C32G0

HWI‐EAS83_20F7TAAXX:1:1:98:170 16 4 28122708 37 36M * 0 0

GCACCCTTTAACTCGGGCTAACTATCTTGCTTCACC

VbINbYZh_hUhQhd\^hfhhhhhhhhhhhhhhhhh XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:33G2

HWI‐EAS83_20F7TAAXX:1:1:582:80 4 * 0 0 * * 0 0

ATGGCTGCCTCGCAGAATCGAAAGTTAGTGCCGCAC

hfhhhhahh`hhAVhEhahQKHKQA_IIPPF@DhEV

HWI‐EAS83_20F7TAAXX:1:1:169:517 16 3 170277940 25 36M * 0 0

AAAACCATATCTGCTGGAAACTCTGCTTCCACAAGC

CDhKDBhDhFaGghMhahhhhPhhhhhhhhhhhhhh XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0

MD:Z:0T0C34

SAM file: 

• Sequence (forward strand of the reference genome)

• Quality score

• Alignment information (position, strand, mismatches, gap) 

• Ambigous alignments

• Paired‐end information

• Read group

Information encoded in SAM file

Page 16: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

BAM is a compressed SAM file

• BAM file is several times smaller than SAM;

• BAM file can be indexed and queried;

• Most software operates directly on BAM;

• BAM format can potentially replace fastqformat. 

Page 17: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

##fileformat=VCFv4.0##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot‐NCBI36##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA0000320     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2     1/1:40:3

VCF file  ‐ variant call format

Page 18: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Alignment with BWA

Commonly used parameters:

Alignment step (aln):

‐n:  maximum number of edit distance (default 0.04)

‐o: maximum number of gap opens (default 1)

Write SAM file step (samse or sampe):

‐n maximum number of alignments to report

Page 19: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

‐ Converting SAM to BAM‐ Index BAM

*** If you want to use Broad GATK software to call SNPs,  do not use SAMtools, always use Picard for processing SAM and BAM files. 

Samtools: view; index

Picard: SamFormatConverter; BuildBamIndex

Page 20: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

BAM file can be visualized with IGV software

Page 21: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Clean up the BAM file• Mark possible PCR duplicates

• Base quality score recalibration

• Local realignment around indels

Page 22: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Clean up the BAM file• Mark possible PCR duplicates

• Base quality score recalibration

• Local realignment around indels

** For sequence reads with exact same sequence, only one copy is kept.

Page 23: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Clean up the BAM file• Mark possible PCR duplicates

• Base quality score recalibration

• Local realignment around indels

• Phred quality score: 20 ‐> 1% error rate.

• Illumina quality score: 0 to 62, need to be calibrated to reflect error rate.

Page 24: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Clean up the BAM file• Mark possible PCR duplicates

• Base quality score recalibration

• Local realignment around indels

Page 25: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Multi‐sample SNP and INDEL calling

• Use Unified Genotyper (GATK) or mpileup(SAMtools) to call SNP and INDEL from multiple samples.

• Set the variants calling thresholdEmission threshold: Q10 (>10x)  Q3(<10x)Confidence threshold: Q30(>10x) Q4(<10x)

Page 26: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Filtering

• Read depth (DP)

• Allele frequency (AF)

• Number of samples with data (NS)

Page 27: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

• SAM ‐> BAM

• Flag possible PCR duplicates

• Quality score calibration

• INDEL realignment

• Call variants on multiple samples

• Filtering

SAMtools GATK/Picard

* SAMtools mpileup has built‐in realignment tool** Limited filtering function. Poor documentation.

*

**

Page 28: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

GATK Documentation:http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2

Page 29: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

SAMtools Variants Calling Documentation:http://samtools.sourceforge.net/mpileup.shtml

Page 30: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

1. Experimental Design.

2.  Computational Resource at Cornell.

Practical aspects 

Page 31: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Whole genome sequencing vs

Targeted sequencing

Target‐enrichment by array or in‐solution based capturing technology. (e.g. Exome sequencing).

Page 32: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

ApeK I site

Line 1

Line 2

Line 3

Whole genome sequencing vs

Genotyping by Sequencing (GBS)

Ed Buckler Lab(http://www.maizegenetics.net/gbs‐overview)

Page 33: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Advantage of GBS over whole genome sequencing

1. Reduced cost by multiplexing;

2. Possible to map markers that are not on the reference genome;

Page 34: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

To identify causative mutations in a mutant strain, it is necessary to use both sequencing 

and genetic linkage analysis. 

Page 35: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

**

*

****

X

F1

F2

Mapping and Mutation Identification of the Pooled F2 population

Page 36: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

SHOREmapSchneeberger K et al (2009) Nat Methods.6(8):550‐1.

Using SHOREmap for  mapping and mutation identification

Page 37: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Zuryn et al. (2010)  A Strategy for Direct Mapping and Identification of Mutationsby Whole‐Genome Sequencing.  Genetics 186: 427–430

Alternative approach: test for enrichment of new mutations

Page 38: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Computational Resource at Cornell

CBSU / 3CPG BioHPC Laboratory (625 Rhodes Hall)

Office Hour: 1:00 to 3:00 PM every Monday.

Email [email protected] to get an BioHPC lab account. 

Page 39: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Training workshops

• Linux for Biologists

• Programming workshop (PERL)