Whitepaper Exome Sequencing and

Exome Sequencing and Data Analysis

Whitepaper | Scigenom.com

Exome Sequencing and

Data Analysis

Whitepaper



Background The protein coding, or exonic regions, constitute ~1.5% of the human genome. An estimated

85% of known variations/mutations underlying disease-related traits occur in the exons of the genome [1]. Mutations identified in the exonic regions are typically actionable compared to variations that occur in non-coding regions. Exome sequencing involves selective capture and sequencing of these protein coding regions of the genome. Compared to whole genome sequencing, exome sequencing is more cost effective and also provides better per base coverage.

The first exome sequencing report was published in 2009 [2]. Since then over 600 publications report the use of exome sequencing for various applications. In particular exome sequencing has been applied in

• Characterization of mutations in inherited disorders and rare syndromes [2-7];

• Understanding complex genetic disorders and disease risk e.g. hypertension, autism, idiopathic generalized epilepsy, familial combined hypolipidemia [8-11];

• Cancer driver gene mutations e.g: Recent exome capture studies involving glioblastoma, colon cancer, prostate cancer and melanoma have identified several novel driver mutations [12-16];

• Diagnosis and treatment of cancer and other diseases [17, 18].

Figure 1. The number of exome sequencing papers indexed in Medline since 2009.



Targeted Capture Select regions of the genome are enriched by hybridizing capture probes to capture genomic DNA of interest. Captured DNA is adapted to generate a library that is then sequenced and analyzed.

Figure 2. Target capture sequencing

Service Offerings 1) Exome Sequencing @ SciGenom

SciGenom offers exome capture, sequencing and analysis services using off-the-shelf capture kits from

(a) Roche NimbleGenSeqCap EZ

(b) Agilent SureSelect

(c) Illumina TruSeq capture



And also provides custom capture based on user defined regions of interest from genomes of interest that have reliable published sequence.

Sequencing Platform: Illumina HiSeq 2000

Coverage: 50-100X (or any other level of coverage defined by customer)

Table 1: Details for ready-to-use human capture probe sets

Workflow

The below is a typical workflow followed for exome sequencing

We first isolate genomic DNA or use DNA provided by the customer that meet our quality specifications. The genomic DNA is sheared, and these fragments are used to perform

Illumina TruSeq Agilent SureSelect Roche Nimblegen

SeqCap EZ

Exome Human All

Exon 50 Mb

Human all Exon V4

Human all Exon V4 +UTRs

Version 2.0 Version 3.0

Target region size 62 Mb 50Mb 51 Mb 71 Mb 36.5 Mb 64 Mb

Number of target genes 20,794 20,718 20,965 20,965 30,000 24,685

Number of target exons 201,121 331,518 334,378 335,765 300,000



capture using oligo nucleotide probes from either by NimbleGen, Illumina or Agilent exome capture kits. Capture fragments are adapted to produce libraries that we sequence on Illumina HiSeq 2000, or other platforms as appropriate, to typically generate paired end 2 x 100bp sequence reads resulting in 50-100x coverage. The generated sequence data is then analyzed, after quality control filtering, using our proprietary exome analysis pipeline for variant calling and annotation.

Table 2. Sample Specifications

For other sample types including plant tissue please contact us.

2) Bioinformatics Analysis @ SciGenom

The large amount of data generated by a next generation sequencer requires a systematic approach for translating the copious data into useful and meaningful information. At SciGenom, we have developed a pipeline for performing various bioinformatics analysis of exome data including alignment, high quality variant calling and comprehensive annotation. Here we briefly discuss our bioinformatics analysis pipeline and list of deliverables for exome data.

Bioinformatics Exome Analysis Workflow



Below is a detailed workflow for the bioinformatics analysis of the exome sequencing data that is followed at SciGenom

Figure 3. Workflow of exome analysis

I. Alignment & Recalibration

We perform the following steps for read alignment and recalibration of base quality

1. Read quality checking – The following parameters are checked from the fastq file a. Base quality score distribution b. Sequence quality score distribution c. Average base content per read d. GC distribution in the reads e. PCR amplification issue f. Over-represented sequences g. Biasing of k-mers h. Read-length distribution

Based on the quality report of fastq files, we trim the sequences to retain the high quality portion of the sequence and remove low-quality sequences from further analysis.

2. Read alignment – The reads are aligned to the GRCh37/hg19 version of the human genome. The alignment is performed using BWA program and only unique mapped reads are reported.

3. Read filtering – The aligned reads are post-processed in order to remove unwanted reads or alignment. It includes removal of PCR duplicates and fragments exceeding a defined size.



4. Read realignment – The filtered reads are then realigned using GATK toolkit to remove false positive and misalignments.

5. Read quality recalibration – After realignment the quality of the reads are recalibrated using GATK recalibrator to correct the variation in base qulaity reported by sequencers in order to provide more accurate quality scores for variant calling.

II. Variant Calling

The following steps are followed to identify variants in the sample(s).

1. Combining samples – If the study includes multiple samples then we combine the samples before performing actual variant calling.

2. Variant calling – The GATK variant calling program is used to identify single nucleotide polymorphisms (SNP)/point mutations and short indels in the exome samples.

II. Variant Annotation

The variants identified are then annotated and compared with various databases.

1. Variant annotation – Genomic coordinates, gene position and mutation type (silent, missense or non-sense) are identified for each. The annotation is performed using the in-house VariMAT (Variation and Mutation Annotation Toolkit) program.

2. OncoMD annotation – The identified variants are compared with SciGenom’sOncoMD (Oncology Mutation Database) that contains curated somatic and germline mutation information.

3. Comparison with disease related databases – The variants are compared with various disease relevant mutation resources, for e.g., OMIM and SNPedia.

4. Comparison with common SNP databases – The variants are compared with 1000-genome project reported SNPs and indels, common SNP databases (db135) and various published personal genomes.

3) Typical deliverables for exome sequencing and analysis

Described below are the deliverables designed for human samples. These will change depending on the species to be sequenced.

I. Alignment & Recalibration

1. Quality check result of the fastq file a. Base quality report b. Base distribution report c. Sequence quality report d. Base content report e. Read GC content report f. PCR amplification/Read duplication related report



g. Enrichment of specific sequences report 2. Read alignment report

a. Total number of aligned and unaligned reads b. Filtered read count c. Chromosome-wise read alignment distribution d. Estimated fragment length distribution e. Coverage across genes and exons f. Mismatch position distribution g. Quality distribution of mismatches h. Distribution of mismatch type i. Alignment and filtered files in BAM and SAM file format

3. Quality recalibration report a. Recalibrated file in BAM file format b. Recalibration report

II. Variant Calling

a. Vcf file for identified SNPs and indels b. Distribution of mapping quality of identified variants c. Zygosity status of the variants d. Summary of variant types

III. Variant Annotation

a. Gene/intergenic annotation of the variants b. Exonic, intronic, 3’UTR, 5’UTR, promoter region, conserved transcription

factor binding sites, conserved intergenic region, and active transcription region identified using ChIP-Seq experiments.

c. Silent, missense, non-sense, frame-shift, in-frame annotation of genic variants.

d. Comparison with OncoMD somatic mutation and germline collection. e. Comparison with OMIM, SNPedia and other relevant databases f. Comparison with common SNP databases: 1000-genomes project,

dbSNP 135, and personal genome variants.



Case Study - Gastric Cancer Study Data Analysis A study was conducted on gastric adenocarcinoma exome samples published by Zhang et. al in 2012 [19]. The authors found recurrent somatic mutations in genes related to cell adhesion and chromatic remodeling pathways.

We downloaded raw exome datasets from NCBI sequence read archive (Accession # SRS04582) and analyzed it using our pipeline.

Below are some examples of the analyzed sample data:

Table 3. Summary of an exome sample data

Filename SRR504964.fastq

Organism Homo sapiens

Study Information Deep sequencing of gastric exome

Sample Information Normal tissue from gastric cancer patient

Tissue/Cell type NA

Sequencing platform Illumina GA II

Total Sequences 51,487,668

Sequence length (bp) 76

Mean fragment size (bp) 110

Paired-end Yes

% GC 46



Quality Check and Filtering

The sequencing quality parameters of the raw (fastq) reads are measured. The base quality, base percentages, GC% for the fastq file are shown in Figure 4. We do not observe any base bias or issues with the quality of the bases. So we used all bases for alignment purpose.

Figure 4. (a) Base quality distribution in the sequencing reads, (b) Base percentage in the sequencing reads, (c) GC% distribution in the sequencing reads, (d) Theoretical and experimental GC% distribution of the sequencing reads.



Quality Recalibration using GATK

The original data quality values are biased towards higher quality. After recalibration the quality scores follow normal distribution. We use GATK toolkit to recalibrate base quality score. The original and re-calibrated quality score distribution is shown in Fig. 5.

Figure 5. Quality calibration using the GATK recalibrator.



Read Alignment Summary

The reads are aligned to hg19/GRCh37 version of the human genome. The aligned reads are post-processed to filter out reads that are of low quality and multiple-mapped. The average fragment length, chromosome-wise read distribution, total aligned reads, mismatch distribution and various other statistics are calculated from the alignment file. Some of these statistics calculated for a sample are shown in Fig. 6.

Figure 6. Read Alignment Summary (a) Total aligned, unaligned and filtered reads, (b) Distribution of aligned reads in chromosomes, (c) Fragment length distribution, (d) Fraction of passed and failed reads in chromosomes, (e) Mismatch type with respect to hg19 genome found in the sample



Variant Calling Statistics

Our pipeline tracks various parameters for the predicted variants in the sample. We applied several filtering criteria to retain only high quality variants detected in the sample. Summary of some of these parameters found for a sample is shown in Fig. 7.

Figure 7. Variant summary (a) Identified variant count, (b) Percentage of called variants of different class, (c) Type of SNV allele, (d) Minor allele frequency for heterozygous variants, (e) Read depth for passed and failed variants, (f) Mapping quality distribution for passed and failed variants

Variant Annotation Statistics

The high quality variant called is processed using our annotation toolkit (VariMAT). A comprehensive annotation of variants is performed including annotation to gene, miRNA, repeats, splice-sites, conserved transcription factor binding sites, enhancers. Our pipeline also predicts the silent, missense and damaging variants present in the sample. A summary of annotation for various samples in the gastric cancer study is shown in Fig. 8.



Figure 8. Annotation of SNVs to gene annotation, repeat, miRNA, conserved TFBS

Conclusion Exome sequencing has been widely adopted for variation discovery. It is now being adopted for variant detection in the clinic for making treatment decisions. At SciGenom, we have the expertise and the technologies needed to collect, analyze and annotate exome data.



References 1. Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloglu A, Ozen S, Sanjad S

et al: Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A 2009, 106(45):19096-19101.

2. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE et al: Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009, 461(7261):272-276.

3. Gibson WT, Hood RL, Zhan SH, Bulman DE, Fejes AP, Moore R, Mungall AJ, Eydoux P, Babul-Hirji R, An J et al: Mutations in EZH2 cause Weaver syndrome. Am J Hum Genet 2012, 90(1):110-118.

4. Hoischen A, van Bon BW, Rodriguez-Santiago B, Gilissen C, Vissers LE, de Vries P, Janssen I, van Lier B, Hastings R, Smithson SF et al: De novo nonsense mutations in ASXL1 cause Bohring-Opitz syndrome. Nat Genet 2011, 43(8):729-731.

5. Kalay E, Yigit G, Aslan Y, Brown KE, Pohl E, Bicknell LS, Kayserili H, Li Y, Tuysuz B, Nurnberg G et al: CEP152 is a genome maintenance protein disrupted in Seckel syndrome. Nat Genet 2011, 43(1):23-26.

6. Simpson MA, Deshpande C, Dafou D, Vissers LE, Woollard WJ, Holder SE, Gillessen-Kaesbach G, Derks R, White SM, Cohen-Snuijf R et al: De novo mutations of the gene encoding the histone acetyltransferase KAT6B cause Genitopatellar syndrome. Am J Hum Genet 2012, 90(2):290-294.

7. Zuchner S, Dallman J, Wen R, Beecham G, Naj A, Farooq A, Kohli MA, Whitehead PL, Hulme W, Konidari I et al: Whole-exome sequencing links a variant in DHDDS to retinitis pigmentosa. Am J Hum Genet 2011, 88(2):201-206.

8. Boyden LM, Choi M, Choate KA, Nelson-Williams CJ, Farhi A, Toka HR, Tikhonova IR, Bjornson R, Mane SM, Colussi G et al: Mutations in kelch-like 3 and cullin 3 cause hypertension and electrolyte abnormalities. Nature 2012, 482(7383):98-102.

9. Chahrour MH, Yu TW, Lim ET, Ataman B, Coulter ME, Hill RS, Stevens CR, Schubert CR, Greenberg ME, Gabriel SB et al: Whole-exome sequencing and homozygosity analysis implicate depolarization-regulated neuronal genes in autism. PLoS Genet 2012, 8(4):e1002635.

10. Heinzen EL, Depondt C, Cavalleri GL, Ruzzo EK, Walley NM, Need AC, Ge D, He M, Cirulli ET, Zhao Q et al: Exome sequencing followed by large-scale genotyping fails to identify single rare variants of large effect in idiopathic generalized epilepsy. Am J Hum Genet 2012, 91(2):293-302.

11. Musunuru K, Pirruccello JP, Do R, Peloso GM, Guiducci C, Sougnez C, Garimella KV, Fisher S, Abreu J, Barry AJ et al: Exome sequencing, ANGPTL3 mutations, and familial combined hypolipidemia. N Engl J Med 2010, 363(23):2220-2227.

12. Barbieri CE, Baca SC, Lawrence MS, Demichelis F, Blattner M, Theurillat JP, White TA, Stojanov P, Van Allen E, Stransky N et al: Exome sequencing identifies recurrent SPOP, FOXA1 and MED12 mutations in prostate cancer. Nat Genet 2012, 44(6):685-689.

13. Nikolaev SI, Rimoldi D, Iseli C, Valsesia A, Robyr D, Gehrig C, Harshman K, Guipponi M, Bukach O, Zoete V et al: Exome sequencing identifies recurrent somatic MAP2K1 and MAP2K2 mutations in melanoma. Nat Genet 2012, 44(2):133-139.

14. Schwartzentruber J, Korshunov A, Liu XY, Jones DT, Pfaff E, Jacob K, Sturm D, Fontebasso AM, Quang DA, Tonjes M et al: Driver mutations in histone H3.3 and chromatin remodelling genes in paediatric glioblastoma. Nature 2012, 482(7384):226-231.

15. Stark MS, Woods SL, Gartside MG, Bonazzi VF, Dutton-Regester K, Aoude LG, Chow D, Sereduk C, Niemi NM, Tang N et al: Frequent somatic mutations in MAP3K5 and MAP3K9 in metastatic melanoma identified by exome sequencing. Nat Genet 2012, 44(2):165-169.

16. Timmermann B, Kerick M, Roehr C, Fischer A, Isau M, Boerno ST, Wunderlich A, Barmeyer C, Seemann P, Koenig J et al: Somatic mutation profiles of MSI and MSS colorectal cancer



identified by whole exome next generation sequencing and bioinformatics analysis. PLoS One 2010, 5(12):e15661.

17. Dixon-Salazar TJ, Silhavy JL, Udpa N, Schroth J, Bielas S, Schaffer AE, Olvera J, Bafna V, Zaki MS, Abdel-Salam GH et al: Exome sequencing can improve diagnosis and alter patient management. Sci Transl Med 2012, 4(138):138ra178.

18. Roychowdhury S, Iyer MK, Robinson DR, Lonigro RJ, Wu YM, Cao X, Kalyana-Sundaram S, Sam L, Balbin OA, Quist MJ et al: Personalized oncology through integrative high-throughput sequencing: a pilot study. Sci Transl Med 2011, 3(111):111ra121.

19. Zang ZJ, Cutcutache I, Poon SL, Zhang SL, McPherson JR, Tao J, Rajasegaran V, Heng HL, Deng N, Gan A et al: Exome sequencing of gastric adenocarcinoma identifies recurrent somatic mutations in cell adhesion and chromatin remodeling genes. Nat Genet 2012, 44(5):570-574.

Documents

Whitepaper Exome Sequencing and