1
Complete mitochondrial genome NGS investigation on Ion Torrent PGM™ platform and population genetic studies in Northern Han Chinese Yishu Zhou 1 , Jiao Yu 2 , Fei Guo 3 , Jinling Zhao 2 , Feng Liu 2 , Hongying Shen 2 , Bin Zhao 2 , Fei Jia 2 , Zhu Sun 2 , He Song 1 , Xianhua Jiang 1,2 * 1 China Medical University School of Forensic Medicine, No. 77, Puhe Road, Shenyang North New Area, Shenyang, Liaoning 110122, P.R. China 2 Criminal Science and Technology Institute of Liaoning Province, No. 2, Qishan Middle Road, Huanggu District, Shenyang, Liaoning 110032, P.R. China 3 Department of Forensic Medicine, National Police University of China, No. 83, Tawan Street, Huanggu District, Shenyang, Liaoning 110854, P.R. China * Corresponding author at: Criminal Science and Technology Institute of Liaoning Province, E-mail address: jiangxianhua_2006@aliyun. Introduction Previous studies often restrict to sequence the HVS-I, II and III of control region (CR) as well as some specific single nucleotide polymorphisms (SNPs) of coding region (CodR). Such partial information may limit the polymorphisms information content of this genetic marker and hinder its application in practical forensic casework. In this study, we have developed strategies for complete mtGenome sequencing on the Ion Torrent Personal Genome™ Machine (PGM™) platform and investigated mtGenome features of the Northern Chinese Han population to evaluate the application in forensic sciences. Strategies Sequencing Data analysis Three in-house Perl scripts were developed for primary data analysis to screen out uncertain positions and samples from variant call format (VCF) reports. Both IGV and NextGENe software were used for base by base review. Online tools were used for haplogroup assignment, including Mthap (http://dna.jameslick. com/mthap), EMMA (www.empop.org), and PhyloTree (www.phylotree.org). Results and discussion Coverage and strand balance Some regions were presented as particularly low coverage, mostly located in HVS and NADH dehydrogenase (ND) coding regions. Most of high reverse strand biases located in the regions with low coverage relative to the rest of the mtGenome. It seemed that most of above regions would coincide with the areas of homopolymers. Homopolymers The result demonstrated that low coverage and high reverse strand biases were mainly attributed to homopolymers, especially presenting a single large component (≥ 8-bp) and/or multiple continuous components in a small region, when PGM™platform was applied. Reliability of Torrent Variant Caller v.4.0 The results demonstrated TVC was more reliable with ≥ 1500 × average coverage and ≤ 5bp homopolymer. When it existed with homopolymers ≥ 6 bp (especially ≥ 8 bp) and average coverage ≤ 500 ×, variants should be authenticated by visual inspection in some certain regions and even across the complete mtGenome. Population genetics Summary statisticsof mtGenome from 107 Northern Chinese Han. a C n indels at positions 309, 315 and 16193 were not counted in all calculations; 523–524DEL, 521–524DEL, 524.1A and 524.2C, and8281–8289DEL were treated as one variant respectively; PHPs were treated as variants. b Haplogroups for HVS-I, HVS-I/HVS-II, CR and CodR were assigned by mthap, while haplogroups for mtGenome were approved by the EMPOP. The RMP with sequencing the complete mtGenome was dramatically decreased (26.19%) by comparing value from HVS-I. Haplogroup resolution Haplotypes based on the complete mtGenome had potential on assigning to the most accurate haplogroups compared with control region only. Conclusions This study outlines strategies for complete mtGenome sequencing on Ion Torrent PGM™ platform and NGS data analysis. According to our experience, ~ 30 samples per week by an individual are produced on PGM™ platform. The TVC is more reliable with samples of higher average coverage (e.g., ≥ 1500 ×) and with ≤ 5bp homopolymer. The resolution with sequencing the complete mtGenome was dramatically improved by comparing value from the subsets of the molecule historically targeted for human identification. Therefore, we believe the NGS technology has powerful potential on complete mtGenome detection compared with traditional method. Acknowledgements This study was supported by grants (No. 201201ZDYJ001) from the key research project of Ministry of Public Security Project, China. The authors wish to thank Walther Parson and Simone Nagl from the EMPOP for advices and efforts on the data evaluation. We also thank Qingqing Zhang from Department of Field Bioinformatics Support (FBS) of Thermo Fisher Scientific for advices on Perl compilation. Primer L644 Primer H877 Primer H8982 Primer L8789 Fragment A 8.3 kbp Fragment B 8.6 kbp mtGenome 16569 bp long range PCR amplification Library construction (Fragmentation, adapter ligation and size selection) Clonal amplification PGM sequencing 314 chip: 4 samples; 316 chip: 15 samples; 318 chip: 30 samples 2.66% 97.34% All samples (average coverage 1269 × ) 1.37% 98.63% Good samples (average coverage 1561 × ) 5.26% 94.74% error variant correct variant Poor samples (average coverage 632 × ) 1.08% 98.92% error variant correct variant 7.58% 92.42% Homopolymer Error rate 3 bp 1.19% (6/504) 4 bp 1.90% (4/211) 5 bp 4.96% (7/141) 6 bp 32.14% (9/28) 7 bp 25.00% (2/8) 8 bp 75.00% (3/4) 9 bp 53.75% (43/80) Outside homopolymer regions Within homopolymer regions HVS-I HVS-I/ HVS-II CR mtGenome Percentage increased HVS-I → HVS-I/HVS-II HVS-I/HVS-II →CR CR→ mtGenome # Variants a 522 892 1102 4022 - - - # Haplotypes a 94 102 105 107 8.51% 2.94% 1.90% # Unique haplotypes a 84 98 103 107 16.67% 5.10% 3.88% Mean # of pairwise differences a 7.36 9.66 11.41 39.15 31.25% 18.12% 243.12% Range of differences 0.9967 0.9989 0.9996 1 0.22% 0.07% 0.04% HD 0.0126 0.0104 0.0097 0.0093 –17.46% –6.73% –4.12% RMP 65 75 79 88 15.38% 5.33% 11.39% # Haplogroups b 43 56 60 74 30.23% 7.14% 23.33% Sample Haplogroup assignment CR mtGenome N021 M9a'b G N044 M9a1a1c1b1 D4 N079 M33-16362 G2b1a N083 D4k C4a1-195 N088 M74 D4j7/D4j-16311 N094 R6 D4-195C N096 M33-16362 G1c For Research, Forensic or Paternity Use Only. Not for use in diagnostic procedures.

Complete mitochondrial genome NGS investigation on Ion ...€¦ · Complete mitochondrial genome NGS investigation on Ion Torrent PGM™ platform and population genetic studies in

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Complete mitochondrial genome NGS investigation on Ion ...€¦ · Complete mitochondrial genome NGS investigation on Ion Torrent PGM™ platform and population genetic studies in

Complete mitochondrial genome NGS investigation on Ion Torrent PGM™ platform and population genetic studies in Northern Han Chinese

Yishu Zhou1, Jiao Yu2, Fei Guo3, Jinling Zhao2, Feng Liu2, Hongying Shen2, Bin Zhao2, Fei Jia2, Zhu Sun2, He Song1, Xianhua Jiang 1,2 *

1 China Medical University School of Forensic Medicine, No. 77, Puhe Road, Shenyang North New Area, Shenyang, Liaoning 110122, P.R. China2 Criminal Science and Technology Institute of Liaoning Province, No. 2, Qishan Middle Road, Huanggu District, Shenyang, Liaoning 110032, P.R. China 3 Department of Forensic Medicine, National Police University of China, No. 83, Tawan Street, Huanggu District, Shenyang, Liaoning 110854, P.R. China

* Corresponding author at: Criminal Science and Technology Institute of Liaoning Province, E-mail address: jiangxianhua_2006@aliyun.

IntroductionPrevious studies often restrict to sequence the HVS-I, II and III of control region (CR) as well as some specific single nucleotide polymorphisms (SNPs) of coding region (CodR). Such partial information may limit the polymorphisms information content of this genetic marker and hinder its application in practical forensic casework. In this study, we have developed strategies for complete mtGenome sequencing on the Ion Torrent Personal Genome™ Machine (PGM™) platform and investigated mtGenome features of the Northern Chinese Han population to evaluate the application in forensic sciences.

StrategiesSequencing

Data analysis

� Three in-house Perl scripts were developed for primary data analysis to screen out uncertainpositions and samples from variant call format (VCF) reports.

� Both IGV and NextGENe software were used for base by base review.

� Online tools were used for haplogroup assignment, including Mthap (http://dna.jameslick.com/mthap), EMMA (www.empop.org), and PhyloTree (www.phylotree.org).

Results and discussionCoverage and strand balance

� Some regions were presented as particularly low coverage, mostly located in HVS andNADH dehydrogenase (ND) coding regions.

� Most of high reverse strand biases located in the regions with low coverage relative to therest of the mtGenome.

� It seemed that most of above regions would coincide with the areas of homopolymers.

Homopolymers

� The result demonstrated that low coverageand high reverse strand biases were mainlyattributed to homopolymers, especiallypresenting a single large component (≥ 8-bp)and/or multiple continuous components ina small region, when PGM™platform wasapplied.

Reliability of Torrent Variant Caller v.4.0

� The results demonstrated TVC was more reliable with ≥ 1500 × average coverage and ≤ 5bphomopolymer.

� When it existed with homopolymers ≥ 6 bp (especially ≥ 8 bp) and average coverage ≤ 500 ×,variants should be authenticated by visual inspection in some certain regions and even acrossthe complete mtGenome.

Population genetics

Summary statisticsof mtGenome from 107 Northern Chinese Han.

a Cn indels at positions 309, 315 and 16193 were not counted in all calculations; 523–524DEL, 521–524DEL, 524.1A and 524.2C, and8281–8289DEL were treated as one variant respectively; PHPs were treated as variants.

b Haplogroups for HVS-I, HVS-I/HVS-II, CR and CodR were assigned by mthap, while haplogroups for mtGenome were approved by the EMPOP.

� The RMP with sequencing the complete mtGenome was dramatically decreased (26.19%) bycomparing value from HVS-I.

Haplogroup resolution

� Haplotypes based on the complete mtGenome had potential on assigning to the mostaccurate haplogroups compared with control region only.

Conclusions� This study outlines strategies for complete mtGenome sequencing on Ion Torrent PGM™

platform and NGS data analysis.

� According to our experience, ~ 30 samples per week by an individual are produced onPGM™ platform.

� The TVC is more reliable with samples of higher average coverage (e.g., ≥ 1500 ×) and with≤ 5bp homopolymer.

� The resolution with sequencing the complete mtGenome was dramatically improvedby comparing value from the subsets of the molecule historically targeted for humanidentification.

� Therefore, we believe the NGS technology has powerful potential on complete mtGenomedetection compared with traditional method.

AcknowledgementsThis study was supported by grants (No. 201201ZDYJ001) from the key research project of Ministry of Public Security Project, China. The authors wish to thank Walther Parson and Simone Nagl from the EMPOP for advices and efforts on the data evaluation. We also thank Qingqing Zhang from Department of Field Bioinformatics Support (FBS) of Thermo Fisher Scientific for advices on Perl compilation.

Primer L644

Primer H877

Primer H8982

Primer L8789

Fragment A8.3 kbp

Fragment B8.6 kbp mtGenome

16569 bp

long range PCR amplification

Library construction (Fragmentation, adapter ligation and size selection)

Clonal amplification

PGM sequencing314 chip: 4 samples; 316 chip: 15 samples;

318 chip: 30 samples

2.66%

97.34%

All samples (average coverage 1269 × )

1.37%

98.63%

Good samples (average coverage 1561 × )

5.26%

94.74% error variantcorrect variant

Poor samples (average coverage 632 × )

1.08%

98.92%error variantcorrect variant

7.58%

92.42%

Homopolymer Error rate3 bp 1.19% (6/504)4 bp 1.90% (4/211)5 bp 4.96% (7/141)6 bp 32.14% (9/28)7 bp 25.00% (2/8)8 bp 75.00% (3/4)≥ 9 bp 53.75% (43/80)

Outside homopolymer regions Within homopolymer regions

HVS-I HVS-I/HVS-II CR mtGenome

Percentage increasedHVS-I →

HVS-I/HVS-IIHVS-I/HVS-II

→CRCR→

mtGenome# Variants a 522 892 1102 4022 - - -# Haplotypes a 94 102 105 107 8.51% 2.94% 1.90%

# Unique haplotypes a 84 98 103 107 16.67% 5.10% 3.88%

Mean # of pairwise differences a 7.36 9.66 11.41 39.15 31.25% 18.12% 243.12%

Range of differences 0.9967 0.9989 0.9996 1 0.22% 0.07% 0.04%HD 0.0126 0.0104 0.0097 0.0093 –17.46% –6.73% –4.12%RMP 65 75 79 88 15.38% 5.33% 11.39%# Haplogroupsb 43 56 60 74 30.23% 7.14% 23.33%

SampleHaplogroup assignment

CR mtGenomeN021 M9a'b GN044 M9a1a1c1b1 D4N079 M33-16362 G2b1aN083 D4k C4a1-195N088 M74 D4j7/D4j-16311N094 R6 D4-195CN096 M33-16362 G1c

For Research, Forensic or Paternity Use Only. Not for use in diagnostic procedures.