1
ABSTRACT The 1000 Genomes Project is the first internationally collaborated project to sequence the genomes of populations at low-coverage (4X coverage) to identify genetic variants that have frequencies of at least 1%. The generated data provides a comprehensive resource on human genetic variations through the characterization of many millions multi-allelic SNPs, several classes of structural variants (SVs), and their haplotype contexts. Analysis of variation data is a critical step in the interpretation of sequencing data. Although, genetic variant data from the 1000 Genomes Project team is freely and publicly available for research studies but high resolution analysis is yet to be done. To address this deficiency, we initiated to explore genetic variant data of 96 Pakistani individuals (PJL sub- population). VCFtools were used to mine genomic variant data specific for PJL sub-population. Sample analysis revealed a total of 62 families, with 42 families are involved in trios study. SNPs, InDels and ratios of transitionstotransversions (Ts/Tv) were calculated for each chromosome. SNPs densities at an interval of 1 Mbs were calculated showing that chr4 in PJL sub-population is the least variant while chr22 has the most variable pattern. Principal Component Analysis (PCA) by R statistical package was performed to observe the relationship of chromosomes within PJL data and developed a comparative model against the whole 1000 Genomes Project data. This study will also help us to identify demographic history in future. INTRODUCTION The understanding of genetic variation is essential to decode traces that evolution has left in our genomes and the availability of whole-genome sequence data now allow us to do interpret these signals at a resolution never possible before. Genetic variation in humans generally follows clines defined by geographical regions, and there are possibly very few fixed differences between any pair of continents or populations. Nevertheless, genetic differences among populations exist, reflecting mainly past demographic events. Common population-specific SNP distributions are non-randomly distributed throughout the genome. In some cases, differences accumulate as adaptation to population-specific environmental pressure, a process known as positive selection. Pakistan is situated at the crossroads of Indian subcontinent, Central Asia, and the Middle East. With an ethnically and linguistically diverse population of >170 million, Pakistan is the 6 th largest country in the world. Most of the Pakistani population has an ancestral north Indian origin, generically close to Middle Easterners, Central Asians and Europeans. The data produced by the 1000 Genomes Project has enabled us to reconstruct the complex evolutionary history of the human species in remarkable detail. All the Pakistani sub-population (PJL) data of South-Asian population represents the characteristic variation sets that will be an important assets to improve the genetic variation map of this region. Remarkably, this simple approach, if applied to whole genome sequences from large population samples, usually seems to lead directly to the functional variants responsible for the differentiated phenotype. In this study, efforts have been devoted to understand the genetic differences of PJL sub-population against the 1000 Genomes Project. Initially at the start of this project, emphasis have been put forward to count the genomic variants, supported by multivariate analysis to develop a model representing the divergence at chromosome level. In future, the information generated by this work will be used to further explore the abundant phenotypic variation to uncover evolutionary history. COMPUTATIONAL METHODS Downloaded *.vcf files (v4.2) for each chromosome (chr1 chr22) from EBI’s ftp server of the 1000 Genomes Project along with other accessory files (ped file, panel file, etc.,) vcf-subset script was utilized with the following options to generate *.vcf files only for PJL samples: -c” list of PJL sample IDs to be kept in PJL *.vcf files for each chromosome, and "-p“print only those sites that have alternative alleles in the PJL samples and skip any other sites that are all REF allele in PJL samples. BCFtools stats (1.1+htslib-1.1) was used to count SNPs, InDels and ratio of Ts/Tv; SNPs densities were calculated in defined bins of 1 Mbs by SNPdensity output filtering statistics option. Perl API scripts of VCFtools (v0.1.11) was used to mine the sub- population of PJL CONCLUSION Genetic variant data of PJL sub-population showed that adaptation has been frequent in our evolutionary history. Much more focus is needed on chr4 and chr22 of PJL data as these two chromosomes has the most distinctive pattern. PCA provides such simplistic models representing the comparative behavior at population level. Using the 1000 Genomes Project data, a more comprehensive genetic variation map of PJL will be produced to support the evolutionary pressure of PJL genome. REFERENCES 1. http://www.1000genomes.org/ 2. The 1000 Genomes Project Consortium, An integrated map of genetic variation from 1,092 human genomes, Nature 491, 2012, 5665doi:10.1038/nature11632. Analysis of Genomic Variants of Pakistani Sub-Population Sequenced by the 1000 Genomes Project Waqasuddin Khan,* Ishtiaq A. Khan, and M. Kamran Azim Jamil-ur-Rahman Center for Genome Research, Dr. Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi-75270, Pakistan. *[email protected] 96 PJL sample (individual) IDs were extracted manually from the 1000 Genomes Project’s panel file PCA was performed by R statistical package (v3.1.2) RESULTS AND DISCUSSION Fig. 1. SNPs and InDels Counts. Fig. 2. InDels Frequency as calculated by BCFtools. Fig. 3. Substitutions types as calculated by BCFtools. Fig. 4. Counts of Ts and Tv and their ratios as calculated by BCFtools. Fig. 5. For adjusting SNP ratios on the scale of 0-1, corrected SNP counts were calculated by the following formula: = ( − ) Fig. 6. Heat map of corrected SNP counts by R function. Heat map with colors scaled according to the SNP densities (Transformed SNP densities: from orange to light yellow region; from low Z-scores to high Z-score values). Each column represents the chromosomes labelled on the vertical axis (right), and each row shows the SNP densities labelled on the horizontal axis (bottom) of the heat map. The dendrogram obtained with the hierarchical cluster analysis is displayed on the left. Clustering of chromosomes is achieved on the basis of SNP densities. Fig. 7. Exploratory multivariate analysis of SNP densities by R package. PCA of (A) PJL sub- population, and (B) 1000 Genomes Project. Both quantitative and qualitative variables, along with the inclusion of supplementary variables and observations were added to the analysis. The red circle on (A) of chr4 has the most dimensionality in terms of SNP densities. Father 52 Mother 52 Child 54 Unrelated 1 Total Individuals 159 Families Having Father Only 3 Families Having Mother Only 3 Families Having Child Only 3 Families Having Father/Child 4 Families Having Mother/Child 4 Families Having Father/Mother 3 Families Having Father/Mother/Child 42 Total Families 63 Table. 1. Individuals for PJL sub-population as reported by the 1000 Genomes Project Table. 2. Classification of families on the basis of individual selected as reported by the 1000 Genomes Project

Analysis of Genomic Variants of Pakistani Sub …...Analysis of Genomic Variants of Pakistani Sub-Population Sequenced by the 1000 Genomes Project Waqasuddin Khan,* Ishtiaq A. Khan,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analysis of Genomic Variants of Pakistani Sub …...Analysis of Genomic Variants of Pakistani Sub-Population Sequenced by the 1000 Genomes Project Waqasuddin Khan,* Ishtiaq A. Khan,

ABSTRACT

The 1000 Genomes Project is the first internationally collaborated project to

sequence the genomes of populations at low-coverage (4X coverage) to

identify genetic variants that have frequencies of at least 1%. The

generated data provides a comprehensive resource on human genetic

variations through the characterization of many millions multi-allelic SNPs,

several classes of structural variants (SVs), and their haplotype contexts.

Analysis of variation data is a critical step in the interpretation of

sequencing data. Although, genetic variant data from the 1000 Genomes

Project team is freely and publicly available for research studies but high

resolution analysis is yet to be done. To address this deficiency, we initiated

to explore genetic variant data of 96 Pakistani individuals (PJL sub-

population). VCFtools were used to mine genomic variant data specific for

PJL sub-population. Sample analysis revealed a total of 62 families, with 42

families are involved in trios study. SNPs, InDels and ratios of transitions–

to–transversions (Ts/Tv) were calculated for each chromosome. SNPs

densities at an interval of 1 Mbs were calculated showing that chr4 in PJL

sub-population is the least variant while chr22 has the most variable

pattern. Principal Component Analysis (PCA) by R statistical package was

performed to observe the relationship of chromosomes within PJL data and

developed a comparative model against the whole 1000 Genomes Project

data. This study will also help us to identify demographic history in future.

INTRODUCTION

The understanding of genetic variation is essential to decode traces that

evolution has left in our genomes and the availability of whole-genome

sequence data now allow us to do interpret these signals at a resolution

never possible before. Genetic variation in humans generally follows clines

defined by geographical regions, and there are possibly very few fixed

differences between any pair of continents or populations. Nevertheless,

genetic differences among populations exist, reflecting mainly past

demographic events. Common population-specific SNP distributions are

non-randomly distributed throughout the genome. In some cases,

differences accumulate as adaptation to population-specific environmental

pressure, a process known as positive selection.

Pakistan is situated at the crossroads of Indian subcontinent,

Central Asia, and the Middle East. With an ethnically and linguistically

diverse population of >170 million, Pakistan is the 6th largest country in the

world. Most of the Pakistani population has an ancestral north Indian origin,

generically close to Middle Easterners, Central Asians and Europeans. The

data produced by the 1000 Genomes Project has enabled us to reconstruct

the complex evolutionary history of the human species in remarkable detail.

All the Pakistani sub-population (PJL) data of South-Asian population

represents the characteristic variation sets that will be an important assets

to improve the genetic variation map of this region. Remarkably, this simple

approach, if applied to whole genome sequences from large population

samples, usually seems to lead directly to the functional variants

responsible for the differentiated phenotype.

In this study, efforts have been devoted to understand the genetic

differences of PJL sub-population against the 1000 Genomes Project.

Initially at the start of this project, emphasis have been put forward to count

the genomic variants, supported by multivariate analysis to develop a

model representing the divergence at chromosome level. In future, the

information generated by this work will be used to further explore the

abundant phenotypic variation to uncover evolutionary history.

COMPUTATIONAL METHODS

Downloaded *.vcf files (v4.2) for each chromosome (chr1 – chr22)

from EBI’s ftp server of the 1000 Genomes Project along with other

accessory files (ped file, panel file, etc.,)

vcf-subset script was utilized with the following options to generate

*.vcf files only for PJL samples:

“-c” − list of PJL sample IDs to be kept in PJL *.vcf files for each

chromosome, and

"-p“− print only those sites that have alternative alleles in the PJL

samples and skip any other sites that are all REF allele in PJL

samples.

BCFtools stats (1.1+htslib-1.1) was used to count SNPs, InDels and

ratio of Ts/Tv; SNPs densities were calculated in defined bins of 1

Mbs by SNPdensity output filtering statistics option.

Perl API scripts of VCFtools (v0.1.11) was used to mine the sub-

population of PJL

CONCLUSION

Genetic variant data of PJL sub-population showed that adaptation has been

frequent in our evolutionary history.

Much more focus is needed on chr4 and chr22 of PJL data as these two

chromosomes has the most distinctive pattern.

PCA provides such simplistic models representing the comparative behavior at

population level.

Using the 1000 Genomes Project data, a more comprehensive genetic variation

map of PJL will be produced to support the evolutionary pressure of PJL genome.

REFERENCES

1. http://www.1000genomes.org/

2. The 1000 Genomes Project Consortium, An integrated map of genetic

variation from 1,092 human genomes, Nature 491, 2012, 56–

65doi:10.1038/nature11632.

Analysis of Genomic Variants of Pakistani Sub-Population Sequenced by the 1000 Genomes Project

Waqasuddin Khan,* Ishtiaq A. Khan, and M. Kamran Azim

Jamil-ur-Rahman Center for Genome Research, Dr. Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences,

University of Karachi, Karachi-75270, Pakistan.

*[email protected]

96 PJL sample (individual) IDs were extracted manually from the

1000 Genomes Project’s panel file

PCA was performed by R statistical package (v3.1.2)

RESULTS AND DISCUSSION

Fig. 1. SNPs and InDels Counts.

Fig. 2. InDels Frequency as calculated by BCFtools.

Fig. 3. Substitutions types as calculated by BCFtools.

Fig. 4. Counts of Ts and Tv and their ratios as calculated by BCFtools.

Fig. 5. For adjusting SNP ratios on the scale of 0-1, corrected SNP counts were calculated by the following formula:

𝑪𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅 𝑺𝑵𝑷 𝑪𝒐𝒖𝒏𝒕 = 𝑹𝒂𝒕𝒊𝒐 𝒐𝒇 𝑻𝒐𝒕𝒂𝒍 𝑺𝑵𝑷 𝑪𝒐𝒖𝒏𝒕 − 𝑴𝒊𝒏𝒊𝒎𝒖𝒎 𝑺𝑵𝑷 𝑪𝒐𝒖𝒏𝒕

(𝑴𝒂𝒙𝒊𝒎𝒖𝒎 𝑺𝑵𝑷 𝑪𝒐𝒖𝒏𝒕 − 𝑴𝒊𝒏𝒊𝒎𝒖𝒎 𝑺𝑵𝑷 𝑪𝒐𝒖𝒏𝒕)

Fig. 6. Heat map of corrected SNP counts by R function. Heat map with colors scaled according to the SNP densities (Transformed SNP densities: from orange to light yellow region; from low Z-scores to high Z-score values). Each column represents the chromosomes labelled on the vertical axis (right), and each row shows the SNP densities labelled on the horizontal axis (bottom) of the heat map. The dendrogram obtained with the hierarchical cluster analysis is displayed on the left. Clustering of chromosomes is achieved on the basis of SNP densities.

Fig. 7. Exploratory multivariate analysis of SNP densities by R package. PCA of (A) PJL sub-population, and (B) 1000 Genomes Project. Both quantitative and qualitative variables, along with the inclusion of supplementary variables and observations were added to the analysis. The red circle on (A) of chr4 has the most dimensionality in terms of SNP densities.

Father 52

Mother 52

Child 54

Unrelated 1

Total Individuals 159

Families Having Father Only 3

Families Having Mother Only 3

Families Having Child Only 3

Families Having Father/Child 4

Families Having Mother/Child 4

Families Having Father/Mother 3

Families Having Father/Mother/Child 42

Total Families 63

Table. 1. Individuals for PJL sub-population as reported by the 1000 Genomes Project

Table. 2. Classification of families on the basis of individual selected as reported by the 1000 Genomes Project