2016 INBRE In Silico SNP Analysis

AcknowledgementsWe would like to thank Plymouth State University, the PSU Research Advisory Council, and the PSU Student Research Advisory Council. Research reported in this poster was supported by an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under grant number P20GM103506.

Conclusions

Department of Biological Sciences at Plymouth State University, Plymouth NH

References

In Silico Analysis of the Impact of SNPs on the CTGF ProteinKM Jesseman, SL Peterson, and HE Doherty PhD

Future Directions

Top Scoring Variants Predicted to Alter CTGF Protein Structure

WMV Scores are Normally DistributedOverexpression of Connective Tissue Growth Factor (CTGF) during healing leads to excessive deposition of extracellular matrix (ECM) proteins, known as fibrosis. Fibrotic tissue buildup decreases organ function and can eventually lead to organ failure. Nonsynonymous single nucleotide polymorphisms (SNPs) have the potential to alter a protein’s structure and/or its function. Nonsynonymous SNPs in the CTGF gene were evaluated using nine programs that analyze the potential impact each SNP has on the CTGF protein. Values from all of the programs were combined into a weighted majority vote (WMV) score and each SNP was designated as “deleterious,” “possibly deleterious,” “possibly neutral,” or “neutral” based on the calculated WMV value. Overall, CTGF SNP WMV scores were normally distributed. Regions of both high and low SNP density were observed in the CTGF protein. Two domains within CTGF contained a relatively large number of SNPs predicted to be deleterious. Further analysis is needed to determine whether these areas are important functional regions or are simply less conserved in humans. The three variants that received the highest WMV scores were modeled using Phyre2 to investigate their impact on CTGF protein structure. The predicted structures suggest these SNPs have a considerable impact on the structure of the protein and therefore they may affect the function as well. In the future, SNP impacts on CTGF function will be investigated in a tissue culture model of wounding. Identification of SNPs that may alter fibrosis risk will aid in the development of individualized treatments to reduce severity of fibrotic disease and improve long term prognosis.

In Silico Analysis of Missense SNPs• A spreadsheet of the 913 CTGF SNPs was downloaded from Ensembl (Build 38, Release 85) and

sorted by type in Microsoft Excel 2016• 131 of 136 CTGF missense SNPs were used in further analyses• Nonsense SNPs and SNPs that alter translation start site were excluded from WMV analysis due

to program limitations• Predictions of each SNP’s potential impact on protein function were generated using web-based

programs: Mutation Assessor (release 3), PANTHER-PSEP (v9.0), PROVEAN (v1.1.3), Align GVGD, MutPred, PolyPhen2 (HumDiv), SNPs&GO, and FATHMM.• SIFT scores provided by PROVEAN were also used in our analysis• For the program Align GVGD, an alignment between the 41 most closely related mammalian

CTGF homologues was used

Weighted Majority Vote Scoring• Weighted Majority Vote (WMV) scores were calculated for each SNP

• Results from each program for individual SNPs were given a score between -2 and +2 as described by Frousios et al (2013). Possible program outputs and their corresponding scores are shown in Table 1. In general:• A score of -2 was used when a mutation was predicted to be neutral by a program• A score of +2 was used when a mutation was predicted to be deleterious• Scores of +1 were used when a program predicted intermediate probabilities of

deleteriousness (i.e. “possibly damaging”)• Total WMV scores were calculated by summing scores for each SNP from all programs. SNPs

were designated to one of four groups based on their score:• “Deleterious” - total WMV between 10 and 18• “Possibly Deleterious” - total WMV between 1 and 9• “Possibly Neutral” - total WMV between 0 and -9• “Neutral” - total WMV between -10 and -18

• A heat map of SNPs by WMV score was produced using Illustrator for Biological Sequences (v 1.0.1)

Protein Modeling• Phyre2 (http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index) was used to predict the

structures of the full-length wild type (WT) CTGF protein before and after addition of SNPs • Predicted structures were colored and visualized with UCSF Chimera (v 1.10.1)

Methods

Abstract SNP Prediction Programs

SNP Density is Variable within CTGF

Figure 2: CTGF protein with heat map depicting locations of the 131 CTGF SNPs colored by WMV score. The 17 SNPs predicted to be deleterious are labeled. Blue colored boxes represent functional domains or motifs and grey areas are unclassified regions in between. The CTGF protein has a Secretion-Signaling Peptide (SP) sequence, an Insulin-like Growth Factor Binding Protein domain (IGFBP), a Von Willebrand Factor Type C like domain (VWFC), a Thrombospondin Type I-like domain (TSP), and a C-terminal Cysteine Knot (CTCK) motif. The C-terminal end of the protein also contains a heparin binding (HB) region.

Results: The SP domain only contains neutral SNPs and has the lowest density of SNPs, suggesting that this region is less tolerant of them and highly conserved in humans. The areas with the fewest SNPs, such as the N-terminal ends of the IGFBP or CTCK domains appear to be well-conserved and also likely play important roles in structure or function. There are large clusters of SNPs located in the central to C-terminal regions of IGFBP and CTCK domains. These SNP-dense areas are likely under less selective pressure. The TSP and IGFBP domains contain a large number of SNPs predicted to be deleterious. Presence of deleterious SNPs in regions well-conserved across species suggests that the regions may be less conserved and less functionally important in humans or that these SNPs are occurring in important functional regions and may have strong potential for phenotypic impact. Further analysis is required to elucidate the effects of the SNPs at these positions.

Table 1: The nine programs used to evaluate the nonsynonymous SNPs within CTGF. The type of data each program uses to make predictions, a description of how the program works, and outputs of each programs and their corresponding WVM scores are listed. Each program uses structural/biochemical information, evolutionary information, machine learning, or a combination of these methods. No two programs used the same methodology to predict impact on the protein. Programs relying exclusively on structural information were not used because there currently is no crystal structure of CTGF. *Abbreviations: Sorting Intolerant From Tolerant (SIFT), Protein ANnotation THrough Evolutionary Relationship – Position Specific Evolutionary Preservation (PANTHER-PSEP), PROtein Variation Effect ANalyzer (PROVEAN), Align Grantham Variation Grantham Deviation (Align GVGD), Functional Analysis Through Hidden Markov Models (FATHMM), and SNPs and Gene Ontology (SNPs&GO).

C

D

A

Figure 1: Distribution of total WMV scores for the 131 CTGF missense SNPs.

Results: SNPs were only categorized as “deleterious” or “neutral” if the WMV score was ≥ 10 (deleterious) or ≤ -10 (neutral), which mathematically requires at least 7 of the 9 programs to give the same prediction. 17 SNPs were predicted to be deleterious based upon this method, and 23 SNPs were predicted to be neutral. The other 91 SNPs fell between that range, with 48 predicted to be possibly deleterious and 43 possibly neutral. Overall, the data follows a roughly normal distribution.

1. Adzhubei IA, et al (2010) A method and server for predicting damaging missense mutations. Nature methods, 7(4), 248-249.2. Adzhubei I, et al. (2013) Predicting functional effect of human missense mutations using PolyPhen 2. ‐ Current protocols in

human genetics, 7-20.3. Calabrese R, et al. (2009) Functional annotations improve the predictive score of human disease-related mutations in

proteins. Hum Mutat, 30(8), 1237-1244.4. Capriotti E, et al. (2005) I-Mutant2. 0: predicting stability changes upon mutation from the protein sequence or

structure.Nucleic acids research, 33(suppl 2), W306-W310.5. Capriotti E, et al. (2013) WS-SNPs&GO: a web server for predicting the deleterious effect of human protein variants using

functional annotation. BMC genomics, 14(3), 1.6. Choi Y, et al. (2012) Predicting the functional effect of amino acid substitutions and indels. PloS one, 7(10), e46688.7. Colwell A, et al. (2005) Hypertrophic scar fibroblasts have increased connective tissue growth factor expression after

transforming growth factor-beta stimulation. Plastic and Reconstructive Surgery, 116(5): 1387-1390. 8. Datta A, et al. (2015) Functional and Structural Consequences of Damaging Single Nucleotide Polymorphisms in Human

Prostate Cancer Predisposition Gene RNASEL.BioMed research international, 2015.9. De Baets G, et al. (2011) SNPeffect 4.0: on-line prediction of molecular and structural effects of protein-coding

variants. Nucleic Acids Research, gkr996.10.Dessein A, et al. (2009) Variants of CTGF are associated with hepatic fibrosis in Chinese, Sudanese, and Brazillians infected

with Schstosomes. The Journal of Experimental Medicine, 206(11):2321-2328. 11.Frazier K, et al. (1996) Stimulation of Fibroblast Cell Growth, Matrix Production, and Granulation Tissue Formation by

Connective Tissue Growth Factor. The Journal of Investigative Dermatology, 107(3): 404-411. 12.Frousios K, et al. (2013) Predicting the functional consequences of non-synonymous DNA sequence variants—evaluation of

bioinformatics tools and development of a consensus strategy. Genomics, 102(4), 223-228.13.Igarashi A, et al. (1993) Regulation of Connective Tissue Growth Factor Gene Expression in Human Skin Fibroblasts and During

Wound Repair. Molecular Biology of the Cell, 4: 637-645. 14.Ivkovic S, et al. (2003) Connective tissue growth factor coordinates chondrogenesis and angiogenesis during skeletal

development. Development, 130(12): 2779-2791. 15.Kawaguchi Y, et al. (2009) Association study of a polymorphism of the CTGF gene and susceptibility to systemic sclerosis in the

Japanese population. Annals of the Rheumatic Diseases, 68(12): 1921-1924. 16.Kelley LA, et al. (2015) The Phyre2 web portal for protein modeling, prediction and analysis. Nature protocols, 10(6), 845-858.17.Kumar P, et al. (2009) Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm.

Nature protocols, 4(7), 1073-1081.18.Li B, et al. (2009) Automated inference of molecular mechanisms of disease from amino acid

substitutions. Bioinformatics, 25(21), 2744-2750.19.Liu W, et al. (2015) IBS: an illustrator for the presentation and visualization of biological sequences. Bioinformatics, 31(20),

3359-3361.20.Mathe E, et al. (2006) Computational approaches for predicting the biological effect of p53 missense mutations: a comparison

of three sequence analysis based methods. Nucleic acids research, 34(5), 1317-1325.21.Mi H, et al. (2013) Large-scale gene function analysis with the PANTHER classification system. Nature protocols, 8(8), 1551-

1566.22.Reva B, et al. (2011) Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic acids

research, gkr407.23.Schwarz JM, et al. (2010) MutationTaster evaluates disease-causing potential of sequence alterations. Nature methods, 7(8),

575-576.24.Shihab HA, et al. (2013) Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using

hidden Markov models. Human mutation, 34(1), 57-65.25.Tang H & Thomas PD (2016) PANTHER-PSEP: predicting disease-causing genetic variants using position-specific evolutionary

preservation. Bioinformatics, btw222.26.Thomas PD, et al. (2006) Applications for protein sequence–function evolution data: mRNA/protein expression analysis and

coding SNP scoring tools. Nucleic acids research, 34(suppl 2), W645-W650.27.Tang H & Thomas PD (2016) Tools for Predicting the Functional Impact of Nonsynonymous Genetic Variation. Genetics, 203(2),

635-647.28.Wynn TA (2007) Common and unique mechanisms regulate fibrosis in various fibroproliferative diseases. The Journal of

clinical investigation,117(3), 524-529.

• CTGF shows substantial variability within the population represented by SNPs deposited in the Ensembl database

• The SP domain is well-conserved and likely important for the function of human CTGF

• Further research is needed to determine if the IGFBP or TSP domains are functionally important

• C82F, C123R, and C228F variants may result in protein structure, function, and/or stability changes

• Classification of novel SNPs identified in our PSU sample population using WMV scores and protein modeling

• Investigation of published CTGF SNPs in other regions of the gene

• Characterization of deleterious mutations of interest in a tissue culture model of wounding

Figure 3: Predicted WT CTGF protein structure and changes in the predicted structure after C82F, C213R, or C228F mutations. A) Predicted structure of the WT CTGF protein. B-D) Zoomed in views of the Cysteine residues affected by the C82F, C213R, and C228F mutations respectively. E-G) Zoomed in views of the mutated residues C82F, C213R, and C228F respectively. H-J) Full predicted CTGF protein models after mutations C82F, C213R, and C228F respectively. The backbones of all models are colored by confidence from red (>90%) to yellow (<90% but >10%) to blue (<10%). All cysteine residues are highlighted in green and affected amino acids are colored by element (gray - Carbon, yellow - Sulphur, blue - Nitrogen).

Results: All mutations modeled change a Cysteine (C) to a Phenylalanine (F) or an Arginine (R). Cysteine is small, polar, neutrally charged, and important in creating disulfide bonds. In the C82F (B, E, and H) and C228F (D, G, and J) variants, the small residue is replaced with a very bulky and hydrophobic aromatic ring which abrogates potential disulfide bonds and produces very large changes in predicted protein structure. Furthermore, these mutations place a hydrophobic amino acid near the surface of the CTGF protein, likely making it less stable. In the C213R (C, F, and I) variant, the Cysteine has no clear binding partner in the WT predicted protein structure. However, the change does place a large and hydrophilic group in a very compact area in the interior of the protein, resulting in changes in the protein structure that place the variant Arginine closer to the protein surface. Overall, the three variants with the highest WMV scores are predicted to considerably alter the structure of the CTGF protein.

The protein connective tissue growth factor (CTGF) is involved in many processes within the body including development, extracellular matrix (ECM) maintenance, cell proliferation, and wound healing. Overexpression of the CTGF gene during healing leads to scar tissue development, also known as fibrosis, from the excessive deposition of ECM proteins such as collagens. CTGF overexpression can lead to the excess scar tissue seen in pulmonary fibrosis, liver cirrhosis, cardiac fibrosis and eventually cause organ system failures. The lack of targeted therapies for fibrotic diseases has created a need for additional research into refined therapeutic approaches to prevent disease progression and organ failure. Presence of single nucleotide polymorphisms (SNPs) in CTGF could alter ECM deposition, induce structural changes, and potentially alter protein function. Nonsynonymous SNPs are of particular interest because they cause an amino acid change. There are currently 136 published nonsynonymous SNPs in the CTGF gene (Ensembl build 38, release 85). Importantly, variations within the CTGF gene that alter structure and/or function of the CTGF protein could impact the severity of fibrosis in patients.Programs designed for evaluating nonsynonymous SNPs were used to analyze the impact each SNP had on the resulting CTGF protein. Predictions were combined using a weighted majority vote (WMV) score. The SNPs that were predicted to be most deleterious were further analyzed with the Phyre2 protein modeling program to evaluate their impact on CTGF protein structure. The most deleterious variants can further be characterized in tissue culture by inserting a vector with the mutated CTGF gene into fibroblasts. Identification of the SNPs that may increase or decrease risk of fibrosis could aid in the development of individualized treatments for fibrotic diseases.

Introduction

C82F

C213R

C228F

Predicted CTGF WT Protein Predicted CTGF Mutated Protein

G

F

E H

I

J

ProgramProgram Type

Description Outputs and WMV ScoresStructural/

Biochemical Evolutionary Machine Learning

SIFT* ✓ Calculates a conservation score and scaled probability based on BLAST protein sequence alignments of similar proteins, inside or outside of a protein’s family

Damaging +2Tolerated -2

Mutation Assessor ✓

Evaluates evolutionary conservation data based on alignments derived exclusively from protein families and/or subfamilies to score the mutation

High +2Medium/Low +1Neutral -2

PANTHER-PSEP* ✓ Uses evolutionary branch length to predict whether a

SNP may be damaging or benign

Probably Damaging +2Possibly Damaging +1Probably Benign -2

PROVEAN* ✓ ✓ Compares protein alignment scores between the WT vs variant AA and WT vs. homologous proteins. Differences in these alignment scores are used to predict whether a SNP is Deleterious or Neutral

Deleterious +2Neutral -2

Align GVGD* ✓ ✓

Compares biochemical variation across WT AAs in aligned species, then determines whether the mutant of interest falls within or outside the range of natural variation across aligned species

Class C55, C65 +2

Class C25, C35, C45 +1

Class C0, C15 -2

MutPred ✓ ✓Uses machine learning to predict loss or gain of structural or functional properties and calculates the probability of a SNP being deleterious

Prob > 0.8 +20.5 Prob 0.8 +1Prob < 0.5 -2

FATHMM* ✓ ✓Uses machine learning based upon multiple sequence alignments and alignments of conserved protein domains with the protein of interest to predict deleteriousness and possible phenotypic outcomes

Deleterious +2Neutral -2

PolyPhen2(HumDiv) ✓ ✓ ✓

Uses machine learning based upon sequence annotations, evolutionary conservation, biochemical properties of amino acids, and protein structure to assign a score

Probably Damaging +2

Possibly Damaging +1Benign -2

SNPs&GO* ✓ ✓ ✓Uses machine learning assessing biochemical, sequence profile, conservation, and gene ontology characteristics to calculate probability of the SNP causing disease

Disease +2Neutral -2

Dele

terio

us ..

.

Poss

ibly

Del

e...

Poss

ibly

Neu

...

Neu

tral

_x00

...

0

10

20

30

40

50

Tota

l WM

V Sc

ore

B

Deleterious Neutral

Documents

2016 INBRE In Silico SNP Analysis