View
25
Download
0
Category
Preview:
Citation preview
44
Chapter-3
HOMOLOGY MODELING
3.1 HOMOLOGY MODELING
One technique that can be applied to generate reasonable models of protein
structures is homology modeling. This procedure, also termed comparative modeling
or knowledge-based modeling, develops a three-dimensional model from a protein
sequence based on the structures of homologous proteins.
3.1.1 Swiss-Prot
Swiss-Prot is a protein sequence database maintained by Swiss Institute of
Bioinformatics (SIB). Swiss-Prot was established in 1986. Swiss-Prot strives to
provide reliable protein sequences associated with a high level of annotation (such as
the description of the function of a protein, its domains structure, post-translational
modifications, variants, etc.) Later on it joined forces as the UniProt consortium. The
UniProt Knowledgebase (UniProtKB) provides the central database of protein
sequences with accurate, consistent, rich sequence and functional annotation, the most
widely used protein information resource in the world. The group also develops and
maintains other databases including PROSITE, a database of protein families and
domains, and ENZYME, a database of enzyme nomenclature.Protein sequence
database is scanned for Glutathione S-transferase sequences and from the resulted
sequences Glutathione S-transferase proteins are listed to select one among those for
homology modeling.
45
Figure 3. 1: Image of Swiss prot protein sequence database.
3.1.2 Protein Sequence Selection
Q08392 protein identified from SWISS PROT database. Six proteins were
observed. They are
1. Q08392 (GSTA1_CHICK),
2. Q08393 (GSTA2_CHICK),
3. P80895[Antechinus stuartii (Brown marsupial mouse)],
4. P46428[Anopheles gambiae (African malaria mosquito)],
5. Q7REH6 [Plasmodium yoelii yoelii],
6. P46423[Hyoscyamus muticus (Egyptian henbane)].
BLAST program, protein-protein blastp from NCBI was selected to scan the
query protein sequence against pdb structure database. Swiss-Prot ID Q08392 was
scanned against pdb database; similarly the other five protein sequences were also
scanned. The results were discussed in result section in detail. FASTA formats of
GST sequences extracted from Swiss-Prot protein sequence database are given below.
46
1. Q08392 (GSTA1_CHICK)
>sp|Q08392|GSTA1_CHICK Glutathione S-transferase OS=Gallus
gallus PE=2 SV=1
MSGKPVLHYANTRGRMESVRWLLAAAGVEFEEKFLEKKEDLQKLKSDGSLLFQQVPMVEI
DGMKMVQTRAILNYIAGKYNLYGKDLKERALIDMYVEGLADLYELIMMNVVQPADKKEEH
LANALDKAANRYFPVFEKVLKDHGHDFLVGNKLSRADVHLLETILAVEESKPDALAKFPL
LQSFKARTSNIPNIKKFLQPGSQRKPRLEEKDIPRLMAIFH
2.Q08393 (GSTA2_CHICK)
>sp|Q08393|GSTA2_CHICK Glutathione S-transferase OS=Gallus gallus
PE=2 SV=2
MAGKPKLHYTRGRGKMESIRWLLAAAGVEFEEEFIEKKEDLEKLRNDGSLLFQQVPMVEIDGMKMVQSR
AILCYIAGKYNLYGKDLKERAWIDMYVEGTTDLMGMIMALPFQAADVKEKNIALITERATTRYFPVYEK
ALKDHGQDYLVGNKLSWADIHLLEAILMTEELKSDILSAFPLLQAFKGRMSNVPTIKKFLQPGSQRKPP
LDEKSIANVRKIFSF
3.Antechinus stuartii (Brown marsupial mouse)
>sp|P80894|GSTA1_ANTST Glutathione S-transferase OS=Antechinus
stuartii PE=1 SV=1
MAGEQNIKYFNIKGRMEAIRWLLAVAGVEFEEKFFETKEQLQKLKETVLLFQQVPMVEIDGMKLVQTRA
ILHYIAEKYNLLGKDMKEHAQIIMYSEGTMDLMELIMIYPFLKGEEKKQRLVEIANKAKGRYFPAFENV
LKTHGQNFLVGNQLSMADVQLFEAILMVEEKVPDALSGFPLLQAFKTRISNIPTVKTFLAPGSKRKPVP
DAKYVEDIIKIFYF
4.Anopheles gambiae (African malaria mosquito)
>sp|P46428|GST_ANOGA Glutathione S-transferase OS=Anopheles gambiae
GN=GstS1 PE=2 SV=4
MPDYKVYYFNVKALGEPLRFLLSYGNLPFDDVRITREEWPALKPTMPMGQMPVLEVDGKKVHQSVAMSR
YLANQVGLAGADDWENLMIDTVVDTVNDFRLKIAIVAYEPDDMVKEKKMVTLNNEVIPFYLTKLNVIAK
ENNGHLVLGKPTWADVYFAGILDYLNYLTKTNLLENFPNLQEVVQKVLDNENVKAYIAKRPITEV
47
5. Plasmodium yoelii yoelii
>sp|Q7REH6|GST_PLAYO Glutathione S-transferase OS=Plasmodium yoelii
yoelii GN=GST PE=3 SV=1
MTYLYNFFFFFFFFFSRGKAELIRLIFAYLQVKYTDIRFGVNGDAFAEFNNFKKEKEIPFNQVPILEIG
GLILAQSQAIVRYLSKKYNISGNGELNEFYADMIFCGVQDIHYKFNNTNLFKQNETTFLNEELPKWSGY
FEKLLQKNNTNYFVGDTITYADLAVFNLYDDIESKYPNCLKNFPLLKAHIELISNIPNIKHYIANRKES
VY
6.Hyoscyamus muticus (Egyptian henbane)
>sp|P46423|GSTF_HYOMU Glutathione S-transferase OS=Hyoscyamus muticus
PE=1 SV=1
MGMKLHGPAMSPAVMRVIATLKEKDLDFELVPVNMQAGDHKKEPFITLNPFGQVPAFEDGDLKLFESRA
ITQYIAHTYADKGNQLLANDPKKMAIMSVWMEVESQKFDPVASKLTFEIVIKPMLGMVTDDAAVAENEE
KLGKVLDVYESRLKDSKYLGGDSFTLADLHHAPAMNYLMGTK
VKSLFDSRPHVSAWCADILARPAWSKAIEYKQ
From about thousands of sequences, were selected for this study. No criterion
wasfollowed during selection as almost all sequences submitted in database are
identical.
3.1.3 Sequence Retrieval
The Sequence Retrieval System (SRS) is the world's premier data integration,
analysis and display tool for genomics, bioinformatics and related data. SRS is a
homogeneous interface to over 80 biological databases that had been developed at the
European Bioinformatics Institute (EBI) at Hinxton, UK .It includes databases of
sequences, transcription factors, metabolic pathways, and application results like
BLAST, FASTA as well as protein 3-D structures, genomes, mappings, mutations,
and locus specific mutations. SRS is a data retrieval system that integrates
heterogeneous databanks in molecular biology and genome analysis. There are
currently several dozen servers world-wide that provide access to over 300 different
48
databanks via the World Wide Web. Additional technology to integrate externally
developed applications into the package gives novel and powerful capabilities for
biological data analysis.
3.1.4 Blast
The Basic Local Alignment Search Tool (BLAST) is the most popular
database searching program due to its combination of speed and sensitivity and also it
finds regions of local similarity between sequences. The program compares
nucleotide or protein sequences to sequence databases and calculates the statistical
significance of matches. BLAST is heuristic method to find the highest locally
optimal alignments between a query sequence and a database and it can be used to
infer functional and evolutionary relationships between sequences as well as help
identify members of gene families. The statistics allows the probability of obtaining
an alignment without gaps (HSP - Highest Segment Pair) with a particular score to be
estimated. The BLAST algorithm permits nearly all HSP's above a cutoff to be
located efficiently in a database. Fundamental problem of sequence similarity search
against a DNA/protein sequence database is to make an inference of structural or even
further functional homology based on sequence similarity score. This is calculated
from pairwise sequence alignments using varieties of algorithms. Homologous genes
are genes that have evolved from a common ancestor gene through duplications and
mutations. Homology is a powerful inference because reliable homology can be
inferred from statistically significant similarity scores with high confidence. It is also
informative because homologous sequences always share a common 3D structure.
However, function could be quite different. In the following discussion we will be
using the BLAST search algorithm as the example as it is the most popular similarity
search program. Fundamental problem of sequence similarity search against a
49
DNA/protein sequence database is to make an inference of structural or even further
functional homology based on sequence similarity score. This is calculated from
pairwise sequence alignments using varieties of algorithms. Homology is a powerful
inference because reliable homology can be inferred from statistically significant
similarity scores with high confidence. It is also informative because homologous
sequences always share a common 3D structure. However, function could be quite
different. In the following discussion we will be using the BLAST search algorithm as
an example as it is the most popular similarity search program.
3.2 How Blast Works
3.2.1 The Basics
The BLAST algorithm is a heuristic program, which means that it relies on
some smart shortcuts to perform the search faster. BLAST performs "local"
alignments. Most proteins are modular in nature, with functional domains often being
repeated within the same protein as well as across different proteins from different
species. The BLAST algorithm is tuned to find these domains or shorter stretches of
sequence similarity. The local alignment approach also means that a mRNA can be
aligned with a piece of genomic DNA, as is frequently required in genome assembly
and analysis. If instead BLAST started out by attempting to align two sequences over
their entire lengths (known as a global alignment), fewer similarities would be
detected, especially with respect to domains and motifs. When a query is submitted
via one of the BLAST Web pages, the sequence, plus any other input information
such as the database to be searched, word size, expect value, and so on, are fed to the
algorithm on the BLAST server. BLAST
works by first making a look-up table of all the “words” (short subsequences, which
for proteins the default is three letters) and “neighbouring words”, i.e., similar words
50
in the query sequence. The sequence database is then scanned for these “hot spots”.
When a match is identified, it is used to initiate gap-free and gapped extensions of the
“word”.BLAST does not search GenBank flat files (or any subset of GenBank flat
files) directly. Rather, sequences are made into BLAST databases. Each entry is split,
and two files are formed, one containing just the header information and one
containing just the sequence information. These are the data that the algorithm uses. If
BLAST is to be run in “stand-alone” mode, the data file could consist of local, private
data, downloaded NCBIBLAST databases, or a combination of the two. For
performing Blastp under NCBI website, protein sequence, Glutathione S-transferase
{query sequence (Q08392)} was downloaded in FASTA format and subjected to blast
against the PDB database, analysis using default parameters. Using default parameters
except the matrices, the results obtained were reported.
51
Figure 3. 2: BLAST input sequence showing PDB database being chosen for analysis.
3.2.2 Wu-Blast2
WU-BLAST2 stands for Washington University Basic Local Alignment tool. This is
sensitive, fast alignment tool as it gives good alignment between query and subject
sequences. WU-Blast program was used to analyze query sequence scan against PDB
protein database using default options. By changing matrices it showed variation in
the result i.e. In the score and e-value.
52
Figure 3. 3: Wu BLAST2 input sequence showing Protein Structure Sequence database being chosen for
analysis
3.2.3 Fasta
It stands for FAST-ALL reflecting the fact that it can be used for a fast
nucleotide comparison for a fast protein comparison. This program achieves a high
level of sensitivity for similarity searching at high speed accuracy. This is achieved by
performing optimized searches for local alignments using a substitution matrix. . The
trade-off between speed and sensitivity is controlled by the ktup parameter, which
specifies the size of the word. Increasing the ktup the number of background hits can
be decreased. Not every word hit is investigated but initially look for segment's
containing several nearby hits. The high speed of this program is achieved by using
the observed pattern of word hits to identify potential matches before attempting the
53
more time consuming optimized search. FASTA analysis was performed by pasting
query sequence in the box given below. All parameters were kept as defaults expect
matrices. They are discussed in Results section in detail. An example of Fasta input is
given below, against the Database PDB selected with BLOSUM45.
Figure 3. 4: FASTA input sequence showing Protein Structure Sequence database Against GST Sequence
for analysis
3.2.4 Pdb
The RCSB PDB provides a variety of tools and resources for studying the
structure of biological macromolecules and their relationship to sequence, function,
and disease. The RCSB is a member of the www.PDB whose mission is to ensure that
the PDB archive remains an international resource with uniform data. This site office
is used for browsing, searching and reporting to utilize the data resulting from
ongoing efforts to create a more consistent and comprehensive archive.It is a
repository for the 3-D structural data of large biological molecules, such as proteins
54
and nucleic acids.The PDB archive contains information about experimentally-
determined structures of proteins, nucleic acids, and complex assemblies. As a
member of the wwPDB, the RCSB PDB curates and annotates PDB data according to
agreed upon standards.
The RCSB PDB also provides a variety of tools and resources. Users can
perform simple and advanced searches based on annotations relating to sequence,
structure and function. These molecules are visualized, downloaded, and analyzed by
users who range from students to specialized scientists. The Protein Data Bank (PDB)
format provides a standard representation for macromolecular structure data derived
from X-ray diffraction and NMR studies. This representation was created in the
1970's and a large amount of software using it has been written. If the contents of the
PDB are thought of as primary data, then there are hundreds of derived (i.e.,
secondary) databases that categorize the data differently. For example, both SCOP
and CATH categorize structures according to type of structure and assumed
evolutionary relations; GO categorize structures based on genes.
This database is used to download structural sequences in pdb extension
format in order to perform homology modeling. The structural data, summary
information, sequence length, x-ray parameters, Resolution, Ramachandran plot and
other factors are carefully studied.
55
Figure 3. 5: Image showing the search page of Protein Data Bank
3.2.5 Sequence Search Using Hill climbing Algorithm (SSHC)
Step 1: Start
Step 2: Take Unknown Sequence Sk
Step 3: Take Known KSs Sequence set As a Search Space
Step 4: Apply Hill Climbing Technique to match Sk with KSs
Sk[i,j] (Intersection) KSs[i,j] >Sk[i,k] (Intersection) KSs[i,k]
Then Sk[i,j] (Intersection) KSs[i,j] is the Result
Step 5: If Matches are not found with KSs Then Sk is New Sequence
Step 6: Stop
KNOWN PROTEIN SEQUENCE 1ML6:
AGKPVLHYFNARGRMECIRWLLAAAGVEFEEKFIQSPEDLEKLKKDGNLMFDQVPMV
EIDGMKLVQTRAILNYIATKYDLYGKDMKERALIDMYTEGILDLTEMIGQLVLCPPD
QREAKTALAKDRTKNRYLPAFEKVLKSHGQDYLVGNRLTRVDVHLLELLLYVEELDA
SLLTPFPLLKAFKSRISSLPNVKKFLQPGSQRKPPLDAKQIEEARKVFKF
56
UN KNOWN PROTIEN SEQUENCE Q08392:
MSGKPVLHYANTRGRMESVRWLLAAAGVEFEEKFLEKKEDLQKLKSDGSLLFQQVPM
VEIDGMKMVQTRAILNYIAGKYNLYGKDLKERALIDMYVEGLADLYELIMMNVVQPA
DKKEEHLANALDKAANRYFPVFEKVLKDHGHDFLVGNKLSRADVHLLETILAVEESK
PDALAKFPLLQSFKARTSNIPNIKKFLQPGSQRKPRLEEKDIPRLMAIFH
3.2.6 Script
<HTML><BODY>
<TABLE border=1>
<?php
$str1 =
"MGAASGRRGPGLLLPLPLLLLLPPQPALALDPGLQPGNFSADEAGAQLFAQSYNSS
AEQVFQSVAASWAHDTNITAENARRQEEAALLSQEFAEAWGQKAKELYEPIWQNFTD
PQLRRIGAVRTLGSANLPLAKRQQYNALLSNMSRIYSTAKVCLPNKTATCWSLDPDL
TNILASSRSYAMLLFAWEGWHNAAGIPLKPLYEDFTALSNEAYKQDGFTDTGAYWRS
WYNSPTFEDDLEHLYQQLEPLYLNLHAFVRRALHRRYGDRYINLRGPIPAHLLGDMW
AQSWENIYDMVVPFPDKPNLDVTSTMLQQGWNATHMFRVAEEFFTSLELSPMPPEFW
EGSMLEKPADGREVVCHASAWDFYNRKDFRIKQCTRVTMDQLSTVHHEMGHIQYYLQ
YKDLPVSLRRGANPGFHEAIGDVLALSVSTPEHLHKIGLLDRVTNDTESDINYLLKM
ALEKIAFLPFGYLVDQWRWGVFS GRTPPSRYNFDW";
$str2 =
MGNTTSDRVSGERHGAKAARSEGAGGHAPGKEHKIMVGSTDDPSVFSLPDSKLPGDK
EFVSWQQDLEDSVKPTQQARPTVIRWSEGGKEVFISGSFNNWSTKIPLIKSHNDFVA
ILDLPEGEHQYKFFVDGQWVHDPSEPVVTSQLGTINNLIHVKKSDFEVFDALKLDSM
57
ESSETSCRDLSSSPPGPYGQEMYAFRSEERFKSPPILPPHLLQVILNKDTNISCDPA
LLPEPNHVMLNHLYALSIKDSVMVLSATHRYKKKYVTTLLYKPI";
?>
<TR>
<TD>String length</TD>
<TD><?phpprint_r(strlen($str1)); ?></TD>
<TD><?phpprint_r($str1); ?></TD>
</TR>
<TR>
<TD>String length</TD>
<TD><?phpprint_r(strlen($str2)); ?></TD>
<TD><?phpprint_r($str2); ?></TD>
</TR>
</table>
<table style="float:left" border=0>
<?php
$chars = array('');
$chars1 = array();
$red= array();
$c=0;
$high=0;
for($l = 0; $l<=strlen($str2); $l++){
for($k = 0; $k<=strlen($str2)-$l; $k++){
$string = substr($str2,$l,$k);
58
//echo substr($str2,$l,$k). $l. '->'. $k . "<br/>";
$chunk = substr($str2,$l,$k); if(strlen($chunk)>0){ $cnt =
substr_count($str1,$chunk);
if($cnt>0){
if(!isset($red[$string])){
$c++;
if($cnt>$high)
$high=$cnt;
$red[$string]=$cnt;
}
}
}
}
}
foreach($red as $i => $value){
?>
<tr><td><b><?phpprint_r($i); ?></b></td>
<TD> - <?php echo $red[$i]; ?> - </TD>
<?php
echo " <TD >".(($red[$i]/sizeof($red))*100)."%</TD> ";
for($j=0;$j<$red[$i];$j++)
echo " <TD bgcolor='blue'>.</TD> ";
echo "</tr>";
}
?>
59
<tfoot></tfoot>
</table>
<?php
echosizeof($red)."<span style='float:left'><b>Total match count is ".$c." with highest
frequency as ".$high."</b></span>";
?>
</BODY></HTML>
3.2.7 Modeller9v1
. MODELLER is a computer program that models three-dimensional structures
of proteins and their assemblies by satisfaction of spatial restraints.
MODELLER implements an automated approach to comparative protein structure
modeling by satisfaction of spatial restraints, Briefly the core modeling procedure
begins with an alignment of the sequence to be modeled (target) with related known
3D structures (templates). This alignment is usually the input to the program. The
output is a 3D model for the target sequence containing all main chain and side chain
non-hydrogen atoms. Based on the given an alignment, the model will obtain.
60
Figure 3. 6: Modeller workspace showing on windows command prompt
3.3 HOMOLOGY MODELING METHODOLOGY
3.3.1 Step 1
Comparative models were constructed for various gene/protein sequences to
study the sequences in the structural context and to suggest site directed mutagenesis
experiments for elucidating specificity changes in this apparent case of convergent
evolution of enzymatic specificity. To perform homology modeling,Blast analysis has
been carried out by using BL-45 matrix against the protein structure sequence
database with the following Swiss entries Q08392,P80894 ,Q08393
P46428,Q7REH6,P46463 out of which Q08392 is taken into consideration It
wasconsidered because they have the much similarity and identity. Thus we select this
sequence, in that the sequence has lowest E-value. So we consider only that sequence.
61
3.3.2 Step2
>>PDB:1ML6 mol:protein length:221 Glutathione S-Trans (221 aa)
initn: 784 init1: 784 opt: 786 Z-score: 1835.9 bits: 346.9 E(): 2.3e-95 Smith-
Waterman score: 786; 66.2% identity (83.1% similar) in 219 aa overlap (2-220:1-219)
10 20 30 40 50 60
Sequence
MSGKPVLHYANTRGRMESVRWLLAAAGVEFEEKFLEKKEDLQKLKSDGSLLFQQVPMVEI
.::::::: :.::::: .:::::::::::::::.. :::.::: ::.:.: :::::::
PDB:1M
AGKPVLHYFNARGRMECIRWLLAAAGVEFEEKFIQSPEDLEKLKKDGNLMFDQVPMVEI
10 20 30 40 50
70 80 90 100 110
Sequence
DGMKMVQTRAILNYIAGKYNLYGKDLKERALIDMYVEGLADLYELIMMNVVQPADKKEEH
::::.::::::::::: ::.:::::.:::::::::.::. :: :.: . :. : :..:
PDB:1M
DGMKLVQTRAILNYIATKYDLYGKDMKERALIDMYTEGILDLTEMIGQLVLCPPDQREAK
60 70 80 90 100 110
130 140 150 160 170 180
SeqLANALDKAANRYFPVFEKVLKDHGHDFLVGNKLSRADVHLLETILAVEESKPDALAKFP
L
: : :.. :::.:.:::::: ::.:.::::.:.:.:::::: .: ::: :. :::
PDB:1M
TALAKDRTKNRYLPAFEKVLKSHGQDYLVGNRLTRVDVHLLELLLYVEELDASLLTPFPL
120 130 140 150 160 170
190 200 210 220
Seq LQSFKARTSNIPNIKKFLQPGSQRKPRLEEKDIPRLMAIFH
62
:..::.: :..::.:::::::::::: :. : : .:
PDB:1M LKAFKSRISSLPNVKKFLQPGSQRKPPLDAKQIEEARKVFKF
180 190 200 210 220
3.3.3. Step 3
Series of commands in the modeller9v1 that will generate model with
superimposed and optimized structure.
1. Mod9v1 search.py
2. Mod9v1 malign.py
3. Mod9v1 get-model.py
4. Mod9v1 optimize.py
5. Mod9v1 superpose.py
3.3.4 Files in Modeller9v1
1. Search file:
Figure 3. 7: Modeller workspace showing sequence search file
63
In this search file the target sequence file name specified as Q08392 with the
extensions. The command “mod9v1 search.py” searches target file by using this file.
2. Alignment file:
Figure 3. 8: Modeller workspace showing sequence alignment file
In this file the alignment should be the same as like alignment in the alignment
program FASTA that we have taken. The command mod9v1 malign.py will check
this alignment and also checks template (1ML6) sequence with template structure.
The alignment sequence(1ML6) must match that from the 1ML6(PDB) in the atom
files exactly.
64
3. Get-model file:
Figure 3. 9: Modeller workspace showing get-model file
In this get-model file we have to specify the known template structure file
name and target protein sequence file name. Here target protein is specified as
Q08392 and template structure as 1ML6.
Here certain modifications were made, such as
Starting model=1
Ending model =5
This will generate five models.
65
4. Optimize file:
Figure 3. 10: Modeller workspace showing optimized file
In this file modeled protein name has to be mentioned for optimization. Here
modeled protein name specified as Q08392 and the command Mod9v1 optimize.py
runs the optimization and gets the modeled protein with stable and minimum energy.
66
5. Superpose:
Figure 3. 11: Modeller workspace showing superimposed file
Here modeled protein file name is specified as (Q08392) and template file
name as (1ML6) for superimposition. The command mod9v1 superpose.py runs
superimposition of these two proteins and gets RMSD (root mean square deviation)
value.
Recommended