Upload
marjory-walton
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
EC in Bioinformatics and Drug Design
EC in Bioinformatics and Drug Design
Dr. Michael L. RaymerDepartment of Computer Science
and Engineering
Dr. Michael L. RaymerDepartment of Computer Science
and Engineering
BioinformaticsBioinformatics
ComputationalMolecularBiology
Bioinformatics
GenomicsGenomics
Proteomics
Functionalgenomics
Structuralbioinformatics
Structuralbioinformatics
DNA is the blueprint for lifeDNA is the blueprint for life
• Every cell in your body has 23 chromosomes in the nucleus
• The genes in these chromosomes determine all of your physical attributes.
• Every cell in your body has 23 chromosomes in the nucleus
• The genes in these chromosomes determine all of your physical attributes.
Image source: Crane digital, http://www.cranedigital.com/
Mapping the GenomeMapping the Genome• The human genome project has provided us
with a draft of the entire human genome.• The human genome project has provided us
with a draft of the entire human genome.
• Four bases:A, T, C, G
• 3.12 billion base-pairs
• 99% of these are the same
• Polymorphisms = where they differ
How does the code work?How does the code work?• Template for construction of proteins• Template for construction of proteins
The genome is information:The genome is information:• TATAAGCTGACTGTCACTGA• TATAAGCTGACTGTCACTGA
one codon
3apr.pdb
4 Bases:A,G,C,T
20 Amino Acids
• The representation is much The representation is much more complex in many ways more complex in many ways than in ECthan in EC
Proteins: Molecular machineryProteins: Molecular machinery
• Proteins in your muscles allows you to move:
myosinandactin
• Proteins in your muscles allows you to move:
myosinandactin
Proteins: Molecular machineryProteins: Molecular machinery• Enzymes
(digestion, catalysis)
• Structure (collagen)
• Enzymes(digestion, catalysis)
• Structure (collagen)
Proteins: Molecular machineryProteins: Molecular machinery
• Signaling(hormones, kinases)
• Transport(energy, oxygen)
• Signaling(hormones, kinases)
• Transport(energy, oxygen)
Image source: Crane digital, http://www.cranedigital.com/
Growth of biological databasesGrowth of biological databases
1 2 3 5 10 16 24 35 49 72 101 157217
385652
1,160
2,009
3,841
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
Millions
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
Source: GenBank
3D StructuresGrowth:
Source: http://www.rcsb.org/pdb/holdings.html
GenBank BASEPAIR GROWTH
ApplicationsApplications• What can we do with all this
information?Cure diseases – computational drug
designModel protein structureUnderstand relationships between
organisms (phylogenies)
• What can we do with all this information?Cure diseases – computational drug
designModel protein structureUnderstand relationships between
organisms (phylogenies)
Example Case: HIV ProteaseExample Case: HIV Protease
1. Exposure & infection
2. HIV enters your cell3. Your own cell reads
the HIV “code” and creates the HIV proteins.
4. New viral proteins prepare HIV for infection of other cells.
1. Exposure & infection
2. HIV enters your cell3. Your own cell reads
the HIV “code” and creates the HIV proteins.
4. New viral proteins prepare HIV for infection of other cells.
http://whyfiles.org/035aids/index.html© George Eade, Eade Creative Services, Inc.
HIV Protease as a drug targetHIV Protease as a drug target• Many drugs bind to
protein active sites.• This HIV protease
can no longer prepare HIV proteins for infection, because an inhibitor is already bound in its active site.
• Many drugs bind to protein active sites.
• This HIV protease can no longer prepare HIV proteins for infection, because an inhibitor is already bound in its active site.
HIV Protease + Peptidyl inhibitor (1A8G.PDB)
Drug DiscoveryDrug Discovery• Target Identification
What protein can we attack to stop the disease from progressing?
• Lead discovery & optimization What sort of molecule will bind to this protein?
• Toxicology Does it kill the patient? Does it have side effects? Does it get to the problem spots?
• Target Identification What protein can we attack to stop the disease from
progressing?
• Lead discovery & optimization What sort of molecule will bind to this protein?
• Toxicology Does it kill the patient? Does it have side effects? Does it get to the problem spots?
Drug Development Life CycleDrug Development Life Cycle
Years
0 2 4 6 8 10 12 14 16
Discovery (2 to 10 Years)
Preclinical Testing(Lab and Animal Testing)
Phase I(20-30 Healthy Volunteers used to check for safety and dosage)
Phase II(100-300 Patient Volunteers used to check for efficacy and side effects)
Phase III(1000-5000 Patient Volunteers used to monitor reactions to long-term drug use)
FDA Review & Approval
Post-Marketing Testing
$600-700 Million!$600-700 Million!
7 – 15 Years!7 – 15 Years!
Drug lead screeningDrug lead screening
5,000 to 10,000 compounds screened
250 Lead Candidates in Preclinical Testing5 Drug Candidates
enter Clinical Testing; 80% Pass Phase I
One drug approved by the FDAOne drug approved by the FDA
30%Pass Phase II
80% Pass Phase III
Finding drug leadsFinding drug leads• Once we have a target, how do we find some
compounds that might bind to it?
• The old way: exhaustive screening
• The new way: computational screening!
• Once we have a target, how do we find some compounds that might bind to it?
• The old way: exhaustive screening
• The new way: computational screening!
Chemistry 101Chemistry 101• Like everything else in the universe, proteins
and drugs are made up of atoms.
• Some atoms, like oxygen, tend to have a negative charge.
• Some, like nitrogen, tend to be positively charged.
• When the two come together, they attract like magnets.
• Like everything else in the universe, proteins and drugs are made up of atoms.
• Some atoms, like oxygen, tend to have a negative charge.
• Some, like nitrogen, tend to be positively charged.
• When the two come together, they attract like magnets.
The Goal:The Goal:
ProteinProtein DrugleadDruglead
Drug Lead Screening & DockingDrug Lead Screening & Docking
• Complementarity Shape Chemical Electrostatic
• Complementarity Shape Chemical Electrostatic
??
But it gets more complicatedBut it gets more complicated
• About 60% of our body weight is water.
• Most proteins are surrounded by water molecules
• Interact with protein-drug complexes
Protein-water interactionsProtein-water interactions
• Water has both negative and positively charged atoms.
• It can bridge gaps between drugs and proteins.
• Water has both negative and positively charged atoms.
• It can bridge gaps between drugs and proteins.
Protein surface
Ligand
Watermolecule
Will the water stay?Will the water stay?
• When a drug comes close to a protein, some of the water molecules are displaced.
• When a drug comes close to a protein, some of the water molecules are displaced.
Protein surface
Ligand
Watermolecule
Pattern Recognition ModelPattern Recognition Model
SampleTransducer,etc.
Rawmeasurementdata
f1f2f3f4f5
Featurevector= pattern
d
Training and TestingTraining and Testing
C
C
N
C
N
NLabeled training data
f1f2f3f4f5
Cube
Classifier Classification/prediction
f1f2f3f4f5
?
Temperature Factor (B-Value)Temperature Factor (B-Value)
How wiggly is it?Here a protein (dihydrofolate reductase) is colored by temperature factor.
Atomic Density (ADN)Atomic Density (ADN)
How crowded is it?The atomic density of this water molecule is 5.
Prediction of water moleculesPrediction of water moleculesBlue spheres are water molecules predicted to stay in the active site.
Wire mesh spheres are water molecules predicted to be displaced — booted out by the ligand.
Nearest Neighbor ClassificationNearest Neighbor Classification
Feature 2
Fea
ture
1
= class 1 training sample
= class 2 training sample
= new unknown (test) sample
Feature Weighted knnFeature Weighted knn
Feature 2
Fe
atu
re 1
a.
Feature 2
Fea
ture
1
Scale Extended
b.
Class 1Class 2Unknown
GA & knn InteractionGA & knn Interaction
Genetic Algorithm
W1 W2 W3 W4 W5
W1 W2 W3 W4 W5
W1 W2 W3 W4 W5
W1 W2 W3 W4 W5
...
KNN Classifier
W1
W2
Masked Weight Vector & kMasked Weight Vector & k
Fitness — How is it calculated?Fitness — How is it calculated?
Weighting and MaskingWeighting and Masking• How do we sample feature subsets?
Weight below a threshold value: slow sampling Masking:
Distinct mutation rates, or multiple mask bits• Intron effect
• Classifier parameters (k) on the chromosome
• How do we sample feature subsets? Weight below a threshold value: slow sampling Masking:
Distinct mutation rates, or multiple mask bits• Intron effect
• Classifier parameters (k) on the chromosome
73.2 0
W1 W2 W3 W4 W5 M1 M2 M3 M4 M5 k
The Cost FunctionThe Cost Function• We can direct the search toward any objective.
Classification accuracy Class balance Feature subset parsimony (reduce d)
• The GA minimizes the cost function:
• We can direct the search toward any objective. Classification accuracy Class balance Feature subset parsimony (reduce d)
• The GA minimizes the cost function:
),(),(_
)(),(),cost(
kwbalCkwvotesincorrectC
wnonzeroCkwerrCkw
balvote
parsacc
UCI Data Set ResultsUCI Data Set ResultsHepatitis Train/Tune Test FeaturesBayes 85.3 65.7 19Knn 87.1 73.4 19GA/knn 86.0 69.6 8.1EP/knn 87.2 73.1 8.9
Wine Train/Tune Test FeaturesBayes 98.8 94.7 13Knn 94.9 94.3 13GA/knn 99.7 94.8 6.0EP/knn 99.5 93.2 6.2
Ionosphere Train/Tune Test FeaturesBayes 93.0 90.1 34Knn 83.4 93.2 34GA/knn 95.0 91.9 8.5EP/knn 93.2 92.3 13.5
Pima Train/Tune Test FeaturesBayes 76.10 64.60 8Knn 73.50 71.50 8GA/knn 80.00 72.10 3.1EP/knn 79.10 72.90 3.9
• Typically we see modest gains in classification accuracy, with significant reduction in features.
• Typically we see modest gains in classification accuracy, with significant reduction in features.
Blake, C. L. and Merz, C. J. (1998). UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences.http://www.ics.uci.edu/~mlearn/MLRepository.html
Classical Methods & SFFSClassical Methods & SFFSMethod Train/Tune Test FeaturesGA/knn 98.5% 98.4% 3SFFS/knn 98.0% 98.1% 6Linear Discriminant 93.8% 88.4% 21Quadratic Discriminant 89.7% 95.3% 21Nearest Neighobr 100.0% 96.1% 21Bayes (independent) 97.1% 92.4% 21Bayes (second order) 97.7% 98.5% 21Neural Net (Back prop) 99.5% 99.3% 21Predictive Value Max. 99.8% 99.4% 21CART Tree 99.8% 84.9% 21
Method Train/Tune Test FeaturesGA/knn 90.4% 90.6% 2SFFS/knn 88.5% 91.4% 3Linear Discriminant 88.7% 86.8% 7Quadratic Discriminant 79.3% 73.6% 7Nearest Neighobr 100.0% 82.1% 7Bayes (independent) 88.7% 83.0% 7Bayes (second order) 95.3% 81.1% 7Neural Net (Back prop) 90.0% 85.8% 7Predictive Value Max. 91.5% 89.6% 7CART Tree 90.0% 84.9% 7
• Thyroid data Best feature
selection performance
within 1% of best accuracy
• Appendicitis data Best feature
selection within 1% of best
accuracy
• Thyroid data Best feature
selection performance
within 1% of best accuracy
• Appendicitis data Best feature
selection within 1% of best
accuracy
Weiss, S. and Kapouleas, I. (1990). An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods. Morgan Kaufmann.
Feature Extraction FrameworkFeature Extraction Framework
Genetic Algorithm..
.Classifier
Features and Classifier ParametersFeatures and Classifier Parameters
Fitness based on accuracyFitness based on accuracy
??
Reducing Computation TimeReducing Computation Time• Most of the compute time is spent calculating
distances and finding nearest neighbors. Branch and bound knn[1] scales:
• linearly with d,
• polynomial with n, O(nc) 1.0 < c < 2.0
• Can we use a faster classifier?
• Most of the compute time is spent calculating distances and finding nearest neighbors. Branch and bound knn[1] scales:
• linearly with d,
• polynomial with n, O(nc) 1.0 < c < 2.0
• Can we use a faster classifier?
[1] Fukunaga, K. and Narendra, P. M. (1975). A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers, 750–753.
The Bayes ClassifierThe Bayes Classifier
• Properties are well understood and thoroughly explored in the literature
• Training data are summarized, classification of test samples is rapid
• Provably optimal when the multivariate feature distribution is known for each class
• Properties are well understood and thoroughly explored in the literature
• Training data are summarized, classification of test samples is rapid
• Provably optimal when the multivariate feature distribution is known for each class
Class-conditional DistributionsClass-conditional Distributions
P(x)
x• MLE optimal when
Equal prior probabilities Equal error costs
• Generalizes to d dimensions
1ω|xP 2ω|xP
Multiple DimensionsMultiple Dimensions ixP ω|
11 ω|xP 21 ω|xP
12 ω|xP 22 ω|xP
The “Naïve” Bayes ClassifierThe “Naïve” Bayes Classifier• We now have P(xi|j) for each feature i and
each class j
• Naïve approach: assume all features are independent
• We now have P(xi|j) for each feature i and each class j
• Naïve approach: assume all features are independent
jdjjj xPxPxPxP ω|ω|ω|ω| 21
• This assumption is almost always false
• As long as monotonicity holds, the decision rule is still valid
Unequal Prior ProbabilitiesUnequal Prior Probabilities• When we know the prior probabilities for the
classes, we can use Bayes rule:• When we know the prior probabilities for the
classes, we can use Bayes rule:
C
jjj
iii
PxP
PxPxP
1
ωω|
ωω||ω
Bayes decision rule: assign to the class for which the posterior probability is highest
iP ω :yprobabilitPrior
Hybridizing the Bayes ClassifierHybridizing the Bayes Classifier
• Unfortunately the Bayes classifier is invariant to feature weighting
• Unfortunately the Bayes classifier is invariant to feature weighting
0 20 40 60 80 100 120 140 160 180 200 +
Feature Value
Pro
port
ion
of T
rain
ing
Sam
ples
0 2 4 6 8 10 12 14 16 18 20 +
P(80 < x < 100) = 0.045
P(8 < x < 10) = 0.045
Bayes Discriminant FunctionBayes Discriminant Function
jxPxPx jii || :if decide , given
• Bayes Decision Rule:• Bayes Decision Rule:
xPxPxg
|| 21 • Two-class Discriminant Function:
2
1
2211
|
||
iii PxP
PxPPxP
2211 || PxPPxP
Naïve Bayes DiscriminantNaïve Bayes Discriminant
2211 || PxPPxP
222221
111211
|||
|||
PxxxP
PxxxP
d
d
• Independence Assumption:
)log()log( baba )log()log( baba
22
11
log|log
log|log
PxP
PxPxg
A Parameterized DiscriminantA Parameterized Discriminant
i
idd
i
ii
P
xPC
xPC
xPCxP
log
|log
|log
|log|
22
11*
i
idd
i
ii
P
xPC
xPC
xPCxP
log
|log
|log
|log|
22
11*
• C1, C2 … Cd are optimized by an evolutionary algorithm.
An Alternative DiscriminantAn Alternative Discriminant• Sum of weighted marginal probabilities• Sum of weighted marginal probabilities
i
idd
i
ii
P
xPC
xPC
xPCxP
|
|
||
22
11*
i
idd
i
ii
P
xPC
xPC
xPCxP
|
|
||
22
11*
Conserved vs. Non-ConservedConserved vs. Non-ConservedBootstrap Accuracy (%) Balance Feature Weights
Disp Cons Total Mean StDev K ADN AHP BVAL HBDP HBDW MOB ABVAL NBVAL65.44 62.96 64.20 3.88 2.94 65 0.000 0.000 0.413 0.135 0.137 0.315 0.000 0.00065.16 62.08 63.62 3.80 3.19 29 0.000 0.000 0.667 0.000 0.000 0.333 0.000 0.00065.56 60.77 63.16 5.35 3.55 25 0.000 0.000 0.463 0.000 0.000 0.323 0.000 0.21462.08 64.14 63.11 4.15 2.75 37 0.000 0.000 0.891 0.000 0.000 0.109 0.000 0.00063.49 62.52 63.00 3.52 2.56 77 0.000 0.000 0.308 0.163 0.225 0.304 0.000 0.00065.30 60.45 62.87 5.36 3.72 17 0.000 0.000 0.841 0.000 0.000 0.159 0.000 0.00058.76 66.19 62.47 7.74 4.24 97 0.459 0.291 0.250 0.000 0.000 0.000 0.000 0.00061.79 62.94 62.36 3.27 2.47 27 0.000 0.371 0.629 0.000 0.000 0.000 0.000 0.00062.86 61.45 62.16 3.79 2.49 23 0.000 0.000 0.372 0.240 0.000 0.203 0.000 0.18462.04 62.26 62.15 3.50 2.34 7 0.000 0.000 0.571 0.156 0.000 0.273 0.000 0.00060.68 63.36 62.02 4.26 3.06 87 0.000 0.118 0.558 0.323 0.000 0.000 0.000 0.00062.68 60.76 61.72 4.14 3.30 17 0.000 0.252 0.352 0.397 0.000 0.000 0.000 0.00062.93 60.47 61.70 4.20 3.25 67 0.000 0.000 0.421 0.000 0.000 0.579 0.000 0.00061.00 62.16 61.58 3.58 2.50 13 0.018 0.388 0.441 0.000 0.000 0.000 0.000 0.15360.40 62.52 61.46 3.84 2.79 63 0.000 0.000 0.227 0.000 0.773 0.000 0.000 0.00061.99 60.86 61.42 3.18 2.45 19 0.000 0.051 0.417 0.058 0.000 0.474 0.000 0.00061.13 61.59 61.36 3.39 2.55 15 0.000 0.000 0.392 0.293 0.000 0.207 0.000 0.10857.71 64.60 61.16 7.14 3.81 19 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.00062.33 58.90 60.62 4.58 3.57 57 0.000 0.000 0.881 0.000 0.000 0.000 0.000 0.11960.65 59.95 60.30 2.78 2.24 75 0.000 0.317 0.000 0.000 0.000 0.000 0.000 0.68359.83 60.68 60.25 3.40 2.48 69 0.000 0.000 0.000 0.336 0.000 0.000 0.000 0.664
Higher weights Lower weights 0.000
Cosine-based kNN ClassifierCosine-based kNN Classifier
Feature 2
Fea
ture
1 Class A
Class B
Test Pattern
k=5 classification: Among 5 points with the smallest angles with respect to the test point, 3 are class A; the test point is labeled as class A.
•A cosine similarity metric finds k points with the most similar angles. The cosine between 2 vectors xi and xj is
defined as:
|||| ||||),cos(
ji
jiji
xx
xxxx
•Once the k most similar points have been identified, the class label is assigned according to the function:
k
i
ii xcxxq1
)(),cos(
where c(xi) = 1 if xi belongs to the positive class; -1 if xi belongs to the negative class.
If q is positive, the query point is assigned to the positive class, otherwise it is assigned to the negative class.
Feature Extraction TechniquesFeature Extraction TechniquesShifting the origin in the search space may affect classification.
Class A
Class B
Test Pattern
Feature 2
Fea
ture
1
Feature 2
Fea
ture
1
Origin Shift
Feature 2
Feature 2 (extended)
Assigning different weightsto each feature may also affect classification, to a lesser extent.
After the origin is shifted above, the test point, originally labeled class A, is relabeled class B.
After feature 2 is extended above, the test point, originally labeled class B, is relabeled class A.
GA/Classifier Hybrid ArchitectureGA/Classifier Hybrid ArchitectureGenetic Algorithm
...
Population of feature weight, offsets, & K
Cosine KNN Classifier
W1, O1
Weight VectorWeights to use for each featureaxis during classification
...
Offset VectorFeature offsets for cosine point of reference during classification
FitnessBased on the number of correctpredictions using the weight vector & the number of masked features
...
W1W2...W8 O1O2...O8 K
W1W2...W8 O1O2...O8 K
W1W2...W8 O1O2...O8 K
W1W2...W8 O1O2...O8 K
K is also optimized
W2,O2
Population-Adaptive MutationPopulation-Adaptive Mutation
min max
mutation range
min max
mutation range
When a feature is chosen for mutation, its range for possible mutation depends on the variance of that feature across the genetic algorithm’s population. In early generations, variance is high, so the range is larger.
Later, as the population begins to converge, variance decreases and the range is smaller.
Probe Site GenerationProbe Site Generation
Aspartic protease (2apr) with crystallographically observed (yellow) and computer-generated (green) water molecules.
Standard Classifier ResultsStandard Classifier ResultsClassifier
Total Fake Real Balance
Logistic 69.331 65.495 73.164 7.668NeuralNetwork 69.293 66.003 72.582 6.579VotedPerceptron 69.246 66.754 71.737 4.983J48 68.992 68.745 69.239 0.494Euclidian kNN/GA 68.18 67.75 68.6 0.85Kstar 67.941 62.378 73.502 11.124ADTree 66.476 59.429 73.521 14.092Ibk (k = 3) 65.762 65.177 66.347 1.171NaiveBayes 61.414 44.741 78.085 33.344DecisionStump 56.061 24.08 88.038 63.958HyperPipes 50.709 99.981 1.446 98.535
Accuracy
All above classifiers are from the WEKA collection of machine learning algorithms and use 10-fold cross validation, except for the Euclidian kNN/GA classifier, which is another EC optimized classifier from Michael Raymer’s research; it uses bootstrap validation.
cosKnn/GA Classifier ResultscosKnn/GA Classifier Results
0.000Higher weights Lower weights
Total Fake Real Mean StDev K ADN AHP HBDP HBDW ABVAL NBVAL
69.910 67.778 72.041 4.359 2.640 80 .252, 11.843 .177, 7.695 .352, 11.771 .105, 1.306 .113, -2.74969.477 65.791 73.162 7.375 2.996 67 .271, 8.422 .253, 15.000 .154, 14.897 .113, 15.000 .209, 12.88569.423 64.382 74.465 10.083 3.054 45 .242, 5.356 .222, -9.702 .205, 5.863 .165, -15.000 .166, -9.63869.311 68.901 69.720 2.346 1.601 83 .306, 2.278 .297, 14.794 .134, 0.050 .161, 13.334 .102, 1.30569.227 68.626 69.828 2.651 1.900 47 .375, 2.689 .102, 0.113 .217, 9.788 .164, 6.253 .142, -4.47269.218 69.720 68.716 2.461 1.863 46 .250, 0.070 .131, 10.949 .161, 0.976 .167, 14.160 .126, 14.413 .165, 4.98269.191 67.578 70.803 3.655 2.267 63 .170, -0.957 .198, 8.989 .141, -2.203 .153, 13.966 .126, 7.411 .213, -1.67869.184 68.624 69.745 2.611 1.863 39 .252, 3.408 .189, 7.372 .331, 9.834 .228, 12.69869.091 68.962 69.221 2.400 1.751 84 .234, 9.261 .130, 15.000 .093, 10.208 .265, 14.664 .099, 2.815 .178, -1.22668.954 68.484 69.423 2.559 1.881 43 .262, 3.861 .262, 13.153 .151, 0.409 .229, 15.000 .095, -5.00068.938 66.444 71.432 5.033 2.874 39 .159, 5.596 .267, 12.055 .101, -1.87 .144, 10.342 .108, -0.715 .221, -2.71068.932 69.169 68.695 2.707 2.098 38 .175, -0.606 .140, 13.072 .159, 7.227 .242, -0.819 .283, 11.72968.926 65.711 72.141 6.430 2.764 70 .329, -1.447 .182, 1.703 .177, -8.284 .087, 10.891 .225, -11.52768.876 67.415 70.338 3.290 2.196 58 .244, 3.784 .187, 10.179 .243, 10.528 .166, 15.000 .159, 8.73468.811 69.254 68.367 2.707 1.936 33 .386, 15.000 .096, -8.129 .131, 14.271 .388, -15.000
Bootstrap Accuracy (%) Balance Feature Weights, Offsets
ResultsResults
Protein Structure Modeling• The holy grail of computational biology:
Given the sequence of amino acids in a protein, how will it fold?
• The holy grail of computational biology: Given the sequence of amino acids in a protein, how will it fold?
• TATAAGCTGACTGTCACTGA• TATAAGCTGACTGTCACTGA
Target Protein align, score
fold recognition – profile vs. profile
optimizealignment
fragmentselection
evaluatethe core
final model
key residuecomparison
Expert
Protein Structure Modeling
Automated
Target-to-Template Alignment
structure 1
structure 2
structure x
:
:
structure y
The first four strands of OB-folds...The first four strands of OB-folds...
The fifth strand, clusteredThe fifth strand, clustered
The second helix, clusteredThe second helix, clustered
Selecting a ModelSelecting a Model
• A genetic algorithm evolves a population of models
• Start with ~50; original 20 and 30 random• Double the population size by mutation and
cross-over• Test each of the new structures – dismiss half
• Fitness Function
• A genetic algorithm evolves a population of models
• Start with ~50; original 20 and 30 random• Double the population size by mutation and
cross-over• Test each of the new structures – dismiss half
• Fitness Function