62
EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering

EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Embed Size (px)

Citation preview

Page 1: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

EC in Bioinformatics and Drug Design

EC in Bioinformatics and Drug Design

Dr. Michael L. RaymerDepartment of Computer Science

and Engineering

Dr. Michael L. RaymerDepartment of Computer Science

and Engineering

Page 2: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

BioinformaticsBioinformatics

ComputationalMolecularBiology

Bioinformatics

GenomicsGenomics

Proteomics

Functionalgenomics

Structuralbioinformatics

Structuralbioinformatics

Page 3: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

DNA is the blueprint for lifeDNA is the blueprint for life

• Every cell in your body has 23 chromosomes in the nucleus

• The genes in these chromosomes determine all of your physical attributes.

• Every cell in your body has 23 chromosomes in the nucleus

• The genes in these chromosomes determine all of your physical attributes.

Image source: Crane digital, http://www.cranedigital.com/

Page 4: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Mapping the GenomeMapping the Genome• The human genome project has provided us

with a draft of the entire human genome.• The human genome project has provided us

with a draft of the entire human genome.

• Four bases:A, T, C, G

• 3.12 billion base-pairs

• 99% of these are the same

• Polymorphisms = where they differ

Page 5: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

How does the code work?How does the code work?• Template for construction of proteins• Template for construction of proteins

Page 6: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

The genome is information:The genome is information:• TATAAGCTGACTGTCACTGA• TATAAGCTGACTGTCACTGA

one codon

3apr.pdb

4 Bases:A,G,C,T

20 Amino Acids

• The representation is much The representation is much more complex in many ways more complex in many ways than in ECthan in EC

Page 7: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Proteins: Molecular machineryProteins: Molecular machinery

• Proteins in your muscles allows you to move:

myosinandactin

• Proteins in your muscles allows you to move:

myosinandactin

Page 8: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Proteins: Molecular machineryProteins: Molecular machinery• Enzymes

(digestion, catalysis)

• Structure (collagen)

• Enzymes(digestion, catalysis)

• Structure (collagen)

Page 9: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Proteins: Molecular machineryProteins: Molecular machinery

• Signaling(hormones, kinases)

• Transport(energy, oxygen)

• Signaling(hormones, kinases)

• Transport(energy, oxygen)

Image source: Crane digital, http://www.cranedigital.com/

Page 10: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Growth of biological databasesGrowth of biological databases

1 2 3 5 10 16 24 35 49 72 101 157217

385652

1,160

2,009

3,841

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

Millions

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

Source: GenBank

3D StructuresGrowth:

Source: http://www.rcsb.org/pdb/holdings.html

GenBank BASEPAIR GROWTH

Page 11: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

ApplicationsApplications• What can we do with all this

information?Cure diseases – computational drug

designModel protein structureUnderstand relationships between

organisms (phylogenies)

• What can we do with all this information?Cure diseases – computational drug

designModel protein structureUnderstand relationships between

organisms (phylogenies)

Page 12: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Example Case: HIV ProteaseExample Case: HIV Protease

1. Exposure & infection

2. HIV enters your cell3. Your own cell reads

the HIV “code” and creates the HIV proteins.

4. New viral proteins prepare HIV for infection of other cells.

1. Exposure & infection

2. HIV enters your cell3. Your own cell reads

the HIV “code” and creates the HIV proteins.

4. New viral proteins prepare HIV for infection of other cells.

http://whyfiles.org/035aids/index.html© George Eade, Eade Creative Services, Inc.

Page 13: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

HIV Protease as a drug targetHIV Protease as a drug target• Many drugs bind to

protein active sites.• This HIV protease

can no longer prepare HIV proteins for infection, because an inhibitor is already bound in its active site.

• Many drugs bind to protein active sites.

• This HIV protease can no longer prepare HIV proteins for infection, because an inhibitor is already bound in its active site.

HIV Protease + Peptidyl inhibitor (1A8G.PDB)

Page 14: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Drug DiscoveryDrug Discovery• Target Identification

What protein can we attack to stop the disease from progressing?

• Lead discovery & optimization What sort of molecule will bind to this protein?

• Toxicology Does it kill the patient? Does it have side effects? Does it get to the problem spots?

• Target Identification What protein can we attack to stop the disease from

progressing?

• Lead discovery & optimization What sort of molecule will bind to this protein?

• Toxicology Does it kill the patient? Does it have side effects? Does it get to the problem spots?

Page 15: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Drug Development Life CycleDrug Development Life Cycle

Years

0 2 4 6 8 10 12 14 16

Discovery (2 to 10 Years)

Preclinical Testing(Lab and Animal Testing)

Phase I(20-30 Healthy Volunteers used to check for safety and dosage)

Phase II(100-300 Patient Volunteers used to check for efficacy and side effects)

Phase III(1000-5000 Patient Volunteers used to monitor reactions to long-term drug use)

FDA Review & Approval

Post-Marketing Testing

$600-700 Million!$600-700 Million!

7 – 15 Years!7 – 15 Years!

Page 16: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Drug lead screeningDrug lead screening

5,000 to 10,000 compounds screened

250 Lead Candidates in Preclinical Testing5 Drug Candidates

enter Clinical Testing; 80% Pass Phase I

One drug approved by the FDAOne drug approved by the FDA

30%Pass Phase II

80% Pass Phase III

Page 17: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Finding drug leadsFinding drug leads• Once we have a target, how do we find some

compounds that might bind to it?

• The old way: exhaustive screening

• The new way: computational screening!

• Once we have a target, how do we find some compounds that might bind to it?

• The old way: exhaustive screening

• The new way: computational screening!

Page 18: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Chemistry 101Chemistry 101• Like everything else in the universe, proteins

and drugs are made up of atoms.

• Some atoms, like oxygen, tend to have a negative charge.

• Some, like nitrogen, tend to be positively charged.

• When the two come together, they attract like magnets.

• Like everything else in the universe, proteins and drugs are made up of atoms.

• Some atoms, like oxygen, tend to have a negative charge.

• Some, like nitrogen, tend to be positively charged.

• When the two come together, they attract like magnets.

Page 19: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

The Goal:The Goal:

ProteinProtein DrugleadDruglead

Page 20: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Drug Lead Screening & DockingDrug Lead Screening & Docking

• Complementarity Shape Chemical Electrostatic

• Complementarity Shape Chemical Electrostatic

??

Page 21: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

But it gets more complicatedBut it gets more complicated

• About 60% of our body weight is water.

• Most proteins are surrounded by water molecules

• Interact with protein-drug complexes

Page 22: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Protein-water interactionsProtein-water interactions

• Water has both negative and positively charged atoms.

• It can bridge gaps between drugs and proteins.

• Water has both negative and positively charged atoms.

• It can bridge gaps between drugs and proteins.

Protein surface

Ligand

Watermolecule

Page 23: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Will the water stay?Will the water stay?

• When a drug comes close to a protein, some of the water molecules are displaced.

• When a drug comes close to a protein, some of the water molecules are displaced.

Protein surface

Ligand

Watermolecule

Page 24: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Pattern Recognition ModelPattern Recognition Model

SampleTransducer,etc.

Rawmeasurementdata

f1f2f3f4f5

Featurevector= pattern

d

Page 25: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Training and TestingTraining and Testing

C

C

N

C

N

NLabeled training data

f1f2f3f4f5

Cube

Classifier Classification/prediction

f1f2f3f4f5

?

Page 26: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Temperature Factor (B-Value)Temperature Factor (B-Value)

How wiggly is it?Here a protein (dihydrofolate reductase) is colored by temperature factor.

Page 27: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Atomic Density (ADN)Atomic Density (ADN)

How crowded is it?The atomic density of this water molecule is 5.

Page 28: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Prediction of water moleculesPrediction of water moleculesBlue spheres are water molecules predicted to stay in the active site.

Wire mesh spheres are water molecules predicted to be displaced — booted out by the ligand.

Page 29: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Nearest Neighbor ClassificationNearest Neighbor Classification

Feature 2

Fea

ture

1

= class 1 training sample

= class 2 training sample

= new unknown (test) sample

Page 30: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Feature Weighted knnFeature Weighted knn

Feature 2

Fe

atu

re 1

a.

Feature 2

Fea

ture

1

Scale Extended

b.

Class 1Class 2Unknown

Page 31: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

GA & knn InteractionGA & knn Interaction

Genetic Algorithm

W1 W2 W3 W4 W5

W1 W2 W3 W4 W5

W1 W2 W3 W4 W5

W1 W2 W3 W4 W5

...

KNN Classifier

W1

W2

Masked Weight Vector & kMasked Weight Vector & k

Fitness — How is it calculated?Fitness — How is it calculated?

Page 32: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Weighting and MaskingWeighting and Masking• How do we sample feature subsets?

Weight below a threshold value: slow sampling Masking:

Distinct mutation rates, or multiple mask bits• Intron effect

• Classifier parameters (k) on the chromosome

• How do we sample feature subsets? Weight below a threshold value: slow sampling Masking:

Distinct mutation rates, or multiple mask bits• Intron effect

• Classifier parameters (k) on the chromosome

73.2 0

W1 W2 W3 W4 W5 M1 M2 M3 M4 M5 k

Page 33: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

The Cost FunctionThe Cost Function• We can direct the search toward any objective.

Classification accuracy Class balance Feature subset parsimony (reduce d)

• The GA minimizes the cost function:

• We can direct the search toward any objective. Classification accuracy Class balance Feature subset parsimony (reduce d)

• The GA minimizes the cost function:

),(),(_

)(),(),cost(

kwbalCkwvotesincorrectC

wnonzeroCkwerrCkw

balvote

parsacc

Page 34: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

UCI Data Set ResultsUCI Data Set ResultsHepatitis Train/Tune Test FeaturesBayes 85.3 65.7 19Knn 87.1 73.4 19GA/knn 86.0 69.6 8.1EP/knn 87.2 73.1 8.9

Wine Train/Tune Test FeaturesBayes 98.8 94.7 13Knn 94.9 94.3 13GA/knn 99.7 94.8 6.0EP/knn 99.5 93.2 6.2

Ionosphere Train/Tune Test FeaturesBayes 93.0 90.1 34Knn 83.4 93.2 34GA/knn 95.0 91.9 8.5EP/knn 93.2 92.3 13.5

Pima Train/Tune Test FeaturesBayes 76.10 64.60 8Knn 73.50 71.50 8GA/knn 80.00 72.10 3.1EP/knn 79.10 72.90 3.9

• Typically we see modest gains in classification accuracy, with significant reduction in features.

• Typically we see modest gains in classification accuracy, with significant reduction in features.

Blake, C. L. and Merz, C. J. (1998). UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences.http://www.ics.uci.edu/~mlearn/MLRepository.html

Page 35: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Classical Methods & SFFSClassical Methods & SFFSMethod Train/Tune Test FeaturesGA/knn 98.5% 98.4% 3SFFS/knn 98.0% 98.1% 6Linear Discriminant 93.8% 88.4% 21Quadratic Discriminant 89.7% 95.3% 21Nearest Neighobr 100.0% 96.1% 21Bayes (independent) 97.1% 92.4% 21Bayes (second order) 97.7% 98.5% 21Neural Net (Back prop) 99.5% 99.3% 21Predictive Value Max. 99.8% 99.4% 21CART Tree 99.8% 84.9% 21

Method Train/Tune Test FeaturesGA/knn 90.4% 90.6% 2SFFS/knn 88.5% 91.4% 3Linear Discriminant 88.7% 86.8% 7Quadratic Discriminant 79.3% 73.6% 7Nearest Neighobr 100.0% 82.1% 7Bayes (independent) 88.7% 83.0% 7Bayes (second order) 95.3% 81.1% 7Neural Net (Back prop) 90.0% 85.8% 7Predictive Value Max. 91.5% 89.6% 7CART Tree 90.0% 84.9% 7

• Thyroid data Best feature

selection performance

within 1% of best accuracy

• Appendicitis data Best feature

selection within 1% of best

accuracy

• Thyroid data Best feature

selection performance

within 1% of best accuracy

• Appendicitis data Best feature

selection within 1% of best

accuracy

Weiss, S. and Kapouleas, I. (1990). An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods. Morgan Kaufmann.

Page 36: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Feature Extraction FrameworkFeature Extraction Framework

Genetic Algorithm..

.Classifier

Features and Classifier ParametersFeatures and Classifier Parameters

Fitness based on accuracyFitness based on accuracy

??

Page 37: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Reducing Computation TimeReducing Computation Time• Most of the compute time is spent calculating

distances and finding nearest neighbors. Branch and bound knn[1] scales:

• linearly with d,

• polynomial with n, O(nc) 1.0 < c < 2.0

• Can we use a faster classifier?

• Most of the compute time is spent calculating distances and finding nearest neighbors. Branch and bound knn[1] scales:

• linearly with d,

• polynomial with n, O(nc) 1.0 < c < 2.0

• Can we use a faster classifier?

[1] Fukunaga, K. and Narendra, P. M. (1975). A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers, 750–753.

Page 38: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

The Bayes ClassifierThe Bayes Classifier

• Properties are well understood and thoroughly explored in the literature

• Training data are summarized, classification of test samples is rapid

• Provably optimal when the multivariate feature distribution is known for each class

• Properties are well understood and thoroughly explored in the literature

• Training data are summarized, classification of test samples is rapid

• Provably optimal when the multivariate feature distribution is known for each class

Page 39: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Class-conditional DistributionsClass-conditional Distributions

P(x)

x• MLE optimal when

Equal prior probabilities Equal error costs

• Generalizes to d dimensions

1ω|xP 2ω|xP

Page 40: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Multiple DimensionsMultiple Dimensions ixP ω|

11 ω|xP 21 ω|xP

12 ω|xP 22 ω|xP

Page 41: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

The “Naïve” Bayes ClassifierThe “Naïve” Bayes Classifier• We now have P(xi|j) for each feature i and

each class j

• Naïve approach: assume all features are independent

• We now have P(xi|j) for each feature i and each class j

• Naïve approach: assume all features are independent

jdjjj xPxPxPxP ω|ω|ω|ω| 21

• This assumption is almost always false

• As long as monotonicity holds, the decision rule is still valid

Page 42: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Unequal Prior ProbabilitiesUnequal Prior Probabilities• When we know the prior probabilities for the

classes, we can use Bayes rule:• When we know the prior probabilities for the

classes, we can use Bayes rule:

C

jjj

iii

PxP

PxPxP

1

ωω|

ωω||ω

Bayes decision rule: assign to the class for which the posterior probability is highest

iP ω :yprobabilitPrior

Page 43: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Hybridizing the Bayes ClassifierHybridizing the Bayes Classifier

• Unfortunately the Bayes classifier is invariant to feature weighting

• Unfortunately the Bayes classifier is invariant to feature weighting

0 20 40 60 80 100 120 140 160 180 200 +

Feature Value

Pro

port

ion

of T

rain

ing

Sam

ples

0 2 4 6 8 10 12 14 16 18 20 +

P(80 < x < 100) = 0.045

P(8 < x < 10) = 0.045

Page 44: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Bayes Discriminant FunctionBayes Discriminant Function

jxPxPx jii || :if decide , given

• Bayes Decision Rule:• Bayes Decision Rule:

xPxPxg

|| 21 • Two-class Discriminant Function:

2

1

2211

|

||

iii PxP

PxPPxP

2211 || PxPPxP

Page 45: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Naïve Bayes DiscriminantNaïve Bayes Discriminant

2211 || PxPPxP

222221

111211

|||

|||

PxxxP

PxxxP

d

d

• Independence Assumption:

)log()log( baba )log()log( baba

22

11

log|log

log|log

PxP

PxPxg

Page 46: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

A Parameterized DiscriminantA Parameterized Discriminant

i

idd

i

ii

P

xPC

xPC

xPCxP

log

|log

|log

|log|

22

11*

i

idd

i

ii

P

xPC

xPC

xPCxP

log

|log

|log

|log|

22

11*

• C1, C2 … Cd are optimized by an evolutionary algorithm.

Page 47: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

An Alternative DiscriminantAn Alternative Discriminant• Sum of weighted marginal probabilities• Sum of weighted marginal probabilities

i

idd

i

ii

P

xPC

xPC

xPCxP

|

|

||

22

11*

i

idd

i

ii

P

xPC

xPC

xPCxP

|

|

||

22

11*

Page 48: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Conserved vs. Non-ConservedConserved vs. Non-ConservedBootstrap Accuracy (%) Balance Feature Weights

Disp Cons Total Mean StDev K ADN AHP BVAL HBDP HBDW MOB ABVAL NBVAL65.44 62.96 64.20 3.88 2.94 65 0.000 0.000 0.413 0.135 0.137 0.315 0.000 0.00065.16 62.08 63.62 3.80 3.19 29 0.000 0.000 0.667 0.000 0.000 0.333 0.000 0.00065.56 60.77 63.16 5.35 3.55 25 0.000 0.000 0.463 0.000 0.000 0.323 0.000 0.21462.08 64.14 63.11 4.15 2.75 37 0.000 0.000 0.891 0.000 0.000 0.109 0.000 0.00063.49 62.52 63.00 3.52 2.56 77 0.000 0.000 0.308 0.163 0.225 0.304 0.000 0.00065.30 60.45 62.87 5.36 3.72 17 0.000 0.000 0.841 0.000 0.000 0.159 0.000 0.00058.76 66.19 62.47 7.74 4.24 97 0.459 0.291 0.250 0.000 0.000 0.000 0.000 0.00061.79 62.94 62.36 3.27 2.47 27 0.000 0.371 0.629 0.000 0.000 0.000 0.000 0.00062.86 61.45 62.16 3.79 2.49 23 0.000 0.000 0.372 0.240 0.000 0.203 0.000 0.18462.04 62.26 62.15 3.50 2.34 7 0.000 0.000 0.571 0.156 0.000 0.273 0.000 0.00060.68 63.36 62.02 4.26 3.06 87 0.000 0.118 0.558 0.323 0.000 0.000 0.000 0.00062.68 60.76 61.72 4.14 3.30 17 0.000 0.252 0.352 0.397 0.000 0.000 0.000 0.00062.93 60.47 61.70 4.20 3.25 67 0.000 0.000 0.421 0.000 0.000 0.579 0.000 0.00061.00 62.16 61.58 3.58 2.50 13 0.018 0.388 0.441 0.000 0.000 0.000 0.000 0.15360.40 62.52 61.46 3.84 2.79 63 0.000 0.000 0.227 0.000 0.773 0.000 0.000 0.00061.99 60.86 61.42 3.18 2.45 19 0.000 0.051 0.417 0.058 0.000 0.474 0.000 0.00061.13 61.59 61.36 3.39 2.55 15 0.000 0.000 0.392 0.293 0.000 0.207 0.000 0.10857.71 64.60 61.16 7.14 3.81 19 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.00062.33 58.90 60.62 4.58 3.57 57 0.000 0.000 0.881 0.000 0.000 0.000 0.000 0.11960.65 59.95 60.30 2.78 2.24 75 0.000 0.317 0.000 0.000 0.000 0.000 0.000 0.68359.83 60.68 60.25 3.40 2.48 69 0.000 0.000 0.000 0.336 0.000 0.000 0.000 0.664

Higher weights Lower weights 0.000

Page 49: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Cosine-based kNN ClassifierCosine-based kNN Classifier

Feature 2

Fea

ture

1 Class A

Class B

Test Pattern

k=5 classification: Among 5 points with the smallest angles with respect to the test point, 3 are class A; the test point is labeled as class A.

•A cosine similarity metric finds k points with the most similar angles. The cosine between 2 vectors xi and xj is

defined as:

|||| ||||),cos(

ji

jiji

xx

xxxx

•Once the k most similar points have been identified, the class label is assigned according to the function:

k

i

ii xcxxq1

)(),cos(

where c(xi) = 1 if xi belongs to the positive class; -1 if xi belongs to the negative class.

If q is positive, the query point is assigned to the positive class, otherwise it is assigned to the negative class.

Page 50: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Feature Extraction TechniquesFeature Extraction TechniquesShifting the origin in the search space may affect classification.

Class A

Class B

Test Pattern

Feature 2

Fea

ture

1

Feature 2

Fea

ture

1

Origin Shift

Feature 2

Feature 2 (extended)

Assigning different weightsto each feature may also affect classification, to a lesser extent.

After the origin is shifted above, the test point, originally labeled class A, is relabeled class B.

After feature 2 is extended above, the test point, originally labeled class B, is relabeled class A.

Page 51: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

GA/Classifier Hybrid ArchitectureGA/Classifier Hybrid ArchitectureGenetic Algorithm

...

Population of feature weight, offsets, & K

Cosine KNN Classifier

W1, O1

Weight VectorWeights to use for each featureaxis during classification

...

Offset VectorFeature offsets for cosine point of reference during classification

FitnessBased on the number of correctpredictions using the weight vector & the number of masked features

...

W1W2...W8 O1O2...O8 K

W1W2...W8 O1O2...O8 K

W1W2...W8 O1O2...O8 K

W1W2...W8 O1O2...O8 K

K is also optimized

W2,O2

Page 52: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Population-Adaptive MutationPopulation-Adaptive Mutation

min max

mutation range

min max

mutation range

When a feature is chosen for mutation, its range for possible mutation depends on the variance of that feature across the genetic algorithm’s population. In early generations, variance is high, so the range is larger.

Later, as the population begins to converge, variance decreases and the range is smaller.

Page 53: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Probe Site GenerationProbe Site Generation

Aspartic protease (2apr) with crystallographically observed (yellow) and computer-generated (green) water molecules.

Page 54: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Standard Classifier ResultsStandard Classifier ResultsClassifier

Total Fake Real Balance

Logistic 69.331 65.495 73.164 7.668NeuralNetwork 69.293 66.003 72.582 6.579VotedPerceptron 69.246 66.754 71.737 4.983J48 68.992 68.745 69.239 0.494Euclidian kNN/GA 68.18 67.75 68.6 0.85Kstar 67.941 62.378 73.502 11.124ADTree 66.476 59.429 73.521 14.092Ibk (k = 3) 65.762 65.177 66.347 1.171NaiveBayes 61.414 44.741 78.085 33.344DecisionStump 56.061 24.08 88.038 63.958HyperPipes 50.709 99.981 1.446 98.535

Accuracy

All above classifiers are from the WEKA collection of machine learning algorithms and use 10-fold cross validation, except for the Euclidian kNN/GA classifier, which is another EC optimized classifier from Michael Raymer’s research; it uses bootstrap validation.

Page 55: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

cosKnn/GA Classifier ResultscosKnn/GA Classifier Results

0.000Higher weights Lower weights

Total Fake Real Mean StDev K ADN AHP HBDP HBDW ABVAL NBVAL

69.910 67.778 72.041 4.359 2.640 80 .252, 11.843 .177, 7.695 .352, 11.771 .105, 1.306 .113, -2.74969.477 65.791 73.162 7.375 2.996 67 .271, 8.422 .253, 15.000 .154, 14.897 .113, 15.000 .209, 12.88569.423 64.382 74.465 10.083 3.054 45 .242, 5.356 .222, -9.702 .205, 5.863 .165, -15.000 .166, -9.63869.311 68.901 69.720 2.346 1.601 83 .306, 2.278 .297, 14.794 .134, 0.050 .161, 13.334 .102, 1.30569.227 68.626 69.828 2.651 1.900 47 .375, 2.689 .102, 0.113 .217, 9.788 .164, 6.253 .142, -4.47269.218 69.720 68.716 2.461 1.863 46 .250, 0.070 .131, 10.949 .161, 0.976 .167, 14.160 .126, 14.413 .165, 4.98269.191 67.578 70.803 3.655 2.267 63 .170, -0.957 .198, 8.989 .141, -2.203 .153, 13.966 .126, 7.411 .213, -1.67869.184 68.624 69.745 2.611 1.863 39 .252, 3.408 .189, 7.372 .331, 9.834 .228, 12.69869.091 68.962 69.221 2.400 1.751 84 .234, 9.261 .130, 15.000 .093, 10.208 .265, 14.664 .099, 2.815 .178, -1.22668.954 68.484 69.423 2.559 1.881 43 .262, 3.861 .262, 13.153 .151, 0.409 .229, 15.000 .095, -5.00068.938 66.444 71.432 5.033 2.874 39 .159, 5.596 .267, 12.055 .101, -1.87 .144, 10.342 .108, -0.715 .221, -2.71068.932 69.169 68.695 2.707 2.098 38 .175, -0.606 .140, 13.072 .159, 7.227 .242, -0.819 .283, 11.72968.926 65.711 72.141 6.430 2.764 70 .329, -1.447 .182, 1.703 .177, -8.284 .087, 10.891 .225, -11.52768.876 67.415 70.338 3.290 2.196 58 .244, 3.784 .187, 10.179 .243, 10.528 .166, 15.000 .159, 8.73468.811 69.254 68.367 2.707 1.936 33 .386, 15.000 .096, -8.129 .131, 14.271 .388, -15.000

Bootstrap Accuracy (%) Balance Feature Weights, Offsets

Page 56: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

ResultsResults

Page 57: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Protein Structure Modeling• The holy grail of computational biology:

Given the sequence of amino acids in a protein, how will it fold?

• The holy grail of computational biology: Given the sequence of amino acids in a protein, how will it fold?

• TATAAGCTGACTGTCACTGA• TATAAGCTGACTGTCACTGA

Page 58: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Target Protein align, score

fold recognition – profile vs. profile

optimizealignment

fragmentselection

evaluatethe core

final model

key residuecomparison

Expert

Protein Structure Modeling

Automated

Target-to-Template Alignment

structure 1

structure 2

structure x

:

:

structure y

Page 59: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

The first four strands of OB-folds...The first four strands of OB-folds...

Page 60: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

The fifth strand, clusteredThe fifth strand, clustered

Page 61: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

The second helix, clusteredThe second helix, clustered

Page 62: EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering Dr. Michael L. Raymer Department of Computer

Selecting a ModelSelecting a Model

• A genetic algorithm evolves a population of models

• Start with ~50; original 20 and 30 random• Double the population size by mutation and

cross-over• Test each of the new structures – dismiss half

• Fitness Function

• A genetic algorithm evolves a population of models

• Start with ~50; original 20 and 30 random• Double the population size by mutation and

cross-over• Test each of the new structures – dismiss half

• Fitness Function