Upload
butest
View
397
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Introduction to Bioinformatics9. Machine Learning for
Protein Structure Prediction #1
Course 341
Department of Computing
Imperial College, London
© Simon Colton
Remember the Scenario
We have found a gene in mice– Which when active makes them immune to a disease
The gene codes for a protein, the protein has a shape and the shape dictates what it does
Humans share 96% of their genes with mice
So, what does the human protein look like?
The Database Approach
If two sequences are sequentially similar– Then they are very likely to code for similar proteins
Find the best match for the mouse gene– In terms of sequences– From a large database of individual human genes– Or from a database of families of genes
Infer protein structure from knowledge of matched genes – If lucky, a structure of one of them may already be known
There is another way…
Machine learning: general set of techniques– For teaching a computer to make predictions– By observing given correct predictions (being trained)
Special type of prediction– Classification of objects into classes
E.g., images into faces/cars/landscapes E.g., drugs into toxic/non-toxic (binary classification)
We want to predict a protein’s structure– Given its sequence
A Good Approach
Look at regions of a protein– i.e., lengths of residues
Define ways to describe the regions– So that we can infer the structure of a protein
From a description of all its regions
Learn methods for predicting:– What type of region a particular residue will be in
Apply this to a protein sequence– To find contiguous regions with same description– Put regions together to predict entire structure
For example
G A G D G A N A A A
Alpha Alpha Alpha Alpha Inter Inter Beta Beta Beta Beta
Trained Predictor
Alpha Helix Beta Sheet
Further Processing
Two Main Questions
How do we describe protein structures?– What are alpha helices and beta-sheets?– Covered in the next lecture
How do we train our predictors?– Covered in this lecture (and the start of the next…)
Machine Learning in a Nutshell
Examples in
Predictor out
Learning is by example– More examples, better predictors– For some methods, the examples are used once– For other methods, they are used repeatedly
Machine Learning Considerations
What is the problem for the predictor to address?
What is the nature of our data?
How will we represent the predictor?
How will we train the predictor?
How will we test how good the predictor is?
Types of Learning Problemsin Bioinformatics
Class membership– e.g., predictive toxicology
Prediction of sequences– e.g., sequences of protein sub-structures
Classification hierarchies– e.g., folds, families, super-families
Shape descriptions– e.g., binding site descriptions
Temporal models– e.g., activity of cells, metabolic pathways
Learning Data
Data comes in many forms, in particular:– Objects (to be classified/predicted for)– Classifications/predictions of objects– Features of objects (to use in the prediction)
Problems with data– Imprecise information (e.g., badly recorded data)– Irrelevant information (e.g., additional features)– Incorrect information (e.g., wrong classifications)– Missing classifications– Missing features for sets of objects
Types of Representations
Logical– Decision trees, grammars, logic programs– Symbolic, understandable representations
Probabilistic– Neural networks, Hidden Markov Models, SVMs– Mathematical functions, not easy to understand
Mixed– Bayesian Networks, Stochastic Logic Programs– Have advantages of both, more difficult to work with
Advantages of Representations
Probabilistic models:– Can handle noisy and imprecise data– Useful when there is a notion of uncertainty (in data/hypothesis)– Well-founded (300 years of development)– Good statistical algorithms for estimation
Logical models– Richness of description– Extensibility - probabilities, sequences, space, time, actions– Clarity of results– Well-founded (2400 years of development)
Decision Tree Representations
Input is a set of features– Describing an example/situation
Many “if-then” choices– Leaves are decision
Logical representation:– “If then” is implication– Branches are conjunctions– Different branches comprise
A disjunction
Artificial Neural Networks
Layers of nodes– Input is transformed into
numbers– Weighted averages are fed into
nodes High or low numbers come
out of nodes– A Threshold function
determines whether high or low Output nodes will “fire” or not
– Determines classification For an example
Logic Program Representations
Logic programs are a subset of first order logic They consist of sets of Horn clauses
– Horn clause: A conjunction of literals implying a single literal
Can easily be interpreted At the heart of the Prolog programming language
Learning Decision Trees
Problem: what feature do nodes in the tree test?– And what happens for each case
ID3 algorithm:– Uses a notion of “Information gain”– Based on entropy: how (dis)organised data is– Chooses the node with the highest information gain
As the node to add to the tree next
– Then restricts examples for next node
Learning Artificial Neural Networks
First problem: layer structure– Usually done through trial and error
Main problem: choosing the weights– Uses a back-propagation algorithm to train them
Each example is given– If currently correctly classified, that’s fine– If not, the errors from the output are passed back
Propagated in order to change the weights throughout Only very small changes are made (avoid un-doing good work)
Once all examples have been given– We start again, until some termination conditions (accuracy) met– Often requires thousands of such training ‘epochs’
Learning Logic Programs
A notion of generality is used– One logic program is more general than another
If one follows from another (sort of) [subsumption]
A search space of logic programs is defined– Constrained using a language bias
Some programs search from general to specific– Using rules of deduction to go between sentences
Other programs search from specific to general– Using inverted rules of deduction
Search is guided by:– Performance of the LP with respect to classifying training examples– Information theoretic calculations to avoid over-specialisations
Testing Learned Predictors #1
Imperative to test on unseen examples– Cannot report accuracy on examples which have been used to
train the predictor, because the results will be heavily biased
Simple method: Hold back– When number of examples > 200 (roughly)– Split into a training set and a test set (e.g., 80%/20%)– Never let the training algorithm see the test set– Report the accuracy of the predictor on the test set only– Have to worry about statistics with smaller numbers
This kind of testing came a little late to bioinformatics– Beware conclusions drawn about badly tested predictors
N-Fold Cross Validation
Leave one out– For m < 20 examples– Train on m-1 examples, test predictor on left out example– Do this for every example and report the average accuracy
N-fold cross validation– Randomly split into n mutually exclusive sets (partitions)– For every set S
Train using all examples from the other n-1 sets Test predictor on S, record the accuracy
– Report the average accuracy over all the sets– 10-fold cross validation is common
Testing Learned Predictors #2
Often different consideration for different contexts– E.g., false positives/negatives in medical diagnosis
Confusion matrix– For binary prediction tasks
Predicted F Predicted T
number = anumber = b
(false pos)
number = c
(false neg)number = d
Actually F
Actually T
Let t = a+b+c+d Predictive accuracy = (a+d)/t Majority class = max ((a+b)/t, (c+d)/t) Precision = Selectivity = d/(b+d) Recall = Sensitivity = d/(c+d)
Comparing Learning Methods
A very simple method: Majority class predictor – Predict everything to be in the majority class– Trained predictors must beat this to be credible
N-fold cross validation results are compared– To show an advance in predictor technology
However, accuracy is not the only consideration– Speed, memory and comprehensibility
Overfitting
It’s easy to over-train predictors If a predictor is substantially better for the training set than
the test set, it is overfitting– Essentially, it has memorised aspects of the examples, rather than
generalising properties of them– This is bad: think of a completely new example– Individual learning schemes have coping methods
Easy general approach to avoiding overfitting:– Maintain a validation set to perform tests on during training
When performance on the validation set degrades, stop learning– Be careful of blips in predictive accuracy (leave a while, then come back)
Note: this shouldn’t be used as the testing set