Upload
millicent-wright
View
214
Download
0
Embed Size (px)
Citation preview
COMPUTATIONAL BIOLOGY
OUTLINE
Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms
DEFINITION
Computational Biology encompasses all computational methods and theories applicable to molecular biology and areas of computer based techniques for solving biological problems.
PROTIENS
Building blocks of living organism Large molecule that is composed of
sequences of amino acids There are 20 amino acids which are divided
into classes hydrophobic(h-phob) hydrophillic(h-phil) polar(pos,neg)
Amino acid
Sym Class Amino Acid
Sym Class
Alanine A h-phob Leucine L h-phob
Arginine R pos Lysine K pos
Asparagine
N h-phill Metheionine
M h-phob
Aspartic acid
D neg Phenylalanine
F h-phob
Cysterine
C h-phill Proline P h-phob
Glutamine
Q h-phill Serine S h-phill
Glutamic acid
E neg Threonine
T h-phill
Glycine G h-phob Tryptophan
W h-phob
Histidine H pos Tyrosine Y h-phill
Isoleucine
I h-prob Valine V h-prob
DNA
Blueprint of living organisms DNA is composed of two strands hold by a
weak hydrogen bond Each strand is a sequence of nucleotides DNA has four bases which are classified as
two chemical types Base Symbol Type
Adenine A Purine
Thymine T Purine
Cytosine C Pyrimidine
Guanine G Pyrimidine
DNA DOUBLE HELIX
RNA
RNA is chemically very similar to DNA There are two important differences Four bases present in RNA are adenine(A) guanine(G) cystosine(C) uracil(U) RNA nucleotides contain a different sugar
molecule(ribose)
GENETICS AND EVOLTION
Mutation
Natural selection
Genetic drift
SEQUENCE MATCHING PROBLEM
Matching DNA,RNA, or Protein sequence between a diseased organism and a healthy organism
Proteins are longer and DNA strands are even longer
We match them by breaking them in to shorter subsequences
Breaking and matching is done by notion of alignment.
SEQUENCE MATCHING EXAMPLE
Consider two amino acid sequences: ACCTGAGAG ACGTGGCAG sequence alignment A C C T G A G – A C A C G T G – G C A C
FINITE STATE MACHINES IN BLAST
It is used to find out which of the sequences in a database are related to the new given sequence using BLAST
The BLAST system is a three step process 1. Examine the query string and select set
of substrings of length w(between 4 and 20) which are good for producing matches
2. Build a DFSM that uses set of substrings and find the sequences with the highest local matches in the database
3. Examine the matches found in step2 and try to build a longer matching sequences
REGULAR EXPRESSIONS SPECIFY PROTEIN MOTIF
Aligning collection of related proteins we can define a motif
Example: E S G H D T Y Y N K N R M D T T T T T S W Q S R G S D T T T P D M T A G P T T W R N T Once an motif is defined we can search for
the occurrences of it in other protein sequence by using regular expressions
HMM FOR SEQUENCE MATCHING
HMM’s are used when sequences become fairly diverse
We can capture the variations among the members of the family and the probabilities associated with them
So by using HMM’s we can find the best alignment between two sequences and from which family does a given new sequence belongs to
HMM profile is given by M = (K,O,π,A,B) K is a set of n states, one for each position in
the sequence O is the output alphabet Π contains the initial state probabilities A contains the transition probabilities B contains the output probabilities
EXAMPLE OF HMM DESCRIBING PROTEIN SEQUENCE FAMILY
RNA SEQUENCE MATCHING AND SECONDARY STRUCTURE PREDICTION USING THE TOOLS OF CONTEXT-FREE LANGUAGES
In RNA a change to a single nucleotide in a stem region could completely alter the molecules shape and its function
So an change in the stem must be matched by a corresponding change in the paired nucleotide
Context free languages are used describe these nested dependencies and secondary structure
EXAMPLE
COMPLEXITY OF ALGORITHMS USED IN COMPUTATIONAL BIOLOGY
Approaches to many of the problems described here are computational like breaking up of large protein and DNA molecules into substrings
NP-hard Conversion to decision problem SHOERTEST-SUPERSTRING(<S,K> : S is a set of
strings and there exists some superstring T such that every element of S is a substring of T and T has length less than or equal to K) – NP-complete
REFERENCE
Automata, computability, and complexity|Theory and Applications [book] by Elaine Rich.
http://en.wikipedia.org/wiki/Computational_biology
Thank you