C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms

COMPUTATIONAL BIOLOGY

OUTLINE

Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms

DEFINITION

Computational Biology encompasses all computational methods and theories applicable to molecular biology and areas of computer based techniques for solving biological problems.

PROTIENS

Building blocks of living organism Large molecule that is composed of

sequences of amino acids There are 20 amino acids which are divided

into classes hydrophobic(h-phob) hydrophillic(h-phil) polar(pos,neg)

Amino acid

Sym Class Amino Acid

Sym Class

Alanine A h-phob Leucine L h-phob

Arginine R pos Lysine K pos

Asparagine

N h-phill Metheionine

M h-phob

Aspartic acid

D neg Phenylalanine

F h-phob

Cysterine

C h-phill Proline P h-phob

Glutamine

Q h-phill Serine S h-phill

Glutamic acid

E neg Threonine

T h-phill

Glycine G h-phob Tryptophan

W h-phob

Histidine H pos Tyrosine Y h-phill

Isoleucine

I h-prob Valine V h-prob

DNA

Blueprint of living organisms DNA is composed of two strands hold by a

weak hydrogen bond Each strand is a sequence of nucleotides DNA has four bases which are classified as

two chemical types Base Symbol Type

Adenine A Purine

Thymine T Purine

Cytosine C Pyrimidine

Guanine G Pyrimidine

DNA DOUBLE HELIX

RNA

RNA is chemically very similar to DNA There are two important differences Four bases present in RNA are adenine(A) guanine(G) cystosine(C) uracil(U) RNA nucleotides contain a different sugar

molecule(ribose)

GENETICS AND EVOLTION

Mutation

Natural selection

Genetic drift

SEQUENCE MATCHING PROBLEM

Matching DNA,RNA, or Protein sequence between a diseased organism and a healthy organism

Proteins are longer and DNA strands are even longer

We match them by breaking them in to shorter subsequences

Breaking and matching is done by notion of alignment.

SEQUENCE MATCHING EXAMPLE

Consider two amino acid sequences: ACCTGAGAG ACGTGGCAG sequence alignment A C C T G A G – A C A C G T G – G C A C

FINITE STATE MACHINES IN BLAST

It is used to find out which of the sequences in a database are related to the new given sequence using BLAST

The BLAST system is a three step process 1. Examine the query string and select set

of substrings of length w(between 4 and 20) which are good for producing matches

2. Build a DFSM that uses set of substrings and find the sequences with the highest local matches in the database

3. Examine the matches found in step2 and try to build a longer matching sequences

REGULAR EXPRESSIONS SPECIFY PROTEIN MOTIF

Aligning collection of related proteins we can define a motif

Example: E S G H D T Y Y N K N R M D T T T T T S W Q S R G S D T T T P D M T A G P T T W R N T Once an motif is defined we can search for

the occurrences of it in other protein sequence by using regular expressions

HMM FOR SEQUENCE MATCHING

HMM’s are used when sequences become fairly diverse

We can capture the variations among the members of the family and the probabilities associated with them

So by using HMM’s we can find the best alignment between two sequences and from which family does a given new sequence belongs to

HMM profile is given by M = (K,O,π,A,B) K is a set of n states, one for each position in

the sequence O is the output alphabet Π contains the initial state probabilities A contains the transition probabilities B contains the output probabilities

EXAMPLE OF HMM DESCRIBING PROTEIN SEQUENCE FAMILY

RNA SEQUENCE MATCHING AND SECONDARY STRUCTURE PREDICTION USING THE TOOLS OF CONTEXT-FREE LANGUAGES

In RNA a change to a single nucleotide in a stem region could completely alter the molecules shape and its function

So an change in the stem must be matched by a corresponding change in the paired nucleotide

Context free languages are used describe these nested dependencies and secondary structure

EXAMPLE

COMPLEXITY OF ALGORITHMS USED IN COMPUTATIONAL BIOLOGY

Approaches to many of the problems described here are computational like breaking up of large protein and DNA molecules into substrings

NP-hard Conversion to decision problem SHOERTEST-SUPERSTRING(<S,K> : S is a set of

strings and there exists some superstring T such that every element of S is a substring of T and T has length less than or equal to K) – NP-complete

REFERENCE

Automata, computability, and complexity|Theory and Applications [book] by Elaine Rich.

http://en.wikipedia.org/wiki/Computational_biology

http://en.wikipedia.org/wiki/Computational_biology

Thank you

Documents

C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms