Computer Science and Bioinformatics

Computer Science and Bioinformatics

James Edwards and Rajinder Singh Bhatti.

http://www.csee.umbc.edu/~smer1/bioinformatics.gif

Biology and Computer Science?

Initially Biology depended on Chemistry to make major strides Biochemistry

Biology then needed to work at atomic level explaining phenomena Biophysics

The modern era of Biology needs to interpret a wealth of data, tools that only computer Science is able to provide Hence Bioinformatics

What is Bioinformatics?

The study of computational methods to expand the use of biological data (Data Orientated).

Often (incorrectly) used instead of the term ‘Computational Biology’. However this is a slightly different discipline.

Computational Biology is the use of computational and mathematical methods to study or simulate biological systems (Hypothesis Orientated).

[source National Institutes of health]

Overlaps Between the two Disciplines

1 23

4

1 – Bioinformatics problems 2 – Computational Biology problems 3 – Problems in both categories 4 – Problems in neither category

Motivation for Bioinformatics

Quote from Donald Knuth- 1974 Turing Award winner:

“…I can’t be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on. It’s at that level.” [source – Wikiquotes]

Can Biological life be equated with Computing? Results so far would suggest the answer is

yes!

Common Bioinformatics Problems

Finding and assessing Similarities between Strings (next slides).

Detecting patterns in strings. Constructing trees of the evolution of

organisms. Classifying new data by clustering

existing data. Also applications of Machine Vision to

detect interactions between proteins

Data Structures Used in Biology

Strings for representing sequences (e.g. DNA, RNA, Amino Acid Sequences).

“ATACGGCGCGCAAGGCT”“TATGCCGCGCGTTCCGA”

Trees for representing the evolution of organisms and other purposes.

Prokaryotes

Eukaryotes

............

Reptiles Birds

…… …… …… ……

Data Structures (Cont..)

Graphs can represent signalling pathways (often found in Neural networks).

1 11

2

0

1

1

3d Points and their Linkages can represent protein structures.

First Instance of a Problem – DNA Shotgun Sequencing

In order to derive a DNA sequence, the DNA must first be duplicated many times.

ATCACCGTAAGAGGAATCACCGTAAGAGGAATCACCGTAAGAGGAATCACCGTAAGAGGA

It must then be processed by Gel Electrophoresis, which ‘chops’ the DNA into smaller pieces named ‘fragments’.

ATCACCGTAAGAGGAATCACCGTAAGAGGAATCACCGTAAGAGGA

ATCCCGTAGGAAAGCCGT

AAGATCA

This is a very simplified Instance of the problem typically each fragment can be between 250 and 1000 Bases long.

TAAGTA

Alignments – the Smith Waterman Method.

How do we identify fragments which link together?

Can use dynamic programming to compute optimal alignment scores between fragments.

Align with either match (1) gap -(1/3 x length of gap) or mismatch (-1).

The score in each cell is the best total score from an already chosen cell/row + the cost of the alignment. If a score is < 0 it is said to be 0.

The first row is always filled with 0’s

000A

010T

000A

CTA

The Smith Waterman algorithm (Continued)

Following this trace back a path through the optimum alignment starting at the highest number in the matrix to the first 0.

In this case it is: ‘AT’ Algorithm extremely expensive O(NM) run time

and O(NM) storage complexity. Always finds optimum solution.

000A

010T

000A

CTA

Alternative to SW Algorithm

Sequences are usually at the very least tens of thousands of characters long Makes O(NM) runtime (and storage

complexity) unacceptable. Alternative – use BLAST (Basic Local

Alignment Search Tool) Algorithm. Gives a much more reasonable run time of

O(N+M). However does not always compute best

solution.

BLAST Algorithm

Computing an entire matrix of values will always require N x M space.

Iterating over values will always require N x M Space. Solution: Ignore parts of the alignment which are

unlikely to improve the score. This improves the Storage Complexity as only

a singular alignment must be stored. It also improves the Runtime Complexity as

at each stage of the algorithm only the optimum so far is processed.

BLAST Illustrated

The strings at the beginning and end are very unlikely to improve the score of the alignment. Therefore no gap and mismatches are

computed in the matrix

CTCTCTCTCATTGATTGCGGGGGG

GGGGGGGGGATTGATTGCCCCCCC

---------ATTGATTGC------

---------ATTGATTGC------

Consider forming an alignment between two sequences:

Alignments Relation to Shotgun Sequencing.

So now there is a way to measure which fragments are likely to align we still need a way to find the correct order efficiently.

In depth Algorithm beyond scope of presentation

However the best current techniques are: Greedy Methods (align every element – then

use only best solutions). Evolutionary Algorithms (start with initial set of

solutions, computing sum of alignment scores then ‘evolve’ set of solutions in each iteration).

Problem is NP- Hard – Techniques give Approximations.

Relating Computer Science to Biology

What have us Computer Science students studied so far in this MSc course that can have some use to Bioinformatics?

Data Mining Artificial Intelligence Heuristic approaches (e.g. Knowledge Representation –

Logics) Algorithm Techniques

Data Mining and BioinformaticsHow and why?

Some of you do COMP 527 Data Mining with Rob

Why Data Mining is essential in Bioinformatics. KDD (Knowledge Discovery DB) is the process of

finding useful information and patterns in data.

Data Mining is the use of algorithms to extract information and patterns derived by the KDD process.

Graphical Techniques such as Brush, Data smoothing etc.

Data Mining and BioinformaticsAlgorithm implementation examples

Data Mining algorithm use for tackling problems in Bioinformatics In conjunction with microarray Technology

Predict a patients outcome, such as survival time disease recurrence health risk assessments etc...

How does Data Mining help? Accurate predictions could help provide better

treatment!

AI and BioinformaticsArtificial Intelligence?

Research in genetics, molecular biology etc. generate enormous amounts of data

Use AI to extract useful information from the wealth of available data

Build good probabilistic models (gene models)

AI provides several powerful algorithms and techniques solving these problems using the stored data

AI and BioinformaticsAI techniques used

Neural networks (Biological and Artificial) Hidden Markov models (Probabilistic

Statistical models) Bayesian networks (Models logic) and many others....

Logic and Bioinformatics

Biology works by applying prior knowledge “what is known” to unknown entities. Therefore Biology said to be knowledge-

based (rather than axiom based) Use pre-existing knowledge to make

inferences about the item under investigation.

Description Logic?

Description Logic and Bioinformatics Why description Logic?

decidable logic with good systems

impossible for a single biologist to deal with all

of a domains knowledge!

similar to programmers writing extremely

complex programs without an IDE to help

with libraries

medical diagnosis systems make good use of

ABOX and TBOX assertions for example, determine if a patients problem is

an element of a particular known disease

Description Logic Example

TBOX sick person isInfected.Cancer non_sick person

isInfected.Cold

ABOXTim : person Steven : person

Cancer : Problem Cold : Problem

(Tim, Cancer) : isInfected

(Steven, Cold) : isInfected

Improvements How far has Bioinformatics come?

“One is struck both by how far the field has come in a relatively short period of time, and also by how far it has yet to go.” - Jessica D. Tenenbaum

The discipline of Bioinformatics has vastly improved over recent years due to Fast technological development of the computer

industry Demand for Computer Scientists - more computer

scientists than ever before! Biological “unknown” discoveries – things that are

discovered with no previous knowledge base Growing of sub-Biology interests, such as

molecular Biology

Improvements How far will Bioinformatics go?

Thoroughly depends if the gap between Biology and Computer Science increases or decreases The gap increases if educational institutions

decide ignore Bioinformatics Put emphasis on prospective students

Computer Scientists choose to ignore Biology Biologists choose to ignore Computer Science

Closing the gap I

Biologists cannot build their own analytical tools

Computer Scientists don't know what to build!

Closing the gap II

Putting a Computer Scientist (Data Mining expert) into a room with a Biologist investigator wont solve the problem Boundaries such as methodologies and

discipline language are a problem.

Closing the gap III

Computer Science is the “science of the artificial”

Biology is the “science of discovery”

The only way to bridge the gap is for both parties to learn the basic fundamentals of each science

Breakthroughs of Bioinformatics

Spatial patterns of structures for understanding protein folding, evolution, and biological functions

To predict protein functions, we develop a method by rapidly matching local surfaces and by incorporating evolutionary information specific to individual binding region via a Bayesian Monte Carlo approach.

These kinds of breakthroughs encourage the computer industry to get involved and work with Biology.

Related Problems?Are there any other disciplines which involve the similar integration of Computer Science with Biology?

Cheminformatics/Chemoinformatics the application of informatics tools to solve

discovery chemistry problems an integral component of hit and lead generation development of new computational methods or

efficient algorithms for chemical software, and pharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery

Related Problems? Are there any other disciplines which involve the similar integration of Computer Science with Biology?

Other similar interests Ecoinformatics Geoinformatics Quantum informatics Astroinformatics Business informatics And many others...

Follow ups of Jacques Cohen

Bioinformatics—an introduction for computer scientists is a previous publication from Jacques Cohen aims to encourage Computer Scientists to get

involved with Biology Updating Computer Science Education released

after Bioinformatics and Computer Science Talks about encouraging the next generation of

Computer Scientists that Computer Science is more than just programming.

Who is Jacques Cohen?

Currently serving Brandies University since 1968. Docter in the field of analysis of algorithms, parsing

and compiling, memory management, logic and constraint logic programming, and parallelism

Recently started researching his interest of Bioinformatics

His most recent publication is about methods used in microarray Data Interpretation See http://www.cs.brandeis.edu/~jc/publications.html

References and related material(All web links last accessed 4th February 2008)

Shotgun sequencingG. Luque, E. Alba Torres and S. Khuri, Assembling DNA Fragments with a Distributed Genetic Algorithm, Parallel Computing for Bioinformatics and Computational Biology, Wiley-Interscience, New Jersey, 2006, Chapter 12, pp. 285-302.

L.D. Paulson, Bioinformatics Experiences Important Breakthroughs, 2005, pp. 26-27

J. Cohen, Bioinformatics: An Introduction for Computer Scientists, ACM Computing Surveys, 36(2), 122-158, 2004.

B. Tjaden, J. Cohen, A Survey of Computational Methods used in Microarray Data Interpretation, Applied Mycology and Biotechnology, Bioinformatics 6, 2006.

J. Cohen, Updating Computer Science Education, Communications of the ACM, 48(6), 29-31, 2005.

J. Cohen, Computational Molecular Biology: A Promising Application Using Logic Programming and Constraint Logic Programming, Lecture Notes in Artificial Intelligence, 1999.

R. Stevens, C.A. Goble and S. Bechhofer, Ontology-based Knowledge Representation for Bioinformatics, 2000.

Jinyan Li, Limsoon Wong and Qiang Yang, Data Mining in Bioinformatics, 2005.

Various material about Bioinformatics, http://www.aaai.org/AITopics/html/bioinf.html

Data Mining in Bioinformatics, http://www.dbs.informatik.uni-muenchen.de/Forschung/Bioinformatics/

http://www.aaai.org/AITopics/html/bioinf.html

http://www.dbs.informatik.uni-muenchen.de/Forschung/Bioinformatics/

Documents

Computer Science and Bioinformatics