Upload
kermit-burt
View
27
Download
1
Embed Size (px)
DESCRIPTION
James Edwards and Rajinder Singh Bhatti. Computer Science and Bioinformatics. http://www.csee.umbc.edu/~smer1/bioinformatics.gif. Biology and Computer Science?. Initially Biology depended on Chemistry to make major strides Biochemistry - PowerPoint PPT Presentation
Citation preview
Computer Science and Bioinformatics
James Edwards and Rajinder Singh Bhatti.
http://www.csee.umbc.edu/~smer1/bioinformatics.gif
Biology and Computer Science?
Initially Biology depended on Chemistry to make major strides Biochemistry
Biology then needed to work at atomic level explaining phenomena Biophysics
The modern era of Biology needs to interpret a wealth of data, tools that only computer Science is able to provide Hence Bioinformatics
What is Bioinformatics?
The study of computational methods to expand the use of biological data (Data Orientated).
Often (incorrectly) used instead of the term ‘Computational Biology’. However this is a slightly different discipline.
Computational Biology is the use of computational and mathematical methods to study or simulate biological systems (Hypothesis Orientated).
[source National Institutes of health]
Overlaps Between the two Disciplines
1 23
4
1 – Bioinformatics problems 2 – Computational Biology problems 3 – Problems in both categories 4 – Problems in neither category
Motivation for Bioinformatics
Quote from Donald Knuth- 1974 Turing Award winner:
“…I can’t be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on. It’s at that level.” [source – Wikiquotes]
Can Biological life be equated with Computing? Results so far would suggest the answer is
yes!
Common Bioinformatics Problems
Finding and assessing Similarities between Strings (next slides).
Detecting patterns in strings. Constructing trees of the evolution of
organisms. Classifying new data by clustering
existing data. Also applications of Machine Vision to
detect interactions between proteins
Data Structures Used in Biology
Strings for representing sequences (e.g. DNA, RNA, Amino Acid Sequences).
“ATACGGCGCGCAAGGCT”“TATGCCGCGCGTTCCGA”
Trees for representing the evolution of organisms and other purposes.
Prokaryotes
Eukaryotes
............
Reptiles Birds
…… …… …… ……
Data Structures (Cont..)
Graphs can represent signalling pathways (often found in Neural networks).
1 11
2
0
1
1
3d Points and their Linkages can represent protein structures.
First Instance of a Problem – DNA Shotgun Sequencing
In order to derive a DNA sequence, the DNA must first be duplicated many times.
ATCACCGTAAGAGGAATCACCGTAAGAGGAATCACCGTAAGAGGAATCACCGTAAGAGGA
It must then be processed by Gel Electrophoresis, which ‘chops’ the DNA into smaller pieces named ‘fragments’.
ATCACCGTAAGAGGAATCACCGTAAGAGGAATCACCGTAAGAGGA
ATCCCGTAGGAAAGCCGT
AAGATCA
This is a very simplified Instance of the problem typically each fragment can be between 250 and 1000 Bases long.
TAAGTA
Alignments – the Smith Waterman Method.
How do we identify fragments which link together?
Can use dynamic programming to compute optimal alignment scores between fragments.
Align with either match (1) gap -(1/3 x length of gap) or mismatch (-1).
The score in each cell is the best total score from an already chosen cell/row + the cost of the alignment. If a score is < 0 it is said to be 0.
The first row is always filled with 0’s
000A
010T
000A
CTA
The Smith Waterman algorithm (Continued)
Following this trace back a path through the optimum alignment starting at the highest number in the matrix to the first 0.
In this case it is: ‘AT’ Algorithm extremely expensive O(NM) run time
and O(NM) storage complexity. Always finds optimum solution.
000A
010T
000A
CTA
Alternative to SW Algorithm
Sequences are usually at the very least tens of thousands of characters long Makes O(NM) runtime (and storage
complexity) unacceptable. Alternative – use BLAST (Basic Local
Alignment Search Tool) Algorithm. Gives a much more reasonable run time of
O(N+M). However does not always compute best
solution.
BLAST Algorithm
Computing an entire matrix of values will always require N x M space.
Iterating over values will always require N x M Space. Solution: Ignore parts of the alignment which are
unlikely to improve the score. This improves the Storage Complexity as only
a singular alignment must be stored. It also improves the Runtime Complexity as
at each stage of the algorithm only the optimum so far is processed.
BLAST Illustrated
The strings at the beginning and end are very unlikely to improve the score of the alignment. Therefore no gap and mismatches are
computed in the matrix
CTCTCTCTCATTGATTGCGGGGGG
GGGGGGGGGATTGATTGCCCCCCC
---------ATTGATTGC------
---------ATTGATTGC------
Consider forming an alignment between two sequences:
Alignments Relation to Shotgun Sequencing.
So now there is a way to measure which fragments are likely to align we still need a way to find the correct order efficiently.
In depth Algorithm beyond scope of presentation
However the best current techniques are: Greedy Methods (align every element – then
use only best solutions). Evolutionary Algorithms (start with initial set of
solutions, computing sum of alignment scores then ‘evolve’ set of solutions in each iteration).
Problem is NP- Hard – Techniques give Approximations.
Relating Computer Science to Biology
What have us Computer Science students studied so far in this MSc course that can have some use to Bioinformatics?
Data Mining Artificial Intelligence Heuristic approaches (e.g. Knowledge Representation –
Logics) Algorithm Techniques
Data Mining and BioinformaticsHow and why?
Some of you do COMP 527 Data Mining with Rob
Why Data Mining is essential in Bioinformatics. KDD (Knowledge Discovery DB) is the process of
finding useful information and patterns in data.
Data Mining is the use of algorithms to extract information and patterns derived by the KDD process.
Graphical Techniques such as Brush, Data smoothing etc.
Data Mining and BioinformaticsAlgorithm implementation examples
Data Mining algorithm use for tackling problems in Bioinformatics In conjunction with microarray Technology
Predict a patients outcome, such as survival time disease recurrence health risk assessments etc...
How does Data Mining help? Accurate predictions could help provide better
treatment!
AI and BioinformaticsArtificial Intelligence?
Research in genetics, molecular biology etc. generate enormous amounts of data
Use AI to extract useful information from the wealth of available data
Build good probabilistic models (gene models)
AI provides several powerful algorithms and techniques solving these problems using the stored data
AI and BioinformaticsAI techniques used
Neural networks (Biological and Artificial) Hidden Markov models (Probabilistic
Statistical models) Bayesian networks (Models logic) and many others....
Logic and Bioinformatics
Biology works by applying prior knowledge “what is known” to unknown entities. Therefore Biology said to be knowledge-
based (rather than axiom based) Use pre-existing knowledge to make
inferences about the item under investigation.
Description Logic?
Description Logic and Bioinformatics Why description Logic?
decidable logic with good systems
impossible for a single biologist to deal with all
of a domains knowledge!
similar to programmers writing extremely
complex programs without an IDE to help
with libraries
medical diagnosis systems make good use of
ABOX and TBOX assertions for example, determine if a patients problem is
an element of a particular known disease
Description Logic Example
TBOX sick person isInfected.Cancer non_sick person
isInfected.Cold
ABOXTim : person Steven : person
Cancer : Problem Cold : Problem
(Tim, Cancer) : isInfected
(Steven, Cold) : isInfected
Improvements How far has Bioinformatics come?
“One is struck both by how far the field has come in a relatively short period of time, and also by how far it has yet to go.” - Jessica D. Tenenbaum
The discipline of Bioinformatics has vastly improved over recent years due to Fast technological development of the computer
industry Demand for Computer Scientists - more computer
scientists than ever before! Biological “unknown” discoveries – things that are
discovered with no previous knowledge base Growing of sub-Biology interests, such as
molecular Biology
Improvements How far will Bioinformatics go?
Thoroughly depends if the gap between Biology and Computer Science increases or decreases The gap increases if educational institutions
decide ignore Bioinformatics Put emphasis on prospective students
Computer Scientists choose to ignore Biology Biologists choose to ignore Computer Science
Closing the gap I
Biologists cannot build their own analytical tools
Computer Scientists don't know what to build!
Closing the gap II
Putting a Computer Scientist (Data Mining expert) into a room with a Biologist investigator wont solve the problem Boundaries such as methodologies and
discipline language are a problem.
Closing the gap III
Computer Science is the “science of the artificial”
Biology is the “science of discovery”
The only way to bridge the gap is for both parties to learn the basic fundamentals of each science
Breakthroughs of Bioinformatics
Spatial patterns of structures for understanding protein folding, evolution, and biological functions
To predict protein functions, we develop a method by rapidly matching local surfaces and by incorporating evolutionary information specific to individual binding region via a Bayesian Monte Carlo approach.
These kinds of breakthroughs encourage the computer industry to get involved and work with Biology.
Related Problems?Are there any other disciplines which involve the similar integration of Computer Science with Biology?
Cheminformatics/Chemoinformatics the application of informatics tools to solve
discovery chemistry problems an integral component of hit and lead generation development of new computational methods or
efficient algorithms for chemical software, and pharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery
Related Problems? Are there any other disciplines which involve the similar integration of Computer Science with Biology?
Other similar interests Ecoinformatics Geoinformatics Quantum informatics Astroinformatics Business informatics And many others...
Follow ups of Jacques Cohen
Bioinformatics—an introduction for computer scientists is a previous publication from Jacques Cohen aims to encourage Computer Scientists to get
involved with Biology Updating Computer Science Education released
after Bioinformatics and Computer Science Talks about encouraging the next generation of
Computer Scientists that Computer Science is more than just programming.
Who is Jacques Cohen?
Currently serving Brandies University since 1968. Docter in the field of analysis of algorithms, parsing
and compiling, memory management, logic and constraint logic programming, and parallelism
Recently started researching his interest of Bioinformatics
His most recent publication is about methods used in microarray Data Interpretation See http://www.cs.brandeis.edu/~jc/publications.html
References and related material(All web links last accessed 4th February 2008)
Shotgun sequencingG. Luque, E. Alba Torres and S. Khuri, Assembling DNA Fragments with a Distributed Genetic Algorithm, Parallel Computing for Bioinformatics and Computational Biology, Wiley-Interscience, New Jersey, 2006, Chapter 12, pp. 285-302.
L.D. Paulson, Bioinformatics Experiences Important Breakthroughs, 2005, pp. 26-27
J. Cohen, Bioinformatics: An Introduction for Computer Scientists, ACM Computing Surveys, 36(2), 122-158, 2004.
B. Tjaden, J. Cohen, A Survey of Computational Methods used in Microarray Data Interpretation, Applied Mycology and Biotechnology, Bioinformatics 6, 2006.
J. Cohen, Updating Computer Science Education, Communications of the ACM, 48(6), 29-31, 2005.
J. Cohen, Computational Molecular Biology: A Promising Application Using Logic Programming and Constraint Logic Programming, Lecture Notes in Artificial Intelligence, 1999.
R. Stevens, C.A. Goble and S. Bechhofer, Ontology-based Knowledge Representation for Bioinformatics, 2000.
Jinyan Li, Limsoon Wong and Qiang Yang, Data Mining in Bioinformatics, 2005.
Various material about Bioinformatics, http://www.aaai.org/AITopics/html/bioinf.html
Data Mining in Bioinformatics, http://www.dbs.informatik.uni-muenchen.de/Forschung/Bioinformatics/