JM - http://folding.chmcc.org 1
Introduction to Bioinformatics: Lecture IIIGenome Assembly and String Matching
Jarek MellerJarek Meller
Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC
JM - http://folding.chmcc.org 2
Outline of the lecture
Physical mapping problem and the resulting
computational challenges Ordering clone libraries: from the consecutive
ones to global optimization methods Applications of exact string matching methods Towards the shortest superstring problem and
the shotgun assembly problem
JM - http://folding.chmcc.org 3
Literature watch
Aloy et. al., “Structure-Based Assembly of Protein Complexesin Yeast”, Science 303,
As a way of getting acquainted with protein pathways and theirintersection with structural studies.
4
Assembling physical maps of a genome
Markers DNA
Physical mapping problem: create and locate in the genome of interesta set of markers (e.g. stretches of DNA that hybridize to a given probe).
With sufficiently dense and ordered set of markers any newly sequenced(and long enough to cover at least one marker) DNA fragment can be mapped to a rough location on the genome.
One of the early goals of the Human Genome Project was to select and map a set of STS markers such that there would be at least one STS ineach stretch of 100 kb of the genome.
5
Physical mapping and the problem of ordering clone libraries with STS markers
DNA clone 1 clone 2 clone 3 clone 4
STS: 1 2 3 4 5
Definition A clone library consists of a set of short DNA fragments,called clones that originated in a stretch of the studied DNA.
Definition A sequence tagged site (STS) is a DNA substring which occurs only once in the DNA of interest. One may think of STSs as a set of indices to which new DNA sequences can be referenced.
Problem What is the minimum length of the STSs that could (at leastin principle) provide the requested coverage for the Human genome?
6
The problem of ordering clone libraries with STS markers can be cast (and solved) as the consecutive ones problem
DNA clone 1 clone 2 clone 3 clone 4
STS: 1 2 3 4 5
Our task is to reconstruct the original order of the STSs (and thus orderthe clone library) given this data.
Assuming that the STS probes are unique and that there are no hybridization errors the problem can be cast as the consecutive ones problem and efficiently solved using CS techniques (PQ-tree algorithm, Booth and Leuker, 1976).
The true location of the STSs and clones is not known. However,for each clone the list of STSs hybridizing to it is given.
7
The consecutive ones problem and its solution
3 5 1 4 2
1 1 0 0 1 0
2 0 0 1 0 1
3 1 0 0 1 1
4 1 1 0 1 0
DNA clone 1 clone 2 clone 3 clone 4
STS: 1 2 3 4 5
1 2 3 4 5
1 0 0 1 1 0
2 1 1 0 0 0
3 0 1 1 1 0
4 0 0 1 1 1
For a binary hybridization matrix find a permutation of its columns such thatin each row all ones are located in a block of consecutive entries.
STS
Clone
8
Fortunately errors make life more interesting …
5 4 1 3 2
1 0 1 0 1 0
2 0 1 1 0 1
3 0 1 0 1 1
4 1 0 0 1 0
DNA clone 1 clone 2 clone 3 clone 4
STS: 1 2 3 4 5
1 2 3 4 5
1 0 0 1 1 0
2 1 1 0 1 0
3 0 1 1 1 0
4 0 0 1 0 1
In the presence of experimental errors the problem leads to globaloptimization problem (see Pevzner, Chapter 3).
STS
Clone
JM - http://folding.chmcc.org 9
Heuristic solutions may still provide good probe ordering
The number of “gaps” (blocks of zeros in rows) in the hybridization matrixmay be used as a cost function, since hybridization errors typically splitblocks of ones (false negatives) or split a gap into two gaps (false positive).
The problem of finding a permutation that minimizes the number of gapscan be cast as a Traveling Salesman Problem (TSP), in which cities are the columns of the hybridization matrix (plus an additional column of zeros)and the distance between two cities is the number of positions in which the two columns differ (Hamming dist.)
Thus, an efficient algorithm is unlikely in general case (unless P=NP) andheuristic solutions are being sought that provide good probe ordering, atleast for most cases (e.g. Alizadeh et. al., 1995)
Problem Is the correct order of the STSs in the example from the previousslide providing the shortest cycle for the corresponding TSP?
JM - http://folding.chmcc.org 10
Map location of anonymous DNA as a string matching problem
A sufficiently long string of anonymous yet sequenced DNA can beplaced on the physical map by finding which STSs are contained inthis sequence.
Due to the size of the problem, efficiency is very important.
Millions of STS are available at present and their total length is typicallymuch larger than the length of the DNA sequence to be mapped.
Assuming no sequencing errors, the problem can be cast as the exact set matching and solved efficiently using for example suffix trees.
Generalized suffix tree or inexact string matching methods need to be used when some errors are allowed.
JM - http://folding.chmcc.org 11
Strings, sequences and string operations
JM - http://folding.chmcc.org 12
String exact matching problem
JM - http://folding.chmcc.org 13
Solving the exact matching problem: conceptual simplicity vs. computational complexity
JM - http://folding.chmcc.org 14
Computationally efficient and elegant solutions
JM - http://folding.chmcc.org 15
The idea of the suffix tree method
A string with m characters has m suffixes, which can be representedas m leaves of a rooted directed tree. Consider for example T=cabca
ca
bc
a$
1
a
b
c
a
$
2
bc
a
$
3
$ 4
$
5
For simplicity one leaf, due to the terminal character $ is not included.Problem What is the reason for adding the terminal character?
JM - http://folding.chmcc.org 16
Why does it work?
A substring of a string is a prefix of a suffix in that string. For example,a substring P=ab is a prefix of the suffix bca in T=cabca. Thus, if P occurs in T there is a leaf in the suffix tree that has a label starting with P.
ca
bc
a$
1
a
b
c
a
$
2
bc
a
$
3
$ 4
$
5
As a related problem consider the motif search, as implemented in PROSITE. Explain how finite automata formalism is used for motif search.
JM - http://folding.chmcc.org 17
General idea: ordered fingerprints and the notion of closeness between DNA fragments
Hierarchical sequencing: physical maps, clone libraries and shotgun
Definition The algorithmic problem of shotgun sequence assemblyis to deduce the sequence of the DNA string from a set of sequencedand partially overlapping short substrings derived from that string.
Analogy to physical map assembly: DNA sequence of a substring maybe viewed as a precise ordered fingerprint (in analogy to STSs) and thesuffix-prefix match determines if two substrings would be assembledtogether.
In general, the shortest superstring problem (find the shortest stringthat contains each string from a certain set of strings as its substring) is NP-hard and heuristics are being developed to address the problem.
JM - http://folding.chmcc.org 18
Get the relevant sequences to compare them: conservation and differences
Problem Algorithms Programs
Sequencing Fragment assembly problem The Shortest Superstring Problem Phrap (Green, 1994)
Gene finding Hidden Markov Models, pattern recognition methods GenScan (Burge & Karlin, 1997)
Sequence comparison pairwise and multiple sequence alignments dynamic algorithm, heuristic methods BLAST (Altschul et. al., 1990)