28
Using DERIVE 6.1 To Generate DNA Fractal Representations Joel Jeffrey’s Chaos Game Representation, Genomic Signatures, and Aligning Sequences L. Carl Leinbach Gettysburg College Gettysburg, PA USA [email protected]

Using DERIVE 6.1 To Generate DNA Fractal - TIME 2012

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Using DERIVE 6.1 To Generate

DNA Fractal Representations

Joel Jeffrey’s Chaos Game

Representation, Genomic

Signatures, and Aligning Sequences

L. Carl Leinbach

Gettysburg College

Gettysburg, PA USA

[email protected]

The Chaos Game

The Chaos Game (Jeffrey’s name) is basically an Iterated Function System

dependent upon points arranged in some geometric configuration. The best

known example is that of Sierpinski’s Triangle.

In this case, the point configuration is the vertices of a triangle located in the

plane, say at vertices, v1 = (0,0), v2 = (1/2, 1), and v3 = (1,0). The sequence of

points (xi, yi) is then chosen with (x0, y0 ) = (1/2, 1/ 2) and

𝑥𝑖 = 𝑥𝑖−1 + 𝑣𝑟𝑖+ 𝑥𝑖−1

2

for I = 1, 2, . . .

𝑦𝑖= 𝑦𝑖−1 + 𝑣𝑟2 +𝑦𝑖−1

2

At each step vr is randomly chosen from among the vertices v1, v2, and v3.

This process is easily programmed in DERIVE and the result graphed.

Sierpinski(T, n, i, j, k, X, Y) ≔ Prog i ≔ 1 P ≔ [1/2, 1/2] X ≔ [P↓1] Y ≔ [P↓2] Loop i ≔ i + 1 If i > n RETURN VECTOR([X↓j, Y↓j], j, 1, n) k ≔ RANDOM(3) + 1 P ≔ [(P↓1 + T↓k↓1)/2, (P↓2 + T↓k↓2)/2] X ≔ INSERT(P↓1, X, i) Y ≔ INSERT(P↓2, Y, i)

Below is an example of a typical program for constructing the Sierpinski

Triangle and the result of the program after 10,000 iterations. The result is

always the same, independent of the choice of the starting point and, remarkably,

the quality of the pseudo-random number generator picking the target vertex.

Sierpinski(T, 10000)

The Birth of the Chaos Game Representation

In his 1990 article in Nucleic Acids Research , Jeffrey noted that if an iterated function

system using the same “Chaos Game” strategy is applied to figures with 5, 6, 7 as well

as three vertices, the result is the same independent of the quality of the random number

generator. However, with 8 vertices, and to a lesser extent 4 vertices, the result was

dramatically different.

Using a good random number generator with 8 vertices, the result of the Chaos Game

was an almost uniformly filled octagon. When the game was played using the random

number generator from Turbo Pascal 3.0 (a notoriously bad generator) the result was

an elaborate pattern generating a circle within a circle and the circles connected by 8

spidery lines. The DOS Basic v2.1 function RND produced no visible pattern at all for

8 vertices.

This experience, according to Jeffrey’s account, led him to the conclusion that the

Chaos Game can display certain kinds of non randomness visually. In particular,

patterns in a four vertex Chaos Game having the vertices labeled starting at the bottom

left and proceeding clockwise with the letters, “A”, “C” , “G”, and “T” could be

observed by controlling the moves with successive letters in a particular DNA sequence.

He called the resulting pattern the “Chaos Game Representation” of the sequence.

An All Too Brrief

Introduction to DNA

Much work preceded the spectacular 1953 announcement of the

double helical structure of DNA by Watson and Crick. Nucleic

Acid was known prior to 1820. DNA was first identified in 1869

and, simultaneously Mendel was doing his pioneering research in

genetics. In the 1940’s Linus Pauling predicted a three

dimensional structure for the DNA molecule. The X-ray

diffraction work of Rosalind Franklin and Maurice Wilkins

played a major role in the Watson–Crick discovery.

DNA consists chains made up from four nucleotides,

two Purines:

Adenine - A

Guanine - G

and two Pyrimidines:

Thymine – T

Cytosine - C

What Watson and Crick Discovered

Erwin Chargaff, an Austrian born biochemist ,who

immigrated to the United States during the Nazi

era, discovered two rules in 1950 that were

fundamental to the discovery of the double helical

structure of DNA.

1. In natural DNA the number of guanine units

equals the number of cytozine units and the

number of adenine units equals the number of

thymine units

2. The composition of DNA varies from one

species to another.

However he did not grasp the connection of the first

rule to the structure of DNA.

Using Chargaff’s first ‘rule’, the results from Roslind Franklin and Maurice

Wilkins’ Xray diffraction work, and Linus Pauling’s work on the three dimensional

structure of molecules, Watson an Crick painstakingly constructed a cardboard

model of the DNA molecule showing the correct nucleic acid bonds and bonding

distances. Thus, they became the first to accurately formulate the molecule’s

structure.

The Chaos Game Representation (CGR) of DNA

Since A always bonds with T and C always bods with G, one strand of the double helix

is a reflection of the other. Thus, if we know the sequence for one strand , we know the

other; When we see a listing of the nucleic acid content of DNA, we see a single

sequence. These sequences are the inputs for the CGR.

The DERIVE program for obtaining the CGR is very similar to that for generating

Sierpinski’s Triangle. Other than the fact that we are using the corners of a square as

the reference points, the main difference is that there is no need for a random number

generator. The following code is the heart of the program:

L ≔ D↓j If L = "A" R ≔ A If L = "C" R ≔ C If L = "G" R ≔ G R ≔ T P ≔ [(P↓1 + R↓1)/2, (P↓2 + R↓2)/2]

NOTE: A, C, G, and T are the locations of the four corners of the unit square.

C

A T

G

Below is the CGR for Human Hemoglobin B: (HUMHBB) on Chromozome 11 The sequence is 73308 base pairs (bp) long. The starting point was (1/2, 1/2). The location of the most recent ly calculated point was a global variable. The results were plotted after each 20000 bp’s and the end of the sequence. Accurate rational arithmetic was essential for correctly locating points

The Fractal Nature Of The CGR

The pattern exhibited by the entire Hemoglobin B shown has a

double scoop in the upper right hand or quadrant of the CGR.

This reflects a paucity of the CG dinucleotide in the sequence.

This double scoop is exhibited in the displays shown below.

Lower Right Quadrant Lower Right Octant

The Shape of the CGR

For Any Sequence Of Reasonable Length

If we choose any two base pairs that are at least a few thousand positions apart in the

sequence, we will see the same basic pattern exhibited. Shown below is the CGR of

the HUMHBB sequence between bp 53000 and 58271

According to Jeffrey the double scoop shape is has been found only in DNA sequences

for vertebrates and in the human and chimpanzee HIV virus. And ,it can be used to

classify these viruses.

Another Example of the Fractal Behavior

To illustrate that the phenomenon observed for the Human

Hemoglobin B is not a one time occurrence, The CGR was applied

to Gordonia otitidis, a pathogen that can cause blood infections. On

the left is the complete sequence for the Gordonia and on the right is

a partial sequence involving 2500 base pairs. Note that we do not

have the double scoops in the CG region; however there is a paucity

in the AT region.

Godonia otitdis – 9421 bp CGR for bp 3241 to 5678

Jeffrey’s Observations

1. The k-th point on the CGR of a sequence corresponds to the first k-long initial

subsequence and no other subsequence. Thus there is a one-to-one correspondence

between the subsequences of a gene (anchored at a start) and the points of the CGR.

2. Visible patterns in the CGR correspond to some pattern in the sequence of bases..

3. The resolution of the screen limits the amount of detail that may be shown in the CGR.

However, the fractal nature of the CGR allows for for any portion of the CGR to be

magnified to show fine structure.

4. Adjacent bases in the sequence are not plotted adjacent to each other. Euclidean distance

in the CGR therefore implies a new metric on subsequences, or bases.

5. Generally, two close points in a CGR correspond to different sequences, however two

points that are close, but in different quadrants may belong to the same sequence.

6. Points 4 and 5 imply that the visible patterns in a CGR represent global as well as local

patterns. A density (or paucity) of points in a region corresponds to a large (or small)

number of sequences with suffixes in that region.

7. Any mathematical description of the CGR of a sequence is a characterization of the

underlying sequence.

Jeffrey’s Theorem

Theorem 1. In a CGR whose side is of length 1, two sequences with suffix of length

k are contained within the square with side of length 2-k. Further, the center of the

square is give by the following recursive definition:

(a) The center of the suffix of 0 length is (1/2, 1/2)

(b) If the center of the square containing sequences with suffix w is at (x, y),

then

i. the center of the square containing sequences with suffix wa is

(x/2, y/2);

ii. the center of the square containing sequences with suffix wc is

(x/2, (y+1)/2);

iii. the center of the square containing sequences with suffix wg is at

((x+1)/2, (y+1)2);

iv. the center of the square containing sequences with suffix wt (or wu) is

at ((x+1)/2, y/2).

Conversely, all points within this square correspond to sequences with this suffix.

Illustrating Jeffrey’s Theorem

Starting at the center of the square with coordinates A = [0,0], C = [0,1], G = [1,1]

and T = [1,0], trace each successive nucleotide base pair in th e string. Each

point in the plot represents the entire sequence to that point. Also note the

position of the G term is closer to the first T term than it is to C.

G GC GCT GCTT

GCTTA

A Very Interesting Observation

If we look at the point representing A in our short sequence. We see that it is in the

center of a square obtained by successively subdividing the unit square into sub

quadrants five times. Furthermore, no other sequence of five terms could have ended

up at that point! Following the lead from Jeffrey’s Theorem: A is in the center of a

1X1 box; the second T, a 2X2 box; the first T, a 4X4 box; C, an 8X8 box; and G a

16X16 box, or 1/4 of the unit square, i.e. each successive letter is in a box that is

1/4 the size of the box containing the letter preceding it in the sequence.

An Illustration of The

Independence of Starting Point

On the CGR Graph below the sequence GCTTA was plotted from three different

initial points: [1/2, 1/2] (black), [2/3, 5/9] (red), and [9/13, 13/15] (blue). Note that

all three ending points ended in the same 1/32 X 1/32 region of the graph.

Towards Constructing a Genomic Signature

The location of a point of the CGR can allow us to reconstruct the last n terms of a

sequence where n is the size of n-th sub-sub- ٠٠٠-sub quadrant of the unit square on

which the CGR is constructed. Consequently, any sequence ending in GCTTA will end

up in this 5-th sub quadrant. The example sequence is in the middle since it started at

the middle of the unit square. This can result in a further refinement for identifying the

genomic source of the sequence..

Deschavanne’s Genomic Signature

Suppose that we number each of the possible 5-th order sub quadrants according to the

following scheme:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ٠٠٠ 32 33 34 35 ٠ ٠ ٠ 64

٠٠

٠٠ ٠٠ ٠٠

993 ٠ ٠ ٠ 1024

The five letter sequences starting at the upper left and proceeding down to the lower right

Listed according to the folowing scheme:

CCCCC, GCCCC, CGCCC, GGCCC, ٠٠٠TTTTT

Note that CCCCC corresponds to cell 1, GGGGG to cell 32, AAAAA to cell 993 and

TTTTT to cell 1024.

In 1999, Patrick J. Deschavanne and his team of researchers saw that by comparing the

distribution of position s in the plane of the CGR of a sequence could be used to identify

sequences and also to differentiate between sequences having similar appearing CGR’s , but

being from different sources.

Creating a Genomic Signature

The heart of the “Genomic Signature” is that the number of points of the CGR that lie in

each of the specified sub regions of the CGR normalize across the entire region will

disclose the concentration of each of the possible oligonucleotides of a certain length

that are in the sequence.

Below are CGR’s of sub sequences of Beta Hemoglobin from a Human, a Whale , and a

Chimpanzee.

HUMAN WHALE CHIMPANZEE

816 bp 640 bp 1729 bp

At first glace these CGR’s appear to be very similar and one might be hard pressed to

tell which CGR came from which sequence

The Genomic Sequence Revealed Given the matrix, M, showing the normalized concentrations in each block of the CGR,

construct a 3-D point (x,y,z), where (x,y) is the center f the block and z is the number of

points lying in that block divided by the maximum number of points in a block of M,

the infinity norm of M. The following graphs were derived from the three CGRs on the

previous slide.

Human Whale Chimpanzee

A natural concern is that the difference in these graphs may be a function of the

subsequence and its size that is chosen for each species. In his paper Deschvanne and

his team considered the distance between several samples from several species and

several samples of reasonable size within each species and showed that the distances

between interspecies samples are much more tightly grouped about their mean than

the distances between the means of different species.

Taken From

Genomic Signature: Characterization and Classification of

Species Assessed by Chaos Game Representation of Sequences

Deschvanne, Giron, Vilain, Fagot, and Fertil

Sequence Homology Why Sequence Alignment is Important

DNA sequences develop from pre-existing sequences instead of being invented by nature

from scratch . This fact makes sequence alignment an important part of biological

research. If the DNA is taken from an area which has a function or structure from a

known sequence, then , if the sequences exhibit regions of similarity, then it is possible

that the surce f the new sequence will exhibit similar structure or function. When a match

occurs, we say the old and the new sequence are homologous.

At this point we will look for sequence similarity, a first step towards sequence homology.

Several tools exist and are publically available to researchers. The most commonly used

both for DNA and protein sequences is called BLAST (Basic Linear Alignment Tool)

developed by Stephan Altschul, Warren Gish, and David Lipman of the National Center for Biotechnical Information.. It looks for short portions of each sequence that are identical and then stretches in each direction and looks for a good alignment.

The identical sub sequences are called “conserved sequences.” We will show how some

practitioners in the biotechnical field are uing Jeffrey’a CGR as a means of finding

conserved sequences.

A DERIVE Program

For Finding Conserved Sequences

It has been noted that if two CGR points reside in the same block of the 1024 block of a

32 X 32 grid imposed upon the CGR they are the result of mapping the exact same six

letter sequence of the DNA alphabet. The problem is one of finding a strategy for

quickly finding the box in which a point of the CGR resides. We circumvent this point

by saying that since the box is 1

64 X

1

64 , their l1norm of the distance between the two

points in the box is at most 1

32 . We will use a much stricter criterion of considering

only points that lie within 1

256 units of each other. This criterion is not “fool proof”, but

gives a good start for

the search:

An Example

We will consider the following two sequences:

Seq1 ≔ TCCCAGTTATGTCAGGGGACACGAGCATGCAGAGAGAC

Seq2 ≔ AATTGCCGTCGTTTTCAGCAGTTATGTCAGATC

The result of the previous program is: [12, 27]

This means that our search will begin at the 12-th letter of sequence 1 and the 27-th of

sequence 2 If, as is hoped, the points are in the same grid box, the five letters preceding

these positions will match and we extend to the right until our criterion on the CGR

points fails. The result

For the two sequences

above is:

[TTATGTCAG, [17, 32],

TTATGTCAG]

Indicating that the conserved

sequence extends from

character 7 = 12 -5 to 17 in

seequence 1 and from 22 to

32 in sequence 2.

Shot Gunning The Genome

The DNA making up the human genome is estimated to be about 3.2 Mb in

size. Only about 1.5% of this total is protein coding DNA The rest is made

up of other components; non coding RNA genes, regulatory sequences,

introns, non coding DNA (aka junk DNA).

To fully sequence the human genome in one shot would be an impossible

task. In the late 1940’s , before DNA structure was even known, Sanger was

awarded a Nobel Prize for determining the amino acid sequence of insulin.

He digested insulin in enzymes that broke up the sequence and was able to

sequence the short sequences rather easily. Then since the fragments

overlapped, he was able (no small task) to put the puzzle back together. In

1974, Russian scientist, Andrey Mirzabekov, whiole visiting Harvard found a

way to selectively break DNA at A and G. Later Allan Maxam and Walter

Gilbert found a method to break DNA at every C and T. The short sequences

could be sequenced and then using combinatorics algorithms, be put back

together. This is known as shotgunning.

A Small Example

Of Shot Gunning

The following three sequences were shot gunned from a protein sequnce.

sg1 ≔ GGATCGTCGTCG

sg2 ≔ CGTCGTCGAACAATG

sg3 ≔ CGAACAATGTCGGA

Using the Derive program we are able to solve this very simple example.

First we try sequence 1 and sequence 2

ConsSeq(sg1, sg2)

[CGTCG, [14, 10], CGTCG]

Indeed there is overlap and it occurs at the end of sequence 1 and beginning of sequence

2. However, we still need to check if there is also overlap between seqquence 1 and 3.

ConsSeq(sg1, sg3)

No Sequence Found

So now what about 2 and 3?

ConsSeq(sg3, sg2)

[AACAA, [10, 15], AACAA]

Here we see one of the weaknesses of the heuristic in determining the same block of the

CGR and also the fact that we are dealing with the end of one sequence and the

beginning of another. The program missed th e beginning CG and final TG. There is

still work to be done! – Next time.

Why CAS? Recovery:

Each point of the CGR is the history of the sequence up to that point. Given the

simple nature of the IFS (Iterrated Function System) used to generate the CGR, the

sequence can be completely recovered from the position of the point and the position of

the starting point. A finite word length decimal approximation will quickly loose this

accuracy. Rational arithmetic of the CAS does not have this restriction.

Quoting from Almeida: “The nucleotide sequence can (also) be recovered from the

CGR coordinate, where the the number of the bases resolved is a function of the

resolution of the CGR coordinates.

Sequence Comparison:

We use a grid of size 2𝑛 𝑋 2𝑛 and store the location of the n-length oligonucleotide

sequence represented by the corresponding grid box containing the CGR points from

the sequences under comparison and then extend then extend the sequence as far as

possible within each sequence. Thus, the CGR jump starts the comparison procedure.

In this paper oligonucleotide of length 6 were used.

Quoting from Joseph: “For long identical sequence segments, the minimum value of

the distance value may go below the minimum possible floating point variable… in

double precision the variable becomes zero when the length of identical segments is

greater than 64.”

References

1. Almeida J. S., Carrico J. A., Maretzek A., Nobel P. A. and Fletcher M., “Analysis of

Genomic Sequences by Chaos Game Representation, Bioinformatics, v. 17, no. 4,

2001, pp 429 – 437

2. Deschavanne P. J., Giron P., Vilain J., Fagot G., and Fertil B., “Genomic Signature:

Characterization and Classification by Chaos Game Representation of Sequences” ,

Molecular Biology and Evolution, http://mbe.oxfordjournals.org/ Stanley Sawer,

reviewing editor, June 1999

3. Isaev, A., Introduction to Mathematical Methods in Bioinformatics, Springer, Berlin,

2004

4. Jeffrey, H. J., “Chaos game Representation of Gene Structure”, Nucleic Acids

Research,1 v 18, no, 8, 1990

5. Jones N. C. and Pevsner P. A., An Introduction to Bioinformatics Algorithms, The

MIT Press, Cambridge, Mass., 2004

6. Joseph J. and Sasikumar, R., “Chaos Game Representation of Whole Genomes”,

BioMed Central (Open Access), 10 pages, May, 2006

7. Pevsner, P., Computational Molecular Biology, An Algoritmic Approach, The MIT

Press. Cambridge, Mass., 2000, (Second Printing), 2001