1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study

1

1. Protein structure study via residue environment – Residues Solvent Accessibility Environm

ent in Globins Protein Family

2. Statistical linguistic study of DNA sequences*

Ka Lok Ng

Department of Information Management Ling Tung College

*In collaborate with S.P. Li,

Institute of Physics,

Academia Sinica

2

Statistical linguistic study of DNA sequences

1. Linguistic study models – Zipf law and Compound Poisson Distribution

2. Compound Poisson Distribution study of the Fortran language and DNA sequences

3. Entropic segmentation method

4. Compound Poisson Distribution study of the DNA segments

3


Zipf LawZipf law stated that

rf = C

where r is the rank of a word; f is the frequency of occurrence of the word; and C is a constant that depends on the text being analyzed. It is linear in a double logarithmic plot, with a slope -~ 1 for all language studied.

DNA sequences study – coding and non-coding regions (Mammals, invertebrate, Eukaryotic Virus, Bacteria )

Reference

Mantegna, R.N.; S.V. Buldryev; A.L. Goldberger; S. Havlin; C.-K. Peng; M. Simons and H.E. Stanley. "Linguistic Features of Noncoding DNA Sequences" v 73 n 23 Physical Review Letters 73, no. 23, p 3169-3172(1994).

Sequence Types : Zipf analysis of 6-tuples of the Mammals, Invertebrates, Yeast chromosome III, Eukaryotoc Virus, Prokaryotics and Bacteria DNA sequences.

Results : They found that non-coding sequences have a slope that is consistently larger, suggesting that the non-coding sequences bear more resemblance to a natural language than the coding sequences.

Log r

Log f

4


Word frequency distribution - Compound Poisson Distributionan author’s total vocabulary, V words (with probability of occurrence 1 < 2 < …. < v)

The frequency distribution of a specific word with probability of occurrence i to appear r = 1, 2 …. times in a total word count of N tokens is given by

dr

NNr rNr

)()1()|(1

0

Replacing the binomial by the Poisson distribution, assuming (r) is a mixing distribution ,and integrate over the probability

distribution, one obtains

where - < < , 0 < < 1 and >0 are three parameters and Kr() is the modified Bessel function of the second kind of order r. For = -0.5,(r) stands for the inverse Gaussian distribution.

)(r ))1((

))1((2/1

2/1

K !

)2/(

r

rK )(

5


0 10 20 30

r

0.00

0.04

0.08

0.12

0.16

ph

i r

COCO1a450t85

0 10 20 30

r

0.00

0.10

0.20

0.30

0.40

ph

i r

CONVERT

a250t85

Fortran program

6


0 20 40 60 80 100

r

0.00

0.02

0.04

0.06

ph

i r

HUMHDABCD

a750t95

0 20 40 60 80 100

r

0.00

0.02

0.04

0.06

ph

i r

HUMMMDBCa770t95

Mammals

7


0 20 40 60 80 100

r

0.00

0.02

0.04

0.06

0.08

ph

i r

CEC0749

a640t95

0 20 40 60 80 100

r

0.00

0.02

0.04

0.06

0.08

ph

i r

CELTW IMUSCa660t95

Invertebrate

8


0 20 40 60 80 100

r

0.00

0.02

0.04

0.06

0.08

ph

i r

ASFV55KB

a530t95

0 20 40 60 80 100

r

0.00

0.01

0.02

0.03

0.04

ph

i r

HE1CGa730t99

Eukaryotic Virus

9


0 20 40 60 80 100

r

0.00

0.01

0.02

0.03

0.04

ph

i r

ECOWU85

a990t96

0 20 40 60 80 100

r

0.00

0.01

0.02

0.03

0.04

ph

i r

ECOUW87a980t97

Bacteria

10


11


Chi-square test

i

ii

T

TO 2)(

O is the observed frequency

T is the theoretical frequency

12


13


Segmentation method

• How to define a sentence ?• DNA sequences are not a random sequences• Such as CpG island and repeated sequences• Look for subsequences different from the rest of the sequence• Segmentation of DNA according to the {ATCG} bases composition by entropic segmentation method ( a method used in

image segmentation)• Let S = {a1, a2, …….aN} where the a’s are symbols over the alphabet A = {A1, ….. Ak} for example{A,T,C,G}• Consider a segmentation at position n, which resulted in S(1) = {a1, a2, …….an} and S(2) = {an+1, a2, …….aN} • Let F(1) = { f1

(1), …. fk(1)} and F(2) = { f1

(2), …. fk(2)} be the relative nucleotide frequencies over alphabet A .

• The Jensen-Shannon divergence measure between the 2 distributions is given by • DJS(F(1) , F(2) ) = H(1 F(1) + 2 F(2) ) – (1H(F(1) ) + 2H(F(2) )) where

i

k

ii ffFH 2

1

log)(

is the Shannon’s entropy of the distribution F and 1 + 2 = 1.

To look for subsequences one maximize DJS. Halting of the segmentation process is determined by the significant level.

References

P. Bernaola-Galvan, R. Roman-Roldan, and J. L. Oliver, “Compositional segmentation and long range fractal correlations

in DNA sequences.” Phys. Rev. E 53, p5181-5189 (1996).

14


15


Summary

1. The compound Poisson distribution fits quite well for a 6bp and 7 bp long DNA sequences and the segmentation domains, we considered that it is better than the Zipf law.

2. The compound Poisson distribution give the correct overall normalization factor.

3. We noticed that controls the long range behavior (ie less frequently occurred, rare word), controls the short range behavior (ie more frequently occurred, frequent word), and seems to control the overall slope (ie the syntax or style) of the distribution (r).

4. It is still premature to suggest that DNA sequences are resemble to natural language and it may be modeled by linguistic methodology.

In linguistic - representation of linguistic expressions

Morpheme word phrase sentence text

Biological implications

Study the statistical significance of word frequency

• Naively, words of rare frequency because it disrupts replication or gene expression ?

• Words of significant frequency survive after natural selection ?

Documents

1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study