Hierarchical Cluster Structures and
Symmetries in Genomic Sequences
Andrei Zinovyev
Institut des Hautes Études Scientifiques
Math@Bio group of M.Gromov
Plan of the talk
Genomic sequences: geometric approach, clustering
Genomic sequence as text Basic 7-cluster structure Global structure of codon frequencies Internal structure of codon frequencies Applications
Introduction
Frequency dictionaries
Genomic sequence as a text in unknown language
tagggrcgcacgtggtgagctgatgctaggg
frequency dictionaries:t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g
ta gg gr cg ca cg tg gt ga gc tg at gc ta gg
tag ggr cgc acg tgg tga gct gat gct agg
tagg grcg cacg tggt gagc tgat gcta gggr
N = 4=41
N = 16=42
N = 64=43
N=256=44
gggrcgccacgttggtgagctgatgctagggrcgacgtgg
tagggrcgcacgtggtgagctgatgctagggrcgacgtgg
agggrcgcacgtggtgagctgatgctagggrcgacgtggc
..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…
From text to geometrycgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
107
cgtggtgagctgatgctagggrcgcacggtgagctgatgctagggrcgcacacttgagctgatgctagggrcgcacaattcgtgagctgatgctagggrcgcacggtg……gagctgatgctagggrcgcacaagtga
length~300-400
3000-4000 fragments
RN
Method of visualizationprincipal components analysis
RNR
2
R2
PCA plot
Chapter 1
Basic 7-cluster structure
(level 1 of non-randomness)
Caulobacter crescentus
singles N=4
doublets N=16
triplets N=64
quadruplets N=256
!!!
the information in genomic sequence is encodedby non-overlapping triplets
First explanation
cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
tga tgc tag ggr cgc acg tgg
ctg atg cta ggg rcg cac gtg
Basic 7-cluster structure
gtgagctgatgctagggrcgcacgtggtgagc
gct gat gct agg grc gca cgt
gtgaatcggtgggtgaqtgtgctgctatgagc
atc ggt ggg tga gtg tgc tgc
tcg gtg ggt gag tgt gct gct
cgg tgg gtg agt gtg ctg ctg
Non-coding parts
gtgagctgatgctagggr cgcacgaat
Point mutations:insertions, deletions
a
Mean-field approximationfor triplet frequencies
321KJIIJK PPPF
FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ):
FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers
letter frequency + correlations
: 12 numbersjiP
Why hexagonal symmetry?
0-+
-+0
+0-
+-0
-0+
0+-
GC-content = PC + PG
Chapter 2
Global structure of codon frequencies
(143 complete bacterial genomes)
Genome codon usageand mean-field approximation
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
…
correct frameshift
64 frequencies FIJK
…
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
12 frequencies PI1 , PJ
2 , PK3
Global structure of codon frequencies
eubacteria
archa
ea
PIJ are linear functions of GC-content
Four symmetry typesof the basic 7-cluster structure
eubacteria
flower-likedegeneratedperpendiculartriangles
paralleltriangles
Chapter 3
Internal structure of codon frequencies
(level 2 of non-randomness)
Second level of hierarchy
?
Distribution of genes
R64
function1 function2
function3
Fast-growing bacteria
IV
II
I
III
Genes of class I(most of)
Genes of class II(higly expressed)
Genes of class III(unusual)
Genes of class IV(hydrophobic proteins)
Escherichia coli
Genes of class I(most of)
Genes of class II(higly expressed)
Genes of class III(unusual)
Genes of class IV(hydrophobicproteins)
Chapter 4
Applications
Computational gene prediction
Accuracy >90%
Protein expression optimization
IV
II
I
III
gene sequence S,protein A
gene sequence S’,same protein A,higher expression
Web-site
http://www.ihes.fr/~zinovyev/7clusters
cluster structures in genomic sequences
PapersGorban A, Popova T, Zinovyev AFour basic symmetry types in the universal 7-cluster Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences.structure of 143 complete bacterial genomic sequences.2004. Arxive e-print.
Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributionsSeven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039.
Zinovyev A, Gorban A, Popova T Self-Organizing Approach Self-Organizing Approach for Automated Gene Identificationfor Automated Gene Identification. 2003. Open Systems and Information Dynamics 10 (4).
People
Dr. Tanya PopovaInstitute of Computational ModelingRussia
ProfessorAlexander GorbanUniversity of LeicesterUK