29
Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Embed Size (px)

Citation preview

Page 1: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Hierarchical Cluster Structures and

Symmetries in Genomic Sequences

Andrei Zinovyev

Institut des Hautes Études Scientifiques

Math@Bio group of M.Gromov

Page 2: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Plan of the talk

Genomic sequences: geometric approach, clustering

Genomic sequence as text Basic 7-cluster structure Global structure of codon frequencies Internal structure of codon frequencies Applications

Page 3: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Introduction

Frequency dictionaries

Page 4: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Genomic sequence as a text in unknown language

tagggrcgcacgtggtgagctgatgctaggg

frequency dictionaries:t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g

ta gg gr cg ca cg tg gt ga gc tg at gc ta gg

tag ggr cgc acg tgg tga gct gat gct agg

tagg grcg cacg tggt gagc tgat gcta gggr

N = 4=41

N = 16=42

N = 64=43

N=256=44

gggrcgccacgttggtgagctgatgctagggrcgacgtgg

tagggrcgcacgtggtgagctgatgctagggrcgacgtgg

agggrcgcacgtggtgagctgatgctagggrcgacgtggc

..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…

Page 5: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

From text to geometrycgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc

107

cgtggtgagctgatgctagggrcgcacggtgagctgatgctagggrcgcacacttgagctgatgctagggrcgcacaattcgtgagctgatgctagggrcgcacggtg……gagctgatgctagggrcgcacaagtga

length~300-400

3000-4000 fragments

RN

Page 6: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Method of visualizationprincipal components analysis

RNR

2

R2

PCA plot

Page 7: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Chapter 1

Basic 7-cluster structure

(level 1 of non-randomness)

Page 8: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Caulobacter crescentus

singles N=4

doublets N=16

triplets N=64

quadruplets N=256

!!!

the information in genomic sequence is encodedby non-overlapping triplets

Page 9: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

First explanation

cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc

Page 10: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

tga tgc tag ggr cgc acg tgg

ctg atg cta ggg rcg cac gtg

Basic 7-cluster structure

gtgagctgatgctagggrcgcacgtggtgagc

gct gat gct agg grc gca cgt

gtgaatcggtgggtgaqtgtgctgctatgagc

atc ggt ggg tga gtg tgc tgc

tcg gtg ggt gag tgt gct gct

cgg tgg gtg agt gtg ctg ctg

Page 11: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Non-coding parts

gtgagctgatgctagggr cgcacgaat

Point mutations:insertions, deletions

a

Page 12: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Mean-field approximationfor triplet frequencies

321KJIIJK PPPF

FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ):

FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers

letter frequency + correlations

: 12 numbersjiP

Page 13: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Why hexagonal symmetry?

0-+

-+0

+0-

+-0

-0+

0+-

GC-content = PC + PG

Page 14: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Chapter 2

Global structure of codon frequencies

(143 complete bacterial genomes)

Page 15: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Genome codon usageand mean-field approximation

ggtgaATG gat gct agg … gtc gca cgc TAAtgagct

correct frameshift

64 frequencies FIJK

ggtgaATG gat gct agg … gtc gca cgc TAAtgagct

12 frequencies PI1 , PJ

2 , PK3

Page 16: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Global structure of codon frequencies

eubacteria

archa

ea

Page 17: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

PIJ are linear functions of GC-content

Page 18: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Four symmetry typesof the basic 7-cluster structure

eubacteria

flower-likedegeneratedperpendiculartriangles

paralleltriangles

Page 19: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Chapter 3

Internal structure of codon frequencies

(level 2 of non-randomness)

Page 20: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Second level of hierarchy

?

Page 21: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Distribution of genes

R64

function1 function2

function3

Page 22: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Fast-growing bacteria

IV

II

I

III

Genes of class I(most of)

Genes of class II(higly expressed)

Genes of class III(unusual)

Genes of class IV(hydrophobic proteins)

Page 23: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Escherichia coli

Genes of class I(most of)

Genes of class II(higly expressed)

Genes of class III(unusual)

Genes of class IV(hydrophobicproteins)

Page 24: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Chapter 4

Applications

Page 25: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Computational gene prediction

Accuracy >90%

Page 26: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Protein expression optimization

IV

II

I

III

gene sequence S,protein A

gene sequence S’,same protein A,higher expression

Page 27: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

Web-site

http://www.ihes.fr/~zinovyev/7clusters

cluster structures in genomic sequences

Page 28: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

PapersGorban A, Popova T, Zinovyev AFour basic symmetry types in the universal 7-cluster Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences.structure of 143 complete bacterial genomic sequences.2004. Arxive e-print.

Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributionsSeven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039.

Zinovyev A, Gorban A, Popova T Self-Organizing Approach Self-Organizing Approach for Automated Gene Identificationfor Automated Gene Identification. 2003. Open Systems and Information Dynamics 10 (4).

Page 29: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

People

Dr. Tanya PopovaInstitute of Computational ModelingRussia

ProfessorAlexander GorbanUniversity of LeicesterUK