22
New issues in storage and analysis Christophe Roos Christophe Roos - - MediCel MediCel christophe.roos@medicel.fi Annotating genomes with functional information: automatic but without errors? igh throughput data acquisition

New issues in storage and analysis Christophe Roos - MediCel ltd [email protected] Annotating genomes with functional information: automatic but

Embed Size (px)

Citation preview

Page 1: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

New issues in storage and analysis

Christophe RoosChristophe Roos - - MediCel ltdMediCel [email protected]

Annotating genomes with functional information: automatic but without

errors?

High throughput data acquisition

Page 2: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Genome annotation

• Annotations is the sum of all non-sequence information that can be connected to any sequence

Gene

Phylogenetic inference Connectors to other mapsMetabolic profiles

Cofactors and metabolitesSequence homologs in other genomes Metabolic map locator

Sequence

Genome location

Expression info

Functionalchemistry

Structure

Raw images Numericalvalues

Cluster genes SSassignments

Structureannotation

Electrondensity

Rawdata

Experimentaldata

Page 3: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Genome annotation

• Primary sources of information about what genes do are laboratory experiments. It may take several experiments for one data point.

• All that data should ideallically be associated – hyperlinked among DBs.– Magpie is an environment for genome annotation

• Compare genomes to learn how their structure affects function– Bacteria have modules of genes functioning together organised in ‘operons’

– Higher organisms need to pack the DNA to fit it in the nucleus. Activating a gene means unpacking and is not efficient if it is done for each gene separately

Page 4: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Functional genomics

• High throughput technologies give us long lists of the parts of systems (chromosomes, genomes, cells, etc). We can now analyse how they work together to produce the complexity of the organisms.

• The function of the genome is– Metabolism: metabolic pathways convert chemical energy derived from food

into useful work in the cell.

– Regulation: regulatory pathways are biochemical mechanisms that control what genomic DNA does. It switches genes on and off in a controlled way.

– Signalling: signalling pathways control the movement of information (chemicals) from one component to another on many levels

– Construction

• Functional genomics tries to map these pathways

Page 5: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Analysing the activity of the genome• Genomics: look at transcriptional activity of genes

– Transcription: When a gene is transcriptionally active, it means that messenger RNA (mRNA) is synthesised. The amount of mRNA from each active gene varies over time.

– Turnover: Different mRNA species have different half-lives.– Translation: When a mRNA is produced, it does not imply that the

corresponding protein is translated. Transcripts can also be produced for storage and later use.

– Technically feasible: it is possible to isolate all mRNAs from cells and to quantitate it within certain limits.

• Proteomics: look at proteins instead of transcripts– Limited: Presently acceptable efficiency comes at the expenses of

incufficient quality– Closer to ’reality’ since the proteins are the players

Page 6: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

EST: Expressed sequence tags

• ESTs are partial sequences of cDNA clones. cDNA clones are DNA synthesised in vitro using mRNA as template.

– Why? cDNA is more stable than mRNA

– How? cDNA can be made ‘en masse’ starting from total cellular mRNA isolates. cDNA libraries are specific for tissue, developmental time, stimulation etc.

– Therefore, looking at cDNA is looking at mRNA is looking at active genes.

– To look at cDNA means sequencing (part of) it.

• Clones are picked at random (10’000-200’000)

• Sequenced from one or both ends once (no proofreading)

• Sequences entered into EST sequence databases

Page 7: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

EST: Expressed sequence tags

• constucting a clone by inserting a piece of DNA into a ’vector’.

• the vector and its insert will behave as an independent unit (’plasmid’) in the bacterial host and carries some additional genes to allow for selection (only those bacterial with the vector will survive on antibiotics)

• Amplify and sequence

• Iterate (in parallell)

Page 8: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

DNA hybridisation

• DNA is a double-helix and can be separated by denaturing treatment into two strands. Each strand becomes ’sticky’ and attempts to renature with homologous single-strand sequences to form hybrids.

• Single-strand DNA from all known genes of a given species can be attached to a matrix, then probed with labelled cDNA molecules from a given sample. Only complementary probes will hybridise and can be detected if they have been previously labelled (radioactivity, fluorescent stain, ...)

• The technique can be multiplexed:– High density arrays carrying sticky probes from a full genome

– Parallel hybridisation with cDNA from various sources

Page 9: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

The process of using microarraysBuilding the Chip:

MASSIVE PCR PCR PURIFICATION and PREPARATION

PREPARING SLIDES PRINTING

Preparing RNA:

CELL CULTURE AND HARVEST

RNA ISOLATION

cDNA PRODUCTION

Hybridising the Chip:

POST PROCESSING

ARRAY HYBRIDIZATION

PROBE LABELING

DATA ANALYSIS

Page 10: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

The output: the image raw data

laser 1laser 2

emission

scanning

analysis

overlay images and normalise

cDNA is prepared from two samples (in this example) and labelled, each sample with a distinct color. Then the array is hybridised with the doubble probe and the signal is recorded as images

Page 11: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Problems in image analysis

• Noise

• Spot detection and intensity

• Alignment if overlay

Page 12: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

A set of experiments on yeast...

• Each row represents one gene

• Each column represents one experiment– The columns have been

organised into related sets of experiments (ALPH, ELU,...)

• The colors indicate gene activity (from high to absent)

Page 13: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Clustering the resulting data

• Looking at 10’000 genes is not easy

• Group genes into clusters of genes that behave the same way over a set of several experiments– Hierarchical clustering

– K-means clustering

– Self-organising maps (SOM)

– Etc.

Page 14: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

The overall process with microarrays

• Microarray data has to be used in a larger frame of experimentation

Page 15: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Making a model of the data

Sequence Structure Function

Interaction Network Function

Genome Transcriptome Proteome

1. Elements2. Binary relations3. Networks

Pathway

Assembly Neighbour Cluster

Hierarchical TreeGenome

Page 16: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Comparing networks

• Gain new biological information by comparison of networks

• What is the metrics?

• How is it done? Is it simply a problem of graph isomorphism

Pathway vs. Pathway

Pathway vs. Genome

Genome vs. Genome

Cluster vs. Pathway

Page 17: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Biological graph comparison

• Search heuristically for clusters of correspondence

A - aB - bC - cD - d. . .. . .

Clusteringalgorithm

A BC

D

E G

H

K

F

I

J

A BC

D

E G

H

K

F

I

J

a bc

d

e g

h

k

f

i

j

a bc

d

e g

h

k

f

i

j

Graph 1 Correspondences Graph 2

Page 18: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Example: genomic, metabolic, structural

Genome-pathway comparison, which reveals the correlation of physical coupling of genes in the genome - operon structure (a) and functional coupling (b) of gene products in the pathway

E. coli genome

hisL hisG hisD hisC hisB hisH hisA hisF hisI

yefM yzzB

Page 19: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Example: genomic, metabolic, structural

Pentose phosphate cycle

Purine metabolism

HISTIDINE METABOLISM

2.4.2.17

3.6.1.31

3.5.4.19

5.3.1.16

2.4.2.-4.2.1.1

92.6.1.9

3.1.3.15

3.5.1.-

2.6.1.-Phosphoribulosyl-Formimino-AICAR-P

Phosphoribosyl-Formimino-AICAR-P

Phosphoribosyl-AMP

Phosphoriboxyl-ATPPRPP

5P-D-1-ribulosyl-formimine

Imidazole-Glicerol-3P

Imidazole-acetole P

L-Histidinol-P

1.1.1.23

2.1.1.-

6.3.2.11

2.1.1.22

6.3.2.11

3.4.13.5

3.4.13.20

3.4.13.3

4.1.1.22

4.1.1.28

1.4.3.61.2.1.31.1413

53.5.2.-3.5.3.5

N-Formyl-L-aspartate

Imidazoloneacetate

Imidazole-4-acetate

Imidazoleacetaldehyde Histamine

Carnosine

Aneserine

1.1.1.23

6.1.1

1-Methyl-L-histidine

L-Hisyidinal

L-Hisyidinal

5P Ribosyl-5-amino 4-Imidazole carboxamide(AICAR)

L-Histidine

Hercyn

Page 20: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

Example: genomic, metabolic, structural

……..NE, TYROSINE AND TRYPTOPHAN BIOSYNTHESIS Tyrosine metabolism

Alkaloid biosynthesis I

2.6.1.9 2.6.1.57

2.6.1.1 2.6.1.5

6.1.1.1

1.4.3.2

2.6.1.9 2.6.1.57

2.6.1.1 2.6.1.5

4.1.1.48 4.2.1.20

4.2.1.20

4.2.1.20

Tryptophanmetabolism

5.3.1.242.4.2.184.1.3.272.5.1.19

2.7.1.71

1.1.99251.1.1.25

4.2.1.10

4.2.1.11

1.1.99251.1.1.24

4.2.1.91

4.2.1.51

2.6.1.57

2.6.1.572.6.1.92.6.1.5

1.4.1.20 2.6.1.1

1.4.3.2

6.1.1.20

4.2.1.91

4.2.1.51

1.14.16.1

1.3.1.43

Tyr-tRNA

4-Hydroxy-phenylpyruvate

Prephenate

Tyrosine

Pretyrosine

RNA Phenylalanine

5.4.99.5

4.6.1.4

Anthranilate

Histidine

N-(5-Phospho--v-ribosyl)-anthranilate

1-(2- Carboxy-Phenylamino)-1-deoxy-D-ribulose5-phosphate

(3-Indolyl)-Glycerolphosphate

Indole

L-Tryptophan

4.1.3.-

Folatebiosynthesis

Ubiquinone biosynthesis

Chorismate

4-Aminobenzoate

4.6.1.3

3-deoxy-D-arabino-heptonate

3-Dehydro-quinate

4.2.1.10

3-Dehydro-shikimate

Protocatechuate

Shikimate

Phenylpyruvate

SCOP hierarchical tree

1. All alpha2. All beta3. Alpha and beta (a/b) 3.1 beta/alpha (TIM)-barrel 3.2 Cellulases . . . . . . . 3.74 Thiolase 3.75 Cytidine deaminase4. Alpha and beta (a+b)5. Multi-domain (alpha and beta)6. Membrane and cell surface pro7. Small proteins8. Peptides9. Designed proteins10. Non-protein

Page 21: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

More challenges?

The list of genes being activated or inactivated or that are unaffected when comparing two samples becomes more informative if the genes can be mapped onto maps from which functions can be deduced.

Page 22: New issues in storage and analysis Christophe Roos - MediCel ltd christophe.roos@medicel.fi Annotating genomes with functional information: automatic but

Spring 2002Christophe Roos - 6/6 Functional genomics

More challenges?