Identifying conserved spatial patterns in genomes

Identifying conserved spatial patterns

in genomes

Rose HobermanComputer Science Department

Carnegie Mellon University

University of Chicago0ct 20, 2006

My focus: Spatial Comparative Genomics

Understanding genome structure, especially how the spatial arrangement of elements within the genome changes and

evolves.

A simple model of a genome

an ordered list of genes

4 53 71 62 8 9 11 12 1310 14 15 1716 19 20183

Genomic Change

speciation

Sequence Mutation + Chromosomal Rearrangements

species 2

species 1

Ancestral genome

Types of Genomic Rearrangements

4 53 71 62

8 97 11 12104 531 62 13 14 15 1716 19 2018

8 9 11 12 1310 14 15 1716 19 2018

Inversions

Duplications/Insertions

3 89111213 10141517 161920 18

20191813 14 15 1716

8 97 11 12104 531 62 431 2 2013 14 15 1716

Fissions and fusions

4 53 71 2 3 13141517 161920 18 3 891112 10

Types of Genomic Rearrangements

InversionsDuplications/InsertionsLoss

An Essential Task forSpatial Comparative Genomics

4 53 71 2

8 97 11 12104 531 62 431 2

13141517 161920 18

2013 14 15 1716

891112 10

Species 2

Species 1

Identify chromosomal regions that descended from the same region in the genome of the common ancestor

Outline

Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic

regions?

Problem Background Why is this challenging? Introduction to cluster finding

Results Statistics for pairwise clusters Statistics for three-way clusters

Pevzner, Tesler. Genome Research 2003

Identification of homologous chromosomal segments is a key task in comparative genomics1. Genome evolution

Reconstruct history of chromosomal rearrangements

Infer ancestral genetic map

Phylogeny reconstruction

Identify ancient whole genome duplications

Identification of homologous chromosomal segments is a key task in comparative genomics

Guillaume Bourque et al. Genome Research 2004

1. Genome evolution Reconstruct

history of chromosomal rearrangements

Whole genome duplication

chromosome 2chromosome 1

Ancestral chromosome

Identify ancient whole genome duplications McLysaght et al

Nature Genetics, 2002.

Infer functional associations

Predict operons

Identify horizontal transfers

2. Understand gene function and regulation in bacteria

Identification of conserved chromosomal structure is a key task in comparative genomics

Insertion

Inversions

Outline

regions?

Closely related genomes

Related regions are easy to identify

4 53 71 2

8 97 11 12104 531 62 431 2

13141517 161920 18

2013 14 15 1716

891112 10Species 1

Species 2

Five hundred million years...

Homologous regions are harder to detect, but there is still spatial evidence of common ancestry Similar gene content Neither gene content nor order is perfectly

preserved

More Distantly Related Genomes

3 71 2

89 11 12104 51 62 31 2

89 11121310 141517 161920

2013 1415 17 16

Gene clusters Similar gene content Neither gene content nor order is perfectly

3 71 2

89 11 12104 51 62 31 2

89 11121310 141517 161920

2013 1415 17 16

The signature of diverged regions

A Framework for Identifying Gene Clusters

1. Find homologous genes

2. Formally define a “gene cluster”

3. Design an algorithm to find clusters

4. Statistically verify clusters

Why Validate Clusters Statistically?

After sufficient time has passed, gene order will become randomized

Uniform random data tends to be “clumpy”

Some genes will end up close together in both genomes simply by chance

43 71 2

89 11 12104 51 62 31 2

1310 141517 161920

2013 1415 17 16

Cluster Statistics

r-windows

max-gap

2 regionsDurand & Sankoff,

2003my work

3 regions my work unsolved

Cluster Models

Outline

regions?

3. Design an algorithm to find clusters

Gene Homology

Identification of homologous gene pairs generally based on sequence similarity conserved genomic context is also informative

Assumptions matches are binary (similarity scores are

discarded) each gene is homologous to at most one other

gene in the other genome

1. Find homologous genes Formally define a “gene cluster”

3. Devise an algorithm to identify clusters

Where are the gene clusters?

Intuitive notion: pairs of regions that are dense with homologs

How can we formalize this intuition?

The r-window cluster definition

Two windows of size r that share at least m homologous gene pairs

r is a user-specified parameter

(Calvacanti et al 03, Durand and Sankoff 03, Friedman and Hughes 01, Raghupathy and Durand 05)

A max-gap chain

The distance or “gap” between genes is equal to the number of intervening genes

A set of genes in a genome form a max-gap chain if the gap between adjacent genes is never

greater than g (a user-specified parameter)

g= 2 gap= 3

The max-gap cluster definition

g= 2 g= 3gap= 3

A set of genes form a max-gap cluster in two genomes if

1. the genes forms a max-gap chain in each genome

2. the cluster is maximal (i.e. not contained within a larger cluster)

2. Formally define a “gene cluster” Devise an algorithm to identify

clusters

Max-gap search algorithms

Many genomic studies use a max-gap criteria Each group designs their own search algorithm These are often greedy algorithms, but greedy

algorithms miss disordered max-gap clusters

Hoberman et al, RECOMB Comp Genomics 2005

Greedy, Agglomerative Algorithms

1. initialize a cluster as a single homologous pair

2. search for a gene in proximity on both chromosomes

3. either extend the cluster and repeat, or terminate

Greedy Algorithms Impose Order Constraints

A max-gap cluster of size four

A greedy, agglomerative algorithm will not find this cluster since there is no max-gap cluster of size

Algorithms and Definition Mismatch

Greedy, bottom-up algorithms will not find all max-gap clusters

There is an efficient divide-and-conquer algorithm to find maximal max-gap clusters (Bergeron et al, WABI, 2002)

Cluster statistics depend on the search space, which depends on which algorithm is used

3. Devise an algorithm to identify clusters

Statistically verify clustersAn example…

Example: Whole genome self-comparison to detect duplicated blocks

Chose g=30

Compared all human chromosomes to all other chromosomes to find max-gap gene clusters

McLysaght et al. Nature Genetics, 2002.

How can we use statistical analysis?

Could two regions display this degree of similarity simply by chance?

Is g=30 a reasonable choice of gap size? Are larger clusters less likely to occur by

chance? How large does a cluster have to be before we

are surprised to observe it?

McLysaght et al. Nature Genetics, 2002.

3 genes duplicatedout of ~25 13 genes

29 genes

10 genes duplicatedout of ~100

Chr 17

Outline

regions?

Cluster Statistics

r-windows

max-gap

2003my work

Cluster Models

The max-gap definition is the most widely used cluster definition in genomic analyses

Yet there is no formal statistical model for max-gap clusters

Overbeek et al 1999: inferring functional coupling of genes in bacteriaVision et al 2000: origins of genomic duplications in ArabidopsisFriedman and Hughes 2001: gene duplication and structure of eukaryotic

genomesTamames 2001: evolution of gene order conservation in prokaryotesVandepoele et al 2002: microcolinearity between Arabidopsis and riceMcLysaght et al 2002: genomic duplication during early chordate

evolutionSimillion et al 2002: hidden duplications in Arabidopsis Blanc et al 2003: recent polyploidy in ArabidopsisLuc et al 2003: gene teams for comparative genomicsChen et al 2004: operon prediction in newly sequenced bacteriaBourque et al 2005: comparison of mammalian and chicken genome

architectures …

Is this cluster biologically meaningful? Could it have occurred in a comparison of

two random genomes?

The QuestionSuppose two whole genomes were

compared, and this max-gap cluster was identified:

Statistical Testing

Hypothesis testing Alternate hypothesis: shared ancestry Null hypothesis: random gene order

Discard clusters that could have arisen under the null model

Determine the probability of observing a similar cluster under the null hypothesis

Given an allowed gap size of g, what is the probability of observing a max-gap cluster

containing exactly h matching gene pairs?

How do we calculate this probability?

The Problem

The Inputs

n: number of genes in each genome

m: number of matching genes pairs

g: the maximum gap allowed in a cluster

h: number of matching genes in the cluster

m=6 n=22

The probability when m = n

…the probability of a max-gap cluster is 1

(regardless of the allowed gap size)

If gene content is identical…

Probability of a cluster of size h when m < n

h genes m genesm-h genes

1. Create chains of h genes in both

genomes

2. Place m-h remaining genes so they do not

extend the cluster

3. Normalize to get a probability

Basic approachEnumerate all ways to:

Probability of observing a cluster of size h

number of ways to place h genes so they form a chain

in both genomes

number of ways to place m-h remaining genes so they do not

extend the cluster

All configurations of m gene pairs in two genomes of size n

in both genomes

extend the cluster

Number of ways to place h genes in two genomes so they form a cluster

h genes m genesm-h genes

Choose h genes to compose

the cluster

Assign each gene to a

selected spot in each genome

Select h spots in each genome, so they form a

max-gap chain

Hoberman, Sankoff, Durand RECOMB Comparative Genomics, 2004

in both genomes

extend the cluster

Approach: design a rule specifying where the genes can

be placed so that the cluster is not extended count the positions that satisfy the rule

Counting the number of ways to place m-h genes outside the cluster

Rule 1: A gene can go anywhere except in the cluster (the white box).

gaps ≤ 1

Too lenient

overcounts

Rule 2: Every gene must be at least g+1 positions from

the cluster (outside the grey box).

Too strict

undercounts

h = 3 gap > 1

gap > 1

Rule 2: Every gene must be at least g+1 positions from

the cluster (outside the grey box).

Too strict

undercounts

gap > 1

Rule 3: At most one member of each gene pair can be in

the grey box.

Too lenient

overcounts

Rule 3: At most one member of each gene pair can be in

the grey box.

Counting the number of ways to place m-h genes outside the clustergaps ≤ 1

Too lenient

overcounts

Acceptable positions for a gene depend on the positions of the remaining genes

Use strict and lenient rules to calculate upper and lower bounds on G

Estimating G

Upper bound: Erroneously

enumerates this configuration

Lower bound: Fails to enumerate this configuration

in both genomes

extend the cluster

Hoberman, Sankoff, Durand Journal of Computational Biology, 2005

What can we learn from this statistical result? Are we less likely to observe a large cluster

(containing more gene pairs) than a small cluster?

How large does a cluster have to be before we are surprised to observe it?

How do we choose the maximum allowed gap value?

Whole-genome comparison cluster statistics

n=1000, m=250

h (cluster size)

With a significance

threshold of 10-4, any cluster

containing 8 or more genes is

significant.

E. coli vs B. Subtilis

Under null hypothesis, by g=25 all genes should form a single cluster

Completecluster doesn’t form until g=110

Typical operonsizes

clusters above the orange line are significant at the .001 level

Algorithm:Bergeron et al, 02

Statistics:Hoberman et al, 05

By g=9, probabilities no longer decrease monotonically with cluster size

Statistical analysis of max-gap gene clusters

1. Data analysis: Identifies statistically significant max-gap clusters

2. Search: Suggests a more principled approach for choosing a gap size that will yield significant clusters

3. Modeling: Provides insight into criteria for cluster definitions larger clusters should be less likely to occur by

chance

Outline

regions?

Cluster Statistics

r-windows

max-gap

2003my work

Cluster Models

Joint work with Narayanan Raghupathy

Duplicated segments can be particularly difficult to identify

chromosome 2

chromosome 1

Duplicated segments can be particularly difficult to identify

Reciprocal gene loss

Comparisons of multiple genomes offer significantly more power to detect highly diverged homologous segments

Arabidopsis thaliana region 1

Rice chr 2

Arabidopsis thaliana region 2

Simillion et al, PNAS 2002

At 2At 1

126 26

At 2At 1

Existing Statistical Approaches

x23x123

x3x13x1

1. Only consider x123, genes shared between all regions disregards pairwise overlaps2. Conduct multiple pairwise comparisons does not consider the greater impact of genes

shared among all three regions requires that at least two pairwise clusters are

independently significant

Durand and Sankoff, J Comp Biology, 2003Methods reviewed in Simillion et al, Bioessays, 2004

No existing statistical approach considers all these quantities in a single test.

r-windows for pairwise comparison

Two windows of size r that share at least m homologous gene pairs

With two regions there is a natural test statistic

An r-window cluster spanning two regions can be characterized by the number of shared

genes, m.

mr-m r-m

Durand and Sankoff, J Comp Biology, 2003

The probability that two random windows share m

genes:

With three regions there are many more quantities to consider

Three-way overlap: x123

Pairwise overlaps: x12, x23, x13

x23x123

x2x12x1

We want to consider both types of overlaps

Our r-window statistics for three windows

Given three randomly selected windows of length r,

we derived equations for the probability of observing such a cluster

under a null hypothesis of random gene order.

),,,( 232313131212123123 xXxXxXxXP

Raghupathy, Hoberman, Durand. APBC 2007.

x23x123

x2x12x1

First tests that uses both pairwise and three-way overlaps

How serious are these limitations?

Pairwise tests have a number of limitations:

1. They do not consider the greater impact of genes shared among all three regions

2. They require that at least two pairwise comparisons are independently significant

Our derived expressions can be used to compare pairwise and three-way cluster probabilities for typical genome parameters and window sizes.

What is the impact of ignoring three-way genes?

a) Three genes are shared by all three windows

b) Three distinct genes are shared by each pair of windows

Each pair of regions shares exactly

three genes

Which cluster is least likely to occur by chance?

n=5000, r=100

hxxij 123,00, 123 xhxij

(a) is less likely to occur than (b)

a) Three genes are shared by all three windows

b) Three distinct genes are shared by each pair of windows

Pairwise tests cannot distinguish these two clusters

but pairwise tests score them identically

How much of a limitation is this?

Pairwise tests have a number of limitations:

1. They do not consider the greater impact of genes shared among all three regions

2. They require that at least two pairwise comparisons are independently significant

Our derived expressions can be used to compare pairwise and three-way cluster probabilities for typical genome parameters and window sizes.

When no genes appear in all three regions, are pairwise statistical tests sufficient?

n=5000, r =100, x123=0

α=0.001

Two Pairwise Tests

Three-window

When no genes appear in all three regions, are pairwise statistical tests sufficient?

The three-window test is strictly more general than two pairwise tests

The three-window test will always reject the null hypothesis for any cluster in which the pairwise tests reject the null hypothesis

Two Pairwise Tests

Three-window

These results suggest that multi-region tests can identify more distantly related homologous regions.

There is an ongoing debate about whether there was 1 or 2 whole genome duplications in early vertebrates

More powerful statistical tests might help resolve this issue

Summary1. First statistical tests for max-gap clusters,

the most commonly used definition determines minimum significant cluster size allows more principled selection of gap

parameter reveals problems with max-gap definition

2. More sensitive statistical tests for comparisons of three genomic regions first test to consider both pairwise and three-

way overlaps can identify more distantly related regions may be able to resolve outstanding questions in

molecular evolution

Acknowledgements

Collaborators Dannie Durand, Biological Sciences and

Computer Science, CMU David Sankoff, Math and Statistics University of

Ottawa Narayanan Raghupathy, Biological Sciences,

Funders Barbara Lazarus Women@IT Fellowship The Sloan Foundation

Thanks

Questions?

Advantages of an analytical approach

Analyzing incomplete datasets Principled parameter selection Understanding statistical trends Insight into tradeoffs between

definitions Efficiency Accuracy

1 2 3 4 5 …. n-L+1 …. n

Ways to place the leftmost gene in the chain, so there are at least L-1 places left

The maximum length of the chain is: L = (h-1)g + h

The number of ways to create a chain of h genes

Ways to place the leftmost gene in the chain, so there are at least L-1 slots left

There are h-1 gaps in a chain of h genes

Choices for the size of each gap

(from 0 to g)

Chains near the end of

the genome

Ways to place the leftmost gene in the chain, so there are at least L-1 slots left

There are h-1 gaps in a chain of h genes

1 2 3 4 5 …. n-L+1 …. n

Choices for the size of each gap

(from 0 to g)

l = w-1

Gaps are constrained:

And sum of gaps is constrained:

Counting clusters at the end of the genome

g1 g2 g3 gm-1

A known solution:

l = w-1

Gaps are constrained:

And sum of gaps is constrained:

Counting clusters at the end of the genome

d(m,g,m)

Cluster Length

d(m,g,m)

d(m,g,m) + d(m,g,m+1) +

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2)

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) + d(m,g,w-1)

ww-1 … m+1m

d(m,g,m+1)+

d(m,g,m) d(m,g,m+1)+

Line of Symmetry

l = w-1

Exploiting Symmetry

m+1 w-1

=m+2 w-2 1

g-1 g-1

d(m,g,m)

Cluster Length

d(m,g,m)

d(m,g,m) + d(m,g,m+1) +

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2)

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) + d(m,g,w-1)

ww-1 … m+1m

d(m,g,m+1)+

d(m,g,m) d(m,g,m+1)+

(g+1)m-1

l = w-1

Number of ways to position h genes in a genome of n genes so they form a max-gap chain

Ways to placeremaining h-1

Starting positions near end

Starting positions

Identifying conserved spatial patterns in genomes

Documents

Selecting Genomes for Reconstruction of A ncestral Genomes

BB30055: Genes and genomes Genomes - Dr. MV Hejmadi (bssmvh@bath.ac.uk)

ÒViroids: filling the lower size niche for RNA genomes Ó Flores.pdf · Viroides Presencia de una región central conservada (CCR) With a Central Conserved Region (CCR) Without a

MOTIF-EM Large Macromolecular Assemblies (LMAs ...robotics.stanford.edu/~mitul/motifEM/motifEM.ppt.pdf1 MOTIF-EM : an Automated Computational Tool for Identifying Conserved Domains

Browsing Genomes with Ensembl - CNIObioinfo.cnio.es/files/training/Fifth_Genome_Browsing... · 2011-11-10 · Compare conserved regions with the position of genes and population variation

Genomes & Developmental Control Prediction and … · Genomes & Developmental Control Prediction and characterisation of a highly conserved, remote and cAMP responsive enhancer that

Identifying candidate Aspergillus pathogenicity factors by ......Functional annotation relative frequencies clustered strains by species A total of 216 Aspergillus genomes were initially

Genomes & Biotechnology

Microbial Genomes !

Chapter 21: Genomes & Their Evolution - Untitled Page chapter 21-F16.pdf · Chapter 21: Genomes & Their Evolution 1. Sequencing & Analyzing Genomes 2. How Genomes Evolve. 1. Sequencing

Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Functionally conserved architecture of hepatitis C …Functionally conserved architecture of hepatitis C virus RNA genomes David M. Maugera, Michael Goldenb, Daisuke Yamanec,d, Sara

Bioinformatics approaches to gene expression - … · Bioinformatics approaches . to gene expression. ... will cover genomes (Ch. 12-18). ... Identifying protein-coding genes in genomic

Evolution of Plant Genomes Phillip McClean April, 2012mcclean/plsc731... · Here we have genes that have the same conserved order found near the two ends of the A. thaliana chromosome

Organelle genomes

Large-Scale Bioinformatics Analysis of Bacillus Genomes ... · Large-Scale Bioinformatics Analysis of Bacillus Genomes Uncovers Conserved Roles of Natural Products in Bacterial Physiology

2018 Identification and characterization of Coronaviridae genomes from Vietnamese bats and rats based on conserved prote

Barcodes Genomes

The Plant Journal A lineage-speciﬁc centromere ...€¦ · retrotransposons. Unlike most transposons in plant genomes, the centromeric retrotransposon (CR) family is conserved over

RESEARCH ARTICLE Open Access Complete plastid ......Background The plastid genome has remained remarkably conserved throughout the evolution of land plants (reviewed in [1-3]). Genomes