36
Pavel Morozov Pavel Morozov March 3 March 3 Legionella Legionella Functional Functional Genomics Project. Genomics Project.

Pavel Morozov March 3 Legionella Functional Genomics Project

  • View
    224

  • Download
    1

Embed Size (px)

Citation preview

Pavel Morozov Pavel Morozov

March 3March 3

LegionellaLegionella Functional Functional Genomics Project.Genomics Project.

Legionella pneumophilaLegionella pneumophila• An intracellular pathogen that can invade and replicate inside human macrophages and causes An intracellular pathogen that can invade and replicate inside human macrophages and causes

potentially fatal human infection Legionaires' disease. potentially fatal human infection Legionaires' disease. • Transmitted through inhaling mist droplets containing the bacteria. Transmitted through inhaling mist droplets containing the bacteria. • Has extraordinary ability to survive in many different ecological niches (axenic cultures, Has extraordinary ability to survive in many different ecological niches (axenic cultures,

biofilms with other organisms and intracellular vacuoles of amoebae, ciliates and human biofilms with other organisms and intracellular vacuoles of amoebae, ciliates and human cells). cells).

• In order to relpicate In order to relpicate LegionellaLegionella should be inside if protozoa (amobae, acanthamoeba) which should be inside if protozoa (amobae, acanthamoeba) which are single-cell eukaryotes, or macrophages of human lungs or monocites.are single-cell eukaryotes, or macrophages of human lungs or monocites.

I

II III

IV

VI V

Adhesion, invasion

Inhibition of

lysosome fusion

Recruitment of ER

Evasion

Modulation of host-cell gene expression

Replication

Page from sciencePage from science

Complete genome of LEgionella Complete genome of LEgionella pneumophila (strain Phyladelphia 1).pneumophila (strain Phyladelphia 1).

oriC

repl termrepl term

region 3: efflux

region 7: tra/trb region (F-plasmid)

Legionella pneumophila (strain Phyladelphya 1) genome.The highlighted regions were noteworthy due to their possession of different than average G+C content and GC skew in addition to skewed strand preference of ORFs. These computationally determined regions turn out to contain gene clusters that belong to specific categories (e.g., ribosomal protein cluster), or those corresponding to points of genome rearrangements or acquired by horizontal transfer. Some examples are shown in more detail below.

genes in direct chain

genes inreverse chain

C+G content

GC skew

Project goalsProject goals

• Study molecular mechanisms (genetics and Study molecular mechanisms (genetics and regulation) of regulation) of – LegionellaLegionella ability to survive in different ecological ability to survive in different ecological

niches.niches.– LegionellaLegionella infection. infection.

• Extended genome annotation of Extended genome annotation of Legionella Legionella species (species (Phyladelphia, Paris, LensPhyladelphia, Paris, Lens strains). strains).

• Custom whole-genome microarrays.Custom whole-genome microarrays.• Network reconstruction and modeling.Network reconstruction and modeling.

June 2001•1344 clones in triplicate•40% of the genome

October 2003

3,230 clones90% of the genome

September 2005

2,997 70-mer oligos

Whole-genome array

3,005 genes in duplicates

640 reference controls

Microarray Design. Microarray Design. History of History of LegionellaLegionella Microarrays. Microarrays.

The goal was to design 70-mer probes covering all protein- and RNA- coding The goal was to design 70-mer probes covering all protein- and RNA- coding genes and control probes for testing background and array properties.genes and control probes for testing background and array properties.

Requirements common to all probes:Requirements common to all probes:

– should not contain short nucleotide stretches that are too abundant;should not contain short nucleotide stretches that are too abundant;

– should be free from secondary structure elements;should be free from secondary structure elements;

– should have approximately same melting temperatureshould have approximately same melting temperature

Requirements specific to probes specific to genes: Requirements specific to probes specific to genes:

– 70-mers should be unique (occure once) in experimental system 70-mers should be unique (occure once) in experimental system ((Legionella, Human, E.coliLegionella, Human, E.coli););

Requirements specific to array control probes:Requirements specific to array control probes:

– should not not exist in experimental system (should not not exist in experimental system (Legionella, Human, Legionella, Human, E.coliE.coli))

Requirements for Microarray Probes.Requirements for Microarray Probes.

Microarray probe design using unique oligonucleotides of particular Microarray probe design using unique oligonucleotides of particular length.length.

5’5’ CDS or genomic sequenceCDS or genomic sequence 3’ 3’

14-mer14-mer

oligonucleotidesoligonucleotides

uniqueunique

oligonuclleotidesoligonuclleotides

overrepresented 8-overrepresented 8-mersmers

70-mer microarray probe70-mer microarray probeIn simplified form probe selection can be described like selection of regions with

maximum number of unique oligonucleotides (in this case of length 14 bp) and minimal number of overrepresented shorter oligonucleotides (in this case 8 bp).

In actual study we have to use oligonucleotides of different length and also check for the probe melting temperature.

Using unique oligonucleotide for designing probes automatically removes secondary structure issues.

Chosing length of Chosing length of oligonucleotidesoligonucleotides

DNA or RNA (genomic or mRNA sequence).

n

n+1

n+2

n+3

n+k

ancestors

descendants

10 12 14 16 18 200

0.5

1

1.5

2

2.5

3x 10

7

oligonucleotide lengthn

um

ber

of n

ow

el u

niq

ue o

ligon

ucl

eot

ide

s (x

107 )

Human chromosome X

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10 5

Simulations for length 1m.b.

oligonucleotide length

6 8 10 12 14 16 18 10 12 14 16 18 20

nu

mb

er o

f no

we

l uni

que

olig

onu

cle

otid

es

(x1

05 )

A) B)

Distributions of ancestral and descendant unique oligonucleotides by it’s length. Solid line denote sum of two distributions, dotted line denote distribution of ancestral oligonucleotides and dashed line stands for descendats. A) Results of simulation for genomes of size 1mb. B) Real data for human chromosome X.

Distribution of ancestors and descendant of various length.Distribution of ancestors and descendant of various length.

For each position we can define the length L at which the nucleotide, starting at this position became unique. All oligonucleotides in this position longer than L will be also unique. Also there are two types of unique oligonucleotides: those who contain unique oligonucleotide of smaller length and those who do not, we name them ancestors and descendants. It is enough to keep information about first occurrence of oligonucleotide for each position in order to have complete information about distribution of unique oligonucleotides for particular sequence region.

Sequence region and ancestors for each position (-1 if not known) :Sequence region and ancestors for each position (-1 if not known) :

a t g c a c t a g c t a g c t a g t c g a t g c a c t a g c t a g c t a g t c g ……

12,14,-1,-1,15,10,10,11,10,14,-1,-1,13,15,12,-1,-1,-12,14,-1,-1,15,10,10,11,10,14,-1,-1,13,15,12,-1,-1,-1,12,16…1,12,16…

PP11

PP22

PPii

Design of probes using unique oligonucleotides Design of probes using unique oligonucleotides positional information.positional information.

For each potential probe Pi can be defined vector of number of unique oligonucleotides of various length (both ancestors and descendants): Vi={0,0,0,0,0,0,0,2,3,4,2,3,5,6,7}.

A Golden Standard vector can be defined as G={0,0,0,0,0,0,n1,n1-1,n1-2,n1-3…}.

An Euclidian distance is a relible choise of a measure for the estimation of distance between Vi and G:

D(Vi,G)=√ ∑L (Vi(j)-G(j))²where L set of oligonucleotide length used. A probes with minimal distance to golden standard we choosed.

• Enumerating oligonucleotides Enumerating oligonucleotides – Binary arithmetic : Binary arithmetic :

00 stands for A, 01 for T, 10 for C and 11 00 stands for A, 01 for T, 10 for C and 11 for G.for G.

Binary:01110001 Decimal:142 Binary:01110001 Decimal:142 T G A T T G A T

– Enumeration is Enumeration is completecomplete, , densedense, and , and nonredundantnonredundant..

• Counting oligonucleotidesCounting oligonucleotides– direct countingdirect counting– Complete space of possible Complete space of possible

oligonucleotides grows as 4oligonucleotides grows as 4nn. . – Memory size of current computers Memory size of current computers

allows to handle oligonucleotides up allows to handle oligonucleotides up to 16 on PC, up to 18 on Sun Solaris. to 16 on PC, up to 18 on Sun Solaris. With algorithm enhancements we can With algorithm enhancements we can go up to 24 (but no need).go up to 24 (but no need).The best resolution for human The best resolution for human genome provided by length 18 and genome provided by length 18 and most bacterial genomes 12-14.most bacterial genomes 12-14.

Olig Space without Space with length coding coding

4 256 32

5 1,024 128

6 4,096 512

7 16,384 2,048

8 65,536 8,192

9 262,144 32,768

10 1,048,576 131,072

11 4,194,304 524,288

12 16,777,216 2,097,152

13 67,108,864 8,388,608

14 268,435,456 33,554,432

15 1,073,741,824 134,217,728

16 4,294,967,296 536,870,912

17 17,179,869,184 2,147,483,648

18 68,719,476,736 8,589,934,592

19 274,877,906,944 34,359,738,368

20 1,099,511,627,776 137,438,953,472

- computable on desktop- computable on workstation with big memory- computable on workstation with big memory with enhanced algorithm- hardly computable

Finding unique oligonucleotides.Finding unique oligonucleotides.

Storing data in Rich FASTA format

Program realization and data Program realization and data formats.formats.

Results of the search for unique oligonucleotides are stored in “rich” Fasta format. Essentially it is linear record of positional information like regular Fasta file, but with coded additional information.

0 1 0 0 1 1 0 0

Symbol Length of first Overrepresented flag unique oligonucleotides in this position

List of fasta files(genomes etc.)

Marked for unique oligonucleotides fasta file

u_find.exeu_findm.exe

for minimal oligonucleotidelength

u_find.exeu_findm.exe

for all desired oligonucleotidelengthu_code.exe

Microarray probes

u_design.exe

Goal: sequence which have no homology to any genome Goal: sequence which have no homology to any genome ( no blast hits over threshold)( no blast hits over threshold)

• Selecting nonexistent oligonucleotidesSelecting nonexistent oligonucleotides• Overlapping and merging oligonucleotidesOverlapping and merging oligonucleotides• Choosing probes from merged sequencesChoosing probes from merged sequences

AATGCTAGCTAAATGCTAGCTA ATGCTAGCTACATGCTAGCTAC CTAGCTACGGACTAGCTACGGA AGCTACGGAATAGCTACGGAAT AATGCTAGCTACGGAAT . . . . . .AATGCTAGCTACGGAAT . . . . . .

ATGCTAGCTACGGAATGCTAGCTACGGA

Nonexisting oligonuclleotides

Nonexisting sequence.

Probe selection (temperature, secondary

structure)

Design of control probes using non-existing Design of control probes using non-existing oligonucleotides information.oligonucleotides information.

12 –mers; 39855 nonexistant out of 16777216 (0.24%); 640 12 –mers; 39855 nonexistant out of 16777216 (0.24%); 640 probes selectedprobes selected

• Finding of unique and nonexistent oligonucleotides have linear computational Finding of unique and nonexistent oligonucleotides have linear computational time on the size of genomes used.time on the size of genomes used.

• Once the unique and system is represented in “rich” fasta format, design of Once the unique and system is represented in “rich” fasta format, design of new probes became extremely fast and can be repeated as much as needed in new probes became extremely fast and can be repeated as much as needed in order to create probes for new set of CDS or genomic region.order to create probes for new set of CDS or genomic region.

• Probes, selected by using unique oligonucleotides automatically reduce the Probes, selected by using unique oligonucleotides automatically reduce the presence of hairpins on RNA secondary structure. presence of hairpins on RNA secondary structure.

• Method can be applied to experimental systems with multiple non-related Method can be applied to experimental systems with multiple non-related genomes (genomes can be as far from each other as eu- and prokaryotes).genomes (genomes can be as far from each other as eu- and prokaryotes).

• Method is efficient for control probe selection.Method is efficient for control probe selection.

• Problem: Method did not provide robust estimation of sequence homology Problem: Method did not provide robust estimation of sequence homology between probe and the rest of genomes, at the same time selected probes have between probe and the rest of genomes, at the same time selected probes have the lowest homology to the rest of genome possible.the lowest homology to the rest of genome possible.

• Method provides valuable statstics about oligonucleotide usage in particular Method provides valuable statstics about oligonucleotide usage in particular genomes and genome sets.genomes and genome sets.

Properties of proposed probe design method.Properties of proposed probe design method.

2,997 70-mer oligos, 3,005 genes in all (with duplicates)

640 reference controls

Legionella Legionella in Microbial in Microbial Communities.Communities.

• Biofilms are not just a bunch of microbes, they are a special environment, Biofilms are not just a bunch of microbes, they are a special environment, protected from harsh outside by a special polysaccharide layer, which is protected from harsh outside by a special polysaccharide layer, which is produced by other microbes in the community.produced by other microbes in the community.

• Microbial community in biofilms have shared metabolic and regulatory Microbial community in biofilms have shared metabolic and regulatory networks.networks.

• Biofilms provide excellent environment for horizontal gene transfer.Biofilms provide excellent environment for horizontal gene transfer.• Since biofilms prevent antibiotics and other biocide from getting to the Since biofilms prevent antibiotics and other biocide from getting to the pathogens biofilms are significant reservoir of health-hazardous pathogens biofilms are significant reservoir of health-hazardous pathogens. pathogens.

• LegionellaLegionella can survive in biofilms, but cannot form it by itself, only as can survive in biofilms, but cannot form it by itself, only as part of the microbial community. part of the microbial community.

• Evolutionary studies (Traces of ancient events?).Evolutionary studies (Traces of ancient events?).Hsieh et.al., Minimal model for genome evolution and growth. Hsieh et.al., Minimal model for genome evolution and growth.

Phys Rev Lett. 2003 Jan 10;90(1):018101. Phys Rev Lett. 2003 Jan 10;90(1):018101. Jordan et.al., A universal trend of amino acid gain and loss in protein evolution.Jordan et.al., A universal trend of amino acid gain and loss in protein evolution.

Nature. 2005 Feb 10;433(7026):633-8. Epub 2005 Jan 19.Nature. 2005 Feb 10;433(7026):633-8. Epub 2005 Jan 19.

• Use in organism and sequence identification – Use in organism and sequence identification – metagenomics.metagenomics.

Metagenomics: "the application of modern genomics Metagenomics: "the application of modern genomics techniques to the study of communities of microbial techniques to the study of communities of microbial organisms directly in their natural environments, bypassing organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual the need for isolation and lab cultivation of individual species.“ (Chen and Pachter, University of California, species.“ (Chen and Pachter, University of California, Berkeley)Berkeley)

Bailey & Ulrich, Molecular profiling approaches for identifying novel biomarkers.Bailey & Ulrich, Molecular profiling approaches for identifying novel biomarkers.Expert Opin Drug Saf. 2004 Mar;3(2):137-51. Review.Expert Opin Drug Saf. 2004 Mar;3(2):137-51. Review.

Palmer et.al., Rapid quantitative profiling of complex microbial populations.Palmer et.al., Rapid quantitative profiling of complex microbial populations.Nucleic Acids Res. 2006 Jan 10;34(1):e5. Nucleic Acids Res. 2006 Jan 10;34(1):e5.

Similar applications and potential use of Similar applications and potential use of proposed method.proposed method.

Annotation and Web Annotation and Web Pages.Pages.

Clickable Interactive Interface

BrowserJAVA

HTML

Server Side

Client Side

requestdata transfersupervision

Local Databases

Local Methods

Update Engine

SQL Engine

Memory Engine

ADAPTERS LAYER: Converting and performing requests, formatting output. UNIX web server, Perl scripts, JAVA, C.

Remote Databases

Remote Methods

•Setting up the server sideSetting up the server side•Setting up mySQL server and servicesSetting up mySQL server and services•Tools for importing and parsing Tools for importing and parsing external databasesexternal databases

– scripts to process flat files (perl, scripts to process flat files (perl, mySQL): mySQL):

•extracting related information extracting related information •fomatting into SQL database fomatting into SQL database •Formatting into static HTMLFormatting into static HTML

– scripts to pars remote databases scripts to pars remote databases (perl, java, mySQL):(perl, java, mySQL):

•extracting related information extracting related information •fomatting into SQL database fomatting into SQL database •Formatting into static HTMLFormatting into static HTML

– update engine (under construction)update engine (under construction)•WEB page development (HTML, WEB page development (HTML, JavaScript, CSS)JavaScript, CSS)

–Testing with Explorer, Fire Fox, Opera, Testing with Explorer, Fire Fox, Opera, Safari.Safari.

Solved technical problems:

ongoing

solved

Sources of Information

Results of computationsPublicly available data

PFAM PDB PRODOM PROSITE

TRANSFAC SMART

GeneNet MetaCyc

Sequence/Genome

Functional Domains

Literature

MEDLINE

Function and annotation

NCBI EMBL TIGR Individual genomes

PathwaysCategoriesGO

Proprietary data

Current list of integrated Current list of integrated databases databases

Parsed for Parsed for Legionella-Legionella-related related information, organized and stored information, organized and stored locally:locally:– NCBINCBI– EMBLEMBL– UniProtUniProt– InterProInterPro– PIRSF (PIR superfamily/family)PIRSF (PIR superfamily/family)– PfamPfam– PRINTSPRINTS– PRODOMPRODOM– PROSITEPROSITE– HSSPHSSP– MedLine/PubMed MedLine/PubMed – MetaCycMetaCyc– NMPDR/FIGNMPDR/FIG– KEGGKEGG

WEB site schemeWEB site scheme

Interactive data retrieval into interactive tables

Interactive genome map

Interactive toolsBLAST, HMM,

REMOTE_SEARCH (SMART, PROSITE etc.)

Static interactive tables

Static interactive gene descriptionsWEB

server and

scripts

Search History

Precompiled

Dynamic (by user requests)

Semi Dynamic

SQLdatabas

e

Integrated

Tools

LegionellaLegionella GenomeGenome Browser. Browser.

Interactive.Interactive.

You can:You can:

•Choose scale Choose scale and regionand region

•Links to tables Links to tables and annotation and annotation datadata

•Choose Choose annotation annotation tracks to display tracks to display and track and track parametersparameters

•Choose various Choose various color schemescolor schemes

•Add custom Add custom annotation annotation trackstracks

Interactive tablesInteractive tables

Row operations: Select/Unselect, Show/Hide

rows

Columns (fields) operations: Show/hide

column

Sorting columns

InteractivInteractive Tables e Tables

by by Demands.Demands.

Snapshots of the NMPDR annotation pages

icmPicmRRegion comparisons in other genomes by sequence homology:

L.pn Phil1

Coxiella burnetii

Visualization of the gene expression in Visualization of the gene expression in NMPDR systemNMPDR system pathway

reactions

Legionella gene info

expression ratios

Excel table to interactive Excel table to interactive tabletable

4. Develop models (gene networks and reporter genes) that describe relevant patterns of gene expression:

(gene networks = expressed genes + their regulators)

Study gene expression:1. during intracellular growth and under various

environmental stresses2. axenically- and protozoan-grown Legionella3. in Legionella-containing biofilms

~3000 genes

ORF Finders BLAST

KEGG Pathways

Plus MissingMembers

Confirm absenceof these genes:

1. Use lower stringency search2. BLAST expected genes to Legionella genome sequence3. Search for probable motif combinations

GeneOntologyMetaCyc

560 assignments

LegCyc:181

pathways

678 assignments

72%

Expressed genes: Original Gene Function Expressed genes: Original Gene Function AssignmentsAssignments

histidine biosynthesis

LegCyc

Legionella metabolic pathway overview (a portion)

Search for transcription factor binding Search for transcription factor binding sitessites

Clusters of co-expressed genes

Predicted operons

Experimental confirmation of the predicted promoters. Transcription start sites.

Use of confirmed motifs to identify additional co-regulated genes.

+

lvhlvrA

1 2 3 4 5 6 7 8

• Promoter manipulations

• Co-expressed gene sets

• Regulatory networks

TF site prediction (in silico).

Microbiology DepartmentProf. Shuman• Gene knockout • Phenotypic analysis

Columbia Genome CenterJing Ju lab, S. Kalachikov, S. PompuGene expression microarrays• Clusters of coexpressed genes• Regulatory genes knockout results (expression)

Molecular biology methodsGene expression microarrays• RT-PCR• Transcriptional factors• promotor verification

Computational AnalysisMorozov Pavel, Morozova Irina•operon structures•putative promotors and transcriptional regulation sites•detailed gene annotation•regulatory network reconstruction