BI 83201: The Literature of BI 83201: The Literature of Computational GenomicsComputational Genomics
Instructor: Prof. Jeffrey ChuangInstructor: Prof. Jeffrey Chuang Meeting Time: Fridays 10:30-11:45 Higgins 465Meeting Time: Fridays 10:30-11:45 Higgins 465 Requirements: Read and discuss 1-2 papers per week. Grading Requirements: Read and discuss 1-2 papers per week. Grading
will be based on participation. Attendance at all sessions is will be based on participation. Attendance at all sessions is mandatory.mandatory.
Course website: bioinformatics.bc.edu/chuanglab/courses.htmCourse website: bioinformatics.bc.edu/chuanglab/courses.htm All papers will be available online at least 1 week before All papers will be available online at least 1 week before
discussion. Students will be assigned sections/figures, for which discussion. Students will be assigned sections/figures, for which they will be expected to lead the discussion, including asking they will be expected to lead the discussion, including asking other students questions.other students questions.
Office Hours by arrangement: Contact Jeff at [email protected], Office Hours by arrangement: Contact Jeff at [email protected], Phone: 2-0804, Higgins 444B (soon to be moving to Higgins 420).Phone: 2-0804, Higgins 444B (soon to be moving to Higgins 420).
Changing perspectives in yeast research nearly a decade after the
genome sequence
Kara Dolinski and David Botstein
BI 83201 : Literature of Computational Genomics
January 27, 2006
I. IntroductionI. Introduction
The yeast Saccharomyces cerevisiae was the 1st sequenced eukaryotic genome (1996).
It is 12 million base pairs long(Human is 3 billion), over 16 chromosomes.
It was chosen because it has an extensive history as a model organism, along with the worm C. elegans and the fly D. melanogaster.
Major Benefits of Sequencing the Yeast Genome
Ability to identify clones via sequencing, rather than genetic or physical mapping methods.
Creation of yeast strains, each with a deletion of one gene, for every gene in the genome.
Whole genome expression assays.
A "grand unification,” showing that protein sequence similarity persists between yeast, mouse, human, fly, and worm, i.e. functional similarity often also means sequence similarity.
From the parts list to the system From the parts list to the system level: Goals of post-genome-sequence level: Goals of post-genome-sequence
yeast researchyeast research
Understand and annotate every Understand and annotate every functional feature in the genome.functional feature in the genome.
Understand the interactions of every Understand the interactions of every feature – “systems biology”feature – “systems biology”
A central goal of yeast research remains A central goal of yeast research remains the determination of the biological role the determination of the biological role of every sequence feature in the yeast of every sequence feature in the yeast genome. The most remarkable change genome. The most remarkable change has been the shift in perspective from has been the shift in perspective from focus on individual genes and focus on individual genes and functionalities to a more global view of functionalities to a more global view of how the cellular networks and systems how the cellular networks and systems interact and function together to interact and function together to produce the highly evolved organism we produce the highly evolved organism we see today. see today.
Genes and their biological rolesGenes and their biological roles
1995: The number of characterized 1995: The number of characterized genes was 1000-2000.genes was 1000-2000.
2006: 5773 genes in the genome. 2006: 5773 genes in the genome. 4299 are characterized.4299 are characterized.
Annotation of individual functions Annotation of individual functions remains challenging.remains challenging.
SGD SGD Several Several Ball et al. 2001 Ball et al. 2001 http://www.yeastgenome.org http://www.yeastgenome.org
CYGD/MIPS CYGD/MIPS Several Several Guldener et al. Guldener et al. 2005 2005
http://mips.gsf.de/genre/proj/yeast/ http://mips.gsf.de/genre/proj/yeast/
bioGRID bioGRID Genetic/physical Genetic/physical interaction interaction
Breitkreutz et al. Breitkreutz et al. 2003 2003
http://biodata.mshri.on.ca/http://biodata.mshri.on.ca/yeast_grid/ yeast_grid/
BIND BIND Genetic/physical Genetic/physical interaction interaction
Bader et al. 2003 Bader et al. 2003 http://www.blueprint.org/bind/http://www.blueprint.org/bind/bind.php bind.php
DIP DIP Physical Physical interaction interaction
Xenarios et al. Xenarios et al. 2002 2002
http://dip.doe-mbi.ucla.edu/dip/http://dip.doe-mbi.ucla.edu/dip/Main.cgi Main.cgi
MINT MINT Physical Physical interaction interaction
Zanzoni et al. Zanzoni et al. 2002 2002
http://160.80.34.4/mint/ http://160.80.34.4/mint/
IntAct IntAct Physical Physical interaction interaction
Hermjakob et al. Hermjakob et al. 2004b 2004b
http://www.ebi.ac.uk/intact/http://www.ebi.ac.uk/intact/index.html index.html
Deletion Deletion Consortium Consortium
Phenotype Phenotype analysis analysis
Giaever et al. Giaever et al. 2002; Winzeler et 2002; Winzeler et al. 1999 al. 1999
http://www-sequence.stanford.edu/http://www-sequence.stanford.edu/group/yeast_deletion_project/data_sgroup/yeast_deletion_project/data_sets.htmlets.html
GEO GEO MicroArray MicroArray Edgar et al. 2002 Edgar et al. 2002 http://http://www.ncbi.nlm.nih.govwww.ncbi.nlm.nih.gov/geo//geo/
Array Express Array Express MicroArray MicroArray Brazma et al. 2003 Brazma et al. 2003 http://http://www.ebi.ac.uk/arrayexpresswww.ebi.ac.uk/arrayexpress//
YMGV YMGV MicroArray MicroArray Marc et al. 2001 Marc et al. 2001 http://http://www.transcriptome.ens.fr/ymgvwww.transcriptome.ens.fr/ymgv//
SMD SMD MicroArray MicroArray Gollub et al. 2003 Gollub et al. 2003 http://http://smd.stanford.edusmd.stanford.edu//
List of the major sources of yeast functional genomics data; in addition to the main SGD site, yeast genome data are also distributed via SGD Lite (http://sgdlite.princeton.edu), a lightweight yeast genome database, which is built from GMOD components and can be downloaded and installed locally.
Gene expression technology and Gene expression technology and
the emergence of system-level biologythe emergence of system-level biology
Two major expression technologies Two major expression technologies developeddeveloped
SAGE (Serial Analysis of Gene SAGE (Serial Analysis of Gene Expression)Expression)
mRNA MicroarraysmRNA Microarrays
SAGE
Serial Analysis ofGene Expression
Figure 1. Yeast genome microarray. The actual size of the microarray is 18 mm by 18 mm.
Derisi et al.Science 24 October 1997:Vol. 278. no. 5338, pp. 680 - 686
Example of an mRNA Expression Microarray
Defining functional or regulatory Defining functional or regulatory subsystems, or "modules“subsystems, or "modules“
Study all the genes that respond to certain stresses, e.g. temperature change, starvation, radiation.
Study genes that are active in “natural” behaviors: cell cycle, sporulation, pheromone response.
Identify genes that are often co-expressed and/or co-regulated, such as ribosomal genes.
(C) Seven members of a class of genes marked by early induction with a peak in mRNA levels at 18.5 hours. Each of these genes contain STRE motif repeats in their upstream promoter regions.
Science 24 October 1997:Vol. 278. no. 5338, pp. 680 - 686
Distinct temporal patterns of induction or repression help to group genes that share regulatory properties.
It is quite rare for genes to have unchanging expression levels across different experiments; for example, expression of the yeast actin (ACT1) gene, which was traditionally used as a control in Northern blots to ensure that equivalent levels of RNA were loaded in each well, changes significantly in several diverse types of microarray experiments
Expression Levels Are Highly Condition Dependent
Analysis and Display of Genome-scale Analysis and Display of Genome-scale DataData
How can such a vast amount of expression data be analyzed, managed, and presented?
Clustering algorithms group genes with similar expression profiles over different experiments.
Figure 1. Yeast genome microarray. The actual size of the microarray is 18 mm by 18 mm.
Derisi et al.Science 24 October 1997:Vol. 278. no. 5338, pp. 680 - 686
Example of an mRNA Expression Microarray
Eisen et al. (1998) PNAS 95:14863
Clustering of Gene Expression Profiles
Gene OntologyGene Ontology
A functional annotation system to allow one to search for biases in clusters of genes.
Broad terms are the parents to more specific terms.
Consistent annotation system across species.
A Clustered Group of Genes and Its Functional Annotation
The Gene Ontology allows one to assess the statistical significance in bias for functional categories.
Insights into the global Insights into the global transcriptional network transcriptional network
Co-regulated genes should share a common transcription factor binding site.
Computational methods to search for motifs shared among co-regulated genes(REDUCE, AlignACE, MODEM).
YCL030C HIS4
SCer GCAGTCGAACTGACTCTAATAGTGACTCCGGTAAATTAGTTAATTAATTGCTAAACCCATGCACAGTGACTCACGTTTTTTTATCAGTCATTCGASPar GCAGTCGAACTGACTCTAATAGTGACTCCGGTAAATTAGTTAATTAATTGCTAAACCCATGCACAGTGACTCATGTTTTTT-ATCAGTCATTCGASMik GCGGTCAAACTGACTCTAATAGTGACTCCGGTAAATTAGTTAATTAATTGCTAAACCCATGCACAGTGACTCATGCTTTCT-ATCAGTCATTCGASBay -TGAACGAACTGACTCTAATAGTGACTCTGGTAAATTAGTTAATTAATTTCTAAACCCATGCACAGTGACTCATGTTTTGTTATCAGTCATTCGT * ********************* ******************** *********************** * *** * ************
SCer TATAGAAGGTAAGAAAAGGATATGACT----ATGAACAGTAGTATACTGTGTATATAATAGATATGGAACGTTATATTCACCTCCGATGTGTGTTSPar TAGAGAAGGTAAGAAAAGGATATGACT----ATGAACAGTAATATACTATGTATATAATAGATAAGGAACGTTATATTCACCTTGGATGTGTGTTSMik TACAGA-GGTAAGAAAAGCGAACTACT----AAGAACAGTGGTACATGGTGTATATAATAGATAAGGAACAT-GTATTCACTTTTAATGTGAGTTSBay TAAAGA-AGAAAGAGAGGAAGATGACTCAAAATAAATACTAGTGTATTGTGTATATAACAGAGATGGAACACTGGATTC-CACCTAATGTGTGTT ** *** * **** * * * *** * ** * * * * ********* *** * ***** **** * ***** ***
SCer GTACATACATAAAAATATCATAGCACAACTGCGCTGTGTAA---TAGTAATACAATAGTTTACAAAATTTTTTTTCTGAATA---SPar GTACATACATAAGAATATCATACTACAAGTGCGCTGTGTAA---TAGTAACATAATAGTTAACAA-----TTTTTTTGAATA---SMik GTCTATA-AGAAGAATAGTATACCACAAGCGTGCTGTGTAACGATAATAATATAACAATTTACAAGATT-TTTTTTTGAATA---SBay GTCCATACATAGAATTAGTATACCACAATTGCGCTGTGTAA---TAATAACATAATAGATTACAAAA---TTTTGGAAAAAAAAA ** *** * * * ** *** **** * ********* ** *** * ** * * **** **** ** *
GCN4 BAS1 PHO2 RAP1 GCN4
TATA
Comparative Genomic Approaches to FindingComparative Genomic Approaches to FindingTranscription Factor Binding SitesTranscription Factor Binding Sites
Alignments of 4 – 13 yeast species, to determine unusually conserved motifs.
DNA-binding proteins are crosslinked to DNA with formaldehyde in vivo.
Isolate the chromatin. Shear DNA along with bound proteins into small fragments.
Bind antibodies specific to the DNA-binding protein to isolate the complex by precipitation. Reverse the cross-linking to release the DNA and digest the proteins.
Use PCR to amplify specific DNA sequences to see if they were precipitated with the antibody.
Chromatin Chromatin Immuno-Immuno-precipitation precipitation to Determine to Determine Binding SitesBinding Sites
Integration of Data Sources
Harbison and colleagues (2004 ) used a combination of experimental (chIP-chip), comparative genomics, and motif discovery methods to identify putative DNA binding sites for >200 transcription factors in yeast.
Bayesian network takes as input different properties of sequence elements upstream of a gene and outputs the likelihood of that gene exhibiting a particular expression pattern
Interaction Networks Interaction Networks
Synthetic lethal interactions
protein-DNA interactions
protein-protein interactions.
Synthetic Lethal Interactions
Genetic interaction network representing the synthetic lethal/sick interactions determined by SGA analysis. Genes are represented as nodes, and interactions are represented as edges that connect the nodes. Up to 1000 genes and 4000 interactions.
Protein-DNA interactions
Transcription factor
Binding site
Motifs in the E. ColiTranscriptional Regulatory Network
Nature Genetics 31, 64 - 68 (2002)
Protein-Protein interactions
Problem: Experiments are not robust
Verification by checking for co-expression of orthologs inother species.
Check for “joint” sequence conservation of orthologs.
Other data integration methods.
Outline of the comprehensive two-hybrid analysis. We cloned almost all yeast ORFs individually as a DNA-binding domain fusion (bait) in a MATa strain and as an activation domain fusion (prey) in a MAT strain, and subsequently divided them into pools, each containing 96 clones. These bait and prey clone pools were systematically mated with each other, and the diploid cells formed were selected for the simultaneous activation of three reporter genes (ADE2, HIS3, andURA3) followed by sequence tagging to obtain ISTs.
PNAS | April 10, 2001 | vol. 98 | no. 8 | 4569-4574
Protein-Protein Interactions
Conclusions and some thoughts Conclusions and some thoughts about the Futureabout the Future
Most new understanding has come from Most new understanding has come from comparative genomics.comparative genomics.
Genome-scale data has provided new Genome-scale data has provided new goalsgoals
Other important areas – allelic effects, Other important areas – allelic effects, gene localization, metabolism dynamics, gene localization, metabolism dynamics, how selection operates on networks.how selection operates on networks.
Philosophy – how should large scale data Philosophy – how should large scale data be used to generate and test hypotheses?be used to generate and test hypotheses?