Upload
tommy96
View
825
Download
1
Tags:
Embed Size (px)
Citation preview
SSGection
tatistical
enetics
ON
Department of Biostatistics
Laura Kelly Vaughan, Ph.D.Assistant Professor
Section on Statistical [email protected]
Data Mining: Functional Statistical
Genetics & Bioinformatics
NCBI (National Center for Biotechnology Information)
Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights and to create a global perspective from which unifying principles in biology can be discerned.
http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html
Integrative Data Analysis
Genetic studies tend to focus on one data source Genetic variation RNA levels Blood biochemistry
This fails to utilize the information contained in the connections among these variables…
Central Dogma of Molecular Biology
DNA RNA Protein Phenotype
StructuralGenomics
Functional Genomics(Transcriptomics)
ProteomicsPhenomics
TXN
Replication
TSN PTM
Metabolomics
Genetics
Different sources of annotation data
Gene Ontology Pathways/Networks Protein/protein
interactions Literature Functional annotations Expression
Cross species Cellular localization Methylation ChIP Sequence similarity Promoter & Regulatory
Network Protein domains
Gene Ontology
www.geneontology.org The GO project has developed three structured
controlled vocabularies (ontologies) that describe gene products in a species-independent manner. biological processes- series of events accomplished by
one or more ordered assemblies of molecular functions cellular components- parts of the cell molecular functions- activities, such as catalytic or
binding activities, that occur at the molecular level
http://www.yeastgenome.org/help/images/cytokinesisDAGrels.jpg
Example of a GO annotation
What is a Pathway?
Physical and functional interactions between genes and gene products Metabolic pathways Kinase based signaling cascades Transcriptional signaling pathways
P
P
P
P
TNF Signaling
TNFR2TNFR1
ATFs
Elk1
NF-B
IBs
IBs Degradation
c-Jun
c-Fos
P
P
TNF TNF
/
FADD
RAIDD
I-TRAF
CIAPMADD
SODD
TRAF2
TRAF3SODD
Caspase9CytoC
Caspase9
APAF1
Caspase8
tBID
ApoptosisApoptosis
Caspa
se2
CytoC
BID
Caspases3,6,7
TRADD
RIP
Caspase1
NIK
TRAF2
RIP
IKKs
NF-B
MEKKs
ERKs
p38
Gene Expression and Cell Survival
P
P
JNKK1
JNK1
TAK1
Ceramides
C 2007-2009SABiosciences.comC 2007-2009SABiosciences.com
What is a Network?
Graphical representation if relationship between genes, gene products, or other objects
Formed with information such as
Genes in interacting pathways Gene products that share protein-protein interactions Gene products protein-nucleotide relationships Regulatory relationships Metabolic interactions
Metabolic Disease Network
Lee D. et.al. PNAS 2008;105:9880-9885©2008 by National Academy of Sciences
Analysis tools
Numerous methods have been developed to aid in the interpretation of biological experiments
2 basic categories Pre-analysis methods where the raw data is
grouped together & the groups are tested Dimension reduction
Post-analysis methods where significant or interesting results are grouped together to identify trends
Before you start…
There are many methods available for integrative data analysis
Before you chose one, you must properly define the questions you are trying to answer… What is your hypothesis?
DBA ~10 mins
Methods
Unsupervised, or data based methods Utilizes all the data to identify trends Hypothesis generating
Supervised, or prior information based Requires the user to provide a ‘training set’ of
genes Hypothesis testing
Gene Set Analysis
Test statistic intended to measure the deviation of gene-set expression measurements from the null hypothesis of no association with the phenotype is calculated
The statistical significance (P-value) for each gene set is calculated based on permutation of samples
Types of enrichment methods
Class 1- Singular enrichment (SEA) P-value calculated on each term from pre-selected list &
enrichment terms are listed
Class 2- Gene set enrichment (GSEA) All genes (without pre-selection) are included
No need to select list Experimental values integrated into P-value calculations Pairwise comparisons (e.g., disease vs. control) Most appropriate for expression data
Class 3- Modular enrichment (MEA) Predetermined list, with term-term or gene-gene
relationships included in enrichment P-value calculation Closest to nature of biological data structure
DAVID
Provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes
Extensive annotation database Includes both pathways and GO
SEA and MEA algorithms Visualization tools http://david.abcc.ncifcrf.gov/
DAVID and LVH gene expression
GO clustering of significant genes between different mouse treatment groups
Stansfield et al 2009 Cardiopulmonary Support and Physiology
Babelomics Suite
Suite of web tools for the functional profiling of genome scale experiments Multiple annotation sources
Pathways, GO, regulation, text mining, interactions
Allows for functional enrichment Several gene set methods
Mostly SEA methods
http://babelomics.bioinfo.cipf.es/
Babelomics and thyroid carcinoma
Montero-Conde et al 2008 Oncogene
Identified 1031 gene with differential expression Enriched pathways included
MAPkinase TGF-B Focal adhesion Cell motility Activation of actin
polymerization Cell cycle
Identified 30 genes that predict prognosis with 95% accuracy
GSEA
Computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).
http://www.broad.mit.edu/gsea/
GSEA: Steps in the MethodologyGSEA: Steps in the Methodology
Define a Gene Set from prior knowledge Order the genes by correlation with phenotype Estimate the gene set’s Enrichment Score Assess Statistical Significance using permutation tests Adjust for Multiple Hypothesis
Subramanian et. al, PNAS, 2005
Biological pathways involved in chemotherapy response in breast cancer
Tordai et al 2008 Breast Cancer Research
GSEA for ER+ breast cancer tumors chemotherapy responders and non-responders
Of >850 gene sets, 4 were significant
Significance Analysis of Function and Expression (SAFE)
Generalization and extension of GSEA method 2 stage permutation based approach to asses significant
changes in gene expression across experimental conditions First computes gene-specific local statistics to test for
association between gene expression and the phenotype. Gene-specific statistics then used to estimate global
statistics that detects shifts in the local statistics within a gene category.
The significance of the global statistics is assessed by repeatedly permuting the response values.
SAFE implements a rank-based global statistics that enables a better use of marginally significant genes than those based on a p-value cutoff.
http://www.bioconductor.org/packages/bioc/1.6/src/contrib/html/safe.html
Dietary resveratrol and aging in mice SAFE analysis based
on GO annotations
Overlap of classes with significant effect caloric restrictive response with low dose resveratrol
Barger et al 2008 PLoS One
Supervised AnalysisEndeavour Web based prioritization of candidate genes
Infers models for the training set in each data source
Application of each model to the candidate geens to rank against profiles of training set
Merges rankings from each data source to give global ranking of genes
http://homes.esat.kuleuven.be/~bioiuser/endeavour/endeavour.php
Copyright restrictions may apply.
Tranchevent, L.-C. et al. Nucl. Acids Res. 2008 36:W377-W384; doi:10.1093/nar/gkn325
ENDEAVOUR: the algorithm behind the wizard
Genetic disorder prioritization using Endeavour
Network Analysis
Dynamic representation of cellular process through the incorporation of annotation & experimental data Structures are not fixed and change with
context Many methods available…
Suderman & Hallett 2007 Bioinformatics
Ingenuity IPA
Pathway Analysis of WTCCC Type 2 Hypertension GWAs
No single SNP was significant at the genome wide level
High degree of relationship between pathways suggests multiple related mechanisms Large number of low
penetrance risk alleles
Pathway analysis with MetaCore
Torkamani et al. 2008 Genomics
English, S. B. et al. Bioinformatics 2007 23:2910-2917; doi:10.1093/bioinformatics/btm483
The next stepTranslational Science
Integration of 49 genome wide experiments for the prediction of previously unknown obesity related genes Greatly outperforms individual experiments
References Song & Black 2008. BMC Bioinformatics. 9:502 Huang et al 2009. NAR 37(1):1-13 Chen et al 2008 Nature 452(27)429-435 Dinu et al 207 Journal of Biomedical Info 40:75-760 Al-Shahrour et al NAR 36:W341-346 Barry et al 2005 Bioinformatics 21(9)1943-1949 Huang et al Nature Protocols 4(1)44-57 Tranchevent et al 2008 NAR 36:W377-384 Mehta et al 2006 Physiol Genomics 28:24-32 Suderman & Hallett Bioinformatics 23(20)2651-2659 Dinu et al 2008 Briefings in Bioinformaics Curtis et al 2005 Trends in Biotech 23(8) Price and Shmulevich 2007 Current Op in Biotech 18:365-370 Zhang et al 2008 BMC Systems Bio 2:5 Werner 2008 Current Op in Biotech 19:50-54 Lui et al 2007 BMC Bioinformatics 8:431 Goeman & Buhimann 2007 Bioinformatics 23(8)980-987 Rivals et al 2007 Bioinformatics 23(4)401-407 Nam & Kim 2008 Briefings in Bioinformatics 9(3) 89-97