If you can't read please download the document
Upload
isaac-thompson
View
16
Download
0
Embed Size (px)
DESCRIPTION
Identifying Potential Biomarkers for Chronic Fatigue Syndrome via Classification Model Ensemble Mining Ben Goertzel, PhD Biomind LLC www. biomind.com. General Methodology. - PowerPoint PPT Presentation
Citation preview
PowerPoint Presentation
Identifying Potential Biomarkers for Chronic Fatigue Syndrome via Classification Model Ensemble Mining
Ben Goertzel, PhD
Biomind LLC
www. biomind.com
General Methodology
Using machine learning to find nonlinear combinations of genes,
mutations or clinical indicators that are associated with diseases,
toxic reactions, symptoms, or other phenotypic qualities
Summary
CAMDA 2006 data was analyzed using a novel classification model
ensemble mining methodologyGenetic programming and heuristic search
are applied to learn ensembles of classification rules that
distinguish CFS from Controlone set of classifiers based on
microarray dataone set based on SNP data)These ensembles are then
statistically analyzed to identify genes, gene categories, and
combinations thereof that appear to play important roles in
characterizing CFS. The results of this analysis include potential
microarray and SNP based diagnostic rules for CFSlists of SNPs,
genes and gene categories that are potentially significant
biomarkers for CFS and are different from those found via simple
statistical category-differentiation analysis).
Summary
Overall, our results appear compatible with a system-theoretic
view of CFS which views the disorder as a complex pattern of
activity across the organism including interlinked disturbances in
neural and endocrine systems.
Conceptual Hypothesis
Recent Related Work
Goertzel BN, Pennachin C, de Souza Coelho L, Maloney EM, Jones JF,
Gurbaxani B. Allostatic load is associated with symptoms in chronic
fatigue syndrome patients. Pharmacogenomics. 2006
Apr;7(3):485-94.
Goertzel BN, Pennachin C, de Souza Coelho L, Gurbaxani B, Maloney EM, Jones JF. Combinations of single nucleotide polymorphisms in neuroendocrine effector and receptor genes predict chronic fatigue syndrome. Pharmacogenomics. 2006 Apr;7(3):475-83.
Maloney EM, Gurbaxani BM, Jones JF, de Souza Coelho L, Pennachin C, Goertzel BN. Chronic fatigue syndrome and high allostatic load. Pharmacogenomics. 2006 Apr;7(3):467-73.
Gurbaxani BM, Jones JF, Goertzel BN, Maloney EM. Linear data
mining the Wichita clinical matrix suggests sleep and allostatic
load involvement in chronic fatigue syndrome. Pharmacogenomics.
2006 Apr;7(3):455-65.
Overview
Microarray Data AnalysisSNP Data Analysis
Often more statistically meaningful than clusteringand allows one to do clustering of features based on whether theyre used in the same categorization modelsThe researcher must divide the data into two or more categories, e.g.Case vs. ControlEarly vs. Late (in a time series experiment)Multiclass categorization: which kind of cancer?Algorithms learn rules (models) that predict which category a microarray gene expression profile falls into, via combining expression values in an automatically learned mathematical formula
Analyzing Microarray Data via Supervised Categorization
Supervised Categorization Algorithms
Many supervised categorization algorithms exist, each with
strengths and weaknessesUnlike with clustering, a choice may be
made based on rigorous validation methodology
Decision treesNeural networksLogistic regressionSupport vector machinesGenetic programmingEtc.
Applications of Supervised Categorization Analysis
Classification models may be used as diagnostic rulesClassification
models may be studied to yield intuitive insightparticularly
interesting in the case of model ensembles
Example Classification Rule Learned via Genetic
Programming
if(NM_005110 + NM_001614)/NM_002230 - .3* NM_002297 > 1then
Caseelse Control
Using Ontologies to Make Enhanced Feature Vectors
Traditional statistical and machine learning methods characterize
individuals by expression values alone.
Gene Expression Data Enhancement
Example Classification Rule Learned via Genetic
Programming
Enhanced Feature based on GO
Enhanced Feature based on PIR
Example Ontology-Based Classification Rule in Biomind
ArrayGenius User Interface
Classification Accuracy Using Biomind Tools
Generally there will be many qualitatively different
classification models distinguishing one category from anotherFor
diagnostics, one needs only a single good rulethough voting across
a model ensemble may give better accuracy than any individual
learned ruleFor gaining qualitative understanding, statistical
analysis of feature usage across models in an ensemble appears to
be quite valuableGenetic programming is a particularly useful
technique here, because each learned model tends to use a
relatively small number of features
Categorization Model Ensembles
Important Features Analysis
Given a classification model ensemble, one can list the features
that occur in the greatest number of modelsThese are NOT
necessarily the same features that provide the greatest
differentiation the two categories, considered
individually
Associate each feature (gene, GO, etc.) with the set of successful classification models that use this featureInterpret these sets as meta feature vectors, one for each direct or enhanced feature in the original datasetCluster these meta feature vectorsThe resulting clusters are sets of genes or GOs that have interesting interactions in the context of the classification problem at hand
Classification Model Utilization Based Clustering
Example Application: Analysis of CFS Data
Collaboration with Dr. Suzanne Vernon at CDCSpecific AimsTo
determine which genes are consistently expressed in the peripheral
bloodTo determine if genes and background knowledge could be used
to classify CFSTo determine if there was a common (perturbed) CFS
pathwayTo understand differences between post-EBV fatigue and other
types of CFS
Details of CFS Data
We analyzedCAMDA (Wichita) datasetExercise datasetPost-Infectious
Fatigue datasetGene expression:Noise filtering: started with 30,000
genesExercise dataset reduced to 1,921Wichita dataset reduced to
10,812Log-transformed and Z-score normalizedBackground
knowledgeExercise dataset added 377 GO and 145 PIR featuresWichita
dataset added 1405 GO and 1413 PIR features
Results
Confusion Matrix on CAMDA Data(Leave-One-Out Validation)
10 Most Important Features
Comparison of Clustering Approaches(Example Clusters)
Expression-Based Clustering
GO:0000118histone deacetylase complexGO:0004407histone
deacetylase activityGO:0005667transcription factor
complexGO:0006476protein amino acid deacetylationGO:0016570histone
modificationGO:0016575histone deacetylationGO:0019213deacetylase
activityNM_001527Homo sapiens histone deacetylase 2 (HDAC2),
mRNA
Model Usage Based Clustering
GO:0004407histone deacetylase activityAC016882GO:0042221response to chemical substanceGO:0008628induction of apoptosis by hormonesENSG00000086758AB053232GO:0009991response to extracellular stimulus
Interpretation of Histone Cluster
The relation between histone deacetylase and apoptosis is now well
knownIt was demonstrated that caspase-2 and -3, which are part of
the superfamily of caspases, the major group of protein responsible
for apoptosis triggering (Cryns and Yuan, 1998), are able to
interact and cleave the amino terminal portion of the histone
deacetylase 4, which accumulates in the nucleus and interacts
physically with the transcription factor MEF2C, thus preventing
this factor from activating anti-death signals that would allow
cell survival (Paroni et al, 2004). Coexposure of cells to HDIs in
conjunction with STI571 have been observed to down-regulate
proteins related to response to extracellular stimulus, such as
phospho-extracellular signal-regulated kinase (ERK) (Yu et al,
2003).
Signal interaction map for cluster derived from CFS
data
Signal interaction map for cluster derived from CFS data
13 of 17 IDed genes in the hairball(320 nodes, 1603 edges)
Produced in MetaCore product by GeneGo
Measuring Clustering Quality
The quality of a clustering was measured as the product homogeneity
x separation. Homogeneity is calculated as the average of the
distances of all members of the cluster to their nearest
cluster-mates. Separation is simply the minimum distance from any
given member of the cluster to elements outside the cluster. These
particular definitions were used in order to minimize the influence
of the size of the cluster on its quality.
Model Utilization Based versus Conventional Clustering
Useful feature map Exercise Challenge
Exercise challenge 31 of 100 most useful features were in GO that
map to the following pathways:
Exercise challenge 62 of 100 most useful features were genes that
map to the following pathways:
RAD (DNA repair)
mRNA processing, ribosomes, translation
Useful feature map Wichita
Wichita 80 of 100 most useful features were GO categories and genes
that map to several pathways.Noticeable absent are features in DNA
repair initiation, transcription, gene expression and mRNA
processing.
Useful feature map Post-Infective Fatigue
Post-infective fatigue 79 of 100 most useful features were GO
categories and genes that map to mRNA processing and splicesome
formation.
spliceosome formationand mRNA processing
Interpretation of CFS Microarray Results
Features that make up these models indicate widespread disruption
of cell homeostasis in both the Exercise and Wichita study.Features
that make up the PIF classification model identify mRNA processing
pathways known to be disrupted by EBV.
CFS SNP data analysis
CFS SNP data publicly available through the CAMDA 2006 challenge
(see http://www.camda.duke.edu/camda06 )Genes pre-selected for SNP
analysis because of possible CFS involvementSNP data processed with
Biomind software
SNP-Sets as Pattern Strength Classifiers
Each Pattern Strength Classifier is simply a list of SNPs and a
threshold. For a given individual being evaluated by a given rule,
the sum of SNP incidences is computed in the following way: if the
individual has a SNP s (present in the SNP list of the rule) in
heterozygosis, then the value 2 is summed for s; if s is present in
homozygosis, then 1 is summed; finally, if s is undetermined for
that individual, then 0 is summed. After this sum is computed for
all SNPs in the rule list, the value is compared with the rule
threshold: if it is greater than the threshold, the individual is
classified as CFS, otherwise Control. For each SNP-set, the
threshold value is selected that allows the SNP-set to achieve the
maximum accuracy for distinguishing Case vs. Control.
SNP Based Classifiers for CFS vs. Control
Important Genes for Differentiating CFS vs. Control
Conceptual Hypothesis
Pppppppppppppppp[