Identifying Potential Biomarkers for Chronic Fatigue Syndrome via

PowerPoint Presentation

Identifying Potential Biomarkers for Chronic Fatigue Syndrome via Classification Model Ensemble Mining

Ben Goertzel, PhD

Biomind LLC

www. biomind.com

General Methodology
Using machine learning to find nonlinear combinations of genes, mutations or clinical indicators that are associated with diseases, toxic reactions, symptoms, or other phenotypic qualities

Summary
CAMDA 2006 data was analyzed using a novel classification model ensemble mining methodologyGenetic programming and heuristic search are applied to learn ensembles of classification rules that distinguish CFS from Controlone set of classifiers based on microarray dataone set based on SNP data)These ensembles are then statistically analyzed to identify genes, gene categories, and combinations thereof that appear to play important roles in characterizing CFS. The results of this analysis include potential microarray and SNP based diagnostic rules for CFSlists of SNPs, genes and gene categories that are potentially significant biomarkers for CFS and are different from those found via simple statistical category-differentiation analysis).

Summary

Overall, our results appear compatible with a system-theoretic view of CFS which views the disorder as a complex pattern of activity across the organism including interlinked disturbances in neural and endocrine systems.

Conceptual Hypothesis

Recent Related Work
Goertzel BN, Pennachin C, de Souza Coelho L, Maloney EM, Jones JF, Gurbaxani B. Allostatic load is associated with symptoms in chronic fatigue syndrome patients. Pharmacogenomics. 2006 Apr;7(3):485-94.

Goertzel BN, Pennachin C, de Souza Coelho L, Gurbaxani B, Maloney EM, Jones JF. Combinations of single nucleotide polymorphisms in neuroendocrine effector and receptor genes predict chronic fatigue syndrome. Pharmacogenomics. 2006 Apr;7(3):475-83.

Maloney EM, Gurbaxani BM, Jones JF, de Souza Coelho L, Pennachin C, Goertzel BN. Chronic fatigue syndrome and high allostatic load. Pharmacogenomics. 2006 Apr;7(3):467-73.

Gurbaxani BM, Jones JF, Goertzel BN, Maloney EM. Linear data mining the Wichita clinical matrix suggests sleep and allostatic load involvement in chronic fatigue syndrome. Pharmacogenomics. 2006 Apr;7(3):455-65.

Overview
Microarray Data AnalysisSNP Data Analysis

Often more statistically meaningful than clusteringand allows one to do clustering of features based on whether theyre used in the same categorization modelsThe researcher must divide the data into two or more categories, e.g.Case vs. ControlEarly vs. Late (in a time series experiment)Multiclass categorization: which kind of cancer?Algorithms learn rules (models) that predict which category a microarray gene expression profile falls into, via combining expression values in an automatically learned mathematical formula

Analyzing Microarray Data via Supervised Categorization

Supervised Categorization Algorithms
Many supervised categorization algorithms exist, each with strengths and weaknessesUnlike with clustering, a choice may be made based on rigorous validation methodology

Decision treesNeural networksLogistic regressionSupport vector machinesGenetic programmingEtc.

Applications of Supervised Categorization Analysis
Classification models may be used as diagnostic rulesClassification models may be studied to yield intuitive insightparticularly interesting in the case of model ensembles

Example Classification Rule Learned via Genetic Programming
if(NM_005110 + NM_001614)/NM_002230 - .3* NM_002297 > 1then Caseelse Control

Using Ontologies to Make Enhanced Feature Vectors
Traditional statistical and machine learning methods characterize individuals by expression values alone.

Gene Expression Data Enhancement

Example Classification Rule Learned via Genetic Programming
Enhanced Feature based on GO
Enhanced Feature based on PIR

Example Ontology-Based Classification Rule in Biomind ArrayGenius User Interface

Classification Accuracy Using Biomind Tools

Generally there will be many qualitatively different classification models distinguishing one category from anotherFor diagnostics, one needs only a single good rulethough voting across a model ensemble may give better accuracy than any individual learned ruleFor gaining qualitative understanding, statistical analysis of feature usage across models in an ensemble appears to be quite valuableGenetic programming is a particularly useful technique here, because each learned model tends to use a relatively small number of features
Categorization Model Ensembles

Important Features Analysis
Given a classification model ensemble, one can list the features that occur in the greatest number of modelsThese are NOT necessarily the same features that provide the greatest differentiation the two categories, considered individually

Associate each feature (gene, GO, etc.) with the set of successful classification models that use this featureInterpret these sets as meta feature vectors, one for each direct or enhanced feature in the original datasetCluster these meta feature vectorsThe resulting clusters are sets of genes or GOs that have interesting interactions in the context of the classification problem at hand

Classification Model Utilization Based Clustering

Example Application: Analysis of CFS Data
Collaboration with Dr. Suzanne Vernon at CDCSpecific AimsTo determine which genes are consistently expressed in the peripheral bloodTo determine if genes and background knowledge could be used to classify CFSTo determine if there was a common (perturbed) CFS pathwayTo understand differences between post-EBV fatigue and other types of CFS

Details of CFS Data
We analyzedCAMDA (Wichita) datasetExercise datasetPost-Infectious Fatigue datasetGene expression:Noise filtering: started with 30,000 genesExercise dataset reduced to 1,921Wichita dataset reduced to 10,812Log-transformed and Z-score normalizedBackground knowledgeExercise dataset added 377 GO and 145 PIR featuresWichita dataset added 1405 GO and 1413 PIR features

Results

Confusion Matrix on CAMDA Data(Leave-One-Out Validation)

10 Most Important Features

Comparison of Clustering Approaches(Example Clusters)
Expression-Based Clustering

GO:0000118histone deacetylase complexGO:0004407histone deacetylase activityGO:0005667transcription factor complexGO:0006476protein amino acid deacetylationGO:0016570histone modificationGO:0016575histone deacetylationGO:0019213deacetylase activityNM_001527Homo sapiens histone deacetylase 2 (HDAC2), mRNA
Model Usage Based Clustering

GO:0004407histone deacetylase activityAC016882GO:0042221response to chemical substanceGO:0008628induction of apoptosis by hormonesENSG00000086758AB053232GO:0009991response to extracellular stimulus

Interpretation of Histone Cluster
The relation between histone deacetylase and apoptosis is now well knownIt was demonstrated that caspase-2 and -3, which are part of the superfamily of caspases, the major group of protein responsible for apoptosis triggering (Cryns and Yuan, 1998), are able to interact and cleave the amino terminal portion of the histone deacetylase 4, which accumulates in the nucleus and interacts physically with the transcription factor MEF2C, thus preventing this factor from activating anti-death signals that would allow cell survival (Paroni et al, 2004). Coexposure of cells to HDIs in conjunction with STI571 have been observed to down-regulate proteins related to response to extracellular stimulus, such as phospho-extracellular signal-regulated kinase (ERK) (Yu et al, 2003).

Signal interaction map for cluster derived from CFS data

Signal interaction map for cluster derived from CFS data
13 of 17 IDed genes in the hairball(320 nodes, 1603 edges)

Produced in MetaCore product by GeneGo

Measuring Clustering Quality
The quality of a clustering was measured as the product homogeneity x separation. Homogeneity is calculated as the average of the distances of all members of the cluster to their nearest cluster-mates. Separation is simply the minimum distance from any given member of the cluster to elements outside the cluster. These particular definitions were used in order to minimize the influence of the size of the cluster on its quality.

Model Utilization Based versus Conventional Clustering

Useful feature map Exercise Challenge
Exercise challenge 31 of 100 most useful features were in GO that map to the following pathways:
Exercise challenge 62 of 100 most useful features were genes that map to the following pathways:
RAD (DNA repair)
mRNA processing, ribosomes, translation

Useful feature map Wichita
Wichita 80 of 100 most useful features were GO categories and genes that map to several pathways.Noticeable absent are features in DNA repair initiation, transcription, gene expression and mRNA processing.

Useful feature map Post-Infective Fatigue
Post-infective fatigue 79 of 100 most useful features were GO categories and genes that map to mRNA processing and splicesome formation.
spliceosome formationand mRNA processing

Interpretation of CFS Microarray Results
Features that make up these models indicate widespread disruption of cell homeostasis in both the Exercise and Wichita study.Features that make up the PIF classification model identify mRNA processing pathways known to be disrupted by EBV.

CFS SNP data analysis
CFS SNP data publicly available through the CAMDA 2006 challenge (see http://www.camda.duke.edu/camda06 )Genes pre-selected for SNP analysis because of possible CFS involvementSNP data processed with Biomind software

SNP-Sets as Pattern Strength Classifiers
Each Pattern Strength Classifier is simply a list of SNPs and a threshold. For a given individual being evaluated by a given rule, the sum of SNP incidences is computed in the following way: if the individual has a SNP s (present in the SNP list of the rule) in heterozygosis, then the value 2 is summed for s; if s is present in homozygosis, then 1 is summed; finally, if s is undetermined for that individual, then 0 is summed. After this sum is computed for all SNPs in the rule list, the value is compared with the rule threshold: if it is greater than the threshold, the individual is classified as CFS, otherwise Control. For each SNP-set, the threshold value is selected that allows the SNP-set to achieve the maximum accuracy for distinguishing Case vs. Control.

SNP Based Classifiers for CFS vs. Control

Important Genes for Differentiating CFS vs. Control

Conceptual Hypothesis

Pppppppppppppppp[

Documents

Identifying Potential Biomarkers for Chronic Fatigue Syndrome via