23
Supporting Information for: Pathway Analysis using Random Forests Herbert Pang 1 , Aiping Lin 2 , Matthew Holford 1 , Bradley E. Enerson 3 , Bin Lu 5 , Michael P. Lawton 5 , Eugenia Floyd 5 and Hongyu Zhao 1,2,4 1 Division of Biostatistics, Department of Epidemiology and Public Health, 2 W. M. Keck Biotechnology Resource Laboratory, 3 Boyer Center for Molecular Medicine, 4 Department of Genetics, Yale University School of Medicine, New Haven, CT, 06520 USA and 5 Pfizer Groton Laboratories, Safety Sciences, Groton, CT, 06340 USA Contents of Supporting Information Figures FS1. Outlier plots for Random Forests classification 1 FS2. Error Rate (OOB) plot versus the number of trees 2 FS3. Overlapping genes of top ranked pathways for classification 4 FS4. Other top ranked pathways for classification 5 FS5. MSE plot versus the number of trees for Random Forests regression 6 FS6. MDS plot for the proximity matrix of the 29 canine cases for ‘one carbon pool by folate’ pathway 8 FS7. Overlapping of genes top ranked pathways for regression 9 Tables TS1. Pathways ranked by OOB error rates with outliers 10 TS2. A description of the 29 canine dataset 11 TS3. Change in classification error rate after removal of outliers 12 TS4. Change in classification error after removing overlapping genes 13 TS5. Simple T-test 14 TS6. Fisher’s Exact Test results 16 TS7. Gene Set Enrichment Analysis (GSEA) results 17 Others DMS1. Proximity matrix 18 DS1. Applications of RF classification 19

Pathway Analysis using Random Forestspeople.duke.edu/~hp44/pwayrf_suppl_rev1.pdfFS6. MDS plot for the proximity matrix of the 29 canine cases for ‘one carbon pool by folate’ pathway

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Supporting Information for:

Pathway Analysis using Random Forests Herbert Pang1, Aiping Lin2, Matthew Holford1, Bradley E. Enerson3, Bin Lu5, Michael P. Lawton5, Eugenia Floyd5 and Hongyu Zhao1,2,4

1Division of Biostatistics, Department of Epidemiology and Public Health, 2W. M. Keck Biotechnology Resource Laboratory, 3Boyer Center for Molecular Medicine, 4Department of Genetics, Yale University School of Medicine, New Haven, CT, 06520 USA and 5Pfizer Groton Laboratories, Safety Sciences, Groton, CT, 06340 USA

Contents of Supporting Information Figures FS1. Outlier plots for Random Forests classification 1 FS2. Error Rate (OOB) plot versus the number of trees 2 FS3. Overlapping genes of top ranked pathways for classification 4 FS4. Other top ranked pathways for classification 5 FS5. MSE plot versus the number of trees for Random Forests regression 6

FS6. MDS plot for the proximity matrix of the 29 canine cases for ‘one carbon pool by folate’ pathway 8

FS7. Overlapping of genes top ranked pathways for regression 9 Tables TS1. Pathways ranked by OOB error rates with outliers 10 TS2. A description of the 29 canine dataset 11 TS3. Change in classification error rate after removal of outliers 12 TS4. Change in classification error after removing overlapping genes 13 TS5. Simple T-test 14 TS6. Fisher’s Exact Test results 16 TS7. Gene Set Enrichment Analysis (GSEA) results 17 Others DMS1. Proximity matrix 18 DS1. Applications of RF classification 19

1

FS1. Outlier plots for Random Forests classification

These five pathways are the top ranked pathways with all the 31 dogs and have the lowest OOB error rate.

2

FS2. Error Rate (OOB) plot versus the number of trees As the number of trees can affect the classification error, we ran 500 (default), 5,000,

50,000 and 100,000 trees. We found that the classification error became stable at around

5,000 for most pathways. Some examples of OOB error rate plot versus the number of

trees are given in the figures. These are plots of the estimated error rate against the

number of trees for pathway Hypoxia and p53 in the Cardiovascular system. With about

1,000 trees, we can see that the OOB error has stabilized. For some pathways, there are

more fluctuations, such as the Pertussis toxin-insensitive CCR5 Signaling pathway.

Therefore, we have decided to use 50,000 trees. Black lines represent the actual

classification error rate, green lines and red lines are the upper and lower confidence

bound respectively.

3

4

FS3. Overlapping genes with positive importance measure of top ranked pathways for classification (pathways are rectangular shaped nodes)

for (diamond shaped genes)

5

FS4. Other top ranked pathways for classification with positive importance measure genes (pathways are rectangular shaped nodes)

for (diamond shaped genes)

6

FS5. MSE plot versus the number of trees for Random Forests regression

7

8

FS6. MDS plot for the proximity matrix of the 29 canine cases for ‘one carbon pool by

folate’ pathway

9

FS7. Overlapping genes with positive importance measure of top ranked pathways for

regression (pathways are rectangular shaped nodes)

for (diamond shaped genes)

10

TS1. Pathways ranked by OOB error rates of less than or equal to 12.9% without outliers

11

TS2. A description of the 31 canine in our dataset

The detection of animals #10 and #19 (in bold) as outliers seems biologically plausible as

those two dogs had lesion score either higher/lower from the other three dogs sampled at

the same time point and dosage group and thus they are classified in the other lesion

groups which consistent with what we have observed in the outlier plots.

Dog number Total Lesion Number Treatment1 0 6h_control2 0 6h_control3 0 6h_control4 0 6h_control5 1 6h_Dose26 5 6h_Dose27 2 6h_Dose28 1 6h_Dose29 2 6h_Dose10

10 0 6h_Dose1011 3 6h_Dose1012 8 6h_Dose1013 0 16h_control14 0 16h_control15 0 16h_control16 0 16h_control17 1 16h_Dose218 0 16h_Dose219 3 16h_Dose220 0 16h_Dose221 1 16h_Dose1022 6 16h_Dose1023 21 16h_Dose1024 6 16h_Dose1025 0 Recovery_control26 0 Recovery_control27 0 Recovery_control28 0 Recovery_control29 0 Recovery_Dose1030 2 (healing) Recovery_Dose1031 3 (healing) Recovery_Dose10

12

TS3. Change in classification error rate after removal of outliers

Pathway without outliers with outliersBC-Low-density lipoprotein (LDL) pathway during atherogenesis(BC) 3.4% 12.90%BC-Msp-Ron Receptor Signaling Pathway(BC) 3.4% 12.90%BC-Hypoxia and p53 in the Cardiovascular system(BC) 3.4% 19.35%BC-Role of Ran in mitotic spindle regulation(BC) 6.9% 22.58%BC-Granzyme A mediated Apoptosis Pathway(BC) 6.9% 16.13%BC-CTCF First Multivalent Nuclear Factor(BC) 6.9% 9.68%BC-CDK Regulation of DNA Replication(BC) 6.9% 19.35%BC-Sumoylation by RanBP2 Regulates Transcriptional Repression(BC) 6.9% 19.35%Circadian rhythm 6.9% 9.68%leukocyte adhesion 6.9% 9.68%

13

TS4. Change in classification error after removing overlapping genes

These results can allow the researchers to focus on a smaller set of genes to study. When

the overlapping genes among those pathways were removed, as expected, pathways with a

majority of overlapping genes with high importance had their classification power

significantly decreased. The WNT Signaling pathway was able to maintain 6.9\% error

because even after overlapped genes were removed as it still contained the important

genes PLCL2, LRP5, and DVL1. Hypoxia and p53 in Cardiovascular system pathway

which contained gene H1F1A after overlapping genes were removed still performed very

well with only 6.9\% error. Similarly for Role of Ran in mitotic spindle regulation,

Circadian Rhythm and Leukocyte Adhesion as all of them contain a set of unique

important genes. Table of changes in classification error in the top 15 pathways without

overlapping genes, this increase in classification error may serve as an indicator as to how

these pathways are working together.

Pathway OOB error # of genes OOB error without overlapBC-Low-density lipoprotein (LDL) pathway during atherogenesis(BC) 3.4% 4 31.0%BC-Msp-Ron Receptor Signaling Pathway(BC) 3.4% 6 37.9%BC-Hypoxia and p53 in the Cardiovascular system(BC) 3.4% 14 6.9%BC-Role of Ran in mitotic spindle regulation(BC) 6.9% 8 10.3%BC-Granzyme A mediated Apoptosis Pathway(BC)* 6.9% 5 6.9%BC-CTCF First Multivalent Nuclear Factor(BC) 6.9% 18 31.0%BC-CDK Regulation of DNA Replication(BC) 6.9% 10 41.4%BC-Sumoylation by RanBP2 Regulates Transcriptional Repression(BC) 6.9% 8 24.1%Circadian rhythm 6.9% 9 10.3%BC-Nitric Oxide Signaling Pathway(BC) 6.9% 14 44.8%Aminosugars metabolism* 6.9% 19 6.9%Wnt signaling pathway 6.9% 68 6.9%Aminoacyl-tRNA biosynthesis* 6.9% 19 6.9%BC-Pertussis toxin-insensitive CCR5 Signaling in Macrophage(BC) 6.9% 15 51.7%leukocyte adhesion 6.9% 59 10.3%

*pathways without overlapped genes

14

TS5. Simple T-test: Mapped genes using the 0.01 cutoff ranked by number of genes

mapped

Mapped genes using the 0.01 cutoff ranked by proportions of genes mapped

The number of significant probes using a simple t-test gives over 5000 genes at 0.01

cutoff. Mapping them back to pathways and then rank the pathways by the number of

genes mapped or the proportions of genes mapped, we were not able to find as many

Pathway Name

Total Genes in Pathway Dataset Genes in Pathway Count

MAPK signaling pathway 217 L-Cf-7912_i_at(GADD45A) GL-Cf-10 55Purine metabolism 133 5_at(NT5C3) GL-Cf-6083_at(PRPS1 31

MAPKinase Signaling Pathway(BC) 64 f-16_at(TGFB1) GL-Cf-3172_at(MAP 25Tryptophan metabolism 85 -21_s_at(CAT) GL-Cf-483_at(MAOA 24Glycerolipid metabolism 96 GL-Cf-12655_at(B4GALT1) GL-Cf-7 24

Cell cycle 81 CCNB2) GL-Cf-9398_at(CDC25A) G 24Integrin-mediated cell adhesion 82 _at(PAK1) GL-Cf-3637_at(PDPK1) G 24Arginine and proline metabolism 72 S2A) GL-Cf-168_i_at(VNN1) GL-Cf- 21

TGF-beta signaling pathway 57 6_at(TGFB1) GL-Cf-1716_at(DCN) G 21Complement and coagulation cascades 91 6_at(THBD) GL-Cf-13131_at(KLKB1 21

Wnt signaling pathway 110 L-Cf-1062_at(PPP2CA) GL-Cf-5993 20Pyrimidine metabolism 70 162_at(TYMS) GL-Cf-9832_at(NME 19Fatty acid metabolism 89 -Cf-136_i_at(CYP2D6) GL-Cf-179_s 18Histidine metabolism 34 f-6906_at(FTSJ1) GL-Cf-483_at(MA 18

Proteasome 48 209_at(PSMA1) GL-Cf-3260_at(PSM 18Leukocyte Adhesion(user defined) 207 AP2K1) GL-Cf-353_at(IL13) GL-Cf-1 18

Glycine, serine and threonine metabolism 42 _at(MAOB) GL-Cf-1575_at(HSD17B 16Lysine degradation 51 49) GL-Cf-2807_at(DLST) GL-Cf-145 16

Starch and sucrose metabolism 73 -Cf-12561_at(UGP2) GL-Cf-7834_s_ 16Cholera - Infection 72 61G) GL-Cf-562_at(SEC61A1) GL-C 16

Tyrosine metabolism 53 f-483_at(MAOA) GL-Cf-6453_s_at(M 15Keratinocyte Differentiation(BC) 44 L-Cf-4239_at(MAP2K1) GL-Cf-7825 14

N-Glycans biosynthesis 46 f-5347_at(FNTB) GL-Cf-2114_i_at(M 13Nicotinate and nicotinamide metabolism 57 f-4378_at(QPRT) GL-Cf-2271_at(NT 13

Pathway Name

Total Genes in Pathway Dataset Genes in Pathway Count

Phosphorylation of MEK1 by cdk5/p35 down regulates the MAP kinase pathway(BC) 7GL-Cf-312_at(HRAS) GL-Cf-10291_at(MAPK1) GL-Cf-4239_at(MAP2K1) GL-Cf-7825_at(MAP2K1) GL-Cf-5049_at(MAP2K2) GL-Cf-3465_at(MAPK3) GL-Cf-13757_at(CDK5)6Ethylbenzene degradation 5GL-Cf-10101_at(NAT5) GL-Cf-9641_at(SLC27A2) GL-Cf-4255_at(GCN5L2) GL-Cf-9945_at(ARD1)4Segmentation Clock(BC) 5GL-Cf-12414_at(GSK3B) GL-Cf-811_at(NOTCH1) GL-Cf-3735_at(DVL1) GL-Cf-9613_at(LFNG)4

Visual Signal Transduction(BC) 5GL-Cf-78_at(PDE6A) GL-Cf-135_at(PDE6G) GL-Cf-105_at(RHO) GL-Cf-1990_at(SLC25A22)4Small Leucine-rich Proteoglycan (SLRP) molecules(BC) 5GL-Cf-1716_at(DCN) GL-Cf-552_at(DCN) GL-Cf-551_at(BGN) GL-Cf-13689_i_at(KERA) GL-Cf-7865_at(LUM)4

Alkaloid biosynthesis II 4GL-Cf-1637_at(ODC1) GL-Cf-5071_at(ABP1) GL-Cf-5141_at(AOC2)3Nitrobenzene degradation 7GL-Cf-4621_at(HEMK) GL-Cf-3953_at(HRMT1L2) GL-Cf-1054_at(WBSCR22) GL-Cf-6906_at(FTSJ1) GL-Cf-12067_at(LCMT1)5

NO2-dependent IL 12 Pathway in NK cells(BC) 9GL-Cf-111_at(NOS2A) GL-Cf-392_at(CD3E) GL-Cf-115_at(CD4) GL-Cf-6400_at(CXCR3) GL-Cf-6228_at(JAK2) GL-Cf-4462_at(JAK2)6Anthrax Toxin Mechanism of Action(BC) 3GL-Cf-4239_at(MAP2K1) GL-Cf-7825_at(MAP2K1) GL-Cf-5049_at(MAP2K2)2BRCA1-dependent Ub-ligase activity(BC) 3 GL-Cf-12002_at(BARD1) GL-Cf-6246_at(FANCD2)2

Aminophosphonate metabolism 8GL-Cf-4621_at(HEMK) GL-Cf-3953_at(HRMT1L2) GL-Cf-1054_at(WBSCR22) GL-Cf-6906_at(FTSJ1) GL-Cf-12067_at(LCMT1)5RNA polymerase III transcription(BC) 5GL-Cf-648_at(TBP) GL-Cf-9041_at(SSB) GL-Cf-6267_at(GTF3C5)3

Inhibition of Cellular Proliferation by Gleevec(BC) 12GL-Cf-312_at(HRAS) GL-Cf-4239_at(MAP2K1) GL-Cf-7825_at(MAP2K1) GL-Cf-2341_at(BAD) GL-Cf-6228_at(JAK2) GL-Cf-4462_at(JAK2) GL-Cf-463_at(MYC) GL-Cf-3465_at(MAPK3)7Deregulation of CDK5 in Alzheimers Disease(BC) 14GL-Cf-536_at(APP) GL-Cf-576_at(APP) GL-Cf-12414_at(GSK3B) GL-Cf-1614_at(MAPT) GL-Cf-4205_at(CAPN1) GL-Cf-12220_at(CSNK1D) GL-Cf-5144_at(CSNK1D) GL-Cf-5646_at(CSNK1A1) GL-Cf-1062_at(PPP2CA) GL-Cf-13757_at(CDK5)8

Erythropoietin mediated neuroprotection through NF-kB(BC) 9GL-Cf-413_at(EPO) GL-Cf-8219_at(HIF1A) GL-Cf-1968_at(NFKB1) GL-Cf-6228_at(JAK2) GL-Cf-4462_at(JAK2)5Overview of telomerase protein component gene hTert Transcriptional Regulation(BC) 9GL-Cf-13296_at(TP53) GL-Cf-13252_at(HDAC1) GL-Cf-6496_i_at(MAX) GL-Cf-7987_at(SP1) GL-Cf-463_at(MYC)5

Pentose and glucuronate interconversions 11GL-Cf-2869_at(AKR1B1) GL-Cf-12561_at(UGP2) GL-Cf-7834_s_at(GUSB) GL-Cf-4358_at(UGDH) GL-Cf-441_at(UGT1A6) GL-Cf-2666_at(UGT2B11)6Ubiquinone biosynthesis 11GL-Cf-4621_at(HEMK) GL-Cf-3953_at(HRMT1L2) GL-Cf-1054_at(WBSCR22) GL-Cf-6906_at(FTSJ1) GL-Cf-5347_at(FNTB) GL-Cf-12067_at(LCMT1)6

Chondroitin / Heparan sulfate biosynthesis 11GL-Cf-5546_at(NDST1) GL-Cf-8194_at(NDST1) GL-Cf-12441_at(EXTL3) GL-Cf-195_at(HS3ST1) GL-Cf-7905_at(HS2ST1) GL-Cf-14013_at(B4GALT7) GL-Cf-2751_at(B3GAT1)6Circadian rhythm 13GL-Cf-4563_at(NR1D1) GL-Cf-12220_at(CSNK1D) GL-Cf-5144_at(CSNK1D) GL-Cf-5646_at(CSNK1A1) GL-Cf-13095_at(ARNTL) GL-Cf-2569_at(CRY1) GL-Cf-3602_at(BHLHB2) GL-Cf-2206_at(PER2)7

Histidine metabolism 34 L-Cf-6906_at(FTSJ1) GL-Cf-483_at(MAOA) 18Growth Hormone Signaling Pathway(BC) 18 K1) GL-Cf-12808_i_at(GHR) GL-Cf-40_at(GH 9

EPO Signaling Pathway(BC) 12GL-Cf-312_at(HRAS) GL-Cf-4239_at(MAP2K1) GL-Cf-7825_at(MAP2K1) GL-Cf-413_at(EPO) GL-Cf-6228_at(JAK2) GL-Cf-4462_at(JAK2) GL-Cf-3465_at(MAPK3)6Estrogen-responsive protein Efp controls cell cycle and breast tumors growth(BC) 10GL-Cf-13296_at(TP53) GL-Cf-1596_at(CCNB2) GL-Cf-3027_at(SMURF1) GL-Cf-9557_at(ESR1) GL-Cf-13757_at(CDK5)5

Role of Erk5 in Neuronal Survival(BC) 10GL-Cf-312_at(HRAS) GL-Cf-10291_at(MAPK1) GL-Cf-9015_at(MEF2C) GL-Cf-10087_at(RPS6KA1) GL-Cf-3465_at(MAPK3)5IL 3 signaling pathway(BC) 10GL-Cf-312_at(HRAS) GL-Cf-4239_at(MAP2K1) GL-Cf-7825_at(MAP2K1) GL-Cf-6228_at(JAK2) GL-Cf-4462_at(JAK2) GL-Cf-3465_at(MAPK3)5

15

interesting pathways compared to using Random Forests classification. Similar results

were noted for the 0.05 cutoff.

16

TS6. Fisher’s Exact Test results compared (lesion vs non-lesion) with FDR multiple

hypothesis correction

Top ranked pathways in Random Forests classification with their p-values using Fisher’s

Exact Test

Aminoacyl-tRNA biosynthesis 0.134323 6|11 36Sumoylation by RanBP2 Regulates Transcriptional Repression(BC) 0.155747 3|4 11Aminosugars metabolism 0.233562 6|12 28Leukocyte Adhesion(user defined) 0.3438 6|36 207Msp/Ron Receptor Signaling Pathway(BC) 0.589935 2|3 4Wnt signaling pathway 0.866301 12|44 110Hypoxia and p53 in the Cardiovascular system(BC) 0.705787 3|8 27Low-density lipoprotein (LDL) pathway during atherogenesis(BC) 1 1|2 7Granzyme A mediated Apoptosis Pathway(BC) 1 1|4 10CTCF: First Multivalent Nuclear Factor(BC) 1 3|11 25CDK Regulation of DNA Replication(BC) 1 2|8 13Nitric Oxide Signaling Pathway(BC) 1 2|8 29Pertussis toxin-insensitive CCR5 Signaling in Macrophage(BC) 1 2|7 24

# of genes

in pathwayOxidative phosphorylation 0.000762 1|51 178Circadian rhythm 0.00136 6|2 13Inhibition of Matrix Metalloproteinases(BC) 0.005098 5|2 17Small Leucine-rich Proteoglycan (SLRP) molecules(BC) 0.007248 4|1 5TGF-beta signaling pathway 0.012722 13|21 57Deregulation of CDK5 in Alzheimers Disease(BC) 0.021264 5|4 14Ethylbenzene degradation 0.035771 4|3 5Lissencephaly gene (LIS1) in neuronal migration and development(BC) 0.035771 4|3 17Spliceosomal Assembly(BC) 0.035771 4|3 19Role of Ran in mitotic spindle regulation(BC) 0.035771 4|3 12Metabolism of Anandamide, an Endogenous Cannabinoid(BC) 0.041791 2|0 4Proepithelin Conversion to Epithelin and Wound Repair Control(BC) 0.041791 2|0 8Cycling of Ran in nucleocytoplasmic transport(BC) 0.041791 2|0 7Glycolysis / Gluconeogenesis 0.045172 3|37 100Selenoamino acid metabolism 0.047803 6|8 30Rac 1 cell motility signaling pathway(BC) 0.047803 6|8 28Proteasome 0.049834 10|17 48gamma-Hexachlorocyclohexane degradation 0.055444 8|13 40Double Stranded RNA Induced Gene Expression(BC) 0.061286 3|2 14Carbon fixation 0.061343 0|17 34ATP synthesis 0.068856 1|23 58Angiotensin-converting enzyme 2 regulates heart function(BC) 0.078187 5|7 29Nicotinate and nicotinamide metabolism 0.082715 4|39 57Propanoate metabolism 0.096781 2|26 49WNT Signaling Pathway(BC) 0.098093 6|9 30

Pathway Name P-ValueS|NS

(Pathway)

17

TS7. Gene Set Enrichment Analysis (GSEA) results

Enrichment for the lesion phenotype with a nominal p-value less than 0.15

Enrichment for the no-lesion phenotype with a nominal p-value less than 0.15

GS SIZE ES NES NOM p-val FDR q-val FWER p-val1 BC-The role of FYVE-finger proteins in vesicle transport 9 0.63 1.5 0.049 1 0.885

3 BC-Regulation of p27 Phosphorylation during Cell Cycle Progression 10 0.52 1.34 0.146 1 0.987

2 BC-Small Leucine-rich Proteoglycan (SLRP) molecules 6 0.79 1.36 0.147 1 0.983

7 BC-Stathmin and breast cancer resistance to antimicrotubule agents 13 0.49 1.26 0.153 1 0.998

GS SIZE ES NES NOM p-val FDR q-val FWER p-val

4 BC-Spliceosomal Assembly 9 -0.9 -1.72 0 1 0.401

5 Proteasome 30 -0.81 -1.59 0.02 0.909 0.706

2 BC-Eukaryotic protein translation 18 -0.77 -1.67 0.021 1 0.534

7 BC-Regulation of eIF4e and p70 S6 Kinase 8 -0.83 -1.58 0.027 0.747 0.748

3 Ganglioside biosynthesis 10 -0.69 -1.62 0.029 1 0.654

5 BC-Regulation of Spermatogenesis by CREM 5 -0.79 -1.53 0.032 1 0.844

4 BC-Ghrelin: Regulation of Food Intake and Energy Homeostasis 9 -0.67 -1.61 0.033 0.949 0.66

6 Protein export 16 -0.78 -1.58 0.035 0.842 0.738

14 DNA polymerase 13 -0.53 -1.48 0.063 0.816 0.913

11 Aminoacyl-tRNA biosynthesis 19 -0.67 -1.5 0.072 0.939 0.893

15 Nucleotide sugars metabolism 10 -0.65 -1.46 0.075 0.898 0.936

20 BC-Granzyme A mediated Apoptosis Pathway 5 -0.84 -1.41 0.076 0.945 0.964

19 BC-Repression of Pain Sensation by the Transcriptional Regulator DREAM 7 -0.67 -1.41 0.081 0.991 0.964

9 Selenoamino acid metabolism 16 -0.59 -1.49 0.087 0.742 0.9

13 BC-Hypoxia and p53 in the Cardiovascular system 14 -0.62 -1.49 0.092 0.839 0.907

17 BC-Mechanism of Protein Import into the Nucleus 7 -0.74 -1.43 0.095 0.74 0.941

22 Nitrobenzene degradation 8 -0.57 -1.38 0.106 1 0.976

18 BC-Internal Ribosome entry pathway 11 -0.65 -1.42 0.111 0.953 0.959

12 N-Glycans biosynthesis 32 -0.51 -1.49 0.113 0.882 0.899

29 Aminophosphonate metabolism 9 -0.5 -1.3 0.119 1 0.994

23 BC-VEGF, Hypoxia, and Angiogenesis 30 -0.45 -1.37 0.123 1 0.98

27 Flavonoids, stilbene and lignin biosynthesis 7 -0.69 -1.32 0.127 1 0.992

21 BC-Role of Ran in mitotic spindle regulation 8 -0.62 -1.39 0.131 1 0.973

24 BC-Monocyte and its Surface Molecules 7 -0.66 -1.35 0.136 1 0.987

16 leukocyte adhesion 59 -0.41 -1.39 0.143 0.77 0.965

10 Ribosome 54 -0.62 -1.5 0.143 1 0.884

18

DMS1. Proximity matrix

Apart from identifying which pathways are important in sample classification, we can also

look at which animals are anomalies. For every pathway, we can find out which dogs are

misclassified. For the pathway of interest, the researcher can make use of the proximity

measure that can tell us which of the subjects/dogs are more similar to each other. This is

particularly useful in seeing for example how a set of recovery dogs are similar to other

dogs measured at different time points. The following is a matrix of how similar dogs are

to each other in a proximity table for the Hypoxia and p53 pathway which corresponds to

the MDS plot in the paper:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 291 1.002 0.67 1.003 0.56 0.54 1.004 0.53 0.48 0.61 1.005 0.18 0.17 0.16 0.22 1.006 0.23 0.20 0.22 0.30 0.48 1.007 0.16 0.15 0.14 0.21 0.66 0.46 1.008 0.28 0.25 0.20 0.28 0.46 0.50 0.47 1.009 0.14 0.13 0.12 0.18 0.64 0.47 0.64 0.46 1.00

10 0.26 0.25 0.24 0.28 0.45 0.31 0.45 0.30 0.51 1.0011 0.27 0.22 0.21 0.31 0.48 0.32 0.49 0.31 0.51 0.43 1.0012 0.47 0.47 0.51 0.43 0.19 0.20 0.19 0.21 0.18 0.33 0.30 1.0013 0.64 0.66 0.56 0.58 0.16 0.24 0.16 0.25 0.12 0.24 0.24 0.43 1.0014 0.33 0.35 0.34 0.29 0.26 0.19 0.25 0.17 0.25 0.40 0.28 0.45 0.29 1.0015 0.60 0.62 0.47 0.44 0.13 0.16 0.13 0.20 0.12 0.23 0.27 0.46 0.60 0.33 1.0016 0.29 0.28 0.30 0.35 0.32 0.27 0.30 0.25 0.33 0.30 0.25 0.32 0.29 0.35 0.21 1.0017 0.48 0.49 0.52 0.46 0.20 0.25 0.20 0.28 0.19 0.33 0.20 0.46 0.44 0.36 0.43 0.37 1.0018 0.49 0.52 0.50 0.46 0.20 0.27 0.21 0.29 0.19 0.33 0.21 0.48 0.46 0.38 0.44 0.38 0.60 1.0019 0.20 0.19 0.19 0.27 0.34 0.31 0.35 0.27 0.37 0.39 0.30 0.30 0.19 0.41 0.17 0.31 0.29 0.30 1.0020 0.13 0.12 0.12 0.18 0.41 0.37 0.42 0.33 0.43 0.46 0.38 0.22 0.13 0.32 0.13 0.30 0.20 0.23 0.60 1.0021 0.10 0.11 0.12 0.15 0.47 0.43 0.47 0.39 0.50 0.46 0.41 0.20 0.11 0.28 0.09 0.35 0.19 0.19 0.53 0.61 1.0022 0.13 0.13 0.12 0.18 0.50 0.43 0.50 0.41 0.53 0.46 0.43 0.19 0.12 0.28 0.11 0.38 0.18 0.20 0.47 0.55 0.63 1.0023 0.28 0.31 0.22 0.23 0.20 0.33 0.19 0.34 0.19 0.19 0.21 0.12 0.36 0.10 0.35 0.21 0.18 0.19 0.16 0.18 0.19 0.22 1.0024 0.19 0.22 0.13 0.15 0.28 0.29 0.27 0.31 0.26 0.23 0.34 0.08 0.27 0.10 0.26 0.23 0.09 0.10 0.21 0.22 0.22 0.23 0.46 1.0025 0.49 0.49 0.41 0.45 0.17 0.23 0.16 0.25 0.14 0.19 0.30 0.29 0.58 0.20 0.56 0.24 0.31 0.32 0.17 0.12 0.10 0.11 0.49 0.42 1.0026 0.57 0.56 0.48 0.52 0.14 0.20 0.13 0.21 0.11 0.19 0.25 0.36 0.66 0.26 0.62 0.24 0.38 0.39 0.16 0.10 0.08 0.10 0.48 0.38 0.69 1.0027 0.24 0.26 0.13 0.14 0.26 0.22 0.24 0.25 0.23 0.29 0.32 0.11 0.29 0.14 0.31 0.26 0.12 0.13 0.22 0.23 0.22 0.25 0.44 0.48 0.40 0.38 1.0028 0.51 0.51 0.43 0.46 0.13 0.20 0.13 0.19 0.11 0.14 0.25 0.32 0.59 0.23 0.57 0.23 0.33 0.34 0.11 0.08 0.08 0.10 0.47 0.40 0.67 0.69 0.39 1.0029 0.41 0.41 0.33 0.36 0.20 0.14 0.18 0.15 0.16 0.19 0.32 0.27 0.49 0.25 0.46 0.30 0.23 0.24 0.16 0.13 0.13 0.15 0.39 0.46 0.56 0.60 0.45 0.61 1.00

Dog 1 has proximity measures higher than 0.6 with 2 and 13(+1) and 15(+1) as well as

Dog 26 which is similar to both dog 13(+1) and dog 15(+1). In addition, dog 26 is like 25

(+2), 28 (+2) and 29 (+2) and so forth. Other pairwise proximity measures with value

greater than 0.6 are in bold.

19

DS1. Applications of RF classification

Canine dataset

As there is a close tie between cytokines in inflammatory diseases, the only important

gene in the IFN alpha signaling pathway, STAT2, was picked up probably due to the fact

that the STAT family members are phosphorylated in response to cytokines and growth

factors using Entrez Gene (Maglott et al. 2005). SELE, E-selectin, plays the role of an

adhesion ligand for recruiting leukocytes during inflammation (Harari et al. 1999) and is a

good biomaker for endothelial function and inflammatory response (Hetzel et al. 2005 and

Eriksson et al. 2005). DNCL1 or DYNLL1 which mediates DNA damage-induced p53

nuclear accumulation is no doubt closely tided to the hypoxia and p53 in cardiovascular

system pathway (Lo et al. 2005).

Breast Cancer dataset

Pathway OOB Error RateNum of Genes76 BC-CARM1 and Regulation of the Estrogen Receptor 0.081633 24

156 Fructose and mannose metabolism 0.081633 39163 BC-GATA3 participate in activating the Th2 cytokine genes expression 0.081633 21171 Glycolysis - Gluconeogenesis 0.081633 68326 BC-Regulation of BAD phosphorylation 0.081633 24

74 Carbon fixation 0.102041 25126 BC-Downregulated of MTA-3 in ER-negative Breast Tumors 0.102041 19140 BC-Estrogen-responsive protein Efp controls cell cycle and breast tumors growth 0.102041 10221 JAK-Stat Signaling Pathway 0.102041 71238 BC-mCalpain and friends in Cell motility 0.102041 33288 Pentose phosphate pathway 0.102041 22434 Valine_ leucine and isoleucine degradation 0.102041 46227 Limonene and pinene degradation 0.122449 26234 BC-Map Kinase Inactivation of SMRT Corepressor 0.122449 16291 Phenylalanine metabolism 0.122449 22351 BC-Role of ERBB2 in Signal Transduction and Oncology 0.122449 29389 Sulfur metabolism 0.122449 9423 BC-Trefoil Factors Initiate Mucosal Healing 0.122449 33425 Tryptophan metabolism 0.122449 94428 Tyrosine metabolism 0.122449 37

See text.

20

Gender dataset

See text.

p53

These data are used to identify targets of the transcription factor p53. p53 regulates gene

expression in response to various cellular stress (Subramanian et al. 2005). Majority of

these pathways are related to p53 function. Hypoxia and p53 in the Cardiovascular

system, p53 signaling pathway, p53 Pathway. Since hypoxic stress is related to DNA

damage and p53 regulates cell cycle if DNA damage is present, it is not surprising to see

pathways such as SA_G1_AND_S_Phases and G2 pathway that are related to cell cycle

and also pathways like Cell death and DNA damage signaling.

Pathway OOB Error RateNum of Genes316 SA_G1_AND_S_PHASES 0.12 24130 g2Pathway 0.14 44319 SA_PROGRAMMED_CELL_DEATH 0.14 2491 DNA_DAMAGE_SIGNALLING 0.16 139

246 mitochondriaPathway 0.16 33274 p53hypoxiaPathway 0.16 4021 badPathway 0.18 4123 bcl2family_and_reg_network 0.18 5961 chemicalPathway 0.18 44

273 p53_signalling 0.18 153275 p53Pathway 0.18 40

Pathway OOB Error RateNum of Genes479 TESTIS_GENES_FROM_XHX_AND_NETAFFX 0.03125 111480 GNF_FEMALE_GENES 0.0625 116367 ST_Dictyostelium_discoideum_cAMP_Chemotaxis_Pathway 0.15625 55445 XINACT 0.1875 34478 WILLARD_INACT 0.1875 31358 SIG_Regulation_of_the_actin_cytoskeleton_by_Rho_GTPases 0.21875 67213 MAP00252_Alanine_and_aspartate_metabolism 0.25 36256 MAP00910_Nitrogen_metabolism 0.28125 36353 SIG_CHEMOTAXIS 0.28125 85

21

In these two Lung cancer studies, the geneset gmt was not available online, therefore, we

have used the pathways from KEGG and BioCarta instead.

Lung_a

Some of these pathways, like protein export is p53 related, since p53 and MDM2 are

proteins which carry nuclear export signaling amino acids. Nitric Oxide is observed in

human tumor cell lines and is related to pulmonary vascular function and dysfunction.

Nitric Oxide plays a role in both pathway Actions of Nitric Oxide in the Heart and

Phenylalanine, tyrosine and tryptophan biosynthesis. There are also pathways with

response to hypoxia, including TSP-1 is related to tumor growth and other ones such as

the first one which involves wound repair control.

Lung_b

In this dataset, the first pathway is related to glycolysis and blood-sugar. It has been

observed that tumor can lead to low glucose levels. Other pathway, like double stranded

RNA induced gene expression, contains p-53 response genes. And finally, BRCA1-

dependent Ub-ligase activity pathway contains ubiquitin which is connected to

proteasome. As proteasome plays an important role in the regulation of proteins in the

Pathway OOB Error Rate Num of Genes384 BC-Steps in the Glycosylation of Mammalian N-linked Oligosaccarides 0.274193548 19125 BC-Double Stranded RNA Induced Gene Expression 0.322580645 19346 BC-RNA polymerase III transcription 0.322580645 665 BC-BRCA1-dependent Ub-ligase activity 0.338709677 7

310 Proteasome 0.35483871 38

Pathway OOB Error RateNum of Genes304 BC-Proepithelin Conversion to Epithelin and Wound Repair Control 0.232558 11308 Protein export 0.232558 7

4 BC-Actions of Nitric Oxide in the Heart 0.244186 34288 Phenylalanine, tyrosine and tryptophan biosynthesis 0.244186 9298 BC-Platelet Amyloid Precursor Protein Pathway 0.244186 19422 BC-TSP-1 Induced Apoptosis in Microvascular Endothelial Cell 0.244186 10

22

cell cycle, apoptosis and angiogenesis, it is reasonable to see these two pathways are

among the ones with a small OOB error rate.

Eriksson,E.E. et al (2005) Powerful inflammatory properties of large vein endothelium in vivo, Arterioscler Thromb Vasc Biol, 25(4), 723-728. Harari,O.A. et al (1999) Targeting an adenoviral gene vector to cytokine-activated vascular endothelium via E-selectin, Nature Gene Therapy, 6(5), 801-807. Hetzel,J. et al (2005) Rapid effects of rosiglitazone treatment on endothelial function and inflammatory biomarkers, Arterioscler Thromb Vasc Biol, 25(9), 1804-1809. Lo,K. et al. (2005) The 8-kDa Dynein Light Chain Binds to p53-binding Protein 1 and Mediates DNA Damage-induced p53 Nuclear Accumulation, J. Biol. Chem, 280(9), 8172-8179. Maglott,D. et al. (2005) Entrez Gene: gene-centered information at NCBI, Nucleic Acids Research, 33 (Database issue), D54-D58. Subramanian,A. et al. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, PNAS, 102(43), 15545- 15550.