49
Big Data Training for Translational Omics Research Statistical Models Unit 3, Session 1

Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Statistical Models

Unit 3 Session 1

Big Data Training for Translational Omics Research

Outline

bull Logistic Regressionndash ROC curve and AUC

bull Linear Regression

bull Kaplan-Meier plot and log-rank test

bull Cox Proportional odds model

Big Data Training for Translational Omics Research

Logistic Model

bull Logistic model is used for casecontrol study

bull Usage scenario when the response is binary say diseasehealthy or recurrencenon-recurrence

log119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890

1 minus 119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890= β0 + 12057311199091 +⋯+ 120573119899119909119899

bull Where 119909119894 are predictors and 120573119894 are the parameters of interest

Big Data Training for Translational Omics Research

Linear Model

bull Response continuous say weight or

gene expression

bull Predictors any variables (say gene

expression)

bull Model 119910 = β0 + 12057311199091 +⋯+ 120573119899119909119899 + 120598

bull Assumptions error term 120598 sim 119894119894119889 119873 0 1205902

Big Data Training for Translational Omics Research

Survival Methods

bull Kaplan-Meier plot visually checking the survival curve

between groups

bull Cox Proportional odds model and log-rank test as formal

statistical test

bull Response survival time (say DFS) and censor

bull Predictors any variables (say group or specific genes)

bull Recurrence censor = 1 and Non-recurrence censor = 0

Big Data Training for Translational Omics Research

Load data

bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)

Big Data Training for Translational Omics Research

Logistic Model

bull Response --- recurrencenon-recurrence status

bull Predictor --- the expression of gene HOXB13

logistic regresion use gene HOXB13 to predict the recurnon-recur status

fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))

summary(fitlogistic)

plot ROC curvep lt- predict(fitlogistic type=response)

pr lt- prediction(p toy_data$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=ROC plot of logistic regression)

calculate the auc

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]auc

Big Data Training for Translational Omics Research

Logistic Regression Result

Big Data Training for Translational Omics Research

ROC Curve

Big Data Training for Translational Omics Research

Linear Model

bull Response --- expression of HOXB13

bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13

HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)

summary(fitlm)

Big Data Training for Translational Omics Research

Linear Regression Result

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 2: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Outline

bull Logistic Regressionndash ROC curve and AUC

bull Linear Regression

bull Kaplan-Meier plot and log-rank test

bull Cox Proportional odds model

Big Data Training for Translational Omics Research

Logistic Model

bull Logistic model is used for casecontrol study

bull Usage scenario when the response is binary say diseasehealthy or recurrencenon-recurrence

log119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890

1 minus 119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890= β0 + 12057311199091 +⋯+ 120573119899119909119899

bull Where 119909119894 are predictors and 120573119894 are the parameters of interest

Big Data Training for Translational Omics Research

Linear Model

bull Response continuous say weight or

gene expression

bull Predictors any variables (say gene

expression)

bull Model 119910 = β0 + 12057311199091 +⋯+ 120573119899119909119899 + 120598

bull Assumptions error term 120598 sim 119894119894119889 119873 0 1205902

Big Data Training for Translational Omics Research

Survival Methods

bull Kaplan-Meier plot visually checking the survival curve

between groups

bull Cox Proportional odds model and log-rank test as formal

statistical test

bull Response survival time (say DFS) and censor

bull Predictors any variables (say group or specific genes)

bull Recurrence censor = 1 and Non-recurrence censor = 0

Big Data Training for Translational Omics Research

Load data

bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)

Big Data Training for Translational Omics Research

Logistic Model

bull Response --- recurrencenon-recurrence status

bull Predictor --- the expression of gene HOXB13

logistic regresion use gene HOXB13 to predict the recurnon-recur status

fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))

summary(fitlogistic)

plot ROC curvep lt- predict(fitlogistic type=response)

pr lt- prediction(p toy_data$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=ROC plot of logistic regression)

calculate the auc

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]auc

Big Data Training for Translational Omics Research

Logistic Regression Result

Big Data Training for Translational Omics Research

ROC Curve

Big Data Training for Translational Omics Research

Linear Model

bull Response --- expression of HOXB13

bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13

HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)

summary(fitlm)

Big Data Training for Translational Omics Research

Linear Regression Result

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 3: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Logistic Model

bull Logistic model is used for casecontrol study

bull Usage scenario when the response is binary say diseasehealthy or recurrencenon-recurrence

log119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890

1 minus 119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890= β0 + 12057311199091 +⋯+ 120573119899119909119899

bull Where 119909119894 are predictors and 120573119894 are the parameters of interest

Big Data Training for Translational Omics Research

Linear Model

bull Response continuous say weight or

gene expression

bull Predictors any variables (say gene

expression)

bull Model 119910 = β0 + 12057311199091 +⋯+ 120573119899119909119899 + 120598

bull Assumptions error term 120598 sim 119894119894119889 119873 0 1205902

Big Data Training for Translational Omics Research

Survival Methods

bull Kaplan-Meier plot visually checking the survival curve

between groups

bull Cox Proportional odds model and log-rank test as formal

statistical test

bull Response survival time (say DFS) and censor

bull Predictors any variables (say group or specific genes)

bull Recurrence censor = 1 and Non-recurrence censor = 0

Big Data Training for Translational Omics Research

Load data

bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)

Big Data Training for Translational Omics Research

Logistic Model

bull Response --- recurrencenon-recurrence status

bull Predictor --- the expression of gene HOXB13

logistic regresion use gene HOXB13 to predict the recurnon-recur status

fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))

summary(fitlogistic)

plot ROC curvep lt- predict(fitlogistic type=response)

pr lt- prediction(p toy_data$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=ROC plot of logistic regression)

calculate the auc

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]auc

Big Data Training for Translational Omics Research

Logistic Regression Result

Big Data Training for Translational Omics Research

ROC Curve

Big Data Training for Translational Omics Research

Linear Model

bull Response --- expression of HOXB13

bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13

HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)

summary(fitlm)

Big Data Training for Translational Omics Research

Linear Regression Result

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 4: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Linear Model

bull Response continuous say weight or

gene expression

bull Predictors any variables (say gene

expression)

bull Model 119910 = β0 + 12057311199091 +⋯+ 120573119899119909119899 + 120598

bull Assumptions error term 120598 sim 119894119894119889 119873 0 1205902

Big Data Training for Translational Omics Research

Survival Methods

bull Kaplan-Meier plot visually checking the survival curve

between groups

bull Cox Proportional odds model and log-rank test as formal

statistical test

bull Response survival time (say DFS) and censor

bull Predictors any variables (say group or specific genes)

bull Recurrence censor = 1 and Non-recurrence censor = 0

Big Data Training for Translational Omics Research

Load data

bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)

Big Data Training for Translational Omics Research

Logistic Model

bull Response --- recurrencenon-recurrence status

bull Predictor --- the expression of gene HOXB13

logistic regresion use gene HOXB13 to predict the recurnon-recur status

fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))

summary(fitlogistic)

plot ROC curvep lt- predict(fitlogistic type=response)

pr lt- prediction(p toy_data$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=ROC plot of logistic regression)

calculate the auc

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]auc

Big Data Training for Translational Omics Research

Logistic Regression Result

Big Data Training for Translational Omics Research

ROC Curve

Big Data Training for Translational Omics Research

Linear Model

bull Response --- expression of HOXB13

bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13

HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)

summary(fitlm)

Big Data Training for Translational Omics Research

Linear Regression Result

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 5: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Survival Methods

bull Kaplan-Meier plot visually checking the survival curve

between groups

bull Cox Proportional odds model and log-rank test as formal

statistical test

bull Response survival time (say DFS) and censor

bull Predictors any variables (say group or specific genes)

bull Recurrence censor = 1 and Non-recurrence censor = 0

Big Data Training for Translational Omics Research

Load data

bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)

Big Data Training for Translational Omics Research

Logistic Model

bull Response --- recurrencenon-recurrence status

bull Predictor --- the expression of gene HOXB13

logistic regresion use gene HOXB13 to predict the recurnon-recur status

fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))

summary(fitlogistic)

plot ROC curvep lt- predict(fitlogistic type=response)

pr lt- prediction(p toy_data$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=ROC plot of logistic regression)

calculate the auc

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]auc

Big Data Training for Translational Omics Research

Logistic Regression Result

Big Data Training for Translational Omics Research

ROC Curve

Big Data Training for Translational Omics Research

Linear Model

bull Response --- expression of HOXB13

bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13

HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)

summary(fitlm)

Big Data Training for Translational Omics Research

Linear Regression Result

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 6: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Load data

bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)

Big Data Training for Translational Omics Research

Logistic Model

bull Response --- recurrencenon-recurrence status

bull Predictor --- the expression of gene HOXB13

logistic regresion use gene HOXB13 to predict the recurnon-recur status

fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))

summary(fitlogistic)

plot ROC curvep lt- predict(fitlogistic type=response)

pr lt- prediction(p toy_data$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=ROC plot of logistic regression)

calculate the auc

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]auc

Big Data Training for Translational Omics Research

Logistic Regression Result

Big Data Training for Translational Omics Research

ROC Curve

Big Data Training for Translational Omics Research

Linear Model

bull Response --- expression of HOXB13

bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13

HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)

summary(fitlm)

Big Data Training for Translational Omics Research

Linear Regression Result

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 7: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Logistic Model

bull Response --- recurrencenon-recurrence status

bull Predictor --- the expression of gene HOXB13

logistic regresion use gene HOXB13 to predict the recurnon-recur status

fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))

summary(fitlogistic)

plot ROC curvep lt- predict(fitlogistic type=response)

pr lt- prediction(p toy_data$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=ROC plot of logistic regression)

calculate the auc

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]auc

Big Data Training for Translational Omics Research

Logistic Regression Result

Big Data Training for Translational Omics Research

ROC Curve

Big Data Training for Translational Omics Research

Linear Model

bull Response --- expression of HOXB13

bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13

HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)

summary(fitlm)

Big Data Training for Translational Omics Research

Linear Regression Result

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 8: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Logistic Regression Result

Big Data Training for Translational Omics Research

ROC Curve

Big Data Training for Translational Omics Research

Linear Model

bull Response --- expression of HOXB13

bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13

HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)

summary(fitlm)

Big Data Training for Translational Omics Research

Linear Regression Result

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 9: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

ROC Curve

Big Data Training for Translational Omics Research

Linear Model

bull Response --- expression of HOXB13

bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13

HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)

summary(fitlm)

Big Data Training for Translational Omics Research

Linear Regression Result

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 10: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Linear Model

bull Response --- expression of HOXB13

bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13

HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)

summary(fitlm)

Big Data Training for Translational Omics Research

Linear Regression Result

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 11: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Linear Regression Result

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 12: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

bull We use Kaplan-Meier plot and log-rank

test to check whether the survival time is

significantly different from each other

between groups (say highlow ratio

group)

ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)

autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of

toy example )yLab = Survival Probability)

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 13: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 14: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

bull We use highlow ratio group to predict the

survival probability Here the response is

the survival time and the censor

information

fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 15: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Cox Model Result

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 16: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Data Downloading Processing

and Analysis

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 17: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Outline

bull Download data

bull Parsing data

bull Normalization

bull Variance based filtering (top 25)

bull T test based filtering(based on the P-value cutoff)

The above steps are implemented in

ldquoget_DEG_tableRrdquo script

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 18: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Data Availability

bull Microdissected dataset GSE1378

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1378

bull Whole tissue dataset GSE1379

httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE

1379

bull The easiest way to download data is using ldquogetGEOrdquo

function from ldquoGEOqueryrdquo package

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 19: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use

ldquogetGEOrdquo function to get data locally or online

bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo

geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)

bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)

bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data

bull Note that the downloaded geno matrix is in log2scale

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 20: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Parsing Data

bull Extract the geno matrix pheno table and

feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs

pheno lt- pData(phenoData(geodata[[idx]]))

feature lt- as(featureData(geodata[[idx]]) dataframe)

bull Parsing phenotype table to get variable

Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])

infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])

infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])

infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 21: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Normalization

bull Gene wise normalization (subtract the

median log2 value)tmp_gm lt- apply(geno 2 median)

geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)

matrix(tmp_gm 1 n)

bull Sample wise normalization (divided

by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )

geno lt- t(apply(geno 1 function(x) x (mean(x)) ))

geno lt- apply(geno c(1 2) function(x) log2(x) )

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 22: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Variance Based Filtering

bull Calculate the variance for each gene and

choose the top 25 variance based filtering (75th percentile)

var_geno lt- apply(geno 1 var)

var_filtered_idx lt- var_geno gt quantile(var_geno 075)

feature_var_filtered lt- feature[var_filtered_idx]

geno_var_filtered lt- geno[var_filtered_idx]

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 23: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

T test Based Filtering

bull For each gene do T test between the

recurrence and non-recurrence group

The status variable indicates the group

informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =

twosided)

pvalue_list[i] lt- tmp_test$pvalue

bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)

feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]

geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 24: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Sample Results

(GSE1378microdissected 00011 cutoff)

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 25: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Sample Results

(GSE1379 whole tissue dataset cutoff 00011)

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 26: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Statistical Modeling(examples)

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 27: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Outline

bull Select overlapped genes between GSE1378 and

GSE1379 for subsequent analysis

bull Heatmap and Dendrogram

bull Univariate logistic regression for selected genes and

two-gene ratio predictor

bull Multivariate logistic regression (size and the other two

potential predictors)

bull Survival analysis part 1 Kaplan-Meier plot

bull Survival analysis part 2 Cox proportional odds model

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 28: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Overlapped Genes

bull In the prepossessing step we obtained two DEG tables

for the datasets GSE1378 and GSE1379

bull We used the overlapped genes in this two DEG tables

for the subsequent analysis

bull GSE1378 Micro-dissected breast cancer cell (LCM)

bull GSE1379 Whole tissue section

bull The overlapped genes are HOXB13 (identified twice as

AI208111 and BC007092) IL17BR (AF2080111) and

AI240933 (EST)

bull We will study the prognostic value of these markers

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 29: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Heatmap and Dendrogram

bull We use Heatmap and Dendrogram to

Visually check the relationship

(correlation) among genes or samples

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 30: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Heatmap(microdissectedGSE1378)

consistent with the paper

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 31: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Heatmap(whole section tissue GSE 1379)

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 32: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Model Set 1

bull Univariate logistic regression for each

gene

ndash Response variable recurnon-recur

status

ndash Predictors one of the overlapped

genes HOXB13 IL17BR(AF2080111)

AI240933(EST)

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 33: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Model Set 2

bull Univariate logistic regression for

ratio of genes

ndash Response variable recurnon-recur

status

ndash Predictors HOXB13IL17BR

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 34: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Model Set 3

bull Multivariate logistic regression

ndash Response variable recurnon-

recur

ndash Predictors tumor size

HOXB13IL17BR PGR and ERBB2

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 35: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Model Set 4

bull Survival model

ndash Response variable DFS (disease free

survival time) censor

ndash Predictor use ldquo-interceptbetardquo from

logistic regression as the cutoff to

divide the sample into two groups high

ratio group and low ratio group

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 36: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Important Note

bull Please remember there are two datasets GSE1378 and GSE1379

bull Can fit the same sets of model on these two datasets

bull Need to set the working dataset variable

working_dataset = GSE1378 whole tissue sectionGSE1379

working_dataset = GSE1378 microdissected breast cancer cells

GSE1378

bull Use working dataset GSE1378 as example

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 37: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Univariate Logistic Regression

for Each Gene

bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13

geno_selected = geno[which(feature$GB_ACC == gb_acc)]

logit_data = dataframe(status = infos_df$statusgene = geno_selected )

fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))

p lt- predict(fit type=response)

pr lt- prediction(p infos_df$status)

prf lt- performance(pr measure = tpr xmeasure = fpr)

plot(prfmain=paste0(ROC plot of gene gb_acc))

auc lt- performance(pr measure = auc)

auc lt- aucyvalues[[1]]

auc

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 38: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Sample Output (gene HOXB13 )

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 39: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

ROC (auc 0796 gene HOXB13 )

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 40: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Univariate Logistic Regression (HOXB13IL17BR)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratio)

fit the model

fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 41: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Sample Output(HOXB13IL17BR)

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 42: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

ROC (auc=084 HOXB13IL17BR)

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 43: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)

gb_acc1 = BC007092 HOXB13

gb_acc2 = AF208111 IL17BR

gene_name3 = PGR_3UTR1 PGR

gene_name4 = BF108852 ERBB2

geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]

geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]

geno_selected3 = geno[which(feature$GeneName == gene_name3)]

geno_selected4 = geno[which(feature$GeneName == gene_name4)]

in the log2 scale the ratio is the difference

gene_ratio = geno_selected1-geno_selected2

logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =

geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)

fit the multinvariate logistic regression

fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))

summary(fit)

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 44: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Sample Output (Multivariate)

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 45: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

ROC (auc = 086 Multivariate )

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 46: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Kaplan-Meier Plot

(gene ratio highlow group cutoff = -12)

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 47: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Cox Proportional Odds Model

(gene ratio highlow group cutoff = -12)

fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)

summary(fitcox)

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 48: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Sample Output (Cox)

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set

Page 49: Statistical Models - Purdue University · Big Data Training for Translational Omics Research Survival Methods • Kaplan-Meier plot: visually checking the survival curve between groups

Big Data Training for Translational Omics Research

Validation GSE6532

bull The link to this dataset

httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532

bull Sample size87

bull Number of total markers 54675

bull Gene HOXB13IL17RB and ESTs are included in this dataset

bull We use this dataset as validation

bull Result They are not significant on this independent set