Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Big Data Training for Translational Omics Research
Statistical Models
Unit 3 Session 1
Big Data Training for Translational Omics Research
Outline
bull Logistic Regressionndash ROC curve and AUC
bull Linear Regression
bull Kaplan-Meier plot and log-rank test
bull Cox Proportional odds model
Big Data Training for Translational Omics Research
Logistic Model
bull Logistic model is used for casecontrol study
bull Usage scenario when the response is binary say diseasehealthy or recurrencenon-recurrence
log119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890
1 minus 119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890= β0 + 12057311199091 +⋯+ 120573119899119909119899
bull Where 119909119894 are predictors and 120573119894 are the parameters of interest
Big Data Training for Translational Omics Research
Linear Model
bull Response continuous say weight or
gene expression
bull Predictors any variables (say gene
expression)
bull Model 119910 = β0 + 12057311199091 +⋯+ 120573119899119909119899 + 120598
bull Assumptions error term 120598 sim 119894119894119889 119873 0 1205902
Big Data Training for Translational Omics Research
Survival Methods
bull Kaplan-Meier plot visually checking the survival curve
between groups
bull Cox Proportional odds model and log-rank test as formal
statistical test
bull Response survival time (say DFS) and censor
bull Predictors any variables (say group or specific genes)
bull Recurrence censor = 1 and Non-recurrence censor = 0
Big Data Training for Translational Omics Research
Load data
bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)
Big Data Training for Translational Omics Research
Logistic Model
bull Response --- recurrencenon-recurrence status
bull Predictor --- the expression of gene HOXB13
logistic regresion use gene HOXB13 to predict the recurnon-recur status
fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))
summary(fitlogistic)
plot ROC curvep lt- predict(fitlogistic type=response)
pr lt- prediction(p toy_data$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=ROC plot of logistic regression)
calculate the auc
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]auc
Big Data Training for Translational Omics Research
Logistic Regression Result
Big Data Training for Translational Omics Research
ROC Curve
Big Data Training for Translational Omics Research
Linear Model
bull Response --- expression of HOXB13
bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13
HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)
summary(fitlm)
Big Data Training for Translational Omics Research
Linear Regression Result
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Outline
bull Logistic Regressionndash ROC curve and AUC
bull Linear Regression
bull Kaplan-Meier plot and log-rank test
bull Cox Proportional odds model
Big Data Training for Translational Omics Research
Logistic Model
bull Logistic model is used for casecontrol study
bull Usage scenario when the response is binary say diseasehealthy or recurrencenon-recurrence
log119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890
1 minus 119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890= β0 + 12057311199091 +⋯+ 120573119899119909119899
bull Where 119909119894 are predictors and 120573119894 are the parameters of interest
Big Data Training for Translational Omics Research
Linear Model
bull Response continuous say weight or
gene expression
bull Predictors any variables (say gene
expression)
bull Model 119910 = β0 + 12057311199091 +⋯+ 120573119899119909119899 + 120598
bull Assumptions error term 120598 sim 119894119894119889 119873 0 1205902
Big Data Training for Translational Omics Research
Survival Methods
bull Kaplan-Meier plot visually checking the survival curve
between groups
bull Cox Proportional odds model and log-rank test as formal
statistical test
bull Response survival time (say DFS) and censor
bull Predictors any variables (say group or specific genes)
bull Recurrence censor = 1 and Non-recurrence censor = 0
Big Data Training for Translational Omics Research
Load data
bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)
Big Data Training for Translational Omics Research
Logistic Model
bull Response --- recurrencenon-recurrence status
bull Predictor --- the expression of gene HOXB13
logistic regresion use gene HOXB13 to predict the recurnon-recur status
fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))
summary(fitlogistic)
plot ROC curvep lt- predict(fitlogistic type=response)
pr lt- prediction(p toy_data$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=ROC plot of logistic regression)
calculate the auc
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]auc
Big Data Training for Translational Omics Research
Logistic Regression Result
Big Data Training for Translational Omics Research
ROC Curve
Big Data Training for Translational Omics Research
Linear Model
bull Response --- expression of HOXB13
bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13
HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)
summary(fitlm)
Big Data Training for Translational Omics Research
Linear Regression Result
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Logistic Model
bull Logistic model is used for casecontrol study
bull Usage scenario when the response is binary say diseasehealthy or recurrencenon-recurrence
log119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890
1 minus 119901 119904119905119886119905119906119904 = 119903119890119888119906119903119903119890119899119888119890= β0 + 12057311199091 +⋯+ 120573119899119909119899
bull Where 119909119894 are predictors and 120573119894 are the parameters of interest
Big Data Training for Translational Omics Research
Linear Model
bull Response continuous say weight or
gene expression
bull Predictors any variables (say gene
expression)
bull Model 119910 = β0 + 12057311199091 +⋯+ 120573119899119909119899 + 120598
bull Assumptions error term 120598 sim 119894119894119889 119873 0 1205902
Big Data Training for Translational Omics Research
Survival Methods
bull Kaplan-Meier plot visually checking the survival curve
between groups
bull Cox Proportional odds model and log-rank test as formal
statistical test
bull Response survival time (say DFS) and censor
bull Predictors any variables (say group or specific genes)
bull Recurrence censor = 1 and Non-recurrence censor = 0
Big Data Training for Translational Omics Research
Load data
bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)
Big Data Training for Translational Omics Research
Logistic Model
bull Response --- recurrencenon-recurrence status
bull Predictor --- the expression of gene HOXB13
logistic regresion use gene HOXB13 to predict the recurnon-recur status
fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))
summary(fitlogistic)
plot ROC curvep lt- predict(fitlogistic type=response)
pr lt- prediction(p toy_data$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=ROC plot of logistic regression)
calculate the auc
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]auc
Big Data Training for Translational Omics Research
Logistic Regression Result
Big Data Training for Translational Omics Research
ROC Curve
Big Data Training for Translational Omics Research
Linear Model
bull Response --- expression of HOXB13
bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13
HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)
summary(fitlm)
Big Data Training for Translational Omics Research
Linear Regression Result
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Linear Model
bull Response continuous say weight or
gene expression
bull Predictors any variables (say gene
expression)
bull Model 119910 = β0 + 12057311199091 +⋯+ 120573119899119909119899 + 120598
bull Assumptions error term 120598 sim 119894119894119889 119873 0 1205902
Big Data Training for Translational Omics Research
Survival Methods
bull Kaplan-Meier plot visually checking the survival curve
between groups
bull Cox Proportional odds model and log-rank test as formal
statistical test
bull Response survival time (say DFS) and censor
bull Predictors any variables (say group or specific genes)
bull Recurrence censor = 1 and Non-recurrence censor = 0
Big Data Training for Translational Omics Research
Load data
bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)
Big Data Training for Translational Omics Research
Logistic Model
bull Response --- recurrencenon-recurrence status
bull Predictor --- the expression of gene HOXB13
logistic regresion use gene HOXB13 to predict the recurnon-recur status
fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))
summary(fitlogistic)
plot ROC curvep lt- predict(fitlogistic type=response)
pr lt- prediction(p toy_data$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=ROC plot of logistic regression)
calculate the auc
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]auc
Big Data Training for Translational Omics Research
Logistic Regression Result
Big Data Training for Translational Omics Research
ROC Curve
Big Data Training for Translational Omics Research
Linear Model
bull Response --- expression of HOXB13
bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13
HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)
summary(fitlm)
Big Data Training for Translational Omics Research
Linear Regression Result
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Survival Methods
bull Kaplan-Meier plot visually checking the survival curve
between groups
bull Cox Proportional odds model and log-rank test as formal
statistical test
bull Response survival time (say DFS) and censor
bull Predictors any variables (say group or specific genes)
bull Recurrence censor = 1 and Non-recurrence censor = 0
Big Data Training for Translational Omics Research
Load data
bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)
Big Data Training for Translational Omics Research
Logistic Model
bull Response --- recurrencenon-recurrence status
bull Predictor --- the expression of gene HOXB13
logistic regresion use gene HOXB13 to predict the recurnon-recur status
fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))
summary(fitlogistic)
plot ROC curvep lt- predict(fitlogistic type=response)
pr lt- prediction(p toy_data$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=ROC plot of logistic regression)
calculate the auc
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]auc
Big Data Training for Translational Omics Research
Logistic Regression Result
Big Data Training for Translational Omics Research
ROC Curve
Big Data Training for Translational Omics Research
Linear Model
bull Response --- expression of HOXB13
bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13
HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)
summary(fitlm)
Big Data Training for Translational Omics Research
Linear Regression Result
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Load data
bull Toy example datatoy_datalt- readcsv(toy_example_datacsv)
Big Data Training for Translational Omics Research
Logistic Model
bull Response --- recurrencenon-recurrence status
bull Predictor --- the expression of gene HOXB13
logistic regresion use gene HOXB13 to predict the recurnon-recur status
fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))
summary(fitlogistic)
plot ROC curvep lt- predict(fitlogistic type=response)
pr lt- prediction(p toy_data$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=ROC plot of logistic regression)
calculate the auc
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]auc
Big Data Training for Translational Omics Research
Logistic Regression Result
Big Data Training for Translational Omics Research
ROC Curve
Big Data Training for Translational Omics Research
Linear Model
bull Response --- expression of HOXB13
bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13
HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)
summary(fitlm)
Big Data Training for Translational Omics Research
Linear Regression Result
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Logistic Model
bull Response --- recurrencenon-recurrence status
bull Predictor --- the expression of gene HOXB13
logistic regresion use gene HOXB13 to predict the recurnon-recur status
fitlogistic lt- glm(status~ gene_HOXB13data = toy_datafamily = binomial(link = logit))
summary(fitlogistic)
plot ROC curvep lt- predict(fitlogistic type=response)
pr lt- prediction(p toy_data$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=ROC plot of logistic regression)
calculate the auc
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]auc
Big Data Training for Translational Omics Research
Logistic Regression Result
Big Data Training for Translational Omics Research
ROC Curve
Big Data Training for Translational Omics Research
Linear Model
bull Response --- expression of HOXB13
bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13
HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)
summary(fitlm)
Big Data Training for Translational Omics Research
Linear Regression Result
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Logistic Regression Result
Big Data Training for Translational Omics Research
ROC Curve
Big Data Training for Translational Omics Research
Linear Model
bull Response --- expression of HOXB13
bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13
HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)
summary(fitlm)
Big Data Training for Translational Omics Research
Linear Regression Result
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
ROC Curve
Big Data Training for Translational Omics Research
Linear Model
bull Response --- expression of HOXB13
bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13
HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)
summary(fitlm)
Big Data Training for Translational Omics Research
Linear Regression Result
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Linear Model
bull Response --- expression of HOXB13
bull Predictor --- expression of IL17BR linear model use gene IL17BR to predict another gene HOXB13
HOXB13fitlmlt- lm(gene_HOXB13~gene_IL17BRdata = data_toy)
summary(fitlm)
Big Data Training for Translational Omics Research
Linear Regression Result
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Linear Regression Result
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
bull We use Kaplan-Meier plot and log-rank
test to check whether the survival time is
significantly different from each other
between groups (say highlow ratio
group)
ratiosurv lt- survfit(Surv(timecensor) ~ ratio_group data = toy_data)
autoplot(ratiosurvpVal = TpX=025pY =025title = paste0(Kaplan-Meier plot of
toy example )yLab = Survival Probability)
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
bull We use highlow ratio group to predict the
survival probability Here the response is
the survival time and the censor
information
fitcox lt- coxph(Surv(timecensor) ~ group data = toy_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Cox Model Result
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Data Downloading Processing
and Analysis
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Outline
bull Download data
bull Parsing data
bull Normalization
bull Variance based filtering (top 25)
bull T test based filtering(based on the P-value cutoff)
The above steps are implemented in
ldquoget_DEG_tableRrdquo script
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Data Availability
bull Microdissected dataset GSE1378
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1378
bull Whole tissue dataset GSE1379
httpwwwncbinlmnihgovgeoqueryacccgiacc=GSE
1379
bull The easiest way to download data is using ldquogetGEOrdquo
function from ldquoGEOqueryrdquo package
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Use ldquogetGEOrdquo to Download Databull We have downloaded the data you can use
ldquogetGEOrdquo function to get data locally or online
bull Local (loading_method = lsquolocalrsquo)geo_Name lt- lsquoGSE1378rsquo
geodata2 lt-getGEO(filename paste0(geo_datageo_Name_series_matrixtxtgz) GSEMatrix = TRUE)
bull Online (loading_method = lsquoonlinersquo)geodata lt- getGEO(geo_Name GSEMatrix = TRUEdestdir = geo_data)
bull You can set loading_method variable in the get_DEG_table function to rdquolocalrdquo or ldquoonlinerdquo to change the way of downloading data
bull Note that the downloaded geno matrix is in log2scale
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Parsing Data
bull Extract the geno matrix pheno table and
feature tableidx lt- 1 geno lt- assayData(geodata[[idx]])$exprs
pheno lt- pData(phenoData(geodata[[idx]]))
feature lt- as(featureData(geodata[[idx]]) dataframe)
bull Parsing phenotype table to get variable
Age Size DFS censorinfos_df$Age = asnumeric(unlist(strsplit(infos_df$X9 split = =))[seq(2 2 n 2)])
infos_df$Size = asnumeric(unlist(strsplit(infos_df$X3 split = =))[seq(2 2 n 2)])
infos_df$DFS = asnumeric(unlist(strsplit(infos_df$X10 split = =))[seq(2 2 n 2)])
infos_df$censor = ifelse(infos_df$status == Status=recur 1 0)
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Normalization
bull Gene wise normalization (subtract the
median log2 value)tmp_gm lt- apply(geno 2 median)
geno lt- geno - matrix(rep(1 numOfGene) numOfGene 1)
matrix(tmp_gm 1 n)
bull Sample wise normalization (divided
by mean value in original scale)geno lt- apply(geno c(1 2) function(x) 2 ^ x )
geno lt- t(apply(geno 1 function(x) x (mean(x)) ))
geno lt- apply(geno c(1 2) function(x) log2(x) )
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Variance Based Filtering
bull Calculate the variance for each gene and
choose the top 25 variance based filtering (75th percentile)
var_geno lt- apply(geno 1 var)
var_filtered_idx lt- var_geno gt quantile(var_geno 075)
feature_var_filtered lt- feature[var_filtered_idx]
geno_var_filtered lt- geno[var_filtered_idx]
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
T test Based Filtering
bull For each gene do T test between the
recurrence and non-recurrence group
The status variable indicates the group
informationtmp_test lt- ttest(gene_express ~ status data = sdata alternative =
twosided)
pvalue_list[i] lt- tmp_test$pvalue
bull Fitering the gene by the P-value cutoffttest_filtered_idx lt- which(pvalue_list lt cutoff)
feature_ttest_filtered lt- feature_var_filtered[ttest_filtered_idx]
geno_ttest_filtered lt- geno_var_filtered[ttest_filtered_idx]
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Sample Results
(GSE1378microdissected 00011 cutoff)
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Sample Results
(GSE1379 whole tissue dataset cutoff 00011)
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Statistical Modeling(examples)
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Outline
bull Select overlapped genes between GSE1378 and
GSE1379 for subsequent analysis
bull Heatmap and Dendrogram
bull Univariate logistic regression for selected genes and
two-gene ratio predictor
bull Multivariate logistic regression (size and the other two
potential predictors)
bull Survival analysis part 1 Kaplan-Meier plot
bull Survival analysis part 2 Cox proportional odds model
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Overlapped Genes
bull In the prepossessing step we obtained two DEG tables
for the datasets GSE1378 and GSE1379
bull We used the overlapped genes in this two DEG tables
for the subsequent analysis
bull GSE1378 Micro-dissected breast cancer cell (LCM)
bull GSE1379 Whole tissue section
bull The overlapped genes are HOXB13 (identified twice as
AI208111 and BC007092) IL17BR (AF2080111) and
AI240933 (EST)
bull We will study the prognostic value of these markers
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Heatmap and Dendrogram
bull We use Heatmap and Dendrogram to
Visually check the relationship
(correlation) among genes or samples
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Heatmap(microdissectedGSE1378)
consistent with the paper
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Heatmap(whole section tissue GSE 1379)
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Model Set 1
bull Univariate logistic regression for each
gene
ndash Response variable recurnon-recur
status
ndash Predictors one of the overlapped
genes HOXB13 IL17BR(AF2080111)
AI240933(EST)
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Model Set 2
bull Univariate logistic regression for
ratio of genes
ndash Response variable recurnon-recur
status
ndash Predictors HOXB13IL17BR
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Model Set 3
bull Multivariate logistic regression
ndash Response variable recurnon-
recur
ndash Predictors tumor size
HOXB13IL17BR PGR and ERBB2
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Model Set 4
bull Survival model
ndash Response variable DFS (disease free
survival time) censor
ndash Predictor use ldquo-interceptbetardquo from
logistic regression as the cutoff to
divide the sample into two groups high
ratio group and low ratio group
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Important Note
bull Please remember there are two datasets GSE1378 and GSE1379
bull Can fit the same sets of model on these two datasets
bull Need to set the working dataset variable
working_dataset = GSE1378 whole tissue sectionGSE1379
working_dataset = GSE1378 microdissected breast cancer cells
GSE1378
bull Use working dataset GSE1378 as example
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Univariate Logistic Regression
for Each Gene
bull As an example we check the gene HOXB13gb_acc = BC007092 HOXB13
geno_selected = geno[which(feature$GB_ACC == gb_acc)]
logit_data = dataframe(status = infos_df$statusgene = geno_selected )
fit lt- glm(status~ geno_selecteddata = logit_datafamily = binomial(link = logit))
p lt- predict(fit type=response)
pr lt- prediction(p infos_df$status)
prf lt- performance(pr measure = tpr xmeasure = fpr)
plot(prfmain=paste0(ROC plot of gene gb_acc))
auc lt- performance(pr measure = auc)
auc lt- aucyvalues[[1]]
auc
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Sample Output (gene HOXB13 )
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
ROC (auc 0796 gene HOXB13 )
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Univariate Logistic Regression (HOXB13IL17BR)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statusgene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratio)
fit the model
fit lt- glm(status~ gene_ratiodata = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Sample Output(HOXB13IL17BR)
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
ROC (auc=084 HOXB13IL17BR)
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Multivariate Logistic Regression(tumor size gene ratio PGR ERBB2)
gb_acc1 = BC007092 HOXB13
gb_acc2 = AF208111 IL17BR
gene_name3 = PGR_3UTR1 PGR
gene_name4 = BF108852 ERBB2
geno_selected1 = geno[which(feature$GB_ACC == gb_acc1)]
geno_selected2 = geno[which(feature$GB_ACC == gb_acc2)]
geno_selected3 = geno[which(feature$GeneName == gene_name3)]
geno_selected4 = geno[which(feature$GeneName == gene_name4)]
in the log2 scale the ratio is the difference
gene_ratio = geno_selected1-geno_selected2
logit_data = dataframe(status = infos_df$statussize = infos_df$Sizegene1 = geno_selected1 gene2 =
geno_selected2ratio =gene_ratiogene3= geno_selected3gene4= geno_selected4)
fit the multinvariate logistic regression
fit lt- glm(status~ gene_ratio+size+gene3+gene4data = logit_datafamily = binomial(link = logit))
summary(fit)
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Sample Output (Multivariate)
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
ROC (auc = 086 Multivariate )
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Kaplan-Meier Plot
(gene ratio highlow group cutoff = -12)
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Cox Proportional Odds Model
(gene ratio highlow group cutoff = -12)
fitcox lt- coxph(Surv(timecensor) ~ group data = surv_data)
summary(fitcox)
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Sample Output (Cox)
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set
Big Data Training for Translational Omics Research
Validation GSE6532
bull The link to this dataset
httpwwwncbinlmnihgovgeoqueryacccgiacc=gse6532
bull Sample size87
bull Number of total markers 54675
bull Gene HOXB13IL17RB and ESTs are included in this dataset
bull We use this dataset as validation
bull Result They are not significant on this independent set