Upload
y-h-taguchi
View
215
Download
2
Embed Size (px)
Citation preview
How to screen out liars
Yh. TaguchiDepartment of Physics
Chuo UniversityTokyo, Japan
・What are liars?
・Two examples to which PCA based PCA based unsupervised FEunsupervised FE was applied >The first example : Transgenerational Epigenetics
>Next example: Epigenetic therapy (NSCLC cell line reprograming)
・Mathematical details of our methodologyour methodology
Our proposed method!
Two kinds of liars often hide behind bioinformatic analyses of massive omics data (yet, well known.....)
False positives
“This gene is expressive differently between controls and treated samples, because it can take place with
very small probability (say, P=104)! ”
However, it may not be true, since
if you consider N=104 genes,
P=104 is not rare at all....
“OK, then we consider Pvalues that are multiplied by N. If it is still small enough, we regard that event as true”
False negatives“There are no genes expresseddifferently between controls and treated samples, since no Pvalues
less than 1/N=104 were observed...”
This also may not be true, since significance may decrease because of noise (very usual in biology...).
What shall we do?What shall we do?
One solution is …. “Detective strategy”
“Examine suspects separately. If they declare same, it could be true. ”
Only one truth exists.
This strategy is well known, by the name “Cross Validation (CV)”. However, CV also often fails if applied to Feature extraction (FE).
“Case Closed”
For example, leave one out CV (LOOCV) for FE
1 2
3 4
1 2
3 4
Task: identify limited number of genes that discriminate healthy controls from patients well.
eight samples Gene set 1Gene set 2
Gene set 8
・・・
Check coincidence between eight trialsCheck coincidence between eight trialsIf coincident, it is true...
reality is toughreality is tough Mean Probability of each miRNA selection
0.5
LOOCV FE
Not enough coincidence. Not enough coincidence. ““Detective strategy” fails, too.Detective strategy” fails, too.
Accuracy
Task: Identify limited number of circulating miRNAs that discriminate patients from healthy controls.Yh. Taguchi and Y. Murakami, BMC Research Notes (2014)
Lasso0.8
Diseases
LOOCV
・What are liars?
・Two examples to which PCA based PCA based unsupervised FEunsupervised FE was applied >The first example : Transgenerational Epigenetics
>Next example: Epigenetic therapy (NSCLC cell line reprograming)
・Mathematical details of our methodologyour methodology
Our proposed method!
Alternative strategy : principal component analysis (PCA) based unsupervised FE.
First : two unpublished biological worksSecond : introduce methodology (the audience may be more interested in biology than mathematical details)
For other published results: Search “Yh. Taguchi” in google scholar.
・What are liars?
・Two examples to which PCA based PCA based unsupervised FEunsupervised FE was applied >The first example : Transgenerational Epigenetics
>Next example: Epigenetic therapy (NSCLC cell line reprograming)
・Mathematical details of our methodologyour methodology
Our proposed method!
The First example:Transgenerational Epigenetics (TGE)
Phenotype transfers between generations without DNA modification
(also focused in “Cell Best of 2014Cell Best of 2014”)
F3 generation of F0 pregnant female exposed to endocrine disruptor
F0♀F1♂ F2
F3Abnormalities without inherited DNA exposed to endocrine disruptor?
Yes!・ male infertility (GuerreroBosagna, PLoS ONE 2013)・ anxiety behavior (Skinner, PLoS ONE, 2008)・ mate preference (Skinner, BMC Genom., 2013)・ various diseases (Anway, Endocrinology, 2006)
(on prostate, kidney, immune system, testis, and tumor development)
・ reprogramming of primordial germ cells (Skinner, PLoS ONE, 2013)・ stress responses ( Crews, PNAS, 2012)
However, understanding how TGE takes place still lacks.
Authors' conclusion : “A comparison between the germ cell differential DNA methylation regions and the differentially expressed genes indicated no significant overlap”Significant overlaps observed would be Significant overlaps observed would be interesting....interesting....
Skinner, PLoS ONE, 2013:Primordial germ cell in F3 generation at E13 and E16, gene expression/promoter methylation
F2♀F3
N'' com
mon
genes
FeatureExtraction
N' g ene s
FeatureExtraction
Promoter methylation
Vinclozolin treatedControl
E13 E16 E13 E16
Gene expression
Vinclozolin treatedControl
E13 E16 E13 E16
Our strategy.....Our strategy.....
N' ←→ N'' PP
Total N genes
N〜104
Results.....Results.....
P=0.05
P=103
P=102
Significant overlaps detected!Significant overlaps detected!
N''=48 genes with RefSeq ID
Are selected N''= 48 genes biologically reasonable?Are selected N''= 48 genes biologically reasonable? various diseases (Anway, Endocrinology, 2006) (on tumor, prostate, kidney, testis, immune system)Genes
Aberrant expression Aberrant expression associated with aberrant associated with aberrant promoter methylation of promoter methylation of these genes may be a these genes may be a causing factor of TGE causing factor of TGE mediated diseases.mediated diseases.
Based on literature searches, 22 genes out of 48 genes turned out to be related these tissues/diseases.
22 genes
In addition to this.... In addition to this.... Chemokine Signaling pathway
CCL3PF4
CCR2
CMKLR1
Some reported relationships to vinclozolin
Some reported relationships to diseases (kidney, prostate, testis, tumor, immunology)
Thus, disfunction of Chemokine Signaling pathway may cause TGE mediated diseases in F3 generaton
Furthermore …. Furthermore ….
Three leucine rich repeat (LRR) proteins (LRRN3, PRAMEL1, and LRRTM1) are included.
LRR proteins were recently regarded to play critical roles in neural systems.
de Wit et al, Annu. Rev. Cell Dev. Biol. 2011. 27:697–729, Role of LeucineRich Repeat Proteins in the Development and Function of Neural Circuits
And …. And ….
LRRN3 and LRRTM3
While …. While ….
2012 (101), pp. 811–818
Aberrant gene expression associated with aberrant promoter methylation of LRR proteins may cause TGE mediated nervous system disorders.
In conclusion …. In conclusion ….
PCA based unsupervised FE could identify significant overlap between aberrant gene expression and aberrant promoter methylation in TGE.
Identified genes were vastly related to previously reported various diseases.
Multiple genes belong to cytokine signaling pathway or LRR proteins, both of which possibly cause TGE mediated diseases.
Possibly, we have successfully screened out liars... Possibly, we have successfully screened out liars... (experimental varidations are of course needed)(experimental varidations are of course needed)
・What are liars?
・Two examples to which PCA based PCA based unsupervised FEunsupervised FE was applied >The first example : Transgenerational Epigenetics
>Next example: Epigenetic therapy (NSCLC cell line reprograming)
・Mathematical details of our methodologyour methodology
Our proposed method!
Next example: (NSCLC) Epigenetic therapy toward nonsmall cell lung cancer
Epigenetic therapy: Drugs targeting epigeneticse.g., promoter methylation, histone modification
Many reports Many reports in vivoin vivo
DNA methyltransferase inhibitor
However, smaller number of reports However, smaller number of reports in vitroin vitro
Epigenetic therapy cannot target specific proteins/genes. Thus, in vitro study may not be able to reproduce in vivo studies.
→ Considering NSCLC cell line Considering NSCLC cell line reprogrammingreprogramming, instead., instead.
Because reprogramming alters epigenetic markers also targeted by epigenetic therapy. Thus, detailed investigation of reprogrammed NSCLC cell line may let us identify genes targeted by epigenetic therapy.
Targeted dataset of NSCLC cell line reprogramming experiment: Mahalingam et al, Sci. Rep., 2012.
Eight cell lines: ・ H1 (ES cell)・ H358 ・ H460 ・ IMR90 (Human Caucasian fetal lung fibroblast)・ iPCH358・ iPCH460・ iPSIMR90・ piPCH358 (re differentiated iPCH358)
(NSCLC)
(reprogrammed cell lines)
Gene expression + promoter methylationGene expression + promoter methylation
differentiated undifferentiated
N'' com
mon
genes
FeatureExtraction
N' g ene s
FeatureExtraction
Promoter methylation
undiff.diff.
Gene expression
undiff.diff.
Our strategy.....Our strategy.....
N' ←→ N'' PP
Total N genes
Advantages of our strategy:Advantages of our strategy:・Integrated analyses of gene expression and promoter methylation(cf. Usually, significance was tested in gene expression and promoter methylation separately, and try to be integrated)
・Usable to unordered multiclass problems(cf. Integration of pairwise comparisons, e.g., by t test)
・Easy to be combined with other FE applicable to multiclass problems (e.g., annova)
PC3++
PC3
PC4+
PC4+
log 10
P
N'
0.05
0.05
0.05
0.05
Significant overlaps observedSignificant overlaps observed
(A) Associations with cancer related genes reported by gendoo server (B) Significant negative correlations between gene expression and promoter methylation (C) At least one study that reported direct/indirect relationship with NSCLC
(A) (B) (C) Vario us B
iolog ical S igni fican ce
Do identified genes include candidate to be targeted by Do identified genes include candidate to be targeted by epigenetic therapy?epigenetic therapy?
YES. SFRP1SFRP1 expression is distinct between HDAC(*) inhibitorresistant cell lines and nonresistant cell lines
Miyanaga, A. et al. Antitumor activity of histone deacetylase inhibitors in nonsmall cell lung cancer cells: development of a molecular predictive model. Mol. Cancer Ther. 7, 1923–1930 (2008).
(*)Histone Deacetylase
H3K9K14ac of SFRP1 increase during treatment with an HDAC inhibitor for NSCLC cell lines.
Tang, Y. A. et al. PLoS ONE 5, e12417 (2010).
Not NSCLC
What is biological function of SFRP1?What is biological function of SFRP1?
SFRP1 deactivates Wnt signaling pathway.
R. Surana et al. / Biochimica et BiophysicaActa 1845 (2014) 53–65
Wnt1
SFRP1
MD by GROMACS
In conclusion …. In conclusion ….
PCA based unsupervised FE could identify significant overlap between aberrant gene expression and aberrant promoter methylation in reprogramming NSCLC cell lines.
Among those identified, we proposed SFRP1 as candidate epigenetic therapy target gene because ...
・Distincet SFRP1 expression between nonresistant /resistant HDAC inhibitor NSCLC cell line・SFRP1 expression in NSCLC cell lines increase by HDAC inhibitor treatment ・SFRP1 is known Wnt signaling cell line diactivator
Possibly, we have successfully screened out liars... Possibly, we have successfully screened out liars... (experimental varidations are of course needed)(experimental varidations are of course needed)
・What are liars?
・Two examples to which PCA based PCA based unsupervised FEunsupervised FE was applied >The first example : Transgenerational Epigenetics
>Next example: Epigenetic therapy (NSCLC cell line reprograming)
・Mathematical details of our methodologyour methodology
Our proposed method!
What is PCA based unsupervised FE and why does it work What is PCA based unsupervised FE and why does it work so well?so well?
Intuitive synthetic example
100 f eatur es
5 features5 features
20 samples
90 features
・20 samples classified to 4 classes・only 10 features are distinct among four classes
Task:Identify 10 features without information about classes
Embedding genes to 2D with PCA
Genes distinct between four categories are placed as outliers outliers
Without category labeling||
unsupervised
Thus, we can identify genes distinct between four categories without using category labeling (unsupervised). How can we do this?
PC1 (the first principal component) is automatically selected to represent distinction between four categories.How can this happen?
PCA is designed to represent majority group behavior. In this data set, PC1 occasionally represents the component that represents distinction between categories, since distinction between four category is only feature that differs from random values.
PC1
Back to real applications …... Back to real applications …... Transgenerational epigenetics
PC2 for mRNA, PC1 for promoter methylation were selected respectively, because these two have the most significant distinction between E13 and E16.
Then, outliers (genes or probes) along PC1 and PC2 were selected.
Epigenetic therapy (NSCLC cell line reprogramming) ① Compute correlation
coefficients rr between PC1... PC24 (mRNA) + PCM1... PCM24 (promoter methylation)
② Perform UPGMA (hierarchical clustering) using 1 |rr| as distance.
③ PC3 and PC4 were identified as the most coincident pairs of PCs between mRNA and promoter methylation
Outliers(genes) along PC3/PC4 were selected! Outliers(genes) along PC3/PC4 were selected!
PCA based unsupervised FE can identify features distinct between categories without using category labeling. Thus, it was supposed to have superior powers to identify genes critical for considered properties, e.g., treated vs control.
ConclusionsConclusions
・PCA based unsupervised FE was proposed.
・PCA based unsupervised FE was applied to two biological examples. → Transgenerational epigenetics → Epigenetic therapy (NSCLC cell line reprogramming)
・Selected genes are biologically feasible.
Successfully screened out liars!Successfully screened out liars!
Funding:Funding:
KAKENHI 23300357,26120528Chuo University Joint Research Grant
Review Article:Review Article:Yh. Taguchi, Hideaki Umeyama, Mitsuo Iwadate, Yoshiki Murakami, Akira Okamoto: Heuristic Principal Component AnalysisBased Unsupervised Feature Extraction and Its Application to Bioinformaticshttp://dx.doi.org/10.4018/9781466666115.ch007In “ Big Data Analytics in Bioinformatics and Healthcare” IGI global pub.
Replacing PCA based unsupervised FE with Replacing PCA based unsupervised FE with categorical regression (ANOVA)categorical regression (ANOVA)
0.05
N〜104
N''=8 (N'=300)
ANOVA : N'=300