Upload
imogene-hicks
View
14
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, Bart De Moor [email protected]. Overview of the presentation. Introduction General context & objectives Clustering Text mining framework - PowerPoint PPT Presentation
Citation preview
Katholieke Universiteit Leuven – ESAT/SCD – Steunpunt O&O Indicatoren 14-08-2007 1/24
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis
Frizo Janssens, Wolfgang Glänzel, Bart De [email protected]
2/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
Overview of the presentation
• Introduction
• General context & objectives
• Clustering
• Text mining framework
• Bibliometrics, citation analysis
• Hybrid (integrated) clustering
• Linear combination
• Fisher’s inverse chi-square method
• Dynamic hybrid mapping of bioinformatics
• Conclusions
• Further research
3/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
• Mapping of scientific and technological fields by using clustering algorithms and techniques from bibliometrics and text mining
General context
• Complementary views on document set → other perceptions of similarity
• Textual information: amount of words in common
• Citation networks, bibliometric properties
• Goal:
• Integrate text mining & bibliometrics (hybrid approach)
• Better clustering and classification performance
• Mapping cognitive structure and dynamics of bioinformatics
4/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17Length
Hair color
Interestin football
1
Agglomerative hierarchicalclustering
‘linkage’
2
3
4
MoreDiscriminative
power (?)
?
…
Binary tree,(hypothetical) Dendrogram
2 clusters
10 women10 men
…
Hair colorLength
Person 20
Person 3
Person 2
Person 1
Length
Hair colorfeatures
‘obj
ect
s’(a)
Interested in football
…
Hair colorLength
Person 20
Person 3
Person 2
Person 1
LengthHair color
Interestin football
(b)
0
…
0
P3
0
P2
…
P20P1
0P20
P3
P2
0P1
Distance matrix(e.g. Euclidean)
(c)
Agglomerative hierarchical clustering
5/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
Indexing in Vector Space Model
Term 1
Term 2
Doc 1
Doc 2
0.1
0.1
.txt
Text extraction
.txt .txt .txt…
Neglect structure, stop word removal, stemming, phrase detection, …‘Bags of words’ remain…
‘Indexing’, weighting (e.g., TF-IDF)
...
...
...
...
...
...
...
0
…
0.24
0.12
0
0
Doc 3
000.25Term 3
0.030.160Term 4
............
0.420.210Term m
00.550.1Term 2
00.20.4Term 1
Doc nDoc 2Doc 1
Term-by-document matrix A
vocabulary
Similarity between documents=cosine of angle between vectors
Towards Mapping Library and Information ScienceFrizo Janssensa,*, Jacqueline Letab,c,
Wolfgang B-3000 Leuven (Belgium)
c Instituto de Bioquímica Médica, Centro de Ciências da Saúde, Cidade
Universitária, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
dHungarian Academy of Sciences, Institute for Research Policy Studies,
Nádor u. 18,
H-1051 Budapest (Hungary)* Corresponding author: Frizo Janssens,
Katholieke Universiteit Leuven, ESAT-SCD, Kasteelpark Arenberg 10, B-300
Doc 2 Doc 3 Doc n Digital documents…Doc 1
6/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
• Bibliographic coupling
Bibliometrics and network analysis
x y
7/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
• Integrate complementary information
• Textual content
• Citations
• Other bibliometric indicators
• Intermediate integration
• Pairwise distances calculated in separate spaces
• Incorporated before clustering
Hybrid (integrated) clustering
8/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
• Weighted linear combination• Fisher’s inverse chi-square method
0
0
0
0
documents
docu
men
ts
Text-baseddistance matrix Dtext
0
0
0
0
documents
docu
men
ts
Distance matrix based on
bibliometrics Dbibl 0
0
0
0
documentsdo
cum
ents
Integrated distance matrix Di
Hierarchical clustering
• Text-based distances• Distances based on co-citation or bibliographic coupling• Integrated distances
Internal validation:number of clusters?
• Dendrogram
• Silhouette curves
• Silhouette plot
• Stability diagram
Using
Hybrid clustering: intermediate integration
9/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
Weighted linear combination (linco)
• Di = α · Dtext + (1-α) · DBIBL
• Attractive, easy, and scalable
• However, neglects differences in distributional characteristics !
• Histograms of mutual distances (<1) based on text (left) and BC (right)
• Unequal or unfair contribution of data sources
• Implicitly favoring text over bibliometric information or vice versa
140 000 700
10/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
• ‘Omnibus statistic’ from statistical meta-analysis
• Combine p-values from multiple sources
• Freed from distributional differences
• Avoids overcompensation of either data source
Fisher’s inverse chi-square method
11/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
ponm
tsrq
lkji
hgfe
dcbadocuments
term
s
‘real
’ tex
t dat
a
16151413
20191817
1211109
8765
4321documents
cita
tions
‘real
’ci
tatio
n da
ta
p-value p1
p-value p2
y
z
0p1
0
0
0documents
docu
men
ts
p-values
0p2
0
0
0documents
docu
men
ts
1
1
0 dist
Cu
mu
l. sh
are
y
cdf
1
1
0 distC
um
ul.
sha
rez
cdf
imsa
ocfp
tdjh
qren
glbk
documents
term
s
randomize
rand
omiz
ed te
xt d
ata
1413168
310520
69182
121419
1571711documents
cita
tions
randomize
rand
omiz
edci
tatio
n da
ta
distance matrices
0
0
0
0documents
docu
men
ts0
0
0
0documents
docu
men
ts
0
0
0
0documents
docu
men
ts
0
0
0
0documents
docu
men
ts
Dt
Dbc
0pi
0
0
0documents
docu
men
ts
Integratedp-values
pi = -2 · log(p1λ
· p21-λ)
Fisher’s omnibus:
Di
y
z
Fisher’s inverse chi-square method
12/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
• Histogram of pairwise document distances for text and BC
• Histogram of p-values for real data w.r.t. randomized datasets
Fisher’s inverse chi-square method
13/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
• Text-only >> cited references
• SVD greatly ameliorates results, especially for text (LSI)
• Best performance: integration !
• Fisher's inverse chi-square
• Significantly > text-only, link-only, & concatenation
• No significant difference with linco’s when SVD
• Generic, incorporate distances with highly dissimilar distributions
• Weighted linco: good option if LSI is used
Conclusions from previous research
• F. Janssens, V. Tran Quoc, W. Glänzel, and B. De Moor. Integration of textual content and link information for accurate clustering of science fields. In Proceedings of the I International Conference on Multidisciplinary Information Sciences & Technologies (InSciT2006). Current Research in Information Sciences and Technologies, volume I, pages 615–619, Mérida, Spain, October 2006.
14/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
Dynamic hybrid mapping of bioinformatics
Total: 7401
15/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
Number of clusters and LSI factors
16/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
Number of clusters: stability diagram
17/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
Number of clusters: link-based Silhouette values
18/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
Dendrogram
1. RNA structure prediction
2. Protein structure prediction
3. Systems biology & molecular networks
4. Phylogeny & evolution
5. Genome sequencing & assembly
6. Gene/promoter/motif prediction
7. Molecular DBs & annotation platforms
8. Multiple sequence alignment
9. Microarray analysis
19/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
20/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
21/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
Dynamics
22/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
Dynamic term networks
23/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
• Main contributions
• Hybrid clustering (of bioinformatics)
• Clustering and classification significantly improved
• Generic: other application domains
• Further Research
• Fuzzy clustering
• Semi-supervised clustering and active learning
• Spectral clustering
• Other matrix decompositions (e.g., NMF)
• Multilinear (tensor) algebra
• Mapping the world’s total yearly publication output
• Detect emerging and converging clusters & hot topics
• Science-technology interaction
Conclusions
24/24
Introduction
Text mining
Bibliometrics & network analysis
Hybrid clustering
Dynamic hybrid
mapping of bioinformatics
Conclusions
Poster #17
?
&