24
Katholieke Universiteit Leuven – ESAT/SCD – Steunpunt O&O Indicatoren 14-08-2007 1/24 Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, Bart De Moor [email protected]

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Embed Size (px)

DESCRIPTION

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, Bart De Moor [email protected]. Overview of the presentation. Introduction General context & objectives Clustering Text mining framework - PowerPoint PPT Presentation

Citation preview

Page 1: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

Katholieke Universiteit Leuven – ESAT/SCD – Steunpunt O&O Indicatoren 14-08-2007 1/24

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Frizo Janssens, Wolfgang Glänzel, Bart De [email protected]

Page 2: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

2/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Overview of the presentation

• Introduction

• General context & objectives

• Clustering

• Text mining framework

• Bibliometrics, citation analysis

• Hybrid (integrated) clustering

• Linear combination

• Fisher’s inverse chi-square method

• Dynamic hybrid mapping of bioinformatics

• Conclusions

• Further research

Page 3: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

3/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

• Mapping of scientific and technological fields by using clustering algorithms and techniques from bibliometrics and text mining

General context

• Complementary views on document set → other perceptions of similarity

• Textual information: amount of words in common

• Citation networks, bibliometric properties

• Goal:

• Integrate text mining & bibliometrics (hybrid approach)

• Better clustering and classification performance

• Mapping cognitive structure and dynamics of bioinformatics

Page 4: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

4/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17Length

Hair color

Interestin football

1

Agglomerative hierarchicalclustering

‘linkage’

2

3

4

MoreDiscriminative

power (?)

?

Binary tree,(hypothetical) Dendrogram

2 clusters

10 women10 men

Hair colorLength

Person 20

Person 3

Person 2

Person 1

Length

Hair colorfeatures

‘obj

ect

s’(a)

Interested in football

Hair colorLength

Person 20

Person 3

Person 2

Person 1

LengthHair color

Interestin football

(b)

0

0

P3

0

P2

P20P1

0P20

P3

P2

0P1

Distance matrix(e.g. Euclidean)

(c)

Agglomerative hierarchical clustering

Page 5: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

5/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Indexing in Vector Space Model

Term 1

Term 2

Doc 1

Doc 2

0.1

0.1

.txt

Text extraction

.txt .txt .txt…

Neglect structure, stop word removal, stemming, phrase detection, …‘Bags of words’ remain…

‘Indexing’, weighting (e.g., TF-IDF)

...

...

...

...

...

...

...

0

0.24

0.12

0

0

Doc 3

000.25Term 3

0.030.160Term 4

............

0.420.210Term m

00.550.1Term 2

00.20.4Term 1

Doc nDoc 2Doc 1

Term-by-document matrix A

vocabulary

Similarity between documents=cosine of angle between vectors

Towards Mapping Library and Information ScienceFrizo Janssensa,*, Jacqueline Letab,c,

Wolfgang B-3000 Leuven (Belgium)

c Instituto de Bioquímica Médica, Centro de Ciências da Saúde, Cidade

Universitária, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil

dHungarian Academy of Sciences, Institute for Research Policy Studies,

Nádor u. 18,

H-1051 Budapest (Hungary)* Corresponding author: Frizo Janssens,

Katholieke Universiteit Leuven, ESAT-SCD, Kasteelpark Arenberg 10, B-300

Doc 2 Doc 3 Doc n Digital documents…Doc 1

Page 6: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

6/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

• Bibliographic coupling

Bibliometrics and network analysis

x y

Page 7: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

7/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

• Integrate complementary information

• Textual content

• Citations

• Other bibliometric indicators

• Intermediate integration

• Pairwise distances calculated in separate spaces

• Incorporated before clustering

Hybrid (integrated) clustering

Page 8: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

8/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

• Weighted linear combination• Fisher’s inverse chi-square method

0

0

0

0

documents

docu

men

ts

Text-baseddistance matrix Dtext

0

0

0

0

documents

docu

men

ts

Distance matrix based on

bibliometrics Dbibl 0

0

0

0

documentsdo

cum

ents

Integrated distance matrix Di

Hierarchical clustering

• Text-based distances• Distances based on co-citation or bibliographic coupling• Integrated distances

Internal validation:number of clusters?

• Dendrogram

• Silhouette curves

• Silhouette plot

• Stability diagram

Using

Hybrid clustering: intermediate integration

Page 9: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

9/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Weighted linear combination (linco)

• Di = α · Dtext + (1-α) · DBIBL

• Attractive, easy, and scalable

• However, neglects differences in distributional characteristics !

• Histograms of mutual distances (<1) based on text (left) and BC (right)

• Unequal or unfair contribution of data sources

• Implicitly favoring text over bibliometric information or vice versa

140 000 700

Page 10: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

10/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

• ‘Omnibus statistic’ from statistical meta-analysis

• Combine p-values from multiple sources

• Freed from distributional differences

• Avoids overcompensation of either data source

Fisher’s inverse chi-square method

Page 11: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

11/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

ponm

tsrq

lkji

hgfe

dcbadocuments

term

s

‘real

’ tex

t dat

a

16151413

20191817

1211109

8765

4321documents

cita

tions

‘real

’ci

tatio

n da

ta

p-value p1

p-value p2

y

z

0p1

0

0

0documents

docu

men

ts

p-values

0p2

0

0

0documents

docu

men

ts

1

1

0 dist

Cu

mu

l. sh

are

y

cdf

1

1

0 distC

um

ul.

sha

rez

cdf

imsa

ocfp

tdjh

qren

glbk

documents

term

s

randomize

rand

omiz

ed te

xt d

ata

1413168

310520

69182

121419

1571711documents

cita

tions

randomize

rand

omiz

edci

tatio

n da

ta

distance matrices

0

0

0

0documents

docu

men

ts0

0

0

0documents

docu

men

ts

0

0

0

0documents

docu

men

ts

0

0

0

0documents

docu

men

ts

Dt

Dbc

0pi

0

0

0documents

docu

men

ts

Integratedp-values

pi = -2 · log(p1λ

· p21-λ)

Fisher’s omnibus:

Di

y

z

Fisher’s inverse chi-square method

Page 12: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

12/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

• Histogram of pairwise document distances for text and BC

• Histogram of p-values for real data w.r.t. randomized datasets

Fisher’s inverse chi-square method

Page 13: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

13/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

• Text-only >> cited references

• SVD greatly ameliorates results, especially for text (LSI)

• Best performance: integration !

• Fisher's inverse chi-square

• Significantly > text-only, link-only, & concatenation

• No significant difference with linco’s when SVD

• Generic, incorporate distances with highly dissimilar distributions

• Weighted linco: good option if LSI is used

Conclusions from previous research

• F. Janssens, V. Tran Quoc, W. Glänzel, and B. De Moor. Integration of textual content and link information for accurate clustering of science fields. In Proceedings of the I International Conference on Multidisciplinary Information Sciences & Technologies (InSciT2006). Current Research in Information Sciences and Technologies, volume I, pages 615–619, Mérida, Spain, October 2006.

Page 14: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

14/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Dynamic hybrid mapping of bioinformatics

Total: 7401

Page 15: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

15/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Number of clusters and LSI factors

Page 16: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

16/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Number of clusters: stability diagram

Page 17: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

17/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Number of clusters: link-based Silhouette values

Page 18: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

18/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Dendrogram

1. RNA structure prediction

2. Protein structure prediction

3. Systems biology & molecular networks

4. Phylogeny & evolution

5. Genome sequencing & assembly

6. Gene/promoter/motif prediction

7. Molecular DBs & annotation platforms

8. Multiple sequence alignment

9. Microarray analysis

Page 19: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

19/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Page 20: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

20/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Page 21: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

21/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Dynamics

Page 22: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

22/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

Dynamic term networks

Page 23: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

23/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

• Main contributions

• Hybrid clustering (of bioinformatics)

• Clustering and classification significantly improved

• Generic: other application domains

• Further Research

• Fuzzy clustering

• Semi-supervised clustering and active learning

• Spectral clustering

• Other matrix decompositions (e.g., NMF)

• Multilinear (tensor) algebra

• Mapping the world’s total yearly publication output

• Detect emerging and converging clusters & hot topics

• Science-technology interaction

Conclusions

Page 24: Dynamic Hybrid Clustering of Bioinformatics  by Incorporating Text Mining and Citation Analysis

24/24

Introduction

Text mining

Bibliometrics & network analysis

Hybrid clustering

Dynamic hybrid

mapping of bioinformatics

Conclusions

Poster #17

?

&