33
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, and Bart De Moor Presented by Cindy Burklow CS 685: Special Topics in Data Mining Professor Dr. Jinze Liu University of Kentucky April 17 th , 2008

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Embed Size (px)

DESCRIPTION

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis. Frizo Janssens, Wolfgang Glänzel, and Bart De Moor. Presented by Cindy Burklow. CS 685: Special Topics in Data Mining Professor Dr. Jinze Liu University of Kentucky April 17 th , 2008. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Frizo Janssens, Wolfgang Glänzel, and Bart De Moor

Presented by Cindy Burklow

CS 685: Special Topics in Data MiningProfessor Dr. Jinze LiuUniversity of KentuckyApril 17th, 2008

Page 2: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

OutlineIntroductionMotivationRelated WorkProposed ModelsProposed AlgorithmsResults: Hybrid & Dynamic ClusteringDiscussion of Pros and ConsQuestionsReferences

Page 3: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

IntroductionBioinformatics …

◦Computer Science◦ Information Technology◦Solves problems in Biomedicine

Goal of Paper: Investigate◦Cognitive structure◦Dynamics of bioinformatics core◦Sub-disciplines◦ ISI Web of Science & MEDLINE◦Retrieval of core literature in

bioinformatics

Page 4: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

MeSH = Medical Subject Headings

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 360, 368, KDD '07. ACM, San Jose, CA, August 2007.

Page 5: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis
Page 6: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis
Page 7: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

MotivationBioinformatics field …

◦Dynamic ◦Evolving discipline ◦Fast growth rate

Monitor current trendsPredict future directionDecision Making

◦Grants◦Business Ventures◦Research Opportunities

Page 8: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.

Page 9: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Related WorkWeb miningBibliometricsText mining & citation analysis

◦Mapping of knowledge◦Charting science & technology fields

Textual & graph-based approaches◦Different perceptions of similarity

between documents or groups of documents

Page 10: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Related Work

Establishing the Data SetPatra & Mishra – Bibliometric Study

◦MeSH term based◦Liberal delineation strategy with

maximal recall◦Broader interpretation of

bioinformatics◦Less restricted search strategy◦Broader coverage of underlying

database◦14,563 journal papers

Page 11: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Related WorkHybrid Clustering

◦He – Unsupervised spectral clustering of web pages

◦Wang & Kitsuregawa – Contents-linked coupled clustering algorithm of web pages

Dynamic hybrid clustering◦Mei & Zhai – Temporal Text Mining◦Kullback-Leibler – Divergence for coherent

themes & Hidden Markov Models◦Griffiths & Steyvers – Latent Dirichlet

Allocation with hot topics in PNAS abstracts

Page 12: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Models: Data SetBibliometric Retrieval StrategyNovel subject delineation

strategy◦Retrieve core literature◦Combines textual components &

bibliometrics, citation-based techniques

◦Web of Science Edition of Thomson Scientific 7401 bioinformatics-related papers 1981 to 2004 Titles, abstracts, author keywords, and

MeSH terms

Page 13: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Models – Text Analysis◦All text was indexed with Jakarta Lucene

Platform◦Encoded in Vector Space Model using TF-

IDF weighting scheme◦Text-based similarities

Cosine of angle between the vector representations of two papers

◦No Stop word used during indexing◦Porter Stemmer

All remaining terms from titles and abstracts

◦Bigrams Candidate list of MeSH descriptors, author

keywords, and noun phrases

◦Latent Semantic Indexing (LSI) – 10 terms

Page 14: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Models – Citation Analysis

Citation GraphsLink-based algorithms

◦HITS◦PageRank

Representative Publications

Text-based

Co-citation

Citation-based

Documents

QUANTIFY SIMILARITIES

Boolean Input

Vectors

CosineBibliographic coupling

(BC)

Combine

Image Reference: Google Logo from http://www.google.com

Page 15: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Models – ClusteringAgglomerative Hierarchical

Clustering Algorithm with Ward’s Method

Hard Clustering Algorithm: ◦Every publication is assigned to exactly 1 cluster.

Image Reference: Clustering Analysis - http://en.wikipedia.org/wiki/Data_clustering

Page 16: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Models – ClusteringOptimal number of clustersCombine Distance-based & Stability-based

Methods Strategy

Dendrogram observation

Silhouette Curves: Mean text andCitation-based

Stability DiagramImage Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, 365, KDD '07. ACM, San Jose, CA, August 2007.

Page 17: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Proposed Algorithm – Hybrid Clustering

Cluster Input: DistancesCombining text mining and

bibliometrics◦Integrate text & citation info early in

mapping process before applying of clustering algorithm

Weighted linear combination

Fisher’s inverse chi-square methodImage Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 362, 363, KDD '07. ACM, San Jose, CA, August 2007.

Page 18: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 363 KDD '07. ACM, San Jose, CA, August 2007.

Page 19: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Proposed Algorithm – Dynamic Hybrid ClusteringGoal: Match & track clusters through

timeProcess:

◦Separate hybrid clustering for each period◦Determine optimal number of clusters

Dendrogram Silhouette curve Ben-hur stability plot

◦Construct complete graph All cluster centroids from each period as nodes Edge weights as mutual cosine similarities in LSS

◦Form Cluster Chains Keep edge weights > threshold, T1 Allow qualifying clusters to join > threshold, T2

Page 20: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

Page 21: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

Page 22: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Results – Hybrid ClusteringSilhouette Curve

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.

Page 23: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Result – Hybrid ClusteringSilhouette Curve

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.

Page 24: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Result – Hybrid ClusteringStability

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.

Page 25: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Result – Hybrid ClusteringDendrogram

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.

Page 26: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Result – Hybrid ClusteringCluster Characterization

RNA structure prediction

205

Protein structure prediction

1167Systems biology & molecular networks

694

Phylogeny &

Evolution

749Genome

sequencing &

assembly

640

Gene / promoter /

motif prediction

995

Molecular

DBs & annotation platforms

1091Multiple

sequence alignment

713

Microarray analysis

1147

Page 27: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Result – Dynamics ClusteringHistogram

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.

Page 28: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Result – Dynamics ClusteringCluster Chains

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

Page 29: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Yearly Publication Outputamong Cluster chains

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.

Page 30: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Dynamic TermNetwork

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.

Page 31: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Pros & ConsPros

◦Offers fresh perspective on clustering

◦Integrates various techniques◦Provides insight into bioinformatics

Cons◦Challenge of selecting the optimal

number of clusters still exists◦There are many steps required to

implement their approach

Page 32: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Questions

Page 33: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

References Janssens, F., Glänzel, W., and De Moor, B. 2007. Dynamic

hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 360-369. DOI= http://doi.acm.org/10.1145/1281192.1281233

ISI Web of Science Image: http://apps.isiknowledge.com/WOS_GeneralSearch_input.do?highlighted_tab=WOS&product=WOS&last_prod=WOS&SID=3DamC8GFDKmpBLhFOIM&search_mode=GeneralSearch

PubMed Image: http://www.ncbi.nlm.nih.gov/pubmed/ The Apache Jakarta Project:

http://lucene.apache.org/java/1_4_3/ Fisher’s Method: http://en.wikipedia.org/wiki/Fisher

%27s_method “Data Mining - Concepts and techniques” by Han and Kamber,

Morgan Kaufmann, 2006. (ISBN:1-55860-901-6)