20
Data Base and Data Mining Group of Politecnico di Torino D B M G Identifying collaborations among researchers: a pattern-based approach Tokyo, August 11 2017 Elena Baralis, Luca Cagliero, Mohammad Reza Kavoosifar, Paolo Garza

DB G - WINGwing.comp.nus.edu.sg/~birndl-sigir2017/BIRNDL_Cagliero.pdf · DB M G Identifying collaborations among researchers: a pattern-based approach Tokyo, August 11 2017 Elena

  • Upload
    trannga

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Data Base and Data Mining Group of Politecnico di Torino

DBMG

Identifying collaborations among researchers: a pattern-based approach

Tokyo, August 11 2017

Elena Baralis, Luca Cagliero, Mohammad Reza Kavoosifar, Paolo Garza

2DBMG

Outline Analyzed data Addressed problem Related works The pattern-based approach Experimental results Conclusions and future work

3DBMG

Analyzed data Electronic versions of

scientific publications Available through Digital

Libraries (DL) and onlinedatabases, e.g., PubMed,OMIM

Publishers’ digital librariesgive limited or freeaccess to conference proceedings books Journal papers

4DBMG

Analyzed data How to find the

scientific publications ofmajor interest? Topic-driven searches Author-driven searches All of the above

5DBMG

Analyzed data What are the most

relevant publicationswritten by an author? Author-driven query Publications are ranked

by number of received

citations date popularity (e.g., number of

reads)

6DBMG

Analyzed data What are the most

relevant publicationswritten by an author ona specific topic? Author- and topic-driven

query The author’s publications

covering the topic underanalysis are selected andranked

7DBMG

Analyzed data What are the most

fruitful collaborationsamong multipleauthors? No deterministic solution Hard to solve using

simple queries For each topic? For each combination of

authors? How to combine and rank

the results?

8DBMG

Addressed problem Issue

Identify fruitful collaborations among researchers

9DBMG

Addressed problem Expected result (automatically inferred from

DL data) A list of significant topics For each topic the groups of researchers who have

produced most relevant publication records Groups of researchers of arbitrary size Ranked lists (of both topics and groups)

10DBMG

Related works Citation content analysis

Analyze position in the text and semantics of citations E.g., [Zhang et al., JASIST 2013], [Kim et al., BIRNDL 2016]

Researcher networks Profile researchers and compute similarities E.g., ArnetMiner [Tang et al., KDD 2008]

Reviewer assignment Assist editors in the peer review of scientific papers Given a pool of candidate reviewers, what papers should be

assigned to each of them? E.g., [Kou et al., SIGMOD 2015], [Kou et al., VLDB 2015]

11DBMG

The pattern-based solution Unsupervised data mining approach

Apply an itemset mining algorithm Discover patterns representing the most significant

correlations between authors and topics The Authors – Topic Patterns (ATP)

Group and rank ATPs to ease manual exploration

12DBMG

Weighted itemset mining Weighted transactional data

Set of weighted transactions Each transaction represents a different publication Each transaction consists of a set of items Items are either authors or topics Transactions are weighted by a relevance weight (e.g.,

the number of received citations)

13DBMG

Weighted itemset mining Weighted itemsets

A weighted k-itemset is a set of k items that co-occurin a weighted dataset (e.g., {(Author: Smith L.),(Topic,Z)} is a 2-itemset)

The traditional support of an itemset in a weighteddataset is its observed frequency of occurrence, i.e., itdisregards transaction weights

Extraction task Discover all itemsets whose support is above a given

(user-specified) threshold

14DBMG

Authors-Topic Pattern Pattern definition and characteristics

An ATP is a combination of set of author items (one ormore) and a topic items k items that co-occur in aweighted dataset e.g., {(Author: Smith L.),(Topic, Z)}

The influence of an ATP I in a dataset D is a linearcombination of the number of citations C(pj) of thepublications pj associated with transactions in D

15DBMG

ATP mining Extraction task

Extract all ATPs whose influence is above a giventhreshold mininf FP-Growth-like extraction [Cagliero & Garza TKDE 2013]

ATP clustering and ranking ATPs are grouped by topic and length (i.e., the

collaboration group size) and ranked by decreasinginfluence

16DBMG

Case study Real context

Discovery of research collaborations who have conductedinfluential studies on genomics and genetics

Data acquired from the open Online Mendelian Inheritancein Man (OMIM) Digital Library

Part of the National Center for Biotechnology Information(NCBI) system of databases

For each genetic disorder The list of related publications The authors of each publication The set of genes correlated with the disorder

17DBMG

Case study Pattern validation

For each gene and genetic disorder pick the top-5 ATPs Research question A: Are the research team and the topic

really correlated with each other? Comparison with top-3 publications returned by author-driven queries on

PubMed Research question B: Among the topics addressed by the team,

is the topic indicated in the pattern the most influential one? Comparison with top-ranked publication according to PubMed search

18DBMG

Case study Results for top-5 ATPs on genetic disorders

19DBMG

Conclusions and future work Summarizing…

Knowledge discovery from DL data Pattern-based solution to identify fruitful collaborations

between researchers New type of interpretable pattern modelling correlation

between authors and topics Promising results on data related to genomics/genetics

Ongoing work Integration of more advanced topic detection algorithms Differentiate authors’ contribution based on their position

in the author’s list Application to reviewer assignment problem

20DBMG

Conclusions

Thanks for the attention.

Questions?