Sparsification and Sampling of Networks for Collective Classification

+

Sparsification and Sampling of Networks for Collective Classification

Tanwistha Saha, Huzefa Rangwala and Carlotta DomeniconiDepartment of Computer Science

George Mason UniversityFairfax, VA, USA

+Outline

Introduction

Motivation

Related Work

Proposed Methods

Results

Conclusion and Future Work

+Sparsification and Sampling of Networks for Collective ClassificationGiven:

Partially labeled weighted network

Node attributes for all the nodes

Goal:

Predict the labels of unlabeled nodes in network

Points to consider:

Networks with fewer edges can be formed using sparsification algorithms

The selection of labeled nodes for training, influences the overall accuracy – research on sampling algorithms for collective classification

+Sample Input Network (partially labeled)

+Relational Network Sparsification

Study of networks involves Relational Learning

Relational network consists of nodes representing entities and edges representing pairwise interactions

Edges can be weighted / unweighted

Weights represents similarity between pair of nodes

Edges with low weights don’t carry much information – we can remove them based on some criteria!

Sparsify the network without losing much information

+Example: Network with noisy edges

+Example: Noise edges removed!

+Importance of Sparsification in Network

Problems:

Data analysis is time consuming

Noisy edges can not convey fruitful information in relational data

Solutions:

Identify and remove the noisy edges

Make sure to remove noisy edges only, and not the others!

Classify the unlabeled nodes in sparsified network using Collective Classification and compare results with unsparsified network

+Graph sparsification methods for clustering

(GS) Global Graph Sparsification (Satuluri et al. SIGMOD 2011)

(LS) Local Graph Sparsification (Satuluri et al. SIGMOD 2011)

Drawbacks:

Methods designed for fast clustering, not suitable for classification

All edges treated equally

Sparsified network becomes more disconnected

+Global Graph Sparsification

(Satuluri et al. SIGMOD 2011)

Singleton nodesDisconnected component

+Local Graph Sparsification

(Satuluri et al. SIGMOD 2011)

In addition to edges marked red, some more edges marked blue were removed!The edges removed with this method might not be a superset of the edges removed by global sparsification method.

Removal of this edge disconnects the graph

+Adaptive Global Sparsifier

Aims to address the drawbacks of LS and GS

Doesn’t remove an edge if the removal is going to make the graph more disconnected

Note:

This method is less aggressive in removing edges compared to local and global sparsification algorithms by Satuluri et al.

(Saha et al. SBP 2013)

+Adaptive Global Sparsifier

Keep the edges with top similarity scores (here, score >= 0.3)

+Adaptive Global Sparsifier (contd.)

Removing red edges doesn’t increase the number of connected components

Mauve colored edges have low similarity score but we put them back to avoid disconnect components

+Collective Classification in Networks

Input: A graph G = (V,E) with given percentage of labeled nodes for training, node features for all the nodes

Output: Predicted labels of the test nodes

Model:

1. Relational features and node features are used for training local classifier using labeled nodes

2. Test nodes labels are initialized with labels predicted by local classifier using node attributes

3. Inference through iterative classification of test nodes until convergence criterion reached

Network of researchers

MLDM SW AI

Bio

?

+Datasets & Experiments Cora citation network, directed graph of 2708 research

papers belonging to either one of 7 research areas (classes) in Computer Science (data downloaded from http://www.cs.umd.edu/projects/linqs/projects/lbc/index.html )

DBLP co-authorship network among 5602 researchers in 6 different areas of computer science (raw data downloaded from http://arnetminer.org and processed)

Number of edges acquired with different sparsification algorithms with sparsification ratio s=70%:

Dataset Total edges in network

Adaptive Global Sparsifier

Global Sparsifier

Local Sparsifier

Cora 5429 3850 3800 2429

DBLP 17265 12251 12086 6859

http://www.cs.umd.edu/projects/linqs/projects/lbc/index.html

http://www.cs.umd.edu/projects/linqs/projects/lbc/index.html

http://arnetminer.org/

+Experiments (contd.)

Weighted Vote Relational Neighbor (wvRN) is used as the base collective classification algorithm (Macskassy et al. JMLR 2007)

Baseline methods: Global Sparsification Algorithm (GS) and Local Sparsification Algorithm (LS) (Satuluri et al. SIGMOD 2011)

Performance metric: Accuracy of Classification

+Results

Cora DBLP

+Sampling for Collective Classification A good sample from a data should inherit all the characteristics

Forest fire sampling, node sampling, edge sampling with induction (Ahmed et al. ICWSM 2012)

We argue: “goodness” of a sample is defined based on the problem we want to solve

Rationale:

Choosing samples for training should make sure that each test node is connected to at least one training node

Why? To facilitate collective classification by ensuring test nodes can have useful relational features computed from training nodes!

+Adaptive Forest Fire Sampling

Modified version of Forest Fire Sampling (Leskovec et al. KDD 2005)

Selects a random node as “seed node” to start and marks as “visited”

“Adaptive” because it randomly selects only a certain percentage of edges incident on a visited node, to propagate along the network and mark the nodes on the other end of edges as “visited”

Maintains a queue of unvisited nodes as propagation occurs in the network

Ensures that each test node is connected to at least one training node

+Adaptive Forest Fire Sampling of network with 19 nodes

Test nodes

Test nodes

+Experiments

Baseline classifiers used for comparing Random Sampling with Adaptive Forest Fire sampling:

wvRN (Macskassy et al. JMLR 2007)

Multi-class SVM (Krammer and Singer JMLR 2001, Tsochantaridis et al. ICML 2004)

RankNN for single labeled data (Saha et al. ICMLA 2012)

+Results (Cora citation network)

Random Sampling Adaptive Forest Fire Sampling

+Conclusions

Introduced a sparsification method for collective classification of network datasets without losing much information and comparable accuracies

Introduced a network sampling algorithm for facilitating collective classification

These algorithms work on single labeled networks, in future we would extend these approach to treat multi-labeled networks as well

These algorithms are designed for static networks, an interesting work would be to formulate sampling methods for networks that change over time

+Thank You!

Documents

Sparsification and Sampling of Networks for Collective Classification