JUNG: JAVA Universal Network/ Graph Framework

Preview:

DESCRIPTION

Visualization/ Analysis Applications (NetSight). Paper. Graph Schema Definition Interface. GAAL Language (Graph Querying Algebra). Write. DBMS Specific Adapters. Extensible ORDBMS. Researcher. Cite. WorksIn. Institute. Graph Schema and Other metadata. Multi-relational - PowerPoint PPT Presentation

Citation preview

Algorithms for Data Mining and Querying with GraphsAlgorithms for Data Mining and Querying with GraphsInvestigators: Padhraic Smyth, Sharad Mehrotra Investigators: Padhraic Smyth, Sharad Mehrotra

University of California, IrvineUniversity of California, IrvineStudents: Joshua O’ Madadhain, Dawit Seid, Jon HutchinsStudents: Joshua O’ Madadhain, Dawit Seid, Jon Hutchins

JUNG: JAVA Universal Network/Graph Framework

GAAL: A General-Purpose Graph Query Language

Link Prediction Algorithms

Example of software built using JUNG: Netsight, an interactive graph visualization and analysis tool

- extensible, open source software library (API) for graph/network modeling, analysis, and visualization

- can decorate graphs, vertices, edges with any JUNG object

- complex filtering/transformation/subset management

- includes library of network and graph algorithms

- clustering, centrality, importance, paths, flows, etc

- includes visualization API, or can use other visualization APIs (e.g. prefuse)

- supports graphs, hypergraphs, parallel edges, mixed-mode graphs, k-partite graphs

- active user/developer community

- 30,000 downloads, 1.3 million page visits

- ranked #60 out of 100k Sourceforge projects

- used in social network analysis, games, trust metrics, upcoming version of HP Zoomgraph,

email visualization, and Netsight

JUNG software is publicly available at

http://jung.sourceforge.net

We have developed a general predictive learning approach that can uses historical graph data to learn a predictive model of whether a link is likely to exist between any pair of nodes A and B in a future time-period. The prediction model utilizes information from both structural graph features around A and B, as well as individual node attributes for A and B. For example, for co-author graphs, features can include distance in the co-author graph of A from B, properties of A’s and B’s graph neighborhoods, and topic models in the form of probability distributions characterizing A’s and B’s research interests.

We have developed a new query language called GAAL that allows users to express complex relational queries on attributed graphs, allowing for queries on graph properties, aggregation operations, and scalability to very large graphs. In 2005 we have extended this approach to provide an algebraic framework for spatio-temporal analysis of semantic graphs.

Rid name URL

Pid title abs year

Pid1 Pid2

Rid Pid pos

Iid name typeRid Iid

InstituteWorksIn

Researcher

Write

Paper

Cite

Multi-relational (attributed graph)

representation entity/event data

GraphSchema and

Other metadata

Graph Schema Definition Interface

Visualization/Analysis Applications

(NetSight)

GAAL Language(Graph Querying Algebra)

DBMS Specific AdaptersExtensible ORDBMS

Algorithms for Ranking Nodes in Dynamic Networks

Email Rankings and Organizational Structure

We have developed a novel algorithmic approach to the problem of determining the importance of nodes in a network where the links occur over time, e.g., an email network or a co-author network. The concept is similar to centrality ideas in social networks, and HITS and PageRank for Web page ranking, but produces a “dynamic rank” such that the rank of each node varies over time as it receives messages in the network.

Example of Rankings over Time

Results on KDD Challenge/Biobase Data

This prediction competition in 2005 evaluated different approaches for link prediction. The specific problem was to predict new collaborations among 300,000 medical researchers in 2002, based on co-author relations in 128,000 papers published from 1998-2001. The figure to the right shows the “lift curve” the ratio of the number of true new collaborations predicted by our models’ rankings (relative to a random ranking). In the top 50 predictions for example, our models predict between 40 and 45 true collaborations (versus about 3 for a random ranking).

Data: Corporate Email History1 million emails, 21 months, 628 individuals

A triple of:(target-property,Spatial-property,objectId)

Point set,Relationship set (pair of points),Spatial/ temporal value

Graph that embedsinput data

GraphQuery

Spatial Queries

Temporal Queries

STProject(G,P,O,T,F)

GProject(G,I ,R)

Node set,Edge setProperty valueGraphGraph-set

G - base-graphP - set of spatial/ temporal propertiesO - node/ link type with the target propertyT - set of target properties of nodes/edges (optional)F - A filtering condition (optional)I - spatial/ temporal query outputR - relationships to be used in embedding of I

LEGEND:

• STProject : projects out spatial/ temporal properties and a target property of nodes/ links for spatial analysis.

• GProject: Embeds a set of nodes or a new relationship type in the graph; nodes that have the same type as those in I are filtered out.

Find the news source that had the most coverage of most heavily damaged regions during the Tsunami disaster ?

Step1: Spatial Projection:STProject(TsunamiGraph,source_agency, report,

mentionedCity,,)Step 2: Find the top 3 cities with most damage:

Distinguish(topic=damage,,sum,3)

Step 3: Project into graph to find the sources:GProject(TsunamiGraph,topCities,range(mentionedCity))

Step 4: Find the top 3 sources using graph query language:

SELECT ?sourceFROM {?report,mentionedCity,?city,

?source, reports,?report}WHERE ?city IN <topCities>GROUP BY source_agencyAGGREGATE branchwise count(reports) INTO ?source.repCount)

name

Damagerelatedtopics

Differentiate these

Topic

reportsmentions

About

country

citycity dateHour

Source_agency

basedIn

ReportReplicates

References

mentionedCity

fileLocation

fileHour fileDateReporter

name

GRAPH SCHEMA

filedBy

Topic

reportsmentions

About

country

citycity dateHour

Source_agency

basedIn

ReportReplicates

References

mentionedCity

fileLocation

fileHour fileDateReporter

name

GRAPH SCHEMA

filedBy

Architecture

Query Example

Algebraic Framework

Recommended