1
Algorithms for Data Mining and Querying with Graphs Algorithms for Data Mining and Querying with Graphs Investigators: Padhraic Smyth, Sharad Mehrotra Investigators: Padhraic Smyth, Sharad Mehrotra University of California, Irvine University of California, Irvine Students: Joshua O’ Madadhain, Dawit Seid, Jon Hutchins Students: Joshua O’ Madadhain, Dawit Seid, Jon Hutchins JUNG: JAVA Universal Network/ Graph Framework GAAL: A General-Purpose Graph Query Language Link Prediction Algorithms Example of software built using JUNG: Netsight, an interactive graph visualization and analysis tool - extensible, open source software library (API) for graph/network modeling, analysis, and visualization - can decorate graphs, vertices, edges with any JUNG object - complex filtering/transformation/subset management - includes library of network and graph algorithms - clustering, centrality, importance, paths, flows, etc - includes visualization API, or can use other visualization APIs (e.g. prefuse) - supports graphs, hypergraphs, parallel edges, mixed-mode graphs, k-partite graphs - active user/developer community - 30,000 downloads, 1.3 million page visits - ranked #60 out of 100k Sourceforge projects - used in social network analysis, games, trust metrics, upcoming version of HP Zoomgraph, email visualization, and Netsight JUNG software is publicly available at http://jung.sourceforge.net We have developed a general predictive learning approach that can uses historical graph data to learn a predictive model of whether a link is likely to exist between any pair of nodes A and B in a future time-period. The prediction model utilizes information from both structural graph features around A and B, as well as individual node attributes for A and B. For example, for co-author graphs, features can include distance in the co-author graph of A from B, properties of A’s and B’s graph neighborhoods, and topic models in the form of probability distributions characterizing A’s and B’s research interests. We have developed a new query language called GAAL that allows users to express complex relational queries on attributed graphs, allowing for queries on graph properties, aggregation operations, and scalability to very large graphs. In 2005 we have extended this approach to provide an algebraic framework for spatio-temporal analysis of semantic graphs. Rid nam e U R L Pid title abs year Pid1Pid2 Rid Pid pos Iid name type Rid Iid Institute WorksIn Researcher Write Paper Cite Multi-relational (attributed graph) representation entity/event data Graph Schema and Other metadata Graph Schema Definition Interface Visualization/ Analysis Applications (NetSight) GAAL Language (Graph Querying Algebra) DBMS Specific Adapters Extensible ORDBMS Algorithms for Ranking Nodes in Dynamic Networks Email Rankings and Organizational Structure We have developed a novel algorithmic approach to the problem of determining the importance of nodes in a network where the links occur over time, e.g., an email network or a co-author network. The concept is similar to centrality ideas in social networks, and HITS and PageRank for Web page ranking, but produces a “dynamic rank” such that the rank of each node varies over time as it receives messages in the network. Example of Rankings over Time Results on KDD Challenge/Biobase Data This prediction competition in 2005 evaluated different approaches for link prediction. The specific problem was to predict new collaborations among 300,000 medical researchers in 2002, based on co-author relations in 128,000 papers published from 1998-2001. The figure to the right shows the “lift curve” the ratio of the number of true new collaborations predicted by our models’ rankings (relative to a random ranking). In the top 50 predictions for example, our models predict between 40 and 45 true collaborations (versus about 3 for a random ranking). Data: Corporate Email History 1 million emails, 21 months, 628 individuals A triple of: (target-property, Spatial-property, objectId) Point set, R elationship set (pair of points), Spatial/tem poralvalue G raph that em beds input data G raph Q uery S patial Q ueries Tem poral Q ueries S TProject ( G,P,O,T,F ) G Project( G, I,R ) N ode set, Edge set Property value G raph G raph-set G - base-graph P - set of spatial/tem poralproperties O - node/link type w ith the target property T - set of target properties of nodes/edges (optional) F - A filtering condition (optional) I - spatial/tem poralquery output R - relationships to be used in em bedding of I LEG EN D : S TProject : projects out spatial/tem poralproperties and a target property of nodes/links for spatialanalysis. G Project : Em beds a set of nodes or a new relationship type in the graph; nodes that have the sam e type as those in I are filtered out. Find the new s source that had the m ost coverage of m ost heavily dam aged regions during the Tsunam i disaster ? S tep1: S patialProjection: S TProject(Tsunam iG raph,source_agency, report, m entionedC ity,,) S tep 2: Find the top 3 cities w ith m ost dam age: D istinguish(topic= dam age,,sum ,3) S tep 3: Project into graph to find the sources: G Project(Tsunam iG raph,topC ities,range(m entionedC ity)) S tep 4: Find the top 3 sources using graph query language: S ELEC T ?source FR O M {?report,m entionedC ity,?city, ?source, reports,?report} W H ER E ?city IN < topC ities> G RO UP BY source_agency AG G REG ATE branchw ise count(reports) IN TO ?source.repC ount) nam e Dam age related topics D ifferentiate these Topic reports mentions A bout country city city date H our Source_ agency basedIn Report R eplicates R eferences mentionedCity fileLocation fileH our fileD ate R eporter nam e G RAPH SCHEMA filedB y Topic reports mentions A bout country city city date H our Source_ agency basedIn Report R eplicates R eferences mentionedCity fileLocation fileH our fileD ate R eporter nam e G RAPH SCHEMA filedB y Architecture Query Example Algebraic Framework

JUNG: JAVA Universal Network/ Graph Framework

  • Upload
    trevor

  • View
    62

  • Download
    6

Embed Size (px)

DESCRIPTION

Visualization/ Analysis Applications (NetSight). Paper. Graph Schema Definition Interface. GAAL Language (Graph Querying Algebra). Write. DBMS Specific Adapters. Extensible ORDBMS. Researcher. Cite. WorksIn. Institute. Graph Schema and Other metadata. Multi-relational - PowerPoint PPT Presentation

Citation preview

Page 1: JUNG:  JAVA Universal Network/ Graph Framework

Algorithms for Data Mining and Querying with GraphsAlgorithms for Data Mining and Querying with GraphsInvestigators: Padhraic Smyth, Sharad Mehrotra Investigators: Padhraic Smyth, Sharad Mehrotra

University of California, IrvineUniversity of California, IrvineStudents: Joshua O’ Madadhain, Dawit Seid, Jon HutchinsStudents: Joshua O’ Madadhain, Dawit Seid, Jon Hutchins

JUNG: JAVA Universal Network/Graph Framework

GAAL: A General-Purpose Graph Query Language

Link Prediction Algorithms

Example of software built using JUNG: Netsight, an interactive graph visualization and analysis tool

- extensible, open source software library (API) for graph/network modeling, analysis, and visualization

- can decorate graphs, vertices, edges with any JUNG object

- complex filtering/transformation/subset management

- includes library of network and graph algorithms

- clustering, centrality, importance, paths, flows, etc

- includes visualization API, or can use other visualization APIs (e.g. prefuse)

- supports graphs, hypergraphs, parallel edges, mixed-mode graphs, k-partite graphs

- active user/developer community

- 30,000 downloads, 1.3 million page visits

- ranked #60 out of 100k Sourceforge projects

- used in social network analysis, games, trust metrics, upcoming version of HP Zoomgraph,

email visualization, and Netsight

JUNG software is publicly available at

http://jung.sourceforge.net

We have developed a general predictive learning approach that can uses historical graph data to learn a predictive model of whether a link is likely to exist between any pair of nodes A and B in a future time-period. The prediction model utilizes information from both structural graph features around A and B, as well as individual node attributes for A and B. For example, for co-author graphs, features can include distance in the co-author graph of A from B, properties of A’s and B’s graph neighborhoods, and topic models in the form of probability distributions characterizing A’s and B’s research interests.

We have developed a new query language called GAAL that allows users to express complex relational queries on attributed graphs, allowing for queries on graph properties, aggregation operations, and scalability to very large graphs. In 2005 we have extended this approach to provide an algebraic framework for spatio-temporal analysis of semantic graphs.

Rid name URL

Pid title abs year

Pid1 Pid2

Rid Pid pos

Iid name typeRid Iid

InstituteWorksIn

Researcher

Write

Paper

Cite

Multi-relational (attributed graph)

representation entity/event data

GraphSchema and

Other metadata

Graph Schema Definition Interface

Visualization/Analysis Applications

(NetSight)

GAAL Language(Graph Querying Algebra)

DBMS Specific AdaptersExtensible ORDBMS

Algorithms for Ranking Nodes in Dynamic Networks

Email Rankings and Organizational Structure

We have developed a novel algorithmic approach to the problem of determining the importance of nodes in a network where the links occur over time, e.g., an email network or a co-author network. The concept is similar to centrality ideas in social networks, and HITS and PageRank for Web page ranking, but produces a “dynamic rank” such that the rank of each node varies over time as it receives messages in the network.

Example of Rankings over Time

Results on KDD Challenge/Biobase Data

This prediction competition in 2005 evaluated different approaches for link prediction. The specific problem was to predict new collaborations among 300,000 medical researchers in 2002, based on co-author relations in 128,000 papers published from 1998-2001. The figure to the right shows the “lift curve” the ratio of the number of true new collaborations predicted by our models’ rankings (relative to a random ranking). In the top 50 predictions for example, our models predict between 40 and 45 true collaborations (versus about 3 for a random ranking).

Data: Corporate Email History1 million emails, 21 months, 628 individuals

A triple of:(target-property,Spatial-property,objectId)

Point set,Relationship set (pair of points),Spatial/ temporal value

Graph that embedsinput data

GraphQuery

Spatial Queries

Temporal Queries

STProject(G,P,O,T,F)

GProject(G,I ,R)

Node set,Edge setProperty valueGraphGraph-set

G - base-graphP - set of spatial/ temporal propertiesO - node/ link type with the target propertyT - set of target properties of nodes/edges (optional)F - A filtering condition (optional)I - spatial/ temporal query outputR - relationships to be used in embedding of I

LEGEND:

• STProject : projects out spatial/ temporal properties and a target property of nodes/ links for spatial analysis.

• GProject: Embeds a set of nodes or a new relationship type in the graph; nodes that have the same type as those in I are filtered out.

Find the news source that had the most coverage of most heavily damaged regions during the Tsunami disaster ?

Step1: Spatial Projection:STProject(TsunamiGraph,source_agency, report,

mentionedCity,,)Step 2: Find the top 3 cities with most damage:

Distinguish(topic=damage,,sum,3)

Step 3: Project into graph to find the sources:GProject(TsunamiGraph,topCities,range(mentionedCity))

Step 4: Find the top 3 sources using graph query language:

SELECT ?sourceFROM {?report,mentionedCity,?city,

?source, reports,?report}WHERE ?city IN <topCities>GROUP BY source_agencyAGGREGATE branchwise count(reports) INTO ?source.repCount)

name

Damagerelatedtopics

Differentiate these

Topic

reportsmentions

About

country

citycity dateHour

Source_agency

basedIn

ReportReplicates

References

mentionedCity

fileLocation

fileHour fileDateReporter

name

GRAPH SCHEMA

filedBy

Topic

reportsmentions

About

country

citycity dateHour

Source_agency

basedIn

ReportReplicates

References

mentionedCity

fileLocation

fileHour fileDateReporter

name

GRAPH SCHEMA

filedBy

Architecture

Query Example

Algebraic Framework