Directed Study Report - Washington Stateyyao/DirectedStudyI/Final Report (summer version...Directed Study Report Name: Yibo Yao ... BGP (Border Gateway Protocol) logs. ... Here is

Directed Study Report

Name: Yibo Yao

WSU ID: 11252107

Advisor: Dr. Holder

School of Electrical Engineering and Computer Science

Washington State University, Pullman, WA 99164

Collection and Streaming of Graph Datasets

I Abstract

During the last decade, a large amount of data with structural information has

emerged in every aspect of human life. A graph-based representation has been used to

depict the entities and their relationships in those datasets, with vertices representing

entities and edges representing relational links among the entities. In this directed

study, we have collected several graph datasets which have dynamic natures from

various application domains and streamed them into GraphML format representations

according to certain time spans. This task will help us to understand the implicit

relations of entities within each dataset and facilitate our future research on graph-

based data mining.

II Introduction

Graph-structured data is becoming increasingly abundant and important in many

domains. The purpose of graph-based data mining is to discover interesting patterns

or novel knowledge within graph datasets. In the past years, most graph miners have

been focusing on dealing with static graphs which do not change over time. However,

with the large emergence of time series data, the current static graph representations

and miners will not be able to tackle these difficulties for their lack of dynamic

properties.

There are bunch of dynamic graph datasets which show temporal changing of

entities or their relations in our life, such as citation networks, biological networks,

communication networks, social networks, and so on. The relations between entities

in many real-world systems usually occur at a certain time period. To study the

temporal interactions among these real-world network graphs, a collection of typical

graph datasets which show temporal properties in entities or their relations, and a

good representation which can describe the dynamic natures of graphs, are needed to

facilitate the development of mining techniques on dynamic graph datasets. For these

reasons, in our directed study, we want to collect around ten graph datasets and stream

them into time series GraphML representations which will capture the dynamic

aspects of these datasets.

The rest of the report is organized as follows. Section III gives a brief

introduction of GraphML format, which is the graph representation we‟ve used in our

study. Section IV details the dynamic graph datasets we‟ve collected and their

corresponding GraphML representations. Some conclusions are drawn in Section V.

III GraphML

GraphML is an XML-based syntax for defining graphs. A GraphML format file can

describe the structural properties of a graph. It supports various forms of graphs

(directed graphs, undirected graphs, hierarchical graphs, and so on). We can use the

following syntax to define a graph:

A defined graph is denoted by a graph element. The declarations of nodes and edges

are nested inside a graph element. A node is defined with a node element, and an edge

with an edge element. GraphML is flexible for representing all kinds of graphs, and it

can contain directed and undirected edges at the same time. The XML attribute

edgedefault in a graph element declares the default direction of edges. Each node has

a unique node ID which is defined by the XML attribute id in a node element. An

edge is declared by using two endpoints with the XML attributes source and target in

an edge element, while the values of source and target must be IDs of nodes.

For more detail description of GraphML format, readers are recommended to

refer to [1, 2]. However, one point worth mentioning is the GraphML-Attributes

extension mechanism, which allows additional information to be attached to the

elements of a graph by using key/data labels. For example, we can use the following

attribute declaration:

<graph id="G" edgedefault="directed">

<node id="n0"/>

<node id="n1"/>

……

<edge source="n0" target="n1"/>

……

</graph>

<key id="d" for="node" attr.name="color" attr.type="string">

<default>yellow</default>

</key>

to define a GraphML-attribute with a default value for a node. In the definition of a

node, if there is no key/data label for attribute “d”, then the color of that node will be

yellow. If there is a key/data label for attribute “d” of that node, for instance,

the value of attribute “d” of that node will be green. We'll use this mechanism in our

study to describe plenty of properties of the entities and their relations in graphs.

Additionally, in this directed study, we have adopted a modification to the

standard GraphML representation. For time series GraphML files, at a corresponding

time of a certain file, we don‟t need to require that all the existing nodes and edges to

be declared as nodes and edges in the same file. If they are not defined in that file,

they must be defined somewhere in previous files. In that file, we only include the

nodes and edges which were added or deleted at the corresponding time. The nodes

and edges created before that time and are not removed at that time are regarded as

existence by default without explicit explanations. Therefore, we have set attributes

named “modification” for a node and an edge respectively. The default value of

“modification” will be “add” for both cases.

If a node or an edge is deleted at the corresponding time of the file, we need to

explicitly define that node or that edge like this:

We have been using this convention through our study to represent the changes of

nodes and edges among a series of GraphML instances.

<node id="n0">

<data key="d">green</data>

</node>

<node id="n0">

<data key="d_n">delete</data>

</node>

……

<edge source="n1" target="n2">

<data key=”d_e”>delete</data>

</edge>

<key id="d_n" for="node" attr.name="modification" attr.type="string">

<default>add</default>

</key>

<key id="d_e" for="edge" attr.name="modification" attr.type="string">

<default>add</default>

</key>

IV Datasets Description

We collected 11 graph datasets from various application domains. This section is

devoted to describing each of them in details.

1. Autonomous System

The AS (Autonomous System) dataset was downloaded from SNAP at Stanford

University [3,4]. It depicts a communication network of who-talks-to-whom from the

BGP (Border Gateway Protocol) logs. And it's originally collected from University of

Oregon Route Views Project, which aims to obtain real-time information about the

global routing system from the perspectives of several different backbones and

locations around the Internet.

This dataset contains 733 daily files which span an interval of 785 days from

November 8, 1997 to January 2, 2000. There are totally 6474 nodes and 13233 edges

in the whole dataset. Each file has timestamp information within the file name. This

dataset exhibits the additions and deletions of nodes and edges over time. In the

GraphML form, we adopt the representation method used in those original data files,

where nodes represent routers in autonomous systems and edges represent

communication links between routers. So the 733 daily instances have been converted

into 733 daily GraphML files with each GraphML file representing communication

events of a certain day. Here is a sample GraphML format file:

In this GraphML representation, each node has a unique numeric ID and each directed

edge has source node and target node which are specified by their IDs. Additionally,

each edge has an attribute attached to it which describes the time information about

the establishment of the link between two nodes.

<graph id="G4" edgedefault="directed">

<node id="1"/node>

<node id="1740"/node>

<node id="1881">

<data key="d_n”>delete</data>

</node>

……

<edge source="1" target="1740">

<data key="d0">1997-11-08</data>

</edge>


<data key="d0">1997-11-07</data>

<data key="d_e">delete</data>

</edge>

……

</graph>

http://www.routeviews.org/

http://www.routeviews.org/

2. Citation Network

For citation-like graphs, we've collected three datasets. Two of them are paper citation

networks downloaded from SNAP at Stanford University [4,5,6,7] and the third one is

patent citation network obtained from NBER [8,9].

(1) Paper Citations

The two paper citation datasets, Hep-Ph (high energy physics phenomenology) and

Hep-Th (high energy physics theory), were originally released at 2003-KDD Cup

[30]. They were collected from the e-print arXiv and covered papers in the period

from February 1992 to March 2002 (122 months). Each paper has a unique ID, and if

a paper i cites paper j, then there is a directed edge from i to j. The raw information

we've collected from SNAP contains citations between papers and submission date of

each paper. There are 34546 papers and 421578 citations included in the Hep-Ph

dataset, and 27770 papers and 352807 citations included in the Hep-Th dataset. In the

GraphML representation, we use nodes to represent papers and edges to represent

citations between papers. The dynamics of these two paper citation networks are

characterized by additions of nodes and edges. The original data files were converted

into 122 monthly GraphML format files with each one describing paper citations in a

certain month. Here is an example of the GraphML file:

Each node has an attribute which gives the submission date of that paper and each

edge also has time information which implies the creation time of the link (it's

actually the submission time of the source node). For the nodes which are not listed in

a GraphML file of a certain month but appear in the target nodes of edges, they are

representing the earlier papers which have been submitted to arXiv before that month.

(2) Patent Citations

The dataset of patent citations comprises detail information about 2923922 US patents

granted between January 1963 and December 1999, and 16522438 citations made to

these patents between 1975 and 1999. Each patent has been assigned a unique

numeric ID. The original data files contain detail information of a patent, such as

grant year and date, application year, country and state of the first author, patent class,


<node id="9203210">

<data key="d0">1992-03-15</data>

</node>

…...


<data key="d1">1992-03-15</data>

</edge>

…...

</graph>

and so on. We have converted the original dataset into 37 yearly GraphML instances

with each instance giving information of patents and citations of a corresponding year.

These GraphML files will show the dynamic nature of this dataset: additions of nodes

and edges over time.

In the GraphML files, nodes are representing patents and edges are representing

citations between patents. There are 21 attributes for a node: grant year, grant date,

application year, country and state of first inventor, assignee ID, assignee type,

number of claims, main patent class, technological category, technological sub-

category, number of citations made, number of citations received, percent of citations

made to patents granted since 1963, measure of generality, measure of originality,

mean forward citation lag, mean backward citation lag, share of self-citations-made

upper bound, share of self-citations-made lower bound, share of self-citations-

received upper bound, share of self-citations-received lower bound. Each edge has an

attribute attached to it, which specifies the establishment time of that citation link (it's

actually the grant time of the source node of that edge). Here is an example of the

GraphML representation of the patent citation network:


<node id="4890340">

<data key="d0">1990</data>



<data key="d3">"US", "CA"</data>









<data key="d12">0.9444</data>









</node>

……

3. Movie Database

There are two independent datasets in the Movie database: MovieLens dataset and

Hetrec dataset. They are both collected from GroupLens Research Project [24].

(1) MovieLens

The MovieLens dataset [25] records 1000209 anonymous ratings of approximately

3900 movies made by 6040 MovieLens [26] users. The original data files have been

translated into GraphML representation with 1040 daily instances, which show that

the dynamic nature of this movie dataset is characterized by additions of nodes and

edges day by day. There are two kinds of nodes in each GraphML file: the user node

and the movie node. Both of them are represented by unique numeric IDs. Each user

node has attributes: gender, age, occupation and zip code, attached to it. Each movie

node has attributes: title and genres. A directed edge represents the rating from a user

node to a movie node with attributes: rating value and timestamp of the rating made

by that user. Here is an example:


<node id="4530">

<data key="d0">F</data>

<data key="d1">35-44</data>

<data key="d2">academic/educator</data>


</node>

<node id="1894">

<data key="d4">Six Days Seven Nights (1998)</data>

<data key="d5">Adventure|Comedy|Romance</data>

</node>

……



<data key="d7">2000-07-27 23:36:30</data>

</edge>

……

</graph>


<data key="d21">1990, 10959</data>

</edge>

……

</graph>

(2) Hetrec

The Hetrec dataset [27] is an extension of the previous MovieLens dataset. It links the

MovieLens dataset with their corresponding webpages at Internet Movie Database

(IMDb) [28] and Rotten Tomatoes (RT) movie review systems [29]. It consists of the

detail information of each movie in MovieLens [26]. There are 10197 movies, 4060

directors and 95321 actors in it. We have extracted the information from the original

data files and converted them into 98 yearly GraphML files, with each one recording

the movies produced in a corresponding year and the relations among movies,

directors and actors. This movie dataset shows an „addition‟ dynamic property with

new nodes and edges being added over time. There are three kinds of nodes in each

GraphML file: movie node, director node and actor node. For a movie node, the

following information has been attached to it as attributes: type, title, year, genres,

country, tag, rtAllCriticsRating, rtAllCriticsNumReviews, rtAllCriticsNumFresh,

rtAllCriticsNumRotten, rtAllCriticsScore, rtTopCriticsRating,

rtTopCriticsNumReviews, rtTopCriticsNumFresh, rtTopCriticsNumRotten,

rtTopCriticsScore, rtAudienceRating, rtAudienceNumRatings, rtAudienceScore. The

other two kinds of nodes have an attribute which specifies whether they are actor

nodes or director nodes. There are two kinds of directed edges in each GraphML

instance. One is named as “acted-by” to represent the link from a movie node to an

actor node and a value with it representing the actor‟s ranking in that movie‟s actor

list. The other is labeled as “directed-by” to show the link from a movie node to a

director node. Following is an example:


<node id="2015">

<data key="d">Movie</data>

<data key="d0">The AbsentMinded Professor</data>


<data key="d2">Children Comedy Fantasy</data>

<data key="d3">USA</data>

<data key="d4">900,1;9450,1;</data>














</node>

4. Social Network Growth

The Social Network Growth data consists of three independent datasets: Flickr-

Growth, Youtube-Growth and Facebook-Growth. They were collected from Online

Social Networks Research [20]. These social network growth datasets are focusing on

the ways in which new user-user links are formed. Therefore, their dynamics are

characterized by additions of both nodes and edges over time. In Flickr-Growth

dataset [21,22], there are 950143 users and over 9.7million links. We converted the

Flickr-Growth dataset into 133 daily GraphML files which span an interval from

November 3, 2006 to May 18, 2007. There are 1138499 users and 4945383 links

covered in the period from December 10, 2006 to January 15, 2007 in the Youtube-

Growth dataset [21,23]. They have been translated into 37 daily GraphML instances.

For the Facebook-Growth dataset [10,11], it includes information about the evolving

link structure on the New Orleans regional network in Facebook. There are 90269

users and a list of 3646662 user-user links with timestamps representing the

establishment time of those links in the dataset. We have converted the original data

files into 869 daily GraphML instances which span an interval from September 5,

2006 to January 21, 2009.

In each GraphML instance of the above datasets, nodes can be regarded as users

and directed edges are representing user-user links in these social networks. There is

an attribute which specifies the time information of the establishment of the links

attached to every edge in the GraphML representation. Here is an example of the

GraphML file:


<node id="11" />

……


<data key="d0">2007-05-17</data>

</edge>

……

</graph>

<edge source="2015" target="robert_stevenson">

<data key="d19">directed-by</data>

</edge>

<edge source="2015" target="1070626-david_lewis">


</edge>

……

</graph>

5. Tencent Weibo

The dataset of Tencent Weibo was released at 2012-KDD Cup [18, 19]. It represents a

sampled snapshot of the Tencent Weibo users' preferences for various items:

recommendations to users and followee-follower relationships. The data consists of

10 million users and 50000 items with over 300 million recommendation records and

about 3 million social network “following” actions. We‟ve extracted data of user

profile, item profile, follower-followee relations and recommendation information

from the original data files. Based on the assumption that the follower-followee

relations created before when the original data was crawled, we created an

independent GraphML file which records this kind of links. The nodes represent users

with four attributes: birth year, gender, number of tweet, tag IDs, and the edges

represent their followee-follower relationships with a constant label „Follower-of‟ on

them. Following is an example:

For the recommendation relations, we have converted the original data into daily

GraphML instances which form a time period of 32 days from October 11, 2011

November 11, 2011. And the dynamic nature is characterized by additions of nodes

and links over time. In each GraphML file, nodes represent users and items, and

directed edges represent user-item recommendation events. There are four attributes

attached a user node: birth year, gender, number of tweet, tag IDs, and one attribute

attached to an item node: category. For a recommendation edge, two attributes:

timestamp of recommendation event, the value of whether the user accepted or

rejected the recommended item, are attached to it. Here is an example:


<node id="2042897">




<data key="d3">16;57;70;35;9;20;21;30;153</data>

</node>

……


<data key="d8">Follower-of</data>

</edge>

……

</graph>


<node id="2042897">




<data key="d3">16;57;70;35;9;20;21;30;153</data>

</node>

……

6. Yahoo! Instant Messenger

This dataset was provided as part of the Yahoo! Webscope program [16,17]. It

contains a sample of the Yahoo! Messenger communication events. The data was

generated by a small subset of Yahoo! Messenger users from different zip codes for

28 days starting from April 1 2008. There are 100000 unique users from 5649 unique

zip codes in the dataset. We extracted the following information: sender‟s ID,

receiver‟s ID, sender‟s zip code, timestamp of the communication, to create our

GraphML instances. There are 28 daily GraphML files generated, which will exhibit

both additions and deletions of nodes and edges day by day. Each file records the first

communication event in a certain day which signifies a sender sends an Instant

Message (IM) to a receiver from a fresh zip code. Hence, in the GraphML

representation, each node represents a user while each edge represents two users‟ IM

communication. There are three attributes attached to every edge. They are: the

sender‟s zip code, the timestamp of the event, whether or not the receiver has the

sender on his buddy list.


<data key="d6">2011-10-20 00:21:56</data>


</edge>

……

</graph>


<node id="U00000"/>

<node id="U93882">

<data key="d_n">delete</data>

</node

……

<edge source="U00000" target="U93722">

<data key="d0">Z0780</data>

<data key="d1">D26-SAT 16:17:59</data>

<data key="d2">y</data>

</edge>

<edge source="U00000" target="U93882">

<data key="d0">Z0711</data>

<data key="d1">D25-FRI 12:17:32</data>

<data key="d2">y</data>

<data key="d_e">delete</data>

</edge>

……

</graph>

V Conclusions and Future Directions

In this directed study, we have collected several graph datasets characterized by their

dynamic natures. We adopted the GraphML format to stream the collected original

data into time series representations which can capture the evolving natures of those

datasets. The datasets from different application domains have different intrinsic

dynamic properties within them. Hence, we took different strategies of time windows

to stream them, for example, streaming paper citations into monthly instances,

streaming cellphone communications and Yahoo! IM into daily instances, and so on.

We hope the presented time series representations in GraphML format will help to

facilitate graph mining techniques on dynamic graph-structured data, especially in our

future research on supervised learning on dynamic graphs.

For the supervised learning tasks on these streamed datasets, we need to define

proper positive and negative examples for training. Here, we suggest some strategies

to classify each of the streamed datasets in table 1. And our future work is to develop

graph-based mining algorithms which will perform supervised classification tasks on

these time series dynamic graph dadasets.

Table 1. positive vs. negative examples for supervised learning

Dataset positive vs negative AS subgraphs (containing nodes whose out-degree exceeds

a threshold value) vs. subgraphs (containing nodes

whose out-degree is lower than a threshold value) Paper citations

Patent citations

Social networks growth

MovieLens movie nodes whose ratings are above a threshold value

vs. movie nodes whose rating are below that value

Hetrec number of movies directed by a director or acted by an

actor exceeds a threshold value vs. number of movies

directed by a director or acted by an actor is below that

value

Tencent Weibo subgraphs (including users accept recommended items)

vs. subgraphs (including users reject recommended

items)

Yahoo! IM subgraphs (containing users make communications in a

single locale) vs. subgraphs (containing users make

communications in multiple locales)

Reference

1. http://www.cs.brown.edu/~rt/gdhandbook/chapters/graphml.pdf

2. http://graphml.graphdrawing.org/primer/graphml-primer.html

3. http://snap.stanford.edu/data/as.html

4. J. Leskovec, J. Kleinberg and C. Faloutsos. “Graphs over Time: Densification Laws,

Shrinking Diameters and Possible Explanations”. ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining (KDD), 2005.

5. http://snap.stanford.edu/data/cit-HepPh.html

6. http://snap.stanford.edu/data/cit-HepTh.html

7. J. Gehrke, P. Ginsparg, J. M. Kleinberg. “Overview of the 2003 KDD Cup”. SIGKDD

Explorations 5(2): 149-151, 2003

8. http://www.nber.org/patents/

9. B. H. Hall, A. B. Jaffe, and M. Trajtenberg. "The NBER Patent Citation Data File: Lessons,

Insights and Methodological Tools." NBER Working Paper 8498.

10. http://socialnetworks.mpi-sws.org/data-wosn2009.html

11. B. Viswanath, A. Mislove, M. Cha and K. P. Gummadi. “On the Evolution of User Interaction

in Facebook”. Proceedings of the 2nd ACM SIGCOMM Workshop on Social Networks

(WOSN'09), August, 2009.

12. N. Eagle, A. Pentland, and D. Lazer. "Inferring Social Network Structure using Mobile

Phone Data", Proceedings of the National Academy of Sciences, 106(36), pp. 15274-

15278.

13. http://reality.media.mit.edu/

14. http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/1998data.ht

ml

15. K. Kendall. “A database of computer attacks for the evaluation of intrusion detection

systems”. Master's thesis, Massachusetts Institute of Technology, 1998.

16. http://webscope.sandbox.yahoo.com/catalog.php?datatype=g

17. Yahoo! Webscope dataset ydata-ymessenger-user-communication-pattern-v1_0

[http://research.yahoo.com/Academic_Relations]

18. http://www.kddcup2012.org/c/kddcup2012-track1

19. http://sigkdd.org/kdd2012/kddcup.shtml

20. http://socialnetworks.mpi-sws.org/

21. http://socialnetworks.mpi-sws.org/data-wosn2008.html

22. A. Mislove, H. S. Koppula, K. P. Gummadi, P. Druschel and B. Bhattacharjee. “Growth of the

Flickr Social Network”. Proceedings of the 1st ACM SIGCOMM Workshop on Social

Networks (WOSN'08). Seattle, WA. August, 2008.

23. A. Mislove. “Online Social Networks: Measurement, Analysis, and Applications to

Distributed Information Systems”. PhDThesis, Rice University, Department of Computer

Science. May 2009.

24. http://www.grouplens.org/

25. http://www.grouplens.org/node/73

26. http://www.movielens.org/

27. http://www.grouplens.org/node/462

http://www.cs.brown.edu/~rt/gdhandbook/chapters/graphml.pdf

http://graphml.graphdrawing.org/primer/graphml-primer.html

http://snap.stanford.edu/data/as.html

http://www.cs.cmu.edu/~jure/pubs/powergrowth-kdd05.pdf

http://www.cs.cmu.edu/~jure/pubs/powergrowth-kdd05.pdf

http://snap.stanford.edu/data/cit-HepPh.html

http://snap.stanford.edu/data/cit-HepTh.html

http://www.cs.cornell.edu/home/kleinber/kddcup2003.pdf

http://www.nber.org/patents/

http://papers.nber.org/papers/w8498.pdf

http://papers.nber.org/papers/w8498.pdf

http://socialnetworks.mpi-sws.org/data-wosn2009.html

http://reality.media.mit.edu/

http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/1998data.html

http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/1998data.html

http://webscope.sandbox.yahoo.com/catalog.php?datatype=g

http://www.kddcup2012.org/c/kddcup2012-track1

http://sigkdd.org/kdd2012/kddcup.shtml

http://socialnetworks.mpi-sws.org/data-wosn2008.html

http://www.grouplens.org/

http://www.grouplens.org/node/73

http://www.movielens.org/

http://www.grouplens.org/node/462

28. http://www.imdb.com

29. http://www.rottentomatoes.com

30. http://www.cs.cornell.edu/projects/kddcup/

Appendix I

Table 2. Statistics of Datasets in the Report

Dataset Nodes; Edges Period Instance Span Dynamics

AS 6474; 13233 Nov 8, 1997 –

Jan 2, 2000 daily

addition,

deletion

HepPh 34546; 421578 Feb 1992 –

Mar 2002 monthly addition

HepTh 27770; 352807 Feb 1992 –

Mar 2002 monthly addition

Patent 2923922; 16522438 Jan 1963 –

Dec 1999 yearly addition

MovieLens 9940; 1000209 Apr 25, 2000 –

Feb 28, 2003 daily addition

Hetrec 109578; N/A 1903 – 2011 yearly addition

Flickr 950143; 9700000 Nov 3, 2006 –

May 18, 2007 daily addition

Youtube 1138499; 4945383 Dec 10, 2006 –

Jan 15, 2007 daily addition

Facebook 90269; 3646662 Sep 5, 2006 –

Jan 21, 2009 daily addition

Tencent 10050000; 3000000000 Oct 11, 2011 –

Nov 11, 2011 daily addition

Yahoo IM 100000; 3179718 Apr 1, 2008 –

Apr 28, 2008 daily

addition,

deletion

http://www.rottentomatoes.com/

http://www.cs.cornell.edu/projects/kddcup/

Documents

Directed Study Report - Washington Stateyyao/DirectedStudyI/Final Report (summer version...Directed Study Report Name: Yibo Yao ... BGP (Border Gateway Protocol) logs. ... Here is