Gephi icwsm-tutorial

Preview:

Citation preview

ICWSM’11 TutorialExploratory Network Analysis with:

Instructors: Sébastien Heymann, Julian Bilckeseb@gephi.org, julian.bilcke@gephi.org

July 17, 2011 | 1 PM - 4 PM

Exploratory Network Analysis with Gephi

This tutorial is an introduction to Gephi, the open source graph network visualization and manipulation software.

Gephi aims to fulfill the complete chain from data importing to aesthetics refinements and interaction.

Users interact with the visualization and manipulate structures, shapes and colors to reveal hidden properties.

The goal is to help data analysts to make hypotheses, intuitively discover patterns or errors in large data collections.

At the end, the participants will walk away with the practical knowledge enabling them to use Gephi for their own projects.

OFFLINE

Exploratory Network Analysis with Gephi

It starts with a brief introduction on the network exploration process and a hands-on demonstration of the essential functionalities of Gephi.

Participants are guided step by step through the complete chain of rep-resentation, manipulation, layout, analysis and aesthetics refinements. Next, teams work on real datasets.

They finally present their preliminary results. The tutorial concludes with a general question and answer session.

OFFLINE

Requirements

Bring your own laptop with Java and Gephi installed.Gephi should be updated (menu Help > Check for Updates).

Bring a mouse with a wheel.

Bring a dataset of your own if you want, verify if it loads well in Gephi.[1]

[1] http://gephi.org/users/supported-graph-formats/

Workshop Schedule - Part I

Exploratory Network Analysis

• Exploratory Data Analysis• Exploratory Network Analysis• Looking for Orderness in Data• Examples• Guideline

Introduction to Gephi

• Approach and Community• Networked Data• Quick Start Demo

* 30 min break *

Workshop Schedule - Part II

Hands-On!

• Team Work on a Dataset• Presentation of Preliminary Results

Q&A

Exploratory Data Analysis

“The greatest value of a picture is when it forces us to notice what we never expected to see”

started with John Tukey (1962)

ConfirmatoryExploratorySerendipity

resultsintuitionsurprise

Exploratory Data Analysis

Non-linear processing chain of Ben Fry in Computational Information Design (2004)

Dummy Example

P2P file size distribution (Latapy et al., 2008)

Observation: visual saliences on specific file sizes

External knowledge:these sizes correspond to films

New hypothesis on data:films are highly exchanged, so the study might dig in this direction

Exploratory Network Analysis

see the network1

1st graph viz tool: Pajek (1996)Vladimir Batagelj, Andrej Mrvar

interact in real time2

3

Gephi prototype (2008)group, filter, compute metrics...

size by rank, color by partition,label, curved edges, thickness...

build a visual language

Looking for a “Simple Small Truth”?

Drew Conway, What Data Visualization Should Do: 1. Make complex things simple2. Extract small information from large data3. Present truth, do not deceive

http://www.dataists.com/2010/10/what-data-visualization-should-do-simple-small-truth/

Looking for Orderness in Data

Make varying 3 cursors simultaneously to extract meaningful patterns

MICRO level MACRO level

1 dimension N dimensions

T+0 T+N

at different levels

on multiple dimensions

at time scale

“Zoom” cursor on Quantitative Data

Global- connectivity- density- centralization

Local- communities- bridges between communities- local centers vs periphery

Individual- centrality- distances- neighborhood- location- local authority vs hub

MICRO level MACRO level

“Crossing” cursor on Qualitative Data

Social- who with whom- communities- brokerage- influence and power- homophily

Semantic- topics- thematic clusters

Geographic- spatial phenomena

1 dimension N dimensions

“Timeline” cursor on Temporal Data

Evolution of social ties

Evolution of communities

Evolution of topics

T+0 T+N

Mapping an Innovation CenterCollaborations on projects at Images et Réseaux

Themes and content

Actors

Territory

Franck Ghitalla & Ecole de Design de Nantes

Mapping Scientific Cooperations

Network Map: a Series of Choices

corpus

data

algorithms

thresholds

graphicaloperations

communication goals

Guideline

lists + edges in bonus, focus on qualitative data

How attributes explain the structure?• easy to read, “obvious” patterns• focus on entities (in context)• metrics are tools to describe the graph (centrality, bridging...)• links help to build and interpret categories of entitieschallenge: mix attribute crossing and connectivity

How the structure explains attributes?• hard to read, problem of “hidden signals”:

track patterns with various layouts and filtering• focus on structures• metrics are tools to build the graph (cosine similarity...)• categories help to understand the structurechallenge: pattern recognition

require high computational power

1 - 100

100 - 1,000

1,000 - 50,000

> 50,000

# nodes

Gephi now!

Gephi in a Nutshell

« Like Photoshop™ for graphs. »

Helps data analysts to reveal patterns and trends,highlight outliers and tells story with their data.

• Network visualization platform

• Open source, supported by a community

• Built for performance and usability

• Extensible by plug-ins

• Windows, MacOS X, Linux

Gephi Community

ContributorsCommunities

Mathieu Bastian, Mathieu Jacomy, Eduardo Ramos Ibañez, Sébastien Heymann, Guillaume Ceccarelli, André Panisson, Antonio Patriarca, Cezary Bartosiak, Martin Škurla, Patrick McSweeney, Yi Du, Hélder Suzuki, Daniel Bernardes, Ernesto Aneiro, Keheliya Gallaba, Luiz Ribeiro, Urban Škudnik, Vojtech Bardiovsky, Yudi Xue

Nonprofit organization

Community Mission

Provide a “sustainable” software

Maintain the technical ecosystem

Build a business ecosystem

Face cutting-edge technological challenges with a long-term vision

Distribute the software in Open Source

Community Values

Open innovation: ideas and features come from the entire community.

Decisions are taken with transparency.

We consider this technology as a public good,and will keep it in open source.

Diversity of Usages

business leisure :-)

communication academic art

Diversity of Network Encoding

V = { a, b, c, d, e }E = { (a,b), (a,d), (b,c), (e,a), (c,e) }

Textual

a b c d ea - 1 - 1 -b - - 1 - -c - - - - 1d - - - - -e 1 - - - -

Tabular

<graph> <nodes> <node id=”a” /> <node id=”b” /> <node id=”c” /> <node id=”d” /> <node id=”e” /> </nodes> <edges> <edge source=”a” target=”b” /> <edge source=”a” target=”d” /> <edge source=”b” target=”c” /> <edge source=”e” target=”a” /> <edge source=”c” target=”e” /> </edges></graph>

XMLGraphical

and many others...

Software I/O

} >

graph streaming

databases

file

file

CSVPajek NETGuess GDFGEXFGraphMLGraphviz DOTUCInet DLNetdrawVNATulip TLPExcel Spreadsheet

MySQL PostgreSL

SQL ServerNeo4j

CSVPajek NETGuess GDFGEXFGraphMLExcel SpreadsheetSVGPDFPNG

user input

Choosing a File Format

Table of features supported by Gephi

* spreadsheets can be loaded in the Data Laboratory

Edg

e List

/Matr

ix Str

uctur

e

XML S

trutur

e

Ed

ge W

eight

At

tribu

tes

Vi

suali

zatio

n Attr

ibutes

At

tribu

te Defa

ult Va

lue

H

ierarc

hical

Graphs

D

ynam

ics

CSVDL UcinetDOT GraphvizGDFGEXFGMLGraphMLNET PajekTLP TulipVNA NetdrawSpreadsheet*

Do you need...

GEXFSpreadsheetGraphMLGuess GDFGMLUCINet DLNetdraw VNAGraphviz DOTPajek NETCSVTulip TLP

Many features

Few features

XMLTabularText

File Type

Using Gephi

DEMO

Team work

Create a team of 2~3 people.1

Two teams present their preliminary findings.

Explore it during 1H.

Choose a dataset.2

3

4

Dataset #1: GitHub Software Repository

“GitHub is an application used by nearly a million people to store over two million code repositories, making GitHub the largest code

host in the world.”

Started in 2008, it provides the features of an online social network and a software repository to lower the barriers of collaboration and make the code easier to contribute.

https://github.com

Dataset #1: GitHub Software Repository

Data extracted by Franck Cuny* at Linkfluence SAS

1st release in March 2010 -> this poster2nd release in June 2011 -> your data

_____________Network of user profiles__________

Nodes: peoples with at least one repository who are followed by at least two other peopleEdges: A follows B

_____________Network of repositories__________

Nodes: repositoriesEdges: A shares a developer with B

Very few research publications on this OSN!

* franck.cuny@linkfluence.net

Dataset #1: GitHub Software Repository

Data extracted by a crawl using the GitHub APISeed: 10 well-known contributors in the Perl community

Networks by country: Japan, France, United StatesNetworks by language: Perl, PHP, Python, Ruby

Node attributes:• user country• number of followers• main programming language

Edges:• directed• weight = number of projects A has forked from B

Dataset #1: GitHub Software Repository

Your mission (should you decide to accept it): find research hypotheses based on your exploration

Example question: are the Perl communities based on geography?

Dataset #2: The Irish Blogosphere

_______________Blogroll Network______________

Nodes: blogs with more than two blogroll linksEdges: blogroll link (in-link)

_______________Post-link Network_____________

Nodes: blogs with more than two blogroll linksEdges: hyperlink inside post from a blog to another (post-link)

“Identifying Representative Textual Sources in Blog Networks”. K. Wade, D. Greene, C. Lee, D. Archambault, P. Cunningham (2011) http://mlg.ucd.ie/blogs

Dataset #2: The Irish Blogosphere

Data extracted by a crawl at distance 2 from the seed for the in-links and Google Blog Search for the post-links.Seed: 21 popular blogs, winners of the “2010 Irish Blog Awards”

Node attributes:• post count = total number of posts by blog• category = from the irish blog index at www.irishblogdirectory.com,

where available• infomap_comm = community to which a node belongs (infomap algo)• gce_comms = overlapping communities (GCE algo)• moses_comms = overlapping communities (MOSES algo)

Edges:• directed• weight = number of hyperlinks in the Post-link network

crawl at distance 2 from the seed

Dataset #2: The Irish Blogosphere

Your mission: explore and try to confirm the official results

Hands-On!

Start:

• Load a graph• Apply a layout• Color the nodes by a qualitative variable in Partition Panel• Size the nodes by a quantitative variable in Ranking Panel• Start to explore...compute metrics, filter the network

End:

• Export maps to PDF in Preview Tab• Save

Presentations

GitHub Repository Irish Blogosphere

Gephi Documentation

Web Site:

Support:Wiki:Source code:

Online Tutorialshttp://gephi.org/users/quick-start/http://gephi.org/users/tutorial-visualization/http://gephi.org/users/tutorial-layouts/http://wiki.gephi.org/index.php/Import_CSV_Datahttp://wiki.gephi.org/index.php/Import_Dynamic_Data

Tutorial in Spanishhttps://code.google.com/p/camon/wiki/Taller_Gephi

Supported Graph Formatshttp://gephi.org/users/supported-graph-formats/

http://gephi.org

http://forum.gephi.org

http://wiki.gephi.org

https://launchpad.net/gephi

Thank You!

Caspar David Friedrich - Wanderer Above the Sea of Fog

Credits

[slide 11] images from Drew Conway

http://www.dataists.com/2010/10/what-data-visualization-should-do-simple-small-truth/

[slide 22 top left] Benoît Vidal at MFG Labs

[slide 22 bottom center] Franck Ghitalla at UTC

[slide 22 right] Studies in MA Digital Fashion at LCF by Peter Jeun Ho Tsang

http://jeunhotsang.com/blog/2010/12/07/prototype/

[slide 27] sketches from Ben Fry, Computational Information Design

Special Thanks to Franck Ghitalla and Mathieu Jacomy

for their insightful discussions.