8
Visualisation and Analysis of the Internet Movie Database Adel Ahmed School of IT, University of Sydney NICTA, Australia Vladimir Batagelj Discrete and Computational Mathematics University of Ljubljana, Slovenia Xiaoyan Fu § NICTA, Australia Seok-Hee Hong School of IT, University of Sydney NICTA, Australia Damian Merrick School of IT, University of Sydney NICTA, Australia Andrej Mrvar ∗∗ Social Science Informatics University of Ljubljana, Slovenia ABSTRACT In this paper, we present a case study for the visualisation and anal- ysis of large and complex temporal multivariate networks derived from the Internet Movie DataBase (IMDB). Our approach is to in- tegrate network analysis methods with visualisation in order to ad- dress scalability and complexity issues. In particular, we defined new analysis methods such as (p,q)-core and 4-ring to identify im- portant dense subgraphs and short cycles from the huge bipartite graphs. We applied island analysis for a specific time slice in order to identify important and meaningful subgraphs. Further, a tem- poral Kevin Bacon graph and a temporal two mode network are extracted in order to provide insight and knowledge on the evolu- tion. Keywords: Large and Complex Networks, Case Study, Visualisa- tion, Network Analysis, IMDB. Index Terms: H.5.2 [Information Interfaces and Presentation]: User Interfaces—Algorithms; I.3.6 [Computer Graphics]: Method- ology and Techniques— 1 I NTRODUCTION Recent technological advances have led to the production of a lot of data, and consequently have led to many large and complex network models across a number of domains. Examples include: Webgraphs: where the entities are web pages and relation- ships are hyperlinks; these are huge: the whole graph consists of billions of nodes. Social networks: These include telephone call graphs (used to trace terrorists), money movement networks (used to de- tect money laundering), and citation networks or collabora- tion networks. The size of the network can be medium to very large. Biological networks: Protein-protein interaction (PPI) net- works, metabolic pathways, gene regulatory networks and phylogenetic networks are used by biologists to analyse and engineer biochemical materials. In general, they are smaller, with thousands of nodes. However, the relationships in these networks are very complex. This paper is based on the winning entry of the Graph Drawing Com- petition 2005 [7] and invited presentation at Sunbelt Viszard Session [9]. e-mail: [email protected] e-mail:[email protected] § e-mail:[email protected] e-mail:[email protected] e-mail:[email protected] ∗∗ e-mail:[email protected] Understanding these networks is a key enabler for many appli- cations. Good analysis methods are needed for these networks, and some are available. However, such methods are not useful unless the results are effectively communicated to humans. Visualisation can be an effective tool for the understanding of such networks. Good visualisation reveals the hidden structure of the networks and amplifies human understanding, thus leading to new insights, new findings and possible predictions for the future. We can identify the following challenging research issues for analysis and visualisation of large and complex networks: Scalability: Webgraphs or telephone call graphs gathered by AT&T have billions of nodes. In some cases, it is impossible to visualise the whole graph, or one cannot possibly load the whole graph in a main memory. Hence, the design of new analysis and visualisation methods for huge networks is a key research challenge from databases to computer graphics. Complexity: Relationships between actors in a social net- work, for example, can have a multitude of attributes (for ex- ample, observed behavior can be confirmed or unconfirmed, relationships can be directed or undirected, and weighted by probabilities). Also, biological networks are quite complex in nature; for example, metabolic pathways have only a few thousand nodes, but their relationships and interactions are very complex. The data may be given by nature, but some parts of the data may be unknown to human scientists. The design of analysis and visualisation methods to resolve these complexity issues is the second research challenge. Network Dynamics: Real world networks are always chang- ing over time. Many social networks, such as webgraphs, evolve relatively slowly over time. In some cases, such as tele- phone call networks, the data is a very fast-streamed graph. Effective and efficient modeling, analysis and visualisation for dynamic networks are challenging research topics. One approach to solve these challenging issues is an integra- tion of analysis with visualisation and interaction. Analysis tools for networks are not useful without visualisation, and visualisation tools are not useful unless they are linked to analysis. Further, in- teraction is necessary to find out more details or insights from the visualisation. In this paper, we present a case study for our approach to inte- grating analysis, visualisation and interaction using large and com- plex temporal multivariate networks derived from the IMDB (Inter- net Movie Data Base). In general, the IMDB is a huge and very rich data set with many attributes. Note that the IMDB data set has become a challenging data set for visualisation researchers [7, 9]. For example, a multi-scale approach for visualisation of small world networks was used for data sets from IMDB [3]. A visual- ization approach for dynamic affiliation networks in which events are characterized by a set of descriptors was presented [6]. A ra- dial ripple metaphor was devised to display the passing of time and 17 Asia-Pacific Symposium on Visualisation 2007 5 - 7 February, Sydney, NSW, Australia 1-4244-0809-1/07/$20.00 © 2007 IEEE

Visualisation and Analysis of the Internet Movie Databasevlado.fmf.uni-lj.si/vlado/papers/IMDBvis.pdf · Visualisation and Analysis of the Internet Movie Database∗ Adel Ahmed†

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Visualisation and Analysis of the Internet Movie Databasevlado.fmf.uni-lj.si/vlado/papers/IMDBvis.pdf · Visualisation and Analysis of the Internet Movie Database∗ Adel Ahmed†

Visualisation and Analysis of the Internet Movie Database∗

Adel Ahmed†

School of IT, University of SydneyNICTA, Australia

Vladimir Batagelj‡

Discrete and Computational MathematicsUniversity of Ljubljana, Slovenia

Xiaoyan Fu§

NICTA, Australia

Seok-Hee Hong¶

School of IT, University of SydneyNICTA, Australia

Damian Merrick‖School of IT, University of Sydney

NICTA, Australia

Andrej Mrvar∗∗Social Science Informatics

University of Ljubljana, Slovenia

ABSTRACT

In this paper, we present a case study for the visualisation and anal-ysis of large and complex temporal multivariate networks derivedfrom the Internet Movie DataBase (IMDB). Our approach is to in-tegrate network analysis methods with visualisation in order to ad-dress scalability and complexity issues. In particular, we definednew analysis methods such as (p,q)-core and 4-ring to identify im-portant dense subgraphs and short cycles from the huge bipartitegraphs. We applied island analysis for a specific time slice in orderto identify important and meaningful subgraphs. Further, a tem-poral Kevin Bacon graph and a temporal two mode network areextracted in order to provide insight and knowledge on the evolu-tion.

Keywords: Large and Complex Networks, Case Study, Visualisa-tion, Network Analysis, IMDB.

Index Terms: H.5.2 [Information Interfaces and Presentation]:User Interfaces—Algorithms; I.3.6 [Computer Graphics]: Method-ology and Techniques—

1 INTRODUCTION

Recent technological advances have led to the production of a lot ofdata, and consequently have led to many large and complex networkmodels across a number of domains. Examples include:

• Webgraphs: where the entities are web pages and relation-ships are hyperlinks; these are huge: the whole graph consistsof billions of nodes.

• Social networks: These include telephone call graphs (usedto trace terrorists), money movement networks (used to de-tect money laundering), and citation networks or collabora-tion networks. The size of the network can be medium to verylarge.

• Biological networks: Protein-protein interaction (PPI) net-works, metabolic pathways, gene regulatory networks andphylogenetic networks are used by biologists to analyse andengineer biochemical materials. In general, they are smaller,with thousands of nodes. However, the relationships in thesenetworks are very complex.

∗This paper is based on the winning entry of the Graph Drawing Com-petition 2005 [7] and invited presentation at Sunbelt Viszard Session [9].

†e-mail: [email protected]‡e-mail:[email protected]§e-mail:[email protected]¶e-mail:[email protected]‖e-mail:[email protected]

∗∗e-mail:[email protected]

Understanding these networks is a key enabler for many appli-cations. Good analysis methods are needed for these networks, andsome are available. However, such methods are not useful unlessthe results are effectively communicated to humans. Visualisationcan be an effective tool for the understanding of such networks.Good visualisation reveals the hidden structure of the networks andamplifies human understanding, thus leading to new insights, newfindings and possible predictions for the future.

We can identify the following challenging research issues foranalysis and visualisation of large and complex networks:

• Scalability: Webgraphs or telephone call graphs gathered byAT&T have billions of nodes. In some cases, it is impossibleto visualise the whole graph, or one cannot possibly load thewhole graph in a main memory. Hence, the design of newanalysis and visualisation methods for huge networks is a keyresearch challenge from databases to computer graphics.

• Complexity: Relationships between actors in a social net-work, for example, can have a multitude of attributes (for ex-ample, observed behavior can be confirmed or unconfirmed,relationships can be directed or undirected, and weighted byprobabilities). Also, biological networks are quite complexin nature; for example, metabolic pathways have only a fewthousand nodes, but their relationships and interactions arevery complex. The data may be given by nature, but someparts of the data may be unknown to human scientists. Thedesign of analysis and visualisation methods to resolve thesecomplexity issues is the second research challenge.

• Network Dynamics: Real world networks are always chang-ing over time. Many social networks, such as webgraphs,evolve relatively slowly over time. In some cases, such as tele-phone call networks, the data is a very fast-streamed graph.Effective and efficient modeling, analysis and visualisationfor dynamic networks are challenging research topics.

One approach to solve these challenging issues is an integra-tion of analysis with visualisation and interaction. Analysis toolsfor networks are not useful without visualisation, and visualisationtools are not useful unless they are linked to analysis. Further, in-teraction is necessary to find out more details or insights from thevisualisation.

In this paper, we present a case study for our approach to inte-grating analysis, visualisation and interaction using large and com-plex temporal multivariate networks derived from the IMDB (Inter-net Movie Data Base). In general, the IMDB is a huge and veryrich data set with many attributes. Note that the IMDB data set hasbecome a challenging data set for visualisation researchers [7, 9].

For example, a multi-scale approach for visualisation of smallworld networks was used for data sets from IMDB [3]. A visual-ization approach for dynamic affiliation networks in which eventsare characterized by a set of descriptors was presented [6]. A ra-dial ripple metaphor was devised to display the passing of time and

17

Asia-Pacific Symposium on Visualisation 20075 - 7 February, Sydney, NSW, Australia 1-4244-0809-1/07/$20.00 © 2007 IEEE

Page 2: Visualisation and Analysis of the Internet Movie Databasevlado.fmf.uni-lj.si/vlado/papers/IMDBvis.pdf · Visualisation and Analysis of the Internet Movie Database∗ Adel Ahmed†

Cream of Comedy

Dansk melodi grand prix

Dronningens nyt�rstale

Eurovision Song Contest, The

Kennedy Center Honors: A Celebration of the Performing Arts, The

King of the Ring

Popular Science

Royal Rumble

Starrcade

Statsministerens nyt�rstale

Summerslam

Survivor Series

Unusual Occupations

’Commissario Corso, Il’

’EnquŒtes du commissaire Maigret, Les’

’Nero Wolfe Mystery, A’

’Operation Phoenix - J�ger zwischen den Welten’

’Sitte, Die’

Bock, Alana

Boyd, KarinB�hm, IrisGawlich, Cathlen

Hłeg, Jannie Leese, Lindsay

Maggio, Rosalia

Margrethe II

Siggaard, Kirsten

Abatantuono, Diego

Anoai, Solofatu

Berry, Colin

Borden, Steve (I)

Calaway, Mark

Carpenter, Ken (I)

Chaykin, Maury

Cronkite, Walter

de Mylius, Jłrgen

DiBiase, Ted

Dunn, Conrad

Eaton, Mark (II)

Flair, Ric

Fox, Colin (I)

Gunn, Billy (II)

Hart, BretHart, Owen

Heick, Keld

Heinrichs, Dirk

Hickenbottom, Michael

Hutton, Timothy

Jacobs, Glen

Jarczyk, Robert

Kelehan, Noel

Lawler, Jerry

Levesque, Paul Michael

Martens, Dirk (I)

McMahon, Vince

Olsen, Jłrgen

Panczak, Hans Georg

Pfohl, Lawrence

Rasmussen, Poul Nyrup

Rasmussen, Tommy (I)

Richard, Jean (I)

Ross, Jim (III)

Schl�ter, Poul

Sims, Tim

Smith, Davey BoyTraylor, Raymond

Whitman, Gayne

Figure 1: Arcs with multiplicity at least 8

conveys relations among the different constituents through appro-priate layout. Note that the method is suitable for an egocentricperspective.

As the first step of our approach, we integrate network analysismethods [5, 10] with visualisation. In particular, we defined thenew analysis methods such as (p,q)-core and 4-ring to identify im-portant dense subgraphs and short cycles from the huge bipartitegraphs. We applied island analysis for a specific time slice in orderto identify important and meaningful subgraphs of the large andcomplex network. Further, a temporal Kevin Bacon graph and atemporal two mode network are extracted and visualised in order toprovide insight and knowledge on the evolution of the IMDB dataset.

This paper is organised as follows. In the next Section, wepresent a simple analysis of the IMDB data set. In Section 3, wepresent the integration of network analysis methods with visualisa-tion for large bipartite graphs including (p,q)-core, 4-ring and is-land. Section 4 presents visual analysis based on the Kevin-Baconnumber. Section 5 presents galaxy metaphor visualisation of a tem-poral two mode actor-movie network, and a visual analysis of thetwo mode network with company attributes. Section 6 concludes.

2 BASIC CHARACTERISTICS OF IMDB

The source of the original data is the Internet Movie Database.We transformed the contest data into a temporal network withsome additional vectors and partitions describing the propertiesof vertices. The IMDB network is bipartite (two mode) and has1324748 = 428440 + 896308 vertices and 3792390 arcs. 9927 ofthe arcs in the network are multiple (parallel) arcs. The nature ofthe appearance of multiple arcs can be seen in Figure 1, where allarcs with multiplicity of at least 8 are displayed.

Note that in the analysis that follows, we treat multiple arcs assingle. The IMDB network consists of 132714 weak components.

3 VISUALISATION AND ANALYSIS OF LARGE BIPARTITENETWORKS

There are few direct specialized methods for analyzing bipartitenetworks, especially large ones. Because of the size of the IMDBnetwork, the standard reduction of the entire network to one or theother derived 1-mode network was not an option. This motivated usto design and implement two new methods for analysis of bipartitenetworks:

• bipartite version of cores – (p,q)-cores

Table 1: (p,q : n1,n2) for IMDB

1 1590: 1590 1 | 22 24: 1854 1153 | 43 14: 29 832 516: 788 3 | 23 23: 47 56 | 44 14: 29 833 212: 1705 18 | 24 23: 34 39 | 45 13: 30 954 151: 4330 154 | 25 22: 42 53 | 46 13: 29 945 131: 4282 209 | 26 22: 31 38 | 47 12: 29 1016 115: 3635 223 | 27 22: 31 38 | 48 12: 28 1007 101: 3224 244 | 28 20: 36 53 | 49 12: 26 958 88: 2860 263 | 29 20: 35 52 | 50 11: 27 1119 77: 3467 393 | 30 19: 35 59 | 51 11: 26 110

10 69: 3150 428 | 31 19: 35 59 | 52 11: 16 7911 63: 2442 382 | 32 19: 34 57 | 53 10: 35 16212 56: 2479 454 | 33 18: 34 62 | 54 10: 35 16213 50: 3330 716 | 34 18: 34 62 | 55 10: 34 16214 46: 2460 596 | 35 18: 33 61 | 56 10: 34 16215 42: 2663 739 | 36 17: 33 65 | 57 9: 35 18716 39: 2173 678 | 37 16: 33 75 | 58 9: 33 18017 35: 2791 995 | 38 16: 30 73 | 59 9: 33 18018 32: 2684 1080 | 39 16: 29 70 | 60 9: 32 17819 30: 2395 1063 | 40 15: 29 77 | 61 9: 31 17720 28: 2216 1087 | 41 15: 28 76 | 62 9: 31 17721 26: 1988 1087 | 42 15: 28 76 | 63 8: 31 202

• 4-ring weights on lines

3.1 (p,q)-core AnalysisThe subset of vertices C ⊆V is a (p,q)-core in a bipartite (2-mode)network N = (V1,V2;L), V = V1 ∪V2 if and only if

a. in the induced subnetwork K = (C1,C2;L(C)), C1 = C ∩V1,C2 = C ∩V2 it holds ∀v ∈ C1 : degK(v) ≥ p and ∀v ∈ C2 :degK(v) ≥ q ;

b. C is the maximal subset of V satisfying condition a.

The basic properties of bipartite cores are:

• C(0,0) = V

• K(p,q) is not always connected

• (p1 ≤ p2)∧ (q1 ≤ q2) ⇒C(p1,q1) ⊆C(p2,q2)

Using (p,q)-cores, we can identify important dense structure outof large and complex networks. We design a very efficient O(m)algorithm to fine (p,q)-cores, and implement in Pajek .

Since there are many (p,q)-cores, we must answer the questionof how to select the interesting ones among them. To help the userin these decisions, we implemented a Table of cores’ characteristicsn1 = |C1(p,q)|, n2 = |C2(p,q)| and k – number of components inK(p,q) (see Table 1 and 2). We look for (p,q)-cores where

• n1 +n2 ≤ selected threshold

• big jumps from C(p−1,q) and C(p,q−1) to C(p,q).

For example, we selected (247,2)-core and (27,22)-core. Fromthe labels we can see that the corresponding topics are: wrestling,and pornography. See Figures 2 and 3.

3.2 4-ring AnalysisA k-ring is a simple closed chain of length k. Using k-rings we candefine a weight of edges as wk(e) = # of different k-rings containingthe edge e ∈ E.

Since for a complete graph Kr, r ≥ k ≥ 3 we have wk(Kr) =(r−2)!/(r−k)! the edges belonging to cliques have large weights.Therefore, these weights can be used to identify the dense parts ofa network. For example, all r-cliques of a network belong to r−2-edge cut for the weight w3.

18

Page 3: Visualisation and Analysis of the Internet Movie Databasevlado.fmf.uni-lj.si/vlado/papers/IMDBvis.pdf · Visualisation and Analysis of the Internet Movie Database∗ Adel Ahmed†

Royal Rumble

Survivor Series

Dumas, AmyEllison, LillianGarc�a, LiliÆnGuenard, NidiaHulette, ElizabethKai, LeilaniKeibler, StacyLaurer, JoanieMartel, SherriMartin, Judy (II)McMahon, StephanieMcMichael, DebraMero, RenaMoore, Carlene (II)Moore, Jacqueline (VI)Moretti, LisaPsaltis, Dawn MarieRobin, Rockin’Runnels, TerriStratus, TrishVachon, AngelleWilson, TorrieWright, JuanitaYoung, Mae (I)Adams, Brian (VI)Ahrndt, JasonAl-Kassi, AdnanAlbano, LouAnderson, ArnAndrØ the GiantAngle, KurtAnoai, ArthurAnoai, MattAnoai, RodneyAnoai, SamAnoai, SolofatuApollo, PhilAustin, Steve (IV)Backlund, BobBarnes, Roger (II)Bass, Ron (II)Batista, DaveBenoit, Chris (I)Bigelow, Scott ’Bam Bam’Bischoff, EricBlackman, Steve (I)Blair, Brian (I)Blanchard, TullyBlood, RichardBloom, Matt (I)Bloom, WayneBresciano, AdolphBrisco, GeraldBrunzell, JimBuchanan, Barry (II)Bundy, King KongCalaway, MarkCandido, ChrisCanterbury, MarkCena, John (I)Centopani, PaulChavis, ChrisClarke, BryanClemont, PierreCoachman, JonathanCoage, AllenCole, Michael (V)Connor, A.C.Constantino, RicoCopeland, Adam (I)Cornette, James E.Darsow, BarryDavis, Danny (III)DeMott, WilliamDiBiase, TedDouglas, ShaneDuggan, Jim (II)Eadie, BillEaton, Mark (II)Enos, Mike (I)Eudy, SidFarris, RoyFatu, EddieFifita, UliuliFinkel, HowardFlair, RicFoley, MickFrazier Jr., NelsonFujiwara, HarryFunaki, ShoGarea, TonyGasparino, PeterGill, DuaneGoldberg, Bill (I)Gray, George (VI)Guerrero Jr., ChavoGuerrero, EddieGunn, Billy (II)Guttierrez, OscarHall, Scott (I)Hardy, Jeff (I)Hardy, MattHarris, Brian (IX)Harris, Don (VII)Harris, Ron (IV)Hart, BretHart, Jimmy (I)Hart, OwenHart, StuHayes, Lord AlfredHeath, David (I)Hebner, DaveHebner, EarlHeenan, BobbyHegstrand, MichaelHelms, ShaneHennig, CurtHenry, Mark (I)Hernandez, RayHeyman, PaulHickenbottom, MichaelHogan, HulkHollie, DanHorn, BobbyHorowitz, BarryHouston, SamHoward, JamieHoward, Robert WilliamHuffman, BookerHughes, DevonHyson, MattJackson, TigerJacobs, GlenJames, Brian (II)Jannetty, MartyJarrett, Jeff (I)Jericho, ChrisJohnson, Ken (X)Jones, Michael (XVI)Keirn, SteveKelly, Kevin (VIII)Killings, RonKnight, Dennis (II)Knobs, BrianLauer, David (II)Laughlin, Tom (IV)Laurinaitis, JoeLawler, Brian (II)Lawler, JerryLayfield, JohnLeinhardt, RodneyLeslie, EdLesnar, BrockLevesque, Paul MichaelLevy, Scott (III)Lockwood, MichaelLoMonaco, MarkLong, TeddyLothario, JoseManna, MichaelMarella, Joseph A.Marella, RobertMartel, RickMartin, Andrew (II)Matthews, Darren (II)McMahon, ShaneMcMahon, VinceMero, MarcMiller, ButchMoody, William (I)Mooney, Sean (I)Morgan, Matt (III)Morley, SeanMorris, Jim (VII)Muraco, DonNash, Kevin (I)Neidhart, JimNord, JohnNorris, Tony (I)Nowinski, ChrisOkerlund, GeneOrton, RandyOttman, FredPage, DallasPalumbo, Chuck (I)Peruzovic, JosipPettengill, ToddPfohl, LawrencePiper, RoddyPlotcheck, MichaelPoffo, LannyPowers, Jim (IV)Prichard, TomRace, HarleyReed, Bruce (II)Reiher, JimReso, JasonRhodes, Dusty (I)Rivera, Juan (II)Roberts, Jake (II)Rock, TheRoss, Jim (III)Rotunda, MikeRougeau Jr., JacquesRougeau, RaymondRude, RickRunnels, DustinRuth, GlenSags, JerrySaturn, PerrySavage, RandyScaggs, CharlesSenerca, PeteShamrock, KenShinzaki, KensukeSimmons, Ron (I)Slaughter, Sgt.Smith, Davey BoySnow, AlSolis, MercidSteiner, Rick (I)Steiner, ScottStorm, LanceSzopinski, TerryTajiri, YoshihiroTanaka, PatTaylor, Scott (IX)Taylor, Terry (IV)Tenta, JohnTraylor, RaymondTunney, JackVailahi, SioneValentine, GregVan Dam, RobVaziri, Kazrowvon Erich, KerryWalker, P.J.Waltman, SeanWare, David (II)Warrington, ChazWarriorWhite, LeonWickens, BrianWight, PaulWilson, Al (III)Wright, Charles (II)Zhukov, Boris (I)

Figure 2: (247,2)-core

Fully Loaded

Invasion

King of the Ring

No Way Out

Royal Rumble

Summerslam

Survivor Series

Wrestlemania 2000

Wrestlemania X-8

Wrestlemania X-Seven

WWE Armageddon

WWE Judgment Day

WWE No Mercy

WWE No Way Out

WWE SmackDown! Vs. Raw

WWE Unforgiven

WWE Vengeance

WWE Wrestlemania X-8

WWE Wrestlemania XX

WWF Backlash

WWF Insurrextion

WWF Judgment Day

WWF No Mercy

WWF No Way Out

WWF Rebellion

WWF Unforgiven

WWF Vengeance

’Raw Is War’

’Sunday Night Heat’

’WWE Velocity’

’WWF Smackdown!’

Dumas, Amy

Keibler, StacyMcMahon, Stephanie

Stratus, TrishAngle, KurtAnoai, SolofatuAustin, Steve (IV)Benoit, Chris (I)Bloom, Matt (I)Calaway, MarkCole, Michael (V)Copeland, Adam (I)Guerrero, EddieGunn, Billy (II)Hardy, Jeff (I)Hardy, Matt

Hebner, EarlHeyman, PaulHuffman, BookerHughes, Devon

Jacobs, GlenJericho, ChrisLawler, JerryLayfield, JohnLevesque, Paul Michael

LoMonaco, Mark

Martin, Andrew (II)

Matthews, Darren (II)

McMahon, ShaneMcMahon, VinceReso, JasonRock, TheRoss, Jim (III)Senerca, PeteSimmons, Ron (I)

Taylor, Scott (IX)Van Dam, Rob

Wight, Paul

Figure 3: (27,22)-core

The 3-ring weights were already available [8]. However, thereare no 3-rings in the IMDB network. The densest substructuresare complete bipartite subgraphs Kp,q. They contain many 4-rings.This motivated us to design a method to find 4-rings weights. Weimplement it in Pajek .

Table 2: (p,q : n1,n2) for IMDB

Size Freq Size Freq Size Freq Size Freq--------------------------------------------------------

2 5512 20 19 38 4 59 23 1978 21 18 39 3 61 14 1639 22 15 40 2 64 15 968 23 9 42 2 67 16 666 24 13 43 3 70 17 394 25 12 45 3 73 18 257 26 6 46 4 76 19 209 27 6 47 5 82 110 148 28 5 48 1 86 111 118 29 6 49 2 106 112 87 30 3 50 2 122 113 55 31 6 51 1 135 114 62 32 5 52 2 144 115 46 33 3 53 1 163 116 39 34 1 54 2 269 117 27 35 5 55 1 301 118 28 36 4 57 1 332 219 29 37 7 58 1 673 1--------------------------------------------------------

Be My Valentine, Charlie Brown

Boy Named Charlie Brown

Charlie Brown Celebration

Charlie Brown Christmas

Charlie Brown Thanksgiving

Charlie Brown’s All Stars!

He’s Your Dog, Charlie Brown

Is This Goodbye, Charlie Brown?

It’s a Mystery, Charlie Brown

It’s an Adventure, Charlie Brown

It’s Flashbeagle, Charlie Brown

It’s Magic, Charlie Brown

It’s the Easter Beagle, Charlie Brown

It’s the Great Pumpkin, Charlie Brown

Life Is a Circus, Charlie Brown

Making of ’A Charlie Brown Christmas’

Play It Again, Charlie Brown

Race for Your Life, Charlie Brown

Snoopy Come Home

There’s No Time for Love, Charlie Brown

You Don’t Look 40, Charlie Brown

You’re a Good Sport, Charlie Brown

You’re In Love, Charlie Brown

You’re Not Elected, Charlie Brown

Charlie Brown and Snoopy ShowAltieri, Ann

Dryer, Sally

Mendelson, Karen

Momberger, Hilary

Stratford, Tracy

Brando, Kevin

Hauer, Brent

Kesten, Brad

Melendez, Bill

Ornstein, Geoffrey

Reilly, Earl ’Rocky’

Robbins, Peter (I)

Schoenberg, Jeremy

Shea, Christopher (I)

Shea, Stephen

Figure 4: Charlie Brown

To identify interesting substructures, we applied the simple is-lands procedure for the weight w4. It takes around three minutes tocompute w4 weights on a 1400 MHz, 1GB RAM computer, and 13seconds to determine the islands. We obtained 12465 simple lineislands on 56086 vertices. Here is their size distribution.

There are 94 of size at least 30; and only 10 over 100. Thelargest island corresponds to wrestling. Each island represents aspecial topic. We visualized only some of them. For example, seeFigures 4, 5, 6, 7 and 8.

3.3 Time slices and Island AnalysisBy extracting a time slice from the complete network, we can iden-tify the main groups in selected time periods. Islands can identifyimportant subgraphs of large networks based on the value of at-tributes [4].

To illustrate this, we extracted the time slice 1935-1950. Thereare 223 simple islands [4] for w4 on 1774 vertices. For example,we selected island 6 – ’Dona Macabra’; see Figure 9.

4 TEMPORAL CO-STARRING NETWORK: KEVIN-BACONNETWORK

We extracted a small important subset of the actors in the IMDBnetwork and constructed from it a dynamic visualisation of a 1-mode network showing the co-appearance of actors in films.

To define a sufficiently small important subgraph, we first con-sidered only nodes in the network with a Kevin Bacon number of1. The Kevin Bacon number of an actor is a similar concept to the

19

Page 4: Visualisation and Analysis of the Internet Movie Databasevlado.fmf.uni-lj.si/vlado/papers/IMDBvis.pdf · Visualisation and Analysis of the Internet Movie Database∗ Adel Ahmed†

Adventures of Mark Twain, The

Bad Men of Missouri

Big City

Castle on the Hudson

Dust Be My Destiny

Go Getter, The

Honky Tonk Hoodlum Saint, The

Kid From Kokomo, The

Kid Galahad

King of the Underworld

Knockout

Man Who Talked Too Much, The

Meet John Doe

Nancy Drew... Reporter

Naughty But Nice

Racket Busters

Roaring Twenties, The

San Quentin

Secret Service of the Air

Sergeant Madden

Smashing the Money Ring

Star Is Born, A

They Drive by Night

They Made Me a Criminal

Unconquered

Union Pacific

Valley of the Giants

Wells Fargo

Whole Town’s Talking, The

Women in the Wind

Yankee Doodle Dandy

You Can’t Take It with You

Flowers, BessChandler, Eddy

Dunn, Ralph

Flavin, James

Holmes, Stuart

Mower, Jack

O’Connor, Frank (I)

Phelps, Lee (I)

Saum, Cliff

Sullivan, Charles (I)

Vogan, Emmett

Figure 5: Mower, Jack and Phelps, Lee

Boy, T.T.

Byron, Tom

Davis, Mark (V)

Dough, Jon

Drake, Steve (I)

Horner, Mike

Jeremy, Ron

Michaels, Sean

Morgan, Jonathan (I)

North, Peter (I)

Sanders, Alex (I)

Savage, Herschel

Silvera, Joey

Thomas, Paul (I)

Voyeur, Vince

Wallice, Marc

West, Randy (I)

Figure 6: Adult

Erdos number of a mathematician; it represents the length of theshortest path in the movie star collaboration network from the actorto Kevin Bacon.

The data set was divided into time slices of a decade in length(e.g. 1920s, 1930s, etc.), and the set of actors reduced in eachdecade to only those who had co-starred in at least 5 films withanother actor with a Kevin Bacon number of 1. The sizes of thegraphs for each of these time slices are given in Table 3.

The 1-mode co-starring networks of these reduced sets of actorswere constructed for each decade, and a three-dimensional layoutwas generated for each using the Scale-free network layout [2]inGEOMI [1]. Nodes in the force-directed layout were restricted tolie on one of three concentric spheres, depending on the degree ofthe node [2]. The colouring of each node was also used to indicatethe degree. The size of each node was dependant on the number of

Abid el gassad

Abid el mal

Abu Ahmad

Abu Dahab

Abu Hadid

Aguazet seif

Amir el antikam

Ana bint min?

Ana zanbi eh?

Ard el ahlam

Ashki limin?

Asrar el naas

Baad al wedah

Baba Amin

Batal lil nehaya

Beyt al Taa

Cass el azab

Ebn el-hetta

Elf laila wa laila

Fatat el mina

Fatawa, El

Fatawat el Husseinia

Ghaltet ab

Ghazal al-banat

Haked, El

Hamida

Hareb min el ayyam

Hub fil zalam

Ibn al ajar

Imlak, ElIskanderija... lih?Laab bil nar, El

Maktub alal guebin

Malak el zalem, El

Massiada, Al

Matloub zawja fawran

Mohtal, ElMurra kulshi, El

Namrud, El

Nashal, El

Nassab, El

Osta Hassan, El

Port Said

Rasif rakam khamsa

Sawak nus el lail

Sittat afarit, al-

Souk el selahTarik el saada

Zalamuni el habaieb

Zoj el azeb, El

Hamama, Faten

Rostom, Hind

Soltan, Hoda

El Dekn, Tewfik

El-Meliguy, Mahmoud

Hamdi, Imad

Riad, Hussein

Sarhan, Shukry

Shawqi, Farid

Figure 7: Shawqi, Farid and El-Meliguy, Mahmoud

Pol

izei

ruf 1

10 -

Ang

st u

m T

essa

B�lo

w

Pol

izei

ruf 1

10 -

Der

Pfe

rdem

�rde

r

Pol

izei

ruf 1

10 -

Der

Spi

eler

Pol

izei

ruf 1

10 -

Dok

tors

piel

e

Pol

izei

ruf 1

10 -

Ein

Bild

von

ein

em M

�rde

r

Pol

izei

ruf 1

10 -

Hei

�kal

te L

iebe

Pol

izei

ruf 1

10 -

Hen

kers

mah

lzei

t

Pol

izei

ruf 1

10 -

Jug

endw

ahn

Pol

izei

ruf 1

10 -

Kop

f in

der

Sch

linge

Pol

izei

ruf 1

10 -

Kur

scha

tten

Pol

izei

ruf 1

10 -

Mor

dsfr

eund

e

Pol

izei

ruf 1

10 -

Ros

ento

d

Pol

izei

ruf 1

10 -

Tod

sich

er

Pol

izei

ruf 1

10 -

Tot

e er

ben

nich

t

Pol

izei

ruf 1

10 -

Zer

st�r

te T

r�um

e

Sta

rkes

Tea

m -

Aug

e um

Aug

e, E

in

Sta

rkes

Tea

m -

Ban

krau

b, E

in

Sta

rkes

Tea

m -

Blu

tsba

nde,

Ein

Sta

rkes

Tea

m -

Bra

unau

ge, E

in

Sta

rkes

Tea

m -

Das

Bom

bens

piel

, Ein

Sta

rkes

Tea

m -

Das

gro

�e S

chw

eige

n, E

in

Sta

rkes

Tea

m -

Der

letz

te K

ampf

, Ein

Sta

rkes

Tea

m -

Der

Man

n, d

en ic

h ha

sse,

Ein

Sta

rkes

Tea

m -

Der

sch

�ne

Tod

, Ein

Sta

rkes

Tea

m -

Der

Tod

fein

d, E

in

Sta

rkes

Tea

m -

Der

Ver

dach

t, E

in

Sta

rkes

Tea

m -

Die

Nat

ter,

Ein

Sta

rkes

Tea

m -

Ein

s zu

Ein

s, E

in

Sta

rkes

Tea

m -

Erb

arm

ungs

los,

Ein

Sta

rkes

Tea

m -

Im V

isie

r de

s M

�rde

rs, E

in

Sta

rkes

Tea

m -

Kin

dert

r�um

e, E

in

Sta

rkes

Tea

m -

Kle

ine

Fis

che,

gro

�e F

isch

e, E

in

Sta

rkes

Tea

m -

Kol

lege

M�r

der,

Ein

Sta

rkes

Tea

m -

Lug

und

Tru

g, E

in

Sta

rkes

Tea

m -

Mor

dlus

t, E

in

Sta

rkes

Tea

m -

M�r

deris

ches

Wie

ders

ehen

, Ein

Sta

rkes

Tea

m -

Rot

er S

chne

e, E

in

Sta

rkes

Tea

m -

Sic

herh

eits

stuf

e 1,

Ein

Sta

rkes

Tea

m -

Tr�

ume

und

L�ge

n, E

in

Sta

rkes

Tea

m -

T�d

liche

Rac

he, E

in

Sta

rkes

Tea

m -

Ver

rate

n un

d ve

rkau

ft, E

in

Sta

rkes

Tea

m, E

in

’Aff�

re S

emm

elin

g, D

ie’

Mar

anow

, Maj

a

Bad

emso

y, T

ayfu

n

Lans

ink,

Leo

nard

Lerc

he, A

rnfr

ied

Mar

tens

, Flo

rian

Sch

war

z, J

aeck

i

Win

kler

, Wol

fgan

g

Figure 8: Polizeiruf 110 and Starkes Team

movies in which the corresponding actor starred in that particulardecade. Similarly, the width of an edge was used to represent thenumber of co-appearances between two actors in a decade.

To effectively illustrate the evolution of the co-starring network,we display smooth animations between the layouts of subsequentdecades. The animations are broken into several parts shown oneafter the other in time, in order to aid retention of the mental map.First, nodes and edges not present in the first layout are faded out.Nodes present in both first and second layouts are then animated totheir new positions in the second layout. Nodes new to the secondlayout burst out from the centre and come to rest in their calcu-lated positions, and finally new edges are faded in to show the newcollaborations in the second decade. The animation is download-able from http://www.it.usyd.edu.au/∼dmerrick/gd05contest/gd05-final.avi

20

Page 5: Visualisation and Analysis of the Internet Movie Databasevlado.fmf.uni-lj.si/vlado/papers/IMDBvis.pdf · Visualisation and Analysis of the Internet Movie Database∗ Adel Ahmed†

Dona Macabra

Hoy canto para ti

Isla Isabel

Janitors, The

Lupo und der Muezzin

Madre padrona

Martin Fierro

Misterio del latigo negro, El

Monja alferez, La

Primo Baby

Rayo de luz, Un

Silencio roto

Sor Juana Inez de la cruz

Suenos atomicos

Tehtaan varjossa

Tesoro de Morgan, El

Tierra y mar del noroeste

Todo un caballero

Triboulet

Tu Hau

Camargos, Glaucia

D’Org, Olga

Delholm, Kirsten

Deray, Sara

Escobar, Valeria

Frank, Constanze

Gomez, Martha

Morales, Lucy

Obregon, Julia

Roldan, Celia

Segarra, Carol

Zea, Kristi

Arenas, Mathieu

Aroza, Diego

Barreiro, Jose

Blanco, Tomas (I)

Buendia, Jorge

Busquets, Enrique

Cabello, Antonio

Calles, David

Calvo, Ricardo

Cardona, Renan

de Anda, Rafael

Del Degan, Davide

Fernandez, Emiliano

Frauscher, Richard

Gonzalez, Gibran

Langlands, Rob

Lopez, Bruno

Lopez, Celso

Marti, Adam

Martinez, Pablo (V)

Noriega, Leonardo J.

O’Farril, Alfredo

Parra, Aleksandr Perez, Jose A. (I)

Rueda, Enrique

Soler, Cote

Trevino, Alejandro

Velasco, Gary

Villarreal, Juan Antonio

Villate, Victor

Wimer, Homero

Figure 9: Dona Macabra

KB1 V EInitial 1324748 3792390

all decades, no filtering 2742 3360601910s, ≥ 5 films 16 181920s, ≥ 5 films 4 21930s, ≥ 5 films 25 531940s, ≥ 5 films 17 171950s, ≥ 5 films 19 181960s, ≥ 5 films 16 351970s, ≥ 5 films 79 4111980s, ≥ 5 films 59 731990s, ≥ 5 films 207 4252000s, ≥ 5 films 124 208

Table 3: Graph sizes per decade of co-starring network

This process was continued for all decade slices from 1911through to 2004, and the result can be seen in the downloadableanimation. Figures 10, 11, 12, 13, 14 show snapshots of the anima-tion from the 1960s through to the early 2000s.

The visualisation revealed a number of interesting facts. One un-expected finding was the substantial number of actors with a KevinBacon number of 1 in the early years of the twentieth century, someof whom could clearly not have co-starred in a film with Kevin Ba-con. This revealed some problems in the collection of the moviedata set. The years of some movies had been recorded incorrectly,while edges to other movies that possessed the same name as amovie of a prior decade were all recorded as belonging to the earliermovie.

In the 1960s (Figure 10), the visualisation shows a clique involv-ing the US president John F. Kennedy. This is due to the assassina-tion of Kennedy in 1963, and the subsequent barrage of documen-taries that were produced detailing the event. The other actors in theclique (Jacqueline Kennedy, John and Nellie Connally, etc.) wereall present at the assassination. They are present in this data setsince the movie JFK, starring Kevin Bacon, included real archivefootage of the assassination. The Kennedys continue through tolater decades in the visualisation, illustrating the vast number ofdocumentary films developed that were based on this event.

The 1970s, shown in Figure 11, sees the first large connectedgroup of Hollywood actors that continue as big names to this day.James Earl Jones, Robert Redford, Steve Martin and John Travoltaall appear in this group.

Figure 10: The co-starring actors visualisation (1960s)

Figure 11: The co-starring actors visualisation (1970s)

The visualisation of the 1980s (Figure 12) highlights some par-ticularly close-knit groups of actors. Comedy stars Chevy Chase,Dan Akroyd and Bill Murray appear due to roles in Satuday NightLive, Caddy Shack and Spies Like Us. Also present are Jim Cum-mings, Jack Angel and Rob Paulson, who have quite high degreesdue to their involvement as voice actors in many short cartoons andepisodes.

These groups continue into the 1990s, where the groups of actorsbecome much larger and more highly connected (Figure 13). Morewell-established modern actors like Whoopi Goldberg, Tom Hanksand Dennis Hopper become particularly prominent in this decade.

Finally, in the 2000s, we see some particularly interesting andunexpected phenomena (Figure 14). First, music stars such as Brit-ney Spears, Beyonce Knowles and Sheryl Crow appear with veryhigh degree and connectedness, due to their participation in numer-ous music award shows. Secondly, on the other side of the visu-alisation, popular actor Arnold Schwarzenegger links politicians tothe movie stars and musicians in the rest of the co-starring network.This was primarily due to Schwarzenegger’s entry into politics, in

21

Page 6: Visualisation and Analysis of the Internet Movie Databasevlado.fmf.uni-lj.si/vlado/papers/IMDBvis.pdf · Visualisation and Analysis of the Internet Movie Database∗ Adel Ahmed†

Figure 12: The co-starring actors visualisation (1980s)

Figure 13: The co-starring actors visualisation (1990s)

becoming the governor of the US state of California. Followingthis event, he was in several political documentaries in which BillClinton also appeared. Bill Clinton, in turn, is linked through docu-mentaries and archival footage to other famous politicians, such asRonald Reagan, Richard Nixon and John F. Kennedy.

5 A GALAXY OF MOVIE STARS OF TEMPORAL ACTOR-MOVIE NETWORK

This section describes a galaxy of movie stars of the temporal actor-movie network with animation (in order to see the overview), anda visualisation of the network of specific time slice (in order to seethe details).

First we consider a “galaxy of stars” metaphor of the movie-actornetwork. The main idea is to map the “movie stars” in a movie(i.e. animation) of a galaxy of stars which displays actor-movieinteractions.

Representing as much information as possible without introduc-ing overwhelming visual complexity has always been a challengewhen visualising large data sets. We define important subgraphs to

Figure 14: The co-starring actors visualisation (2000s)

reduce visual complexity as follows.We define the “stars” from the IMDB as follows:

• every star actor must have been in more than 12 movies overthe whole time period

• every star movie must have more than 12 actors

• each star actor must have played in between three to sixmovies in each year

We again use a bipartite (2-mode) network model. There are twotypes of nodes: actor nodes and movie nodes. Actor nodes are dis-played as stars in the night sky, and edges are displayed as faintlines joining up “constellations” of actors (See Figure 15). Edgeswith bends are displayed between actor and movie nodes; however,movie nodes are hidden; in this manner, collaboration between ac-tors can easily be seen. In this case, the picture not only reduces thevisual complexity (especially for edges), but also represents actor-movie and actor-actor interactions at the same time.

To produce an overview of the temporal network dynam-ics, we computed a layout for each year from 1907 to 2004and produced an animation. A two-dimensional force-directedlayout was generated for each year’s subgraph using GEOMI[1]. The animation is performed between each layout, in asimilar manner to the animation of the co-starring authors net-work in the previous section. The animation is available fromhttp://www.it.usyd.edu.au/∼dmerrick/gd05contest/gd05-final.avi

Once we have an overview of the temporal network using ananimation, we now focus on the details of the specific year of thenetwork to observe some interesting patterns in specific time peri-ods.

Figure 16 shows part of the layout of year 1918. Those threeactors co-starred in five movies together; on the other hand, they didnot appear in any other movies. Only one of the movies includesactors from outside. This kind of pattern can be usually found inthe early years.

Figures 17 and 18 show a different pattern. They are both cap-tured from the layout of year 1983. In Figure 17, nineteen actorsco-starred in a masterpiece. In Figure 18, the same group of peo-ple starred in a series of movies together, whilst also appearing inother movies with actors from outside the group. Compared to thepattern of early years in Figure 16, one may gain some knowledgeand insight about the trends of the movie industry from Figure 17.

22

Page 7: Visualisation and Analysis of the Internet Movie Databasevlado.fmf.uni-lj.si/vlado/papers/IMDBvis.pdf · Visualisation and Analysis of the Internet Movie Database∗ Adel Ahmed†

Figure 15: A frame from the galaxy of stars animation

Figure 16: Actor collaboration pattern in early years.

Further insights can be discovered when combining company at-tributes in visualisation, Figures 19 to 22 show. There are two clus-ters in 1985. To assist with analysis, we display the movie nodeswith their labels. The two clusters are normal movies and adultmovies.

Figures 19 to 22 show some patterns in the evolution: before the1990s, these two types of movies were clearly separated, meaningthat they were produced by different companies with different ac-tors. That is, two groups seldom collaborated. However, these twogroups started to merge into one big group. The actors started tomove around between different companies for collaboration. Forexample, see the year 1994. It is difficult to separate these twogroups in the picture. This may be an indication of the possiblechange in the movie industry, as well as to the social network of ac-tors. This visualisation can be a useful supplement to formal anal-ysis methods.

6 CONCLUSION

Integration of good analysis methods with proper visualisationmethods is an effective approach to gain an insight into large andcomplex networks. Our next step is to further integrate variousanalysis methods with visualisation on different data sets. A for-mal evaluation on the insights and knowledge derived then needs tobe carried out.

Figure 17: Many actors co-starring one movie.

Figure 18: Same group of people in several movie.

Ultimately, appropriate interaction methods need to be integratedin order to complete our visual analysis framework for large andcomplex networks.

REFERENCES

[1] A. Ahmed, T. Dwyer, M. Forster, X. Fu, J. Ho, S. Hong, D.Koschutzki, C. Murray, N. Nikolov, A. Tarassov, R. Taib and K. Xu,GEOMI: GEometry for Maximum Insight, Proc. of Graph Drawing2006, pp. 468-479, 2006.

[2] A. Ahmed, T. Dwyer, S. Hong, C. Murray, L. Song and Y. Wu, Vi-sualisation and Analysis of Large and Complex Scale-free Networks,Proc. of EuroVis 2005, pp. 18, 2005.

[3] D. Auber, Y. Chiricota, F. Jourdan and G. Melanon, Multiscale Visu-alization of Small World Networks, Proc. of InfoVis, pp. 75-81, 2003.

[4] V. Batagelj, Analysis of large networks - Islands, Dagstuhl seminar03361: Algorithmic Aspects of Large and Complex Networks, 2003.

[5] U. Brandes and T. Erlebach, Network Analysis: methodological foun-dations, Springer, 2005.

[6] U. Brandes, M. Hoefer and C. Pich, Affiliation Dynamics with an Ap-plication to Movie-Actor Biographies, Proc. of EuroVis 2006, pp. 179-186, 2006.

[7] Graph Drawing 2005 Competition, http://gd2005.org/[8] Pajek, http://vlado.fmf.uni-lj.si/pub/networks/pajek/[9] Sunbelt XXVI 2006 Viszard Sesseion.

[10] S. Wasserman and K. Faust, Social Network Analysis: Methods andApplications, Cambridge University Press, 1994.

23

Page 8: Visualisation and Analysis of the Internet Movie Databasevlado.fmf.uni-lj.si/vlado/papers/IMDBvis.pdf · Visualisation and Analysis of the Internet Movie Database∗ Adel Ahmed†

Figure 19: Layout of 1985

Figure 20: Layout of 1988

Figure 21: Layout of 1991

Figure 22: Layout of 1994

24