20
Wikis as Social Networks: Evolution and Dynamics Ralf Klamma and Christian Haasler RWTH Aachen University Information Systems Ahornstr. 55, 52056 Aachen, Germany (klamma|haasler)@dbis.rwth-aachen.de Abstract. Despite of the enormous success of wikis in public and cor- porate knowledge sharing projects we do not know much about the evo- lution and dynamics of wikis. Our approach is to analyze wikis as social networks and apply dynamic network analysis on them. In our proto- typical environment we handle the complex data management problems arising when dealing with different wiki engines and different sizes of wiki dumps. The analysis and visualization of evolving wiki networks allow wiki stakeholder to research the social dynamics of their wikis. 1 Introduction Since wikis have become very successful in huge collaborative projects like the Wikipedia, an encyclopedia with millions of entries in hundreds dozens of lan- guages edited by a countless crowd of editors. But also in organizational settings wikis are already introduced as Web 2.0 tools for knowledge sharing and project management [1–3]. Therefore, questions like what is making a wiki project par- ticularly successful have become very interesting for research and practice. A lot of added-value services and businesses have been built around the wiki concept which lead to a big variety in wiki software and wiki hosting services. Even if one can start a wiki with a simple ’wiki-on-a-stick’ solution like TiddlyWiki, the maintenance of large collaborative wiki demands more elaborated platforms. Two well-known providers of wiki platforms based on the MediaWiki engine are the Wikimedia Foundation and the hosting service Wikia. A variety of wiki projects are hosted on the MediaWiki engine, e.g. the Wikipedia. We concen- trate on this engine here. Due the chosen architecture, also other engines can be supported. We demonstrate this with the TikiWiki engine. The enormous number of public and organizational wikis has created a long tail [4] of wikis. Besides the very successful and very visible wiki-based knowledge creation and sharing projects, there are many others with much lesser numbers of editors and edits. If we want to give wiki stakeholders tools for analyzing the dynamics and the evolution of their wikis we have to deal with different wiki hosting software, with wikis ranging from a few hundred nodes to wikis with millions of nodes, with

Wikis as Social Networks: Evolution and Dynamics

Embed Size (px)

DESCRIPTION

Despite of the enormous success of wikis in public and corporate knowledge sharing projects we do not know much about the evolution and dynamics of wikis. Our approach is to analyze wikis as social networks and apply dynamic network analysis on them. In our prototypical environment we handle the complex data management problems arising when dealing with different wiki engines and different sizes of wiki dumps. The analysis and visualization of evolving wiki networks allow wiki stakeholder to research the social dynamics of their wikis.

Citation preview

Page 1: Wikis as Social Networks: Evolution and Dynamics

Wikis as Social Networks:Evolution and Dynamics

Ralf Klamma and Christian Haasler

RWTH Aachen UniversityInformation Systems

Ahornstr. 55, 52056 Aachen, Germany(klamma|haasler)@dbis.rwth-aachen.de

Abstract. Despite of the enormous success of wikis in public and cor-porate knowledge sharing projects we do not know much about the evo-lution and dynamics of wikis. Our approach is to analyze wikis as socialnetworks and apply dynamic network analysis on them. In our proto-typical environment we handle the complex data management problemsarising when dealing with different wiki engines and different sizes of wikidumps. The analysis and visualization of evolving wiki networks allowwiki stakeholder to research the social dynamics of their wikis.

1 Introduction

Since wikis have become very successful in huge collaborative projects like theWikipedia, an encyclopedia with millions of entries in hundreds dozens of lan-guages edited by a countless crowd of editors. But also in organizational settingswikis are already introduced as Web 2.0 tools for knowledge sharing and projectmanagement [1–3]. Therefore, questions like what is making a wiki project par-ticularly successful have become very interesting for research and practice. A lotof added-value services and businesses have been built around the wiki conceptwhich lead to a big variety in wiki software and wiki hosting services. Even ifone can start a wiki with a simple ’wiki-on-a-stick’ solution like TiddlyWiki,the maintenance of large collaborative wiki demands more elaborated platforms.Two well-known providers of wiki platforms based on the MediaWiki engineare the Wikimedia Foundation and the hosting service Wikia. A variety of wikiprojects are hosted on the MediaWiki engine, e.g. the Wikipedia. We concen-trate on this engine here. Due the chosen architecture, also other engines can besupported. We demonstrate this with the TikiWiki engine.

The enormous number of public and organizational wikis has created a longtail [4] of wikis. Besides the very successful and very visible wiki-based knowledgecreation and sharing projects, there are many others with much lesser numbersof editors and edits.

If we want to give wiki stakeholders tools for analyzing the dynamics and theevolution of their wikis we have to deal with different wiki hosting software, withwikis ranging from a few hundred nodes to wikis with millions of nodes, with

Page 2: Wikis as Social Networks: Evolution and Dynamics

wiki dumps from unreliable public and corporate sources, with data managementproblems and with complex algorithmic problems. Especially, the ability to hostthe analysis and visualization of different wiki engines for the whole long tail ofwikis is still a true challenge.

Social networks in computer mediated communication have drawn also a lotof scientific attention, e.g. [5–7, 1]. In this paper we concentrate on the dynamicanalysis of wikis, especially dynamic network analysis (DNA). DNA [8] is anemerging area of science advancing traditional social network analysis (SNA)by the idea that networks evolve over time in terms of changes of nodes inthe networks and changes of links between nodes. We argue in our paper thatDNA is applicable for wikis. For wiki users, wiki managers, and wiki hostingservices it is extremely important to know if wikis are still going to grow innumbers of authors, edits and wiki articles or if the wiki is going into a phaseof stagnation. It is important to know if and when non-existing articles will becreated and edited by users. When a node (wiki page, an editor, a URL) is‘important’ in the moment, will it stay important over the lifetime of the wikior will its importance change over time? If a network is heterogeneous will itbecome homogeneous after a while or will it be that way for ever?

In the Web 2.0 not only wikis but also other new media have become tremen-dously successful [9–11]. By developing standard operations for handling Web2.0 data analysis and visualization we hope to encourage communities to ap-ply dynamic network analysis thus increasing their agency in a world where weleave billions of virtual footprints day by day. To serve the needs of differentstakeholders and communities in DNA we have developed a framework calledthe MediaBase [12]. A MediaBase consists of three elements: (a) a collectionof crawlers specialized for distinguished Web 2.0 media like blogs, wikis, pods,feeds, and so on; (b) the crawlers feed multimedia databases with a commonmetamodel for all the different media, artifacts, actors, and communities leadingto a community-oriented cross-media repository; (c) a collection of web-basedanalysis and visualization tools for DNA. Examples for MediaBases are availablefor technology enhanced learning communities (www.prolearn-academy.org),for German cultural science communities (www.graeclus.de), and for the cul-tural heritage management of the UNESCO world heritage Bamyian Valley inAfghanistan (www.bamyian-development.org). The WikiWatcher introducedin this paper is part of the MediaBase.

The rest of the paper is organized as follows. In Section 2 we analyze priorapproaches and open issues. In Section 3 we characterize wikis as social networkswhere DNA is applicable. Section 4 describes design and architecture of oursoftware prototype WikiWatcher. In Section 5 we are presenting the main resultsof our analysis of different wikis. We conclude our paper with a discussion andan outlook on further research.

Page 3: Wikis as Social Networks: Evolution and Dynamics

2 Static Analysis of Wikis

There is an already existing literature on the analysis of wikis. Most of thestudies concentrate on static aspects of wikis. A lot of studies have already beenperformed to analyze wikis. In general, those studies can be classified in studieswhich make use of the publicly available wiki data (dumps) themselves and instudies making use of additional data like access log files [13]. In this paper, weconcentrate on the analysis of publicly available wiki dumps. In this regard, wecan further classify studies concentrating on the static analysis of wiki dumps[14, 15] and those concentrating on the dynamic aspects. But first, let us startwith a well-known example: the Wikipedia.

The Wikipedia is the most researched wiki. We want to mention only afew studies to characterize the scientific progress in the DNA of wikis. Amongthe first comprehensive studies of Wikipedia was the 2005 study by Voß [14].Wikipedia was measured according to its network characteristics. In particular,the article referred to the changes in size of the wiki database, the number ofarticles, words, users and links. Among other things Voß figured out that thedistribution of links behaves scale-free with respect to growth and preferentialattachment [16]. Wilkinson and Huberman evaluated qualitatively the collabo-ration of the Wikipedia community. They showed that the accretion of edits toan article is described by a simple stochastic mechanism, resulting in a heavytail of highly visible articles with a large number of edits [17]. They figured outthat the quality of an article depends on the number of its modifications. Kitturet al. [7] examined the success of Wikipedia. In particular, they analyzed if it isa great number of contributors where each deals with only a few articles (‘wis-dom of the crowd’) or if it is only a small elite group of contributors that hasthe lion’s share (‘power of the few’). The later is true. On this qualitative viewon Wikipedia, the work of Priedhorsky et al. [13] is built up. They dealt withvandalism in Wikipedia articles. For this purpose, two types of information wereused: Wikipedia articles themselves and their log files. By this means it could bemeasured which article revisions the visitors had viewed and if it was an intactor damaged version. The researchers aimed to quantify the influence of articleedits and revisions respectively to the visitors. The number of vandalized pagesviewed by real readers is extremely low. Further research classified users withrespect to their position in online communities like wikis. Although Wikipedia’sequal treatment of editors some members seem to get a leading role [18].

Most research in this area aim for Wikipedia, but is not ready for arbitrarywikis. A general concept for handling and analyzing any wiki is missing. Wikisare applied in a variety of social and organizational environments. It would beuseful to obtain methods and tools for interpreting those incidental social struc-tures. Hence, the motivations of this work are to afford a view on wikis as socialnetworks, to build up formal network models, to apply measurements of SNA,to visualize wiki networks, and to consider the dynamic aspect of wiki networks.The data and information basis of the most projects and researches is built up onwiki log files or direct database access. With respect to the ‘open’ wiki conceptthis work uses only public wiki data. Most wikis offer automatically generated

Page 4: Wikis as Social Networks: Evolution and Dynamics

dumps which can also be inquired e.g. via the MediaWiki page special:export. Ar-ticles, links and references as well as authors are treated as network components(actors) – not only as a growing number. We apply Actor-Network Theory [19]for the data management, i.e. we do not differentiate between human and non-human actors and we can aggregate groups of actors as a new actor. Actor rela-tionships and dependencies evolve during any given period. The static networkstructure of classic SNA is extended consequently. Qualitative characteristicssuch as the community leading role are now measurable by network analysismethods. Characteristics like scale-free networks, hierarchical structures, short-est paths and centrality (‘importance’) of network components can be measuredand analyzed in a time series context. Thus, it is possible to illustrate socialchange and evolution in wiki networks.

Collaborative work, playing a fundamental role in wikis, is now able to bevisualized by means of dynamic network visualization. Not only ‘clinical’ num-bers and measured values but also graph visualization help to identify strategicactors and their activities. These possibilities give an aid to handle with wikiactors, e.g. in a social, economic or security-relevant way. Our concept and ourimplementation are introduced in the following.

3 Wikis as Dynamic Networks

Social science deals with the analysis of relations between different kind of ac-tors, such as single persons, interacting groups or organisations. Social networkanalysis is concerned with patterns of relationships between social actors [20].Social networks can be seen as constructs of relations and entities like actorsand artifacts. Wikis conform to these qualitative aspects of social networks. Themain idea of ‘writing articles in common’ by Ward Cunningham [21] can berealized only if wiki users collaborate in creating, modifying and maintainingarticles. These writing processes imply different kinds of social networks. Wikiusers as well as wiki pages can be seen as objects in a social network that help toachieve the aim of establishing the wiki. Writing articles in common creates re-lations between the participated authors and hence edges between author nodesin the network. Wiki pages (articles) and their linked structure can only be war-ranted by wiki users. Like the most networks in social science our different kindof wiki networks evolve during the editing process. At this point one can see therestriction of SNA. A great lack of dynamic components becomes noticeable.Social networks hold characteristics like growth and adjustment. To solve thischallenge we apply dynamic network analysis. The static view is enhanced by adynamic one while the evolution process considers the agency and behavior ofthe network actors. This is realized by adding one or more time parameters tothe networks. The corresponding models are introduced in the next sections.

A well-known classification of network topology will be introduced briefly.Existing empirical and theoretical results indicate that complex networks canbe divided into the two major classes of homogeneous and heterogeneous net-works [22]. This classification is based on the connectivity distribution P (k)

Page 5: Wikis as Social Networks: Evolution and Dynamics

which gives the probability that an arbitrary node is connected to k othernodes [22]. Homogeneous networks are characterized by almost the same numberof links at each node.

In contrast to that heterogeneous networks are often characterized by theexistence of clusters, i.e. the aggregation of nodes. Furthermore they have adegree distribution that is characterized in such a way that not all nodes in anetwork have the same number of edges [23]. The regarding distribution functionacts in accordance to the power law P (k) ∼ k−γ [22, 24].

Some seminal indices for determining the ‘important components’ in socialnetworks are centrality indices [25]. They quantify central nodes which can of-ten spotted intuitively by considering the visualized networks. Degree centralitywhich refers to the distribution of links as described above is one of the easiestmeasurements for determining the influence of a node on its neighbors. For undi-rected graph d(v) is the number of adjacent edges of node v, analogous d−(v)and d+(v) for directed graphs.

The focus of closeness centrality lies on measuring the closeness of a node toall other nodes in the network [26]. In contrast to degree centrality there is nolocal restriction any more. Closeness centrality CC(v) of a node v is defined asfollows. d(v, t) is the distance from v to a node t ∈ V .

CC(v) =1∑

t∈V d(v, t)

Betweenness centrality is based on shortest paths measurements. It indicateswhich nodes have strong influence on the network. They control the informationflow through the network since many shortest paths going through them [26].Betweenness centrality is defined as follows. σst denotes the number of short-est paths between nodes s and t, σst(v) denotes the number of shortest pathsbetween s and t with v on it.

CB(v) =∑

s 6=v 6=t

σst(v)σst

With respect to the interdependent wiki actors two network models are es-tablished. While the term network refers to the informal concept describing anobject composed of elements and interactions or connections between these ele-ments, the natural means to model networks mathematically is provided by thenotion of graphs [27].

An article graph is such a graph with directed edges. It can be constitutedintuitively by considering wikis as a part of the World Wide Web (WWW).Each wiki article (‘page’) induces a node that is labeled by the page title andits namespace. As one can observe in the XML dumps as well as in the pageURLs almost every page is denoted by its namespace and title (separated by acolon). Namespaces help to group wiki pages. For example, the Wikipedia pageCategory:Football denotes the category page of football. Page names withouta namespace prefix refer to the main namespace of the wiki (in the followingdenoted by ARTICLE). Like in the WWW articles are linked among each other.

Page 6: Wikis as Social Networks: Evolution and Dynamics

Links can be set arbitrarily by wiki users either to other articles or to externalresources like ‘normal’ web pages. Furthermore, it is possible to set links to otherwiki articles that do not exist yet. In the standard wiki theme those links arered colored. Due to the evolution of a wiki and its time dependent graphs fourdifferent types of nodes have to be considered:

– ‘Normal’ article nodes, type existing: They already exist in the wiki and inthe XML dump respectively. They have got a text body and at least onerevision.

– Article nodes of type requested: They refer to requested articles on which alink is set in another article. Requested articles will be established in thefuture of the XML dump. A usual way to create new articles is to set a linkto them in some special pages called Seed or Sandbox. Requested articles willchange their type to existing articles somewhere in the future of the XMLdump.

– Article nodes of type never exists: Wiki dumps correspond to a certain timeperiod that begins at the creation of the wiki and ends at the moment of thedump creation. Never existing articles can be seen as part of the requestedarticles set, but in contrast to them they will not be created until the end ofthe wiki dump.

– URL nodes: They refer to URL artifacts that are referred in the text body.

Naturally, the last three node types only possess incoming edges. The setof all nodes is denoted by Varticle, the set of edges by Earticle. Since the graphdepends on a certain timestamp, it is defined as Garticle(t) = (Varticle, Earticle)where t ∈ TS with TS a set of timestamps. The ‘oldest’ element corresponds tothe wiki creation, the ‘youngest’ one to the point of the wiki dump.

Author graphs can not be perceived in such an intuitive way as article graphs.They are built up on collaboration of wiki users (authors). The main idea ofmodifying wiki articles is the equality of wiki authors. Every web user is allowedto participate. Of course, due to vandalism there are some exceptions and re-strictions for some articles with a more or less sensitive content. Furthermore,just a few users have special admin rights, but this does not effect the model.An easy way to edit articles classifies authors in the two types anonymous au-thors and registered authors. Anonymous authors are denoted by their IP ad-dress, registered auhtors by their username. Consequently, the node set Vauthor

contains all authors involved in the wiki. A social relation (undirected edge)between two authors arises when they have worked in common on a wiki arti-cle, i.e. an intentional or unintentional collaboration by modifying the text body.Eauthor denotes the set of the collaboration edges. Due to the high dynamics andgrowth of a wiki an author graph is time-dependent, too. In contrast to articlegraphs they’ve got two timestamps t0 and t1 ∈ TS as input parameters. Thus,Gauthor(t0, t1) = (Vauthor, Eauthor) determines the graph where those nodes areconnected which authors have worked on a common article during the giventime period. According to the introduced wiki graphs and network models asystem database was established. It considers all entities and their dependenciesthat are required to generate author networks and article networks respectively.

Page 7: Wikis as Social Networks: Evolution and Dynamics

Figure 1 shows the corresponding entity relationship diagram. The entities wiki

wiki article

refersTo

0:n

0:0

timestamp

titlenamespace

size

article revision

isA

type

wiki author

name

wiki_id

article_id

modified

revision_id

author_id

type

url

addressurl_id

refersTo0:n

0:n

0:n

1:1

Fig. 1. entity relationship diagram.

author and wiki article take center stage. They correspond to the network nodesdescribed above. Articles can be identified by their labels consisting of name-space and title. The attribute type reflects one of the three article node types. Anarticle revision is a previous version of an article, but also the article itself. Dueto the wiki concept every article revision is saved in the wiki database. Theyare determined by their revision timestamp additionally. Their disc space canbe an interesting attribute, too. Every article revision may contain an arbitrarynumber of links in its text body to other articles, but not to article revisions.This is clear by considering the link format that does not contain any revision ortime information, e.g. [[Mathematics]] and [[Category:Science]]. Each ar-ticle text body may also contain web links of the format [http://example.com](squared brackets are optional). They correspond to the entity URL which can bepart of an arbitrary number of revisions. Last but not least, every wiki authorcan be identified by its name and type (anonymous/registered). Every authormay have modified/created 0 to n article revisions, but every revision was onlymodified by exactly one user.

4 Design and Architecture

For the realization of the dynamic network analysis of wikis a two-stage systemwas developed. As mentioned before it is called WikiWatcher. WikiWatcher con-

Page 8: Wikis as Social Networks: Evolution and Dynamics

sists of a prototype that contains parsing modules for extracting the requirednetwork data out of the dump files. Furthermore, a graphical user interface wasbuilt offering functions for the generation, visualization and measurement of wikinetworks. Data which is used by the system derives from wikis with MediaWikiengine and conforms to the introduced network models.

The conceptual design of the system is shown in Figure 2. Stage 1 realizes

article

article pages,URLS,revisions

Tim

Liz

Joe

123.45.67.89

authors

RDB

Stage 1

Stage 2

Wiki Network Data

Metadata

[[Article]]

[[requested]]

article

[http://…]

[[Article2]]

Generating XMLdump/export files

Parsing wiki data/database transfer

Measurement

Network Analysis

Generating Networks

Visualization

[[never exists]]

Fig. 2. conceptual design of the system.

the SAX-based parser tool that gets XML dumps as input data. It extracts thenecessary data that are used for generating networks later on and transfers itinto a system database. The system database serves as an interface between bothstages.

Stage 2 uses the previously stored network data of the system database. Theprototype on this stage offers functions for generating wiki networks accordingto some input parameters that correspond to the network models mentionedbefore. It also visualizes networks and applies them for methods and algorithmsof social network analysis. These methods support the verification of assumptionsand hypotheses about the behavior and characteristics of wiki networks and helpto accomplish the dynamic network analysis.

As a programming language Perl was chosen for stage 1 with respect to itscomfortable existing modules and its possibilities of defining regular expressions.One of the most important questions was how to extract network informationout of a wiki. Due to the tremendous amount of information of such wikis likeWikipedia, Wikiversity or Wikia Search the most feasible and effective way wasto use the SAX standard (Simple API for XML). It provides the potentialityof parsing XML documents in linear time and constant memory space. Theseproperties are essential when treating XML dumps with a disk space in the rangeof some Gigabytes.

In general, the SAX parser works event-based, i.e. if a certain XML tag or anattribute emerges the parser invokes user-defined methods. Entity or attribute

Page 9: Wikis as Social Networks: Evolution and Dynamics

values can be read and prepared for further processing. Memory will be clearedand can be used later on.

The parser tool of WikiWatcher considers in particular the dynamic aspect ofwikis. This means to process dumps that were generated with the option full only.These dumps contain wiki pages with all their revisions and timestamps. Theschema of the system database that was built up for storing the wiki informationis geared to the ER-diagram. For example one interrelational dependency speci-fies that each revision is dealt by exactly one author. Thus, the authors table hasto be filled before the revisions table is allowed to be filled. Hence, the parser isdivided in some sub-modules that care about these restrictions. After extractingsome ‘heading information’ like the wiki name, wiki URL, namespaces etc. thefirst pass through the XML dump extracts all participated authors. As a resultof the ‘natural’ appearance of redundancies in XML documents some databaseconstraints have to be set. In wiki dumps author names and IP addresses arestored redundantly for instance.

Another problem occurs when considering different types of article nodes dueto the evolution of a wiki. This also effects the parsing sequence. By having alook at the text body of a revision, there may be some links to other articles(pages) that do not exist at this point of time. Because of the sequential parsingprocess there can arise two cases: The article occurs later on or the article willnever exist in the whole dump. For getting all articles with their types (existing,requested, never existing) and for storing them into the pages table there mustbe a first pass for getting the existing pages and a second pass through thetext body of each revision that scans links to requested articles (“timestamp ofthe current revision must be smaller than the timestamp of first revision of therequested article’) and links to articles that will not exist at all.

While scanning the revision text all occurring URLs are stored into the sys-tem database. The last step is to store some further revision data like times-tamps, size and participated authors (their IDs) and to save the link structureof each revision to other articles and URLs (with references to previously saveddata records).

Scanning links and URLs means to create appropriate regular expressions.Links may have the form [[Football]], [[Football (Soccer)|Football]] or[http://some.url]. One challenge was that MediaWiki software allows to setlinks to external web pages either with or without squared brackets. On URLswithout brackets terminal symbols like question marks, exclamation marks orcommas are not considered for generating the corresponding web page link, butthere may occur ambiguousness. Another challenge is given by typing or syn-tax errors induced by wiki users. There may emerge some wrong ‘links’ like[[Football] or [[http://www.example.com]] that result in incorrect datarecords. By editing Wikipedia articles it is obligated often to have a previewbefore saving the modification, but it will not be possible to avoid this problemcompletely. Due to treating XML dumps as sequential data streams the timecomplexity is O(n) and the space complexity is O(c) with n as the length ofthe XML document and c constant. For the representation of wiki networks the

Page 10: Wikis as Social Networks: Evolution and Dynamics

Java graph tool kit yFiles has been integrated into WikiWatcher on stage 2. Itprovides classes and methods for generation, visualization and measurement ofnetworks. A graphical user interface was built that provides elements to chooseparameters like namespaces, node types (article networks), author types (au-thor networks) as well as for timestamps and measurement methods. It furtherprovides functions to compute mean values, standard deviation etc. as well asfunctions to export visualized networks into common formats. A general problemin research is the representation and visualization of tremendous networks withmore than 10,000 nodes and edges.

While the parser on stage 1 is able to process in linear time and constantspace, some measurement methods in stage 2 like centrality indices are costly tocompute [28]. For unweighted graphs the complexity to compute the betweennesscentrality using yFiles is O(|V |·|E|), closeness centrality is O(|V |2+|V |·|E|). Thesystem database that serves as an intermediary between both stages is realizedwith IBM DB2.

5 Dynamic Network Analysis of Wikis

The characteristics of wiki networks, their behavior, rates of change and distinc-tive features during the evolution process are the basis of the DNA. The devel-oped prototype WikiWatcher allows to conduct the DNA by offering methodsfor measurement and representation of wiki networks. It permits the verificationof assumptions about network characteristics. Not only the stage 2 prototypebut also the system database allows to query and analyze wiki data. While themeasurement data at stage 2 can be classified in information that refers to net-work structure, the gained information at the database level may refer more tonetwork dimension of wikis. Structural aspects correspond to characteristics likecentrality, clustering, network diameter or shortest path issues, whereas dimen-sional aspects cover properties like number of articles and authors, number ofmodifications, size of articles as well as their rates of change. We start with acouple of well known ideas and come to some newer hypotheses later.

The rate of new authors/articles into a wiki network falls off after a pe-riod of time. The idea is to come from a ‘foundation fever’ of a wiki. Figure 3shows both the growth rate of the number of authors and articles. A few wikisare treated in the diagrams. In general, the assumption can not be verified. Itcouldn’t determined a fall off in the rate of growth in both cases. The growth’scharacteristics may be up to semantic aspects of a wiki, e.g. up-to-date incidencesthat animate new users to write new articles. It has to be proved individually.In the case of Wikia Search it seems to be clear. In January 2008 it went onlinefor public – observably in the sharp bend in both network types. The mea-surements of Wikipedia (Simple English) show a progressive growth rate, in thecase of Wikiversity it fluctuates sometimes or it may have leaps and boundsin other wikis. Because Wikiversity’s articles are strongly categorised, furthername spaces are included. There is a remarkable observation that wasn’t in-tended when considering both growth rates separately. Joining new authors to a

Page 11: Wikis as Social Networks: Evolution and Dynamics

Fig. 3. Rate of growth (author/article networks).

wiki mostly means new articles – it does not mean working on already existingarticles.

Wiki networks are heterogeneous during the whole evolution process. In ho-mogeneous networks the number of k links per node is about the average 〈k〉[22]. Such a uniform distribution couldn’t be verified in (social) wiki networks.Applying and measuring the degree centrality showed an imbalance between thenetwork nodes in terms of their links. According to a lot of situations in socialstructures a small portion of actors have above-average links and do most ofthe work, i.e. editing articles and establishing new relations. This is shown inFigure 4 where two author networks are given (left, center). To make contactto other users, one needs to edit a lot of articles. But, this kind of users arethe minority. This distinctive heterogeneity not only occurs in author networks,but also in article networks (see Figure 4, right). For article networks this is

Fig. 4. Heterogeneous author/article networks

proven in Figure 5 by using the degree centrality. Incoming as well as outgoingarticle edges and links respectively are observed over a certain time period. Themeasurements showed in all considered wikis a continuous strong standard devi-ation of edges to nodes. Depending on semantic issues there may be a very highstandard deviation of outgoing links. This is given in Aachen Wiki which serves

Page 12: Wikis as Social Networks: Evolution and Dynamics

Fig. 5. Article networks: degree centrality and standard deviation

as a information wiki for the city of Aachen and as an index which naturally hasmany outgoing references.

Central nodes hold their important role during the evolution process. As de-scribed, the ‘importance’ of a node can be determined by using the betweennesscentrality. This means that most shortest paths in the network go through thesenodes. This measurement is done for Wikia Search for the time period August2004 to August 2005. The left side of Figure 6 gives for every registered authorits betweenness centrality depending on time (unnormalized for a better view).Like the degree centrality there is only a small part of authors that have a highbetweenness centrality. In general they hold or increase their high value duringthe evolution process. The survey can be found in article networks as well. Theright side of Figure 6 shows the betweenness centrality for Jabber Wiki, a wiki asthe name suggests. One of the most evident characteristics of wiki networks is the

0

100

200

300

400

500

600

AngelaMelancholie

JasonrPayo1

NlwSgeo

DedalusTim

MaurreenAnthony_DiPierro

BdeshamWeideEllmist

Christopher_mcdermottHashar

Aphrael_RunestarFennec

AlexGIngoolemo

Jimbo_WalesPpp

JimCimon_Avaro

VickieNickshanks

HtaccessAndrevan

MdavisTimur

Node_ueJfsParYannGskur

authors

0 2

4 6

8 10

12

months

0

100

200

300

400

500

600

CB

0

50

100

150

200

250

Jabberfaehige_ProgrammeTransportsVorteile_von_JabberHelgaWarum_JabberKryptografieEinrichtung_PsiHelga−HTTPHelga−GruppenAktuelle_EreignisseAlte_MeldungenKryptografie_SSL/TLSHelga−BefehleEinrichtung_GajimEinrichtung_von_PidginExterne_BotsGemeinsame_BenutzergruppenHelga−IdeenKryptografie_OTREinrichtung_JBotherModeration_von_GruppenchatsGruppenchatHelga−ErsteSchritteInterceptorEinrichtung_PandionKryptografie_OpenPGPVerbesserungen_der_SoftwareEinrichtung_CampusTalkIdeensammlungFAQAnmelden_und_AktivierenEinrichtung_MirandaEinrichtung_AdiumXEinrichtung_mcabberEinrichtung_kopeteHelga_(HTTP)RSS−BotEinrichtung_CentericqHelga_(Bugs)HauptseiteHelga.phpDateitransferHelga_(Befehle)Ideen_fuer_weitere_Jabber−DiensteEinrichtung_SparkEinrichtung_MeeboEinrichtung_TkabberEinrichtung_ExodusGemeinsame_Benutzergruppen_in_der_KontaktlisteEinrichtung_iChatKryptografie_SimpLiteMoeglichkeiten_fuer_Moderatoren_in_GruppenchatsEinrichtung_TrillianHelga_(Server_Bot)AGBTODOAudio−_und_VideochatRSS_Feeds_BotHelga_(Ideen)Helga−BugsCampusWebTalkEinrichtung_NeosAdmin_LogEinrichtung_von_GaimTestphase_FahrplanWebRegInternEinrichtung_von_SimJabber−AdministratorenHelga_(Gruppen)Bekannte_ProblemeFingerprintKryptografieOTR

articles

0 2

4 6

8 10

12

months

0

50

100

150

200

250CB

Fig. 6. Betweenness centrality of author and article networks

heterogeneity during their entire evolution processes. In homogeneous networksthe number of links k per node is about the average 〈k〉 [22]. Such homogeneousstructures do not appear in wiki author or wiki article networks. In fact, thereexist a few nodes with a lot of adjacent edges and a plenty of nodes with only afew edges. Figure 7 gives an impression of heterogeneous networks. The author

Page 13: Wikis as Social Networks: Evolution and Dynamics

network (circular layout) is a collaboration network of anonymous and registeredusers of the BerlinWiki that is hosted on Wikia. The article network (organiclayout) gives the status of the German Wikia itself in May 2008 including allnamespaces. Requested and never existing links as well as URLs are excludedbecause they’ve got only incoming edges. However, a strongly unbalanced edgedistribution can be observed. The exponential distribution of wiki networks dur-

Fig. 7. Heterogeneous author/article networks

ing the whole evolution is also shown by considering betweenness and degreecentrality indices. Furthermore, the standard deviation of the number of edgespoints out these characteristics. But also semantically it can be explained byconsidering the ‘intention’ of certain articles. For instance, there are a few ar-ticles that hold an index character with many outgoing links and references.Considering this issue, a consequential claim is to divide (registered) authorsinto two classes. The distinctive criterion is how intensive the participation onarticles by authors is in the whole wiki. The system database outputs for everyauthor its number of revisions. Ordered by this number, it could be drawn aline where the discrepancy of the revision numbers allocated to the authors washigh. For example, in the Wikipedia (simple english) only 377 authors did 93%of the work (revisions), almost 5,000 authors did only a small part of it (7%).This phenomenon could be observed in all considered wikis independent fromtheir size. A predominant number of revisions can be allocated to a small groupof authors. After a short period of time a small group of users gathers aroundan article.

Registered authors often serves as ‘connectors’ of anonymous author networkcomponents. This is another phenomenon could be discovered in the connectionsbetween registered and anonymous wiki users. Handling with a number of wikishas shown that this has to be proven for each wiki. In the given example ofWikia Search (see left side of Figure 8), it is remarkable that anonymous authors

Page 14: Wikis as Social Networks: Evolution and Dynamics

can be identified in a certain way. Although they can only be spotted by their

Fig. 8. ‘Connectors’ in author and article networks

IP addresses, one can divide them after a period of time into single groups orgraph components that are completely separated from each other. It would beinteresting for further research to decompose the IP addresses and to allocatethem geographically. By this way, it could be observable which addresses belongto single users. By adding registered authors to an anonymous author networkone obtains only one strongly connected network component. The Wikia Searchexample gives the state of the anonymous author network with t0 = July 15,2004 and t1 = January 7, 2008 (shortly before the ‘official start’).

Nodes with a high betweenness rate are gateways to the rest of the web. Asshown in the right side of Figure 8 which gives the state of the article-URLnetwork of the AachenWiki in May 2008 there are a few article nodes that containa lot of outgoing edges to external resources (web pages). These articles can beimportant as ‘connectors’ to the WWW. An interesting question may be if thereis a correlation between these articles with high degree centrality to externalpages and articles that have a high betweenness centrality to other wiki articles.Articles with a high CB control the information flow within a network. They areimportant by clicking through the wiki and must be protected against vandalismand other damages. Figure 9 treats ten of the most important article nodes of theUnofficial Google Wiki from January to December 2007. The wiki is hosted onWikia. First, one can observe an almost constant betweenness centrality that isoccupied by every node during the whole period. At this point a nice side effectemerges: Vandalism was detected by using the model. On July 31, 2007 (seemonth 8) the content of the main article Google Wiki was deleted completely.This means of course vanishing of all edges to other articles. Hence no shortest

Page 15: Wikis as Social Networks: Evolution and Dynamics

Fig. 9. Betweenness and degree centrality of article nodes

path could go through the main page. This implies a betweenness centrality of0 – in the diagram visualized as a ‘gap’.

The right side of Figure 9 shows a stable number of outgoing edges to URLs.Both measurements show the strong heterogeneity of wiki networks. In generalit can not be assumed a correlation between an article-article-CB and article-URL-CD. But, in the treated wikis all articles with a CB greater than 0 hold anumber of URLs. Hence, these nodes are important in both ways: for the internalstructure of the wiki as well as a connector to the ‘real world’.

Complex networks become denser durring their evolution. Approaches of Les-kovec et al. [29, 30] concerning the Densification Power Law showed that complexnetworks may become denser during their evolution and growth. Generally, thiscould not be verified for wiki author networks. Figure 10 reflects two essen-tial characteristics. A few wikis were treated by considering their shortest pathlengths. The measurements begin at the creation of the wikis (first month) andend at the moment of the XML dump. At each measurement point the greatest

Fig. 10. Lengths of shortest paths in author networks

Page 16: Wikis as Social Networks: Evolution and Dynamics

strongly connected component of an author network was considered by comput-ing the average shortest path length from one author node to another authornode. As a consequence of the easy way of (intended or unintended) collaborationusers are connected very quickly to other users. (Remember, you just need towork on an article in common.) On average, the shortest path length to anotherauthor is not longer than 3.

But after a time period of ‘self-discovery’ the average distances stagnate atnearly 2 for all treated author networks. This kind of wiki self-discovery is de-picted in Figures 11. They show the author networks (anonymous and registeredauthors) of Wikia (de) in July and August 2007 respectively. In the beginning,authors work in small groups on ‘their’ articles. More and more new authors jointhe network. After accomplishing this first evolution process strongly connectedcomponents will be merged to one single component (apart from some isolatednodes). The figure shows the important author link between both components.Of course, the average distance increases when a new component is connected(see peak in Figure 10). But due to more and more interactions between authors,the average distances will level off at 2 until the measurement ends. Hence, agrowing densification during the evolution process could not be determined.

Fig. 11. Author network evolution

6 Conclusions and Outlook

In this paper, our aim was to establish a dynamic network analysis view on wikis.Wikis are continuously mutating and growing network structures. We introduceddifferent quantitative and qualitative characterizations of wiki networks to modelevolution and dynamics of wikis. In the formal network models we introduced

Page 17: Wikis as Social Networks: Evolution and Dynamics

different node types in the network like authors, articles, revisions, and URLs.Each of the nodes is annotated by a time component allowing us to track com-plex changes in the structures over time. Due to the limited space, we cannotpresent all the hypotheses we tested for the study, cf. [31, 32] for more details.We highlighted here that a predominant number of revisions can be allocated toa small group of authors. We described that an anonymous user can be spottedby her editing behavior regardless of the IP address. We demonstrated that wikipages with a high betweenness centrality also contain a lot of external links thusserving as a gateway to the external web. In the end, we had a closer look at theassumed densification of wiki networks which could not be affirmed.

The applied DNA refers to structural aspects of wiki networks. Measurementsof centrality indices revealed a growing heterogeneity in wiki networks. Likein other social networks we could determine a strong hierarchical structure ofimportant and unimportant nodes. Furthermore, we have built a bridge to theSmall World Phenomenon [33–35] that can be found in social science frequently.It was shown a continuous growth in the number of authors and articles witha remarkable correlation. But there could not made a general assertion aboutthe kind of growth. This has to be checked in any particular case. But it offersinteresting starting points for further research in cross-medial network types likeauthor–article networks. What effect does have a weighting of edges? What kindof influence do have minor edits? In addition, semantic analysis of correspondingdiscussion, talk or user pages in terms of growth and changing may be interesting.

What kind of benefits does have DNA for wikis? Next to getting an overviewon hidden interrelationships and pointing out remarkable actors there are somefurther applications. Vandalism is widespread in the web. Wikis are concernedof it, too. So, it is necessary to protect particular areas and articles respectively.Wikipedia protects its articles due to semantic decisions – if an article has a sen-sitive content. But by means of network analysis articles could be protected bytheir importance to guarantee a secure information flow in the network. Theremay be further advantages of considering wiki networks, e.g. economic or socialaspects that are based on network measurements. They can give recommenda-tions to users according to the gained network data. This may be commercialadvertisement or social information.

In this paper wikis based on MediaWiki software were considered only. Forgenerating the networks according to the models we implemented a two-stagedsystem. It consists of a crawler that takes care of data extraction, transferringthem into a system database, preparing them for generating and visualizingnetworks as well as applying measurement methods. Stage 1 is able to manageXML dumps of arbitrary file size. Parsing is done in linear time and constantspace using SAX. Stage 2 uses the advantages of existing graph drawing librariesand their network analysis algorithms. One of the biggest problems was to handlewith ‘big’ wikis as measured by their number of nodes. The English Wikipediacontains more than 2 million articles (nodes) and its German counterpart after all1 million articles (nodes). Until now, it is a research challenge how to generate,analyze and visualize such tremendously huge networks. The main aspect of

Page 18: Wikis as Social Networks: Evolution and Dynamics

wikis (‘writing articles in common’) echoes in all wiki systems. Every wiki isable to be represented as a mutating and developing network. Due to the 2-stage design approach modifications for other wiki software are easily done. Wealready implemented a modified first stage for the content management systemTikiWiki. Stage 2 can be untouched. Amongst other things like export format,namespaces and author types the different tagging of links had to be considered(see table 1).

MediaWiki TikiWiki

[[article]] ((article))

[[article|description]] ((article|description))

[http://example.com] [http://example.com]

[http://example.com eg] [http://example.com|eg]

Table 1. Tagging of links

By adjusting the parser module on stage 1 according to the new requirementsit is possible to adapt the most common wikis and even TikiWiki to the system.In this manner it is possible to apply DNA methods on arbitrary wikis based onarbitrary wiki engines.

For dynamic network analysis incremental dumps are more appropriate thanthe static dumps we used for the extraction of timed information from the evo-lutionary wiki. Future work also includes the design of such incremental dumpoptions for existing wiki software.

7 Acknowledgments

This work was supported by the German National Science Foundation (DFG)within the collaborative research center SFB/FK 427 ‘Media and Cultural Com-munication’, within the research cluster established under the excellence ini-tiative of the German government ‘Ultra High-Speed Mobile Information andCommunication (UMIC)’ and within the cluster project CONTICI. We thankour colleagues for the inspiring discussions.

References

1. Aronsson, L.: Operation of a large scale, general purpose wiki website: Experiencefrom susning.nu’s first nine months in service. In Carvalho, J.a.A., Hubler, A.,Baptista, A.A., eds.: Proceedings of the 6th International ICCC/IFIP Conferenceon Electronic Publishing, Karlovy Vary, Czech Republic (November 2002) 27–37

2. Lamb, B.: Wiki open spaces: Wikis, ready or not. Educause Review 39(5) (Septem-ber/October 2004) http://www.educause.edu/apps/er/erm04/erm045.asp, last ac-cessed: November 2008.

Page 19: Wikis as Social Networks: Evolution and Dynamics

3. Aguiar, A., David, G.: Wikiwiki: weaving heterogeneous software artifacts. In:WikiSym ’05: Proceedings of the 2005 international symposium on Wikis, NewYork, NY, USA, ACM (2005) 67–74

4. Anderson, C.: The Long Tail: Why the Future of Business Is Selling Less of More.Hyperion (2006)

5. Vega-Redondo, F.: Complex Social Networks. Econometric Society Monographs.Cambridge University Press, Cambridge (2007)

6. Adler, B.T., de Alfaro, L.: A content-driven reputation system for the Wikipediai.In: WWW. (2007) 261–270

7. Kittur, A., Chi, E.H., Pendleton, B.A., Suh, B., Mytkowicz, T.: Power of thefew vs. wisdom of the crowd: Wikipedia and the rise of the bourgeoisie. In: 25thAnnual ACM Conference on Human Factors in Computing Systems (CHI 2007);2007 April 28 - May 3; San Jose, CA. (2007)

8. Carley, K.M.: Dynamic network analysis. In Breiger, R., Carley, K.M., eds.: Sum-mary of the NRC workshop on Social Network Modeling and Analysis, NationalResearch Council (2003)

9. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: Structure and evolution ofblogspace. Communications of the ACM 47(12) (2004) 35–39

10. Vossen, G., Hagemann, S.: Unleashing Web 2.0. - From Concepts to Creativity.Morgan Kaufman, Burlington, MA (2007)

11. Aigrain, P.: The individual and the collective in open information com-munities. In: 16th BLED Electronic Commerce Conference. (June 2003)http://hdl.handle.net/2038/957, last accessed: November 2008.

12. Klamma, R., Spaniol, M., Jarke, M.: Pattern-based cross media social networkanalysis for technology enhanced learning in europe. In Nejdl, W., Tochtermann,K., eds.: Proceedings of the First European Conference on Technology EnhancedLearning, Crete, Greece, October 3-5. Volume 4227 of LNCS., Berlin Heidelberg,Springer-Verlag (2006) 242–256

13. Priedhorsky, R., Chen, J., Lam, S.T.K., Panciera, K., Terveen, L., Riedl, J.: Cre-ating, destroying, and restoring value in Wikipedia. In: GROUP ’07: Proceedingsof the 2007 international ACM conference on Supporting group work, New York,NY, USA, ACM (2007) 259–268

14. Voß, J.: Measuring Wikipedia. In: Proceedings of the 10th International Conferenceof the International Society for Scientometrics and Informetrics. (2005)

15. Hu, M., Lim, E.P., Sun, A., Lauw, H.W., Vuong, B.Q.: Measuring article qualityin Wikipedia: models and evaluation. In: CIKM ’07: Proceedings of the sixteenthACM conference on Conference on information and knowledge management, NewYork, NY, USA, ACM (2007) 243–252

16. Barabasi, A.L., Albert, R., Jeong, H.: Mean-field theory for scale-free randomnetworks. Physica A Statistical Mechanics and its Applications 272 (1999) 173–187

17. Wilkinson, D.M., Huberman, B.A.: Assessing the value of cooperation inWikipedia. First Monday, volume 12, number 4 (April 2007) (Feb 2007)

18. Joseph M. Reagle, J.: Do as i do:: authorial leadership in Wikipedia. In: WikiSym’07: Proceedings of the 2007 international symposium on Wikis, New York, NY,USA, ACM (2007) 143–156

19. Latour, B.: On recalling ant. In Law, J., Hassard, J., eds.: Actor-Network Theoryand After. Oxford (1999) 15–25

20. Breiger, R.L.: The analysis of social networks. In Hardy, M., Bryman, A., eds.:Handbook of Data Analysis. London, SAGE Publications (2004) 505–526

Page 20: Wikis as Social Networks: Evolution and Dynamics

21. Cunningham, W.: Invitation to the patterns list.http://c2.com/cgi/wiki?InvitationToThePatternsList (2005)

22. Albert, R., Jeong, H., Barabasi, A.L.: Error and attack tolerance of complexnetworks. Nature 406 (2000) 378–382

23. Albert, R., Barabasi, A.L.: Statistical mechanics of complex networks. Reviews ofModern Physics 74 (2002) 47

24. Barabasi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(1999) 509

25. Koschutzki, D., Lehmann, K.A., Peeters, L., Richter, S., Tenfelde-Podehl, D., Zlo-towski, O.: Centrality indices. In Brandes, U., Erlebach, T., eds.: Network Analysis:Methodological Foundations. Springer (2005)

26. Brandes, U., Kenis, P., Wagner, D.: Communicating centrality in policy networkdrawings. IEEEE Transactions on Visualization and Computer Graphics 9(2)(2003) 241–253

27. Brandes, U., Erlebach, T.: Fundamentals. In Brandes, U., Erlebach, T., eds.:Network Analysis: Methodological Foundations. Springer (2005)

28. Brandes, U.: A faster algorithm for betweenness centrality. Journal of Mathemat-ical Sociology 25(2) (2001) 163 – 177

29. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: Densification andshrinking diameters. ACM Trans. Knowl. Discov. Data 1(1) (2007) 1–40

30. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws,shrinking diameters and possible explanations. In: KDD ’05: Proceedings of theeleventh ACM SIGKDD international conference on Knowledge discovery in datamining, New York, NY, USA, ACM (2005) 177–187

31. Haasler, C.: Dynamische Netzwerkanalyse von Wikis. Diplomarbeit, RWTHAachen, Lehrstuhl fur Informatik 5 (12 2007)

32. Klamma, R., Haasler, C.: Dynamic network analysis of wikis. In: Proceedingsof I-Know’08 and I-Media’08, International Conferences on Knowledge Manage-ment and New Media Technology, Graz, Austria, September 3-5, 2008, Journal ofUniversal Computer Science (J.UCS), 2008. (2008) 161–168

33. Milgram, S.: The small-world problem. Psychology Today 1(1) (1967) 60–6734. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature

393 (1998) 440–44235. Adamic, L.A.: The small world web. In: ECDL ’99: Proceedings of the Third

European Conference on Research and Advanced Technology for Digital Libraries,London, UK, Springer-Verlag (1999) 443–452