1
Background Wikipedia Is an open source encyclopedia based on the wiki technology. Wikipedia relies on a large community of editors all with the goal of providing a clear and unbiased source of global information. Wikipedia also depends on a smaller group of devoted admins who are voted for by the community and serve the community by settling disputes and cleaning pages. Much HCI research is being done to determine what makes editors want to become admins, what signals a good admin, and what choices editors make in their journey to become one. Objective We sought to determine differences over time in the amount and type of work admin nominees performed depending on whether they were selected to become admins or not. This information could be used as an evaluation metric by nominees and admin voters, but most significantly as an audit as to how successful Wikipedia is at determining who becomes admin and the before and after effects of this decision on the nominees. Wikipedia Nominee and Administrator Evaluation Metrics David Bunker, Robert Kraut PhD Setup and Methods We analyzed the revision data of a set of 1438 Wikipedia nominees who had a request for adminship (RFA) between January 2003 and January 2006. Of these 1438 Wikipedia nominees we investigated 856 became admins and 627 did not become admins. We evaluated nominee activity be creating a vector of the activity each editor had performed in a month. All data was centered around the RFA date, which gives us a good indication of how nominees compare to each other as they progress. Number of revisions, pages, talk revisions, and user page revisions, and their relative proportions were extracted. Number of revisions is an indicator of interest in Wikipedia and in quality of admin. Revisions/pages gives some indication as to the breadth of edits of the nominee. Finally, talk revisions/total revisions gives an indication as to how much the user interacts with the Wikipedia community as opposed to just cleaning articles. We stored all data in a Hadoop database and extracted the desired statistics using the map reduce based parallel programming language Pig. The NMF Algorithm Results Analysis The basis spectra shown indicate two ideal trajectories produced via the NMF algorithm. The first is a rapid increase and decay and the second is rapid increase followed by more sustained activity. This is true for both features analyzed. The peak could indicate people became nominated when they are at peak performance and then drop back toward mean. Via its regularization NMF also indicates a correlation between sustained activity with those chosen and decay with those not chosen. It is interesting to see that the drop off for a successful candidate is much slower than the unsuccessful candidate. This may suggest that the selection process works. It could also mean that becoming admin sustains behavior or not becoming an admin discourages candidates. This data could also indicate that since some people not granted adminship continue to be productive, they should have been given adminship. The results of the average vectors indicate a high drop off of activity after the RFA period. We can see that although those that became admin continue to revise and tend to revise more per page and utilize their talk pages more than those who did not become admins, the results are still similar for both. Future Work Although we have discovered a high drop off of activity in nominees after their RFA date and have roughly correlated greater drop off to those who were refused adminship using NMF, it is still unclear how administrator voters could have determined the degree of drop-off not in hindsight, but when they voted. NMF could potentially be applied to determine these Spectra 1 Spectra 2 Spectra 3 Became Admin 401/856 = 46.8% 296/856 = 34.6% 139/856 = 16.2% Not Admin 196/627 = 31.3% 272/627 = 43.4% 109/627 = 17.4% Spect ra Basis Spectra ((Talk Page Revs)/(All Revs)) 1 2 3 Spectra 1 Spectra 2 Became Admin 320/856 = 37.4 536/856 = 62.6 Didn’t Become Admin 333/627 = 53.1 295/ 627 = 47.1 Spect ra Basis Spectra (All Revs Total) 1 2 Became Admin Didn’t Become Admin Both results Combined Avera ge Revs Avera ge Revs/ Pages Avera ge Talk/ Revs Non-negative matrix factorization is a useful tool when trying to determine the features that distinguish graphs. In our case, NMF takes as input the vectors of user data over time and then forms two graphs that best represent the data. These are called the basis. Best represent means that we can best recreate the original data as a bunch of linear combinations of these graphs. This is called deflation. The linear combination information is called the weights. NMF works by performing gradient descent as shown by the update equations below. There are many different kinds of NMF algorithms available, we chose the multiplicative update based regularized Lee-Seung algorithm EMML and used the Kullback-Leibler cost function. Regularization is important. We want to try to separate nominee from non- nominee, and NMF is ideal for this sort of feature matching. The NMFLAB Matlab toolkit and additional scripts were used to process the data. To enforce sparse representation, the EMML update equations as stated below were used. For our data, we want the weights to be sparse, however sparsity does not matter for the basis graphs. So, We set the sparsity coefficients αSa=0 and αSx=0.3.

Background Wikipedia Is an open source encyclopedia based on the wiki technology. Wikipedia relies on a large community of editors all with the goal of

Embed Size (px)

Citation preview

Page 1: Background Wikipedia Is an open source encyclopedia based on the wiki technology. Wikipedia relies on a large community of editors all with the goal of

BackgroundWikipedia Is an open source encyclopedia based on the wiki technology. Wikipedia relies on a large community of editors all with the goal of providing a clear and unbiased source of global information. Wikipedia also depends on a smaller group of devoted admins who are voted for by the community and serve the community by settling disputes and cleaning pages. Much HCI research is being done to determine what makes editors want to become admins, what signals a good admin, and what choices editors make in their journey to become one.

ObjectiveWe sought to determine differences over time in the amount and type of work admin nominees performed depending on whether they were selected to become admins or not. This information could be used as an evaluation metric by nominees and admin voters, but most significantly as an audit as to how successful Wikipedia is at determining who becomes admin and the before and after effects of this decision on the nominees.

Wikipedia Nominee and AdministratorEvaluation Metrics

David Bunker, Robert Kraut PhD

Setup and MethodsWe analyzed the revision data of a set of 1438 Wikipedia nominees who had a request for adminship (RFA) between January 2003 and January 2006. Of these 1438 Wikipedia nominees we investigated 856 became admins and 627 did not become admins.

We evaluated nominee activity be creating a vector of the activity each editor had performed in a month. All data was centered around the RFA date, which gives us a good indication of how nominees compare to each other as they progress.

Number of revisions, pages, talk revisions, and user page revisions, and their relative proportions were extracted. Number of revisions is an indicator of interest in Wikipedia and in quality of admin. Revisions/pages gives some indication as to the breadth of edits of the nominee. Finally, talk revisions/total revisions gives an indication as to how much the user interacts with the Wikipedia community as opposed to just cleaning articles.

We stored all data in a Hadoop database and extracted the desired statistics using the map reduce based parallel programming language Pig.

The NMF Algorithm

Results AnalysisThe basis spectra shown indicate two ideal trajectories produced via the NMF algorithm. The first is a rapid increase and decay and the second is rapid increase followed by more sustained activity. This is true for both features analyzed. The peak could indicate people became nominated when they are at peak performance and then drop back toward mean. Via its regularization NMF also indicates a correlation between sustained activity with those chosen and decay with those not chosen.  It is interesting to see that the drop off for a successful candidate is much slower than the unsuccessful candidate. This may suggest that the selection process works. It could also mean that becoming admin sustains behavior or not becoming an admin discourages candidates. This data could also indicate that since some people not granted adminship continue to be productive, they should have been given adminship. The results of the average vectors indicate a high drop off of activity after the RFA period. We can see that although those that became admin continue to revise and tend to revise more per page and utilize their talk pages more than those who did not become admins, the results are still similar for both.

Future WorkAlthough we have discovered a high drop off of activity in nominees after their RFA date and have roughly correlated greater drop off to those who were refused adminship using NMF, it is still unclear how administrator voters could have determined the degree of drop-off not in hindsight, but when they voted. NMF could potentially be applied to determine these features.

Spectra 1 Spectra 2 Spectra 3Became Admin 401/856 = 46.8% 296/856 = 34.6% 139/856 = 16.2%Not Admin 196/627 = 31.3% 272/627 = 43.4% 109/627 = 17.4%

Spectra Basis Spectra ((Talk Page Revs)/(All Revs))

1

2

3

Spectra 1 Spectra 2Became Admin 320/856 = 37.4 536/856 = 62.6Didn’t Become Admin 333/627 = 53.1 295/ 627 = 47.1

Spectra Basis Spectra (All Revs Total)

1

2

Became Admin Didn’t Become Admin Both results CombinedAverage Revs

Average Revs/Pages

Average Talk/Revs

Non-negative matrix factorization is a useful tool when trying to determine the features that distinguish graphs. In our case, NMF takes as input the vectors of user data over time and then forms two graphs that best represent the data. These are called the basis. Best represent means that we can best recreate the original data as a bunch of linear combinations of these graphs. This is called deflation. The linear combination information is called the weights. NMF works by performing gradient descent as shown by the update equations below.

There are many different kinds of NMF algorithms available, we chose the multiplicative update based regularized Lee-Seung algorithm EMML and used the Kullback-Leibler cost function.

Regularization is important. We want to try to separate nominee from non-nominee, and NMF is ideal for this sort of feature matching. The NMFLAB Matlab toolkit and additional scripts were used to process the data.

To enforce sparse representation, the EMML update equations as stated below were used. For our data, we want the weights to be sparse, however sparsity does not matter for the basis graphs. So, We set the sparsity coefficients αSa=0 and αSx=0.3.