What Did They Do?Deriving High-Level Edit Histories in Wikis
Peter Kin-Fong Fong and Robert P. Biuk-Aghai
Data Analytics and Collaborative Computing Group
Department of Computer and Information Science
Faculty of Science and Technology
University of Macau
Motivation
1000s of edits & editors a dayBut:
Which edits are significant?Who provides significant edits?
Look at wiki article history?Tedious!
Automatically analyze the nature of the editsTry to answer:
What Did They Do?
1: en - de (..450,000)2: es - it (..70,000)3: ja - pt (..50,000)4: vo - nl (..35000)5: zh - sv (..22500)6: hu - tr (..20,000)7: ca - no (..15000)http://stats.wikimedia.org/EN/PlotsPngDatabaseEdits.htm
Revision History
Looking at Byte Count?
2,070 bytes (new) – 2,055 bytes (old) = 15 bytes changed
But:
56 bytes were cut71 bytes were added56 + 71 = 127 bytes changed
Looking at Minor vs. Non-Minor Edit?
→ Minor edit flag: may not be available; subjective
53 bytes changedMinor
1 byte changedNon-Minor
Related Work
Related Work: Text Differencing
An O(ND) Difference Algorithm and Its Variations (Myers, 1986)Longest common subsequence methodDoes not take movement into accountBasis of Wikipedia “Diff” function
Alpha Bravo Charlie Delta Echo
Alpha Charlie Delta Golf Bravo
Alpha Bravo Charlie Delta Echo
Alpha Charlie Delta Golf Bravo
Alpha Charlie Delta Golf Bravo
Alpha Bravo Charlie Delta Echo
Alpha Charlie Delta Golf Bravo
Alpha Bravo Charlie Delta Echo
Related Work: Text Differencing
The String-to-String Correction Problem with Block Moves (Tichy, 1984)Diff by matching blocks of text, regardless positionMovement can be detectedBasis of Wikitrust
Text Differencing Granularity
Flexible Diff-ing in a collaborative writing system (Neuwirth et al. 1992)Differencing on word level only not sufficientHierarchical decompositionParagraph, Sentence, Phrase, Word, Character
Further decompose only when similarity is high
Wikipedia “Diff” function works at two levelsParagraph & WordAlways do word level diff when paragraph is changedOften produces hard-to-read difference statements
A hard-to-read Wikipedia Diff
Related Work: Edit Categorization
X. de Pedro Puente, 2007Wiki e-learning environmentRequires its student editors to categorize their own editsMarkup improvement, New information, …
Gorgeon and Swanson, 2009Studying evolution of conceptClassify each of 3,665 edit by bare eye:“simple but tedious”CategoriesCommon used: Vandalism, Spam
Self-defined: Challenged, Unchallenged
Problems
Produce an intuitive difference statement between two versions of text
Rate the significance of an editA metric that can be compared across different articles
Proposed Design
Wiki Edit History Analyzer
2. Text Differencing Engine
1. Lexical Analyzer
3. Action Categorizer
4. History Summarizer
MediaWiki
Revisions of Article
Summary of changes
Sequence of Tokens
List of Basic Edit Actions
High Level Edit Actions
Step 1: Lexical Analyzer
Break the raw text into tokens of words and symbols
Divide the whole article into sentences
Step 2: Text Differencing Engine
Two level differencing Sentence levelToken (word & markup) level
Approximate sentence matchingMatching rate:
Movement detectionTarget: minimize number of movement actionsOnly consider segment with 4 or more tokens, to avoid false tagging of common words
ji
jiji cnco
ccm
+×
= ,,
2
Step 3: Action CategorizerRule based categorization
Example: Spelling correctionIf the matching rate of two sentences > 80%Calculate the character level edit distanceIf edit distance ≤ 3 Spelling correction⇒
Rosaleen's story to the hunstman/wolf: A she-wolf who arrives at a village.
Rosaleen's story to the huntsman/wolf: A she-wolf who arrives at a village.
93% matching rate, edit distance = 2
Step 3: Action Categorizer
WikipediaWikifyInter-language linksSpelling correctionAdd / Modify categoryAdd referencesContent re-organizationContent re-writing……
Step 4: History Summarizer
Summarize editsGenerate edit summaryCalculate edit significance
Edit significance by weighted sum
s=shighsbasic
shigh=∑x=1
m
∑i=1
n
w x , i cx , i
sbasic=∑i=1
n
w ins , ic ins , iwdel , i cdel , iw repl , ic repl , iwmov , icmov , i
Prototype
Prototype
Currently performs first 3 stepsHistory summarizer under development
Implemented in Java, PHP front-end to MediaWiki
Produces categorized edit statements
At early alpha stageSource code available at http://sourceforge.net/projects/weha/
Prototype Screenshot
Prototype Screenshot
Compare with Wikipedia diff
Preliminary Evaluation: Overview
10 articlesLength from 2000 to 41000 charactersConsecutive revisions with most edit actions were chosen (2 – 18 edit actions)
10 student volunteers, about equally distributed in terms of:Gender (6 male, 4 female)Technical background (4 have, 6 haven’t)Education level (5 undergrads, 5 post grads)
Each student evaluated 2 articles
Preliminary Evaluation: Process
Printouts of both versions were presentedEnd-user view, not source wiki textIdentical paragraphs removed to reduce evaluators workload
Evaluators mark and categorize changes manuallyPresent the edit list generated by our prototypeFor each item in the list, Ask evaluator if they agreed or notState the reason of disagreement, if any
Preliminary Evaluation: Results
Average agreement rate: 84.1%11 out of 20 evaluations agreed 100% to our edit listRemaining 9 evaluations ranged from 33.3% to 88.9%No significant differences between the different evaluator groups
Preliminary Evaluation: Results
Low agreement rate sample examinedOld: electrons quickly go around the nucleusNew: electrons move around the nucleus very quickly
“quickly” not considered moved because single word is lower than movement recognition threshold (4 tokens)
Possible improvementConsider intra-sentence moves regardless of length of token sequence
Conclusions
Conclusions
Our contributions:New difference model for wiki contentDesign of edit history analyzerPrototype implementation
Ongoing Work
Intuitiveness of edit statementFound some issues in preliminary evaluationMore adjustments & experiments needed
Decide weights of the edit significance modelAny good method to consult the Wikipedia community?Design of questionnaire and following experiments
Prototype source code available at:http://sourceforge.net/projects/weha/
Data Analytics and Collaborative Computing Grouphttp://www.fst.umac.mo/se/dacc/