An Unsupervised Approach for the Detection of Outliers in Corpora
David GuthrieLouise Guthire, Yorick Wilks
The University of Sheffield
Corpora in CL
• Increasingly common in computational linguistics to use textual resources gathered automatically
o IR, scraping Web, etc.
• Construct corpora from specific blogs, bulletin boards, websites (Wikipedia, RottenTomatoes)
Corpora Can Contain Errors
• IR and scraping can lead to errors in precision
• Can contain entries that might be considered spam:
o Advertisingo gibberish messageso (more subtly) information that is an opinion
rather than a fact, rants about political figures
Difficult to verify
• The quality of corpora has a dramatic impact on the results of QA, ASR, TC, etc.
• Creation and validation of corpora has generally relied on humans
Goals
• Improve the consistency and quality of corpora
• Automatically identify and remove text from corpora that does not belong
Approach
• Treat the problem as a type of outlier detection
• We aim to find pieces of text in a corpus that differ significantly from the majority of text in that corpus and thus are ‘outliers’
Method
• Characterize each piece of text (document, segment, paragraph, …) in our corpus as a vector of features
• Use these vectors to construct a matrix, X, which has number of rows equal to the pieces of text in the corpus and number of columns equal to the number of features
Feature Matrix
X
Represent each piece of text as a vector of features
Characterizing Text
• 158 features computed for every piece of text (many of which have been used successfully for genre classification by Biber, Kessler, Argamon, …)
o Simple Surface Featureso Readability Measureso POS Distributions (RASP)o Vocabulary Obscurityo Emotional Affect (General Inquirer Dictionary)
Feature Matrix
X
Identify outlying Text
Outliers are ‘hidden’
SDE
• Use the Stahel-Donoho Estimator (SDE) to identify outliers
o Project the data down to one dimension and measure the outlyingness of each piece of text in that dimension
o For every piece of text, the goal is to find a projection of the that maximizes its robust z-score
o Especially suited to data with a large number of dimensions (features)
Robust Zscore of furthest point is <3
Robust z score for triangles in thisProjection is >12 std dev
SDE
• Where a is a direction (unit length vector) and
• xia is the projection of row xi onto direction a
• mad is the median absolute deviation
Outliers have a large SD
• The distances for each piece of text SD(xi) are then sorted and all pieces of text above a cutoff are marked as outliers
• We use
Experiments
• In each experiment we randomly select 50 segments of text from the Gigaword corpus of newswire and insert one piece of text from a different source to act as an ‘outlier’
• Measure the accuracy of automatically identifying the inserted segment as an outlier
• We varied the size of the pieces of text from 100 to 1000 words
Anarchist Cookbook
• Very different genre from newswire. The writing is much more procedural (e.g. instructions to build telephone phreaking devices) and also very informal (e.g. ``When the fuse contacts the balloon, watch out!!!'')
• Randomly one segment from the Anarchist Cookbook and attempt to identify outliers This is repeated 200 times for each segment size (100, 500,
and 1,000 words)
Cookbook Results
• Remember we are not using any training data and there is only a chance 1/51 (1.96%) of guessing the outlier correctly
Machine Translations
• 35 thousand words of Chinese news articles were hand picked (Wei Liu) and translated into English using Googles Chinese to English translation engine
• Similar genre to English newswire but translations are far from perfect and so the language use is very odd
• 200 test collections are created for each segment size as before
MT Results
Conclusions and Future Work
• Outlier detection can be a valuable tool for corpus linguistics (if we want a homogeneous corpus) Automatically clean corpora Does not require training data or human annotation
• This this method can be used reliably for relatively large pieces of text (1000 words). Threshold could be adjusted to insure a high precision at the expense of recall
• Looking at ways to increase accuracy by more intelligently picking directions for SDE and the cutoff to use for outliers