Download ppt - An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

An Unsupervised Approach for the Detection of Outliers in Corpora

David GuthrieLouise Guthire, Yorick Wilks

The University of Sheffield

Corpora in CL

• Increasingly common in computational linguistics to use textual resources gathered automatically

o IR, scraping Web, etc.

• Construct corpora from specific blogs, bulletin boards, websites (Wikipedia, RottenTomatoes)

Corpora Can Contain Errors

• IR and scraping can lead to errors in precision

• Can contain entries that might be considered spam:

o Advertisingo gibberish messageso (more subtly) information that is an opinion

rather than a fact, rants about political figures

Difficult to verify

• The quality of corpora has a dramatic impact on the results of QA, ASR, TC, etc.

• Creation and validation of corpora has generally relied on humans

Goals

• Improve the consistency and quality of corpora

• Automatically identify and remove text from corpora that does not belong

Approach

• Treat the problem as a type of outlier detection

• We aim to find pieces of text in a corpus that differ significantly from the majority of text in that corpus and thus are ‘outliers’

Method

• Characterize each piece of text (document, segment, paragraph, …) in our corpus as a vector of features

• Use these vectors to construct a matrix, X, which has number of rows equal to the pieces of text in the corpus and number of columns equal to the number of features

Feature Matrix

X

Represent each piece of text as a vector of features

Characterizing Text

• 158 features computed for every piece of text (many of which have been used successfully for genre classification by Biber, Kessler, Argamon, …)

o Simple Surface Featureso Readability Measureso POS Distributions (RASP)o Vocabulary Obscurityo Emotional Affect (General Inquirer Dictionary)

Feature Matrix

X

Identify outlying Text

Outliers are ‘hidden’

SDE

• Use the Stahel-Donoho Estimator (SDE) to identify outliers

o Project the data down to one dimension and measure the outlyingness of each piece of text in that dimension

o For every piece of text, the goal is to find a projection of the that maximizes its robust z-score

o Especially suited to data with a large number of dimensions (features)

Robust Zscore of furthest point is <3

Robust z score for triangles in thisProjection is >12 std dev

SDE

• Where a is a direction (unit length vector) and

• xia is the projection of row xi onto direction a

• mad is the median absolute deviation

Outliers have a large SD

• The distances for each piece of text SD(xi) are then sorted and all pieces of text above a cutoff are marked as outliers

• We use

Experiments

• In each experiment we randomly select 50 segments of text from the Gigaword corpus of newswire and insert one piece of text from a different source to act as an ‘outlier’

• Measure the accuracy of automatically identifying the inserted segment as an outlier

• We varied the size of the pieces of text from 100 to 1000 words

Anarchist Cookbook

• Very different genre from newswire. The writing is much more procedural (e.g. instructions to build telephone phreaking devices) and also very informal (e.g. ``When the fuse contacts the balloon, watch out!!!'')

• Randomly one segment from the Anarchist Cookbook and attempt to identify outliers This is repeated 200 times for each segment size (100, 500,

and 1,000 words)

Cookbook Results

• Remember we are not using any training data and there is only a chance 1/51 (1.96%) of guessing the outlier correctly

Machine Translations

• 35 thousand words of Chinese news articles were hand picked (Wei Liu) and translated into English using Googles Chinese to English translation engine

• Similar genre to English newswire but translations are far from perfect and so the language use is very odd

• 200 test collections are created for each segment size as before

MT Results

Conclusions and Future Work

• Outlier detection can be a valuable tool for corpus linguistics (if we want a homogeneous corpus) Automatically clean corpora Does not require training data or human annotation

• This this method can be used reliably for relatively large pieces of text (1000 words). Threshold could be adjusted to insure a high precision at the expense of recall

• Looking at ways to increase accuracy by more intelligently picking directions for SDE and the cutoff to use for outliers